In complement to
https://github.com/neondatabase/tokio-epoll-uring/pull/56.
## Problem
We want to make tokio-epoll-uring slots waiters queue depth observable
via Prometheus.
## Summary of changes
- Add `pageserver_tokio_epoll_uring_slots_submission_queue_depth`
metrics as a `Histogram`.
- Each thread-local tokio-epoll-uring system is given a `LocalHistogram`
to observe the metrics.
- Keep a list of `Arc<ThreadLocalMetrics>` used on-demand to flush data
to the shared histogram.
- Extend `Collector::collect` to report
`pageserver_tokio_epoll_uring_slots_submission_queue_depth`.
Signed-off-by: Yuchen Liang <yuchen@neon.tech>
Co-authored-by: Christian Schwarz <christian@neon.tech>
Manual testing of the changes in #7160 revealed that, if the
thread-local destructor ever runs (it apparently doesn't in our test
suite runs, otherwise #7160 would not have auto-merged), we can
encounter an `abort()` due to a double-panic in the tracing code.
This github comment here contains the stack trace:
https://github.com/neondatabase/neon/pull/7160#issuecomment-2003778176
This PR reverts #7160 and uses a atomic counter to identify the
thread-local in log messages, instead of the memory address of the
thread local, which may be re-used.
The PR #7141 added log message
```
ThreadLocalState is being dropped and id might be re-used in the future
```
which was supposed to be emitted when the thread-local is destroyed.
Instead, it was emitted on _each_ call to `thread_local_system()`,
ie.., on each tokio-epoll-uring operation.
Testing
-------
Reproduced the issue locally and verified that this PR fixes the issue.
refs https://github.com/neondatabase/neon/issues/7136
Problem
-------
Before this PR, we were using
`tokio_epoll_uring::thread_local_system()`,
which panics on tokio_epoll_uring::System::launch() failure
As we've learned in [the
past](https://github.com/neondatabase/neon/issues/6373#issuecomment-1905814391),
some older Linux kernels account io_uring instances as locked memory.
And while we've raised the limit in prod considerably, we did hit it
once on 2024-03-11 16:30 UTC.
That was after we enabled tokio-epoll-uring fleet-wide, but before
we had shipped release-5090 (c6ed86d3d0)
which did away with the last mass-creation of tokio-epoll-uring
instances as per
commit 3da410c8fe
Author: Christian Schwarz <christian@neon.tech>
Date: Tue Mar 5 10:03:54 2024 +0100
tokio-epoll-uring: use it on the layer-creating code paths (#6378)
Nonetheless, it highlighted that panicking in this situation is probably
not ideal, as it can leave the pageserver process in a semi-broken
state.
Further, due to low sampling rate of Prometheus metrics, we don't know
much about the circumstances of this failure instance.
Solution
--------
This PR implements a custom thread_local_system() that is
pageserver-aware
and will do the following on failure:
- dump relevant stats to `tracing!`, hopefully they will be useful to
understand the circumstances better
- if it's the locked memory failure (or any other ENOMEM): abort() the
process
- if it's ENOMEM, retry with exponential back-off, capped at 3s.
- add metric counters so we can create an alert
This makes sense in the production environment where we know that
_usually_, there's ample locked memory allowance available, and we know
the failure rate is rare.