rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-05-19 14:10:37 +00:00

Author	SHA1	Message	Date
Arpad Müller	920040e402	Update storage components to edition 2024 (#10919 ) Updates storage components to edition 2024. We like to stay on the latest edition if possible. There is no functional changes, however some code changes had to be done to accommodate the edition's breaking changes. The PR has two commits: * the first commit updates storage crates to edition 2024 and appeases `cargo clippy` by changing code. i have accidentially ran the formatter on some files that had other edits. * the second commit performs a `cargo fmt` I would recommend a closer review of the first commit and a less close review of the second one (as it just runs `cargo fmt`). part of https://github.com/neondatabase/neon/issues/10918	2025-02-25 23:51:37 +00:00
Yuchen Liang	85b954f449	pageserver: add tokio-epoll-uring slots waiters queue depth metrics (#9482 ) In complement to https://github.com/neondatabase/tokio-epoll-uring/pull/56. ## Problem We want to make tokio-epoll-uring slots waiters queue depth observable via Prometheus. ## Summary of changes - Add `pageserver_tokio_epoll_uring_slots_submission_queue_depth` metrics as a `Histogram`. - Each thread-local tokio-epoll-uring system is given a `LocalHistogram` to observe the metrics. - Keep a list of `Arc<ThreadLocalMetrics>` used on-demand to flush data to the shared histogram. - Extend `Collector::collect` to report `pageserver_tokio_epoll_uring_slots_submission_queue_depth`. Signed-off-by: Yuchen Liang <yuchen@neon.tech> Co-authored-by: Christian Schwarz <christian@neon.tech>	2024-10-25 21:30:57 +01:00
Christian Schwarz	2bc2fd9cfd	fixup(#7160 / tokio_epoll_uring_ext): double-panic caused by info! in thread-local's drop() (#7164 ) Manual testing of the changes in #7160 revealed that, if the thread-local destructor ever runs (it apparently doesn't in our test suite runs, otherwise #7160 would not have auto-merged), we can encounter an `abort()` due to a double-panic in the tracing code. This github comment here contains the stack trace: https://github.com/neondatabase/neon/pull/7160#issuecomment-2003778176 This PR reverts #7160 and uses a atomic counter to identify the thread-local in log messages, instead of the memory address of the thread local, which may be re-used.	2024-03-18 16:12:01 +01:00
Christian Schwarz	db749914d8	fixup(#7141 / tokio_epoll_uring_ext): high frequency log message (#7160 ) The PR #7141 added log message ``` ThreadLocalState is being dropped and id might be re-used in the future ``` which was supposed to be emitted when the thread-local is destroyed. Instead, it was emitted on _each_ call to `thread_local_system()`, ie.., on each tokio-epoll-uring operation. Testing ------- Reproduced the issue locally and verified that this PR fixes the issue.	2024-03-18 12:29:20 +00:00
Christian Schwarz	0694ee9531	tokio-epoll-uring: retry on launch failures due to locked memory (#7141 ) refs https://github.com/neondatabase/neon/issues/7136 Problem ------- Before this PR, we were using `tokio_epoll_uring::thread_local_system()`, which panics on tokio_epoll_uring::System::launch() failure As we've learned in [the past](https://github.com/neondatabase/neon/issues/6373#issuecomment-1905814391), some older Linux kernels account io_uring instances as locked memory. And while we've raised the limit in prod considerably, we did hit it once on 2024-03-11 16:30 UTC. That was after we enabled tokio-epoll-uring fleet-wide, but before we had shipped release-5090 (`c6ed86d3d0`) which did away with the last mass-creation of tokio-epoll-uring instances as per commit `3da410c8fe` Author: Christian Schwarz <christian@neon.tech> Date: Tue Mar 5 10:03:54 2024 +0100 tokio-epoll-uring: use it on the layer-creating code paths (#6378) Nonetheless, it highlighted that panicking in this situation is probably not ideal, as it can leave the pageserver process in a semi-broken state. Further, due to low sampling rate of Prometheus metrics, we don't know much about the circumstances of this failure instance. Solution -------- This PR implements a custom thread_local_system() that is pageserver-aware and will do the following on failure: - dump relevant stats to `tracing!`, hopefully they will be useful to understand the circumstances better - if it's the locked memory failure (or any other ENOMEM): abort() the process - if it's ENOMEM, retry with exponential back-off, capped at 3s. - add metric counters so we can create an alert This makes sense in the production environment where we know that _usually_, there's ample locked memory allowance available, and we know the failure rate is rare.	2024-03-15 19:46:15 +00:00

5 Commits