rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-01-08 05:52:55 +00:00

Author	SHA1	Message	Date
Folke Behrens	23ad228310	pgxn: Increase the pageserver response timeout a bit (#11339 ) Increase the PS response timeout slightly but noticeably, so it does not coincide with the default TCP_RTO_MAX.	2025-03-21 14:21:53 +00:00
Folke Behrens	d0102a473a	pgxn: Include local port in no-response log messages (#11321 ) ## Problem Now that stuck connections are quickly terminated it's not easy to quickly find the right port from the pid to correlate the connection with the one seen on pageserver side. ## Summary of changes Call getsockname() and include the local port number in the no-response-from-pageserver log messages.	2025-03-20 16:06:00 +00:00
Alex Chi Z.	78502798ae	feat(compute_ctl): pass compute type to pageserver with pg_options (#11287 ) ## Problem second try of https://github.com/neondatabase/neon/pull/11185, part of https://github.com/neondatabase/cloud/issues/24706 ## Summary of changes Tristan reminded me of the `options` field of the pg wire protocol, which can be used to pass configurations. This patch adds the parsing on the pageserver side, and supplies `neon.endpoint_type` as part of the `options`. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-03-20 15:48:40 +00:00
Konstantin Knizhnik	3da70abfa5	Fix pageserver_try_receive (#11096 ) ## Problem See https://neondb.slack.com/archives/C04DGM6SMTM/p1741176713523469 The problem is that this function is using `PQgetCopyData(shard->conn, &resp_buff.data, 1 /* async = true */)` to try to fetch next message. But this function returns 0 if the whole message is not present in the buffer. And input buffer may contain only part of message so result is not fetched. ## Summary of changes Use `PQisBusy` + `WaitEventSetWait` to check if data is available and `PQgetCopyData(shard->conn, &resp_buff.data, 0)` to read whole message in this case. --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2025-03-20 15:21:00 +00:00
Suhas Thalanki	5589efb6de	moving LastWrittenLSNCache to Neon Extension (#11031 ) ## Problem We currently have this code duplicated across different PG versions. Moving this to an extension would reduce duplication and simplify maintenance. ## Summary of changes Moving the LastWrittenLSN code from PG versions to the Neon extension and linking it with hooks. Related Postgres PR: https://github.com/neondatabase/postgres/pull/590 Closes: https://github.com/neondatabase/neon/issues/10973 --------- Co-authored-by: Tristan Partin <tristan@neon.tech>	2025-03-19 17:29:40 +00:00
Konstantin Knizhnik	24f41bee5c	Update LFC in case of unlogged build (#11262 ) ## Problem Unlogged build is used for GIST/SPGIST/GIN/HNSW indexes. In this mode we first change relation class to `RELPERSISTENCE_UNLOGGED` and save them on local disk. But we do not save unlogged relations in LFC. It may cause fetching incorrect value from LFC if relfilenode is reused. ## Summary of changes Save modified pages in LFC on second stage of unlogged build (when modified pages are walloged). There is no need to save pages in LFC at first phase because the will be in any case overwritten with assigned LSN at second phase. Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2025-03-17 19:06:42 +00:00
Konstantin Knizhnik	15e63afe7d	Support DEBUG_COMPARE_LOCAL mode for unloggedindex build (#11257 ) ## Problem In unlogged index build (used fir GIST/SPGIST/GIN indexes) files is created on disk and then removed at the end. It contradicts to the logic of DEBUG_COMPARE_LOCAL mode. ## Summary of changes Do not create and unlink files in unlogged build in DEBUG_COMPARE_LOCAL mode. Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2025-03-17 06:07:24 +00:00
Konstantin Knizhnik	398d2794eb	Handle DEBUG_COMPARE_LOCAL mode in neon_zeroextend (#11220 ) ## Problem DEBUG_COMPARE_LOCAL is not supported in neon_zeroextend added in PG16 ## Summary of changes Add support of DEBUG_COMPARE_LOCAL in neon_zeroextend Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2025-03-13 16:30:32 +00:00
Konstantin Knizhnik	f60ffe3021	Rebase compare local debug mode (#11174 ) ## Problem DEBUG_COMPARE_LOCAL mode is broken See https://neondb.slack.com/archives/C03QLRH7PPD/p1732862608323269?thread_ts=1732711054.862919&cid=C03QLRH7PPD ## Summary of changes Fix compile errors and unlogged build issues. Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2025-03-12 05:52:18 +00:00
Arseny Sher	359c64c779	walproposer: pre generations refactoring (#11060 ) ## Problem https://github.com/neondatabase/neon/issues/10851 ## Summary of changes Do some refactoring before making walproposer generations aware. - Rename SS_VOTING to SS_WAIT_VOTING, SS_IDLE to SS_WAIT_ELECTED - Continue to get rid of epochs: rename GetEpoch to GetLastLogTerm, donorEpoch to donorLastLogTerm - Instead of counting n_votes, n_connected, introduce explicit WalProposerState (collecting terms / voting / elected). Refactor out TermsCollected and VotesCollected; they will determine state transition differently depending whether generations are enabled or not. There is no new logic in this PR and thus no new tests.	2025-03-11 14:01:00 +00:00
Konstantin Knizhnik	fb1957936c	Fix caclulation of LFC used_pages (#11095 ) ## Problem Async prefetch in LFC PR cause incorrect calculation of LFC `used_pages`when page is overwritten ## Summary of changes Decrement `used_pages` is page is overwritten. --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech> Co-authored-by: Matthias van de Meent <matthias@neon.tech>	2025-03-10 18:28:55 +00:00
Matthias van de Meent	bc052fd0fc	Add configuration options to disable prevlink checks (#11138 ) This allows for improved decoding of otherwise broken WAL. ## Problem Currently, if (or when) a WAL record has a wrong prevptr, that breaks decoding. With this, we don't have to break on that if we decide it's OK to proceed after that. ## Summary of changes Use a Neon GUC to allow the system to enable the NEON-specific skip_lsn_checks option in XLogReader.	2025-03-10 17:02:30 +00:00
Konstantin Knizhnik	c87d307e8c	Print state of connection buffer when no response is receioved from PS for a long time (#11145 ) ## Problem See https://neondb.slack.com/archives/C08DE6Q9C3B Sometimes compute is not able to receive responses from PS for a long time (minutes). I do not think that the problem is at compute side, but in order to exclude this possibility I wan to see more information about connection state at compute side, particularly amount of data cached in connection buffer. ## Summary of changes Right now we are dumping state of socket buffer. This PR adds printing state of connection buffer. --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2025-03-09 18:36:36 +00:00
Em Sharnoff	1fe23fe8d2	compute/lfc: Add chunk size to neon_lfc_stats (#11100 ) This PR adds a new key to neon.neon_lfc_stats — 'file_cache_chunk_size_pages'. It just returns the value of BLOCKS_PER_CHUNK from the LFC implementation. The new value should (eventually) allow changing the chunk size without breaking any places that rely on LFC stats values measured in number of chunks. See neondatabase/cloud#25170 for more.	2025-03-05 20:35:08 +00:00
Konstantin Knizhnik	438f7bb726	Check response status in prefetch_lookup (#11080 ) ## Problem New async prefetch introduces `prefetch+lookup[` function which is called before LFC lookup to check if prefetch request is already completed. This function is not containing now check that response is actually `T_NeonGetPageResponse` (and not error). ## Summary of changes Add checks for response tag. --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2025-03-05 10:03:09 +00:00
Arseny Sher	ef2b50994c	walproposer: basic infra to enable generations (#11002 ) ## Problem Preparation for https://github.com/neondatabase/neon/issues/10851 ## Summary of changes Add walproposer `safekeepers_generations` field which can be set by prefixing `neon.safekeepers` GUC with `g#n:`. Non zero value (n) forces walproposer to use generations. In particular, this also disables implicit timeline creation as timeline will be created by storcon. Add test checking this. Also add missing infra: `--safekeepers-generation` flag to neon_local endpoint start + fix `--start-timeout` flag: it existed but value wasn't used.	2025-03-03 13:20:20 +00:00
Konstantin Knizhnik	8669bfe493	Do not store zero pages in inmem SMGR for walredo (#11043 ) ## Problem See https://neondb.slack.com/archives/C033RQ5SPDH/p1740157873114339 smgrextend for FSM fork is called during page reconstruction by walredo process causing overflow of inmem SMGR (64 pages). ## Summary of changes Do not store zero pages in inmem SMGR because `inmem_read` returns zero page if it is not able to locate specified block. Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2025-03-03 12:50:07 +00:00
Konstantin Knizhnik	e58f264a05	Increase inmem SMGR size for walredo process to 100 pagees (#10937 ) ## Problem We see `Inmem storage overflow` in page server logs: https://neondb.slack.com/archives/C033RQ5SPDH/p1740157873114339 walked process is using inseam SMGR with storage size limited by 64 pages with warning watermark 32 (based ion the assumption that XLR_MAX_BLOCK_ID is 32, so WAL record can not access more than 32 pages). Actually it is not true. We can update up to 3 forks for each block (including update of FSM and VM forks). ## Summary of changes This PR increases inseam SMGR size for walled process to 100 pages and print stack trace in case of overflow. --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2025-02-27 14:31:05 +00:00
Matthias van de Meent	a283edaccf	PS/Prefetch: Use a timeout for reading data from TCP (#10834 ) This reduces pressure on OS TCP buffers, reducing flush times in other systems like PageServer. ## Problem ## Summary of changes	2025-02-27 14:00:18 +00:00
Arseny Sher	c1a040447d	walproposer: send valid timeline_start_lsn in v2 (#10994 ) ## Problem https://github.com/neondatabase/neon/pull/10647 dropped timeline_start_lsn from protocol messages as it can be taken from term history. In v2 0 was sent in the placeholder. However, until safekeepers are deployed with that PR they still use the value, setting timeline_start_lsn to 0, which confuses WAL reading; problem appears only when compute includes 10647 but safekeepers don't. ref https://neondb.slack.com/archives/C04DGM6SMTM/p1740577649644269?thread_ts=1740572363.541619&cid=C04DGM6SMTM ## Summary of changes Send real value instead of 0 in v2.	2025-02-26 17:38:44 +00:00
Konstantin Knizhnik	dc975d554a	Incremenet getpage histogram in prefetch_lookup (#10965 ) ## Problem PR https://github.com/neondatabase/neon/pull/10442 added prefetch_lookup function. It changed handling of getpage requests in compute. Before: 1. Lookup in LFC (return if found) 2. Register prefetch buffer 3. Wait prefetch result (increment getpage_hist) Now: 1. Lookup prefetch ring (return if prefetch request is already completed) 2. Lookup in LFC (return if found) 3. Register prefetch buffer 4. Wait prefetch result (increment getpage_hist) So if prefetch result is already available, then get page histogram is not incremented. It case failure of some test_throughtput benchmarks: https://neondb.slack.com/archives/C033RQ5SPDH/p1740425527249499 ## Summary of changes Increment getpage histogram in `prefetch_lookup` Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2025-02-25 19:51:38 +00:00
Konstantin Knizhnik	8f82c661d4	Move neon_pgstat_file_size_limit to the extension (#10959 ) ## Problem PG14 uses separate backend for stats collector having no access to shaerd memory. As far as AUX mechanism requires access to shared memory, persisting pgstat.stat file is not supported at pg14. And so there is no definition of `neon_pgstat_file_size_limit` variable. It makes it impossible to provide same config for all Postgres version. ## Summary of changes Move neon_pgstat_file_size_limit to Neon extension. Postgres submodules PR: https://github.com/neondatabase/postgres/pull/587 https://github.com/neondatabase/postgres/pull/588 https://github.com/neondatabase/postgres/pull/589 --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech> Co-authored-by: Tristan Partin <tristan@neon.tech>	2025-02-25 12:23:04 +00:00
Arseny Sher	758f597280	compute <-> sk protocol v3 (#10647 ) ## Problem As part of https://github.com/neondatabase/neon/issues/8614 we need to pass membership configurations between compute and safekeepers. ## Summary of changes Add version 3 of the protocol carrying membership configurations. Greeting message in both sides gets full conf, and other messages generation number only. Use protocol bump to include other accumulated changes: - stop packing whole structs on the wire as is; - make the tag u8 instead of u64; - send all ints in network order; - drop proposer_uuid, we can pass it in START_WAL_PUSH and it wasn't much useful anyway. Per message changes, apart from mconf: - ProposerGreeting: tenant / timeline id is sent now as hex cstring. Remove proto version, it is passed outside in START_WAL_PUSH. Remove postgres timeline, it is unused. Reorder fields a bit. - AcceptorGreeting: reorder fields - VoteResponse: timeline_start_lsn is removed. It can be taken from first member of term history, and later we won't need it at all when all timelines will be explicitly created. Vote itself is u8 instead of u64. - ProposerElected: timeline_start_lsn is removed for the same reasons. - AppendRequest: epoch_start_lsn removed, it is known from term history in ProposerElected. Both compute and sk are able to talk v2 and v3 to make rollbacks (in case we need them) easier; neon.safekeeper_proto_version GUC sets the client version. v2 code can be dropped later. So far empty conf is passed everywhere, future PRs will handle them. To test, add param to some tests choosing proto version; we want to test both 2 and 3 until we fully migrate. ref https://github.com/neondatabase/neon/issues/10326 --------- Co-authored-by: Arthur Petukhovsky <petuhovskiy@yandex.ru>	2025-02-25 11:56:05 +00:00
Heikki Linnakangas	565a9e62a1	compute: Disconnect if no response to a pageserver request is received (#10882 ) We've seen some cases in production where a compute doesn't get a response to a pageserver request for several minutes, or even more. We haven't found the root cause for that yet, but whatever the reason is, it seems overly optimistic to think that if the pageserver hasn't responded for 2 minutes, we'd get a response if we just wait patiently a little longer. More likely, the pageserver is dead or there's some kind of a network glitch so that the TCP connection is dead, or at least stuck for a long time. Either way, it's better to disconnect and reconnect. I set the default timeout to 2 minutes, which should be enough for any GetPage request under normal circumstances, even if the pageserver has to download several layer files from remote storage. Make the disconnect timeout configurable. Also make the "log interval", after which we print a message to the log configurable, so that if you change the disconnect timeout, you can set the log timeout correspondingly. The default log interval is still 10 s. The new GUCs are called "neon.pageserver_response_log_timeout" and "neon.pageserver_response_disconnect_timeout". Includes a basic test for the log and disconnect timeouts. Implements issue #10857	2025-02-24 20:16:37 +00:00
Heikki Linnakangas	40acb0c06d	Fix usage of WaitEventSetWait() with timeout (#10947 ) WaitEventSetWait() returns the number of "events" that happened, and only that many events in the WaitEvent array are updated. When the timeout is reached, it returns 0 and does not modify the WaitEvent array at all. We were reading 'event.events' without checking the return value, which would be uninitialized when the timeout was hit. No test included, as this is harmless at the moment. But this caused the test I'm including in PR #10882 to fail. That PR changes the logic to loop back to retry the PQgetCopyData() call if WL_SOCKET_READABLE was set. Currently, making an extra call to PQconsumeInput() is harmless, but with that change in logic, it turns into a busy-wait.	2025-02-24 17:15:07 +00:00
Konstantin Knizhnik	b1d8771d5f	Store prefetch results in LFC cache once as soon as they are received (#10442 ) ## Problem Prefetch is performed locally, so different backers can request the same pages form PS. Such duplicated request increase load of page server and network traffic. Making prefetch global seems to be very difficult and undesirable, because different queries can access chunks on different speed. Storing prefetch chunks in LFC will not completely eliminate duplicates, but can minimise such requests. The problem with storing prefetch result in LFC is that in this case page is not protected by share buffer lock. So we will have to perform extra synchronisation at LFC side. See: https://neondb.slack.com/archives/C0875PUD0LC/p1736772890602029?thread_ts=1736762541.116949&cid=C0875PUD0LC @MMeent implementation of prewarm: See https://github.com/neondatabase/neon/pull/10312/ ## Summary of changes Use conditional variables to sycnhronize access to LFC entry. --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2025-02-21 16:56:16 +00:00
Konstantin Knizhnik	0b9b391ea0	Fix caclulation of prefetch ring position to fit in-flight request in resized ring buffer (#10899 ) ## Problem Refer https://github.com/neondatabase/neon/issues/10885 Wait position in ring buffer to restrict number of in-flight requests is not correctly calculated. ## Summary of changes Update condition and remove redundant assertion Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2025-02-20 20:21:44 +00:00
Tristan Partin	da79cc5eee	Add neon.extension_server_{connect,request}_timeout (#10801 ) Instead of hardcoding the request timeout, let's make it configurable as a PGC_SUSET GUC. Additionally, add a connect timeout GUC. Although the extension server runs on the compute, it is always best to keep operations from hanging. Better to present a timeout error to the user than a stuck backend. Signed-off-by: Tristan Partin <tristan@neon.tech>	2025-02-17 15:40:43 +00:00
Konstantin Knizhnik	8c6d133d31	Fix out-of-boundaries access in addSHLL function (#10840 ) ## Problem See https://github.com/neondatabase/neon/issues/10839 rho(x,b) functions returns values in range [1,b+1] and addSHLL tries to store it in array of size b+1. ## Summary of changes Subtract 1 fro value returned by rho --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2025-02-17 12:54:17 +00:00
Christian Schwarz	a32e8871ac	compute/pageserver: correlation of logs through backend PID (via `application_name`) (#10810 ) This PR makes compute set the `application_name` field to the PG backend process PID which is also included in each compute log line. This allows correlation of Pageserver connection logs with compute logs in a way that was guesswork before this PR. In future, we can switch for a more unique identifier for a page_service session. Refs - discussion in https://neondb.slack.com/archives/C08DE6Q9C3B/p1739465208296169?thread_ts=1739462628.361019&cid=C08DE6Q9C3B - fixes https://github.com/neondatabase/neon/issues/10808	2025-02-14 20:11:42 +00:00
Tristan Partin	b6f972ed83	Increase the extension server request timeout to 1 minute (#10800 ) pg_search is 46ish MB. All other remote extensions are around hundeds of KB. 3 seconds is not long enough to download the tarball if the S3 gateway cache doesn't already contain a copy. According to our setup, the cache is limited to 10 GB in size and anything that has not been accessed for an hour is purged. This is really bad for scaling to 0, even more so if you're the only project actively using the extension in a production Kubernetes cluster. Signed-off-by: Tristan Partin <tristan@neon.tech>	2025-02-13 17:33:27 +00:00
Alexey Kondratov	8c2f85b209	chore(compute): Postgres 17.3, 16.7, 15.11 and 14.16 (#10771 ) ## Summary of changes Bump all minor versions. The only non-trivial conflict was between - `0350b876b0` - and `bd09a752f4` It seems that just adding this extra argument is enough. I also got conflict with `c1c9df3159` but for some reason only in PG 15. Yet, that was a trivial one around ```c if (XLogCtl) LWLockRelease(ControlFileLock); /* durable_rename already emitted log message */ return false; ``` in `xlog.c` ## Postgres PRs - https://github.com/neondatabase/postgres/pull/580 - https://github.com/neondatabase/postgres/pull/579 - https://github.com/neondatabase/postgres/pull/577 - https://github.com/neondatabase/postgres/pull/578	2025-02-13 13:28:05 +00:00
OBBO67	82cbab7512	Switch reqlsns[0].request_lsn to arrow operator in neon_read_at_lsnv() (#10620 ) (#10687 ) ## Problem Currently the following line below uses array subscript notation which is confusing since `reqlsns` is not an array but just a pointer to a struct. ``` XLogWaitForReplayOf(reqlsns[0].request_lsn); ``` ## Summary of changes Switch from array subscript notation to arrow operator to improve readability of code. Close #10620.	2025-02-06 17:26:26 +00:00
Konstantin Knizhnik	01f0be03b5	Fix bugs in lfc_cache_containsv (#10682 ) ## Problem Incorrect manipulations with iteration index in `lfc_cache_containsv` ## Summary of changes ``` - int this_chunk = Min(nblocks, BLOCKS_PER_CHUNK - chunk_offs); + int this_chunk = Min(nblocks - i, BLOCKS_PER_CHUNK - chunk_offs); int this_chunk = ``` - if (i + 1 >= nblocks) + if (i >= nblocks) ``` Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2025-02-06 07:00:00 +00:00
Konstantin Knizhnik	81cd30e4d6	Use #ifdef instead of #if USE_ASSERT_CHECKING (#10683 ) ## Problem USE_ASSERT _CHECKING is defined as empty entity. but it is checked using #if ## Summary of changes Replace `#if USE_ASSERT _CHECKING` with `#ifdef USE_ASSERT _CHECKING` as done in other places in Postgres Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2025-02-06 05:47:56 +00:00
Christian Schwarz	77f9e74d86	pgxn: include socket send & recv queue size in slow response logs (#10673 ) # Problem When we see an apparent slow request, one possible cause is that the client is failing to consume responses, but we don't have a clear way to see that. # Solution - Log the socket queue depths on slow/stuck connections, so that we have an indication of whether the compute is keeping up with processing the connection's responses. refs - slack https://neondb.slack.com/archives/C036U0GRMRB/p1738652644396329 - refs https://github.com/neondatabase/cloud/issues/23515 - refs https://github.com/neondatabase/cloud/issues/23486	2025-02-06 01:14:29 +00:00
alexanderlaw	4d2328ebe3	Fix C code to satisfy sanitizers (#10473 )	2025-01-29 10:05:43 +00:00
Anastasia Lubennikova	8e8df1b453	Disable logical replication subscribers (#10249 ) Drop logical replication subscribers before compute starts on a non-main branch. Add new compute_ctl spec flag: drop_subscriptions_before_start If it is set, drop all the subscriptions from the compute node before it starts. To avoid race on compute start, use new GUC neon.disable_logical_replication_subscribers to temporarily disable logical replication workers until we drop the subscriptions. Ensure that we drop subscriptions exactly once when endpoint starts on a new branch. It is essential, because otherwise, we may drop not only inherited, but newly created subscriptions. We cannot rely only on spec.drop_subscriptions_before_start flag, because if for some reason compute restarts inside VM, it will start again with the same spec and flag value. To handle this, we save the fact of the operation in the database in the neon.drop_subscriptions_done table. If the table does not exist, we assume that the operation was never performed, so we must do it. If table exists, we check if the operation was performed on the current timeline. fixes: https://github.com/neondatabase/neon/issues/8790	2025-01-23 11:02:15 +00:00
Matthias van de Meent	e781cf6dd8	Compute/LFC: Apply limits consistently (#10449 ) Otherwise we might hit ERRORs in otherwise safe situations (such as user queries), which isn't a great user experience. ## Problem https://github.com/neondatabase/neon/pull/10376 ## Summary of changes Instead of accepting internal errors as acceptable, we ensure we don't exceed our allocated usage.	2025-01-20 18:29:21 +00:00
Christian Schwarz	c47c5f4ace	fix(page_service pipelining): tenant cannot shut down because gate kept open while flushing responses (#10386 ) # Refs - fixes https://github.com/neondatabase/neon/issues/10309 - fixup of batching design, first introduced in https://github.com/neondatabase/neon/pull/9851 - refinement of https://github.com/neondatabase/neon/pull/8339 # Problem `Tenant::shutdown` was occasionally taking many minutes (sometimes up to 20) in staging and prod if the `page_service_pipelining.mode="concurrent-futures"` is enabled. # Symptoms The issue happens during shard migration between pageservers. There is page_service unavailability and hence effectively downtime for customers in the following case: 1. The source (state `AttachedStale`) gets stuck in `Tenant::shutdown`, waiting for the gate to close. 2. Cplane/Storcon decides to transition the target `AttachedMulti` to `AttachedSingle`. 3. That transition comes with a bump of the generation number, causing the `PUT .../location_config` endpoint to do a full `Tenant::shutdown` / `Tenant::attach` cycle for the target location. 4. That `Tenant::shutdown` on the target gets stuck, waiting for the gate to close. 5. Eventually the gate closes (`close completed`), correlating with a `page_service` connection handler logging that it's exiting because of a network error (`Connection reset by peer` or `Broken pipe`). While in (4): - `Tenant::shutdown` is stuck waiting for all `Timeline::shutdown` calls to complete. So, really, this is a `Timeline::shutdown` bug. - retries from Cplane/Storcon to complete above state transitions, fail with errors related to the tenant mgr slot being in state `TenantSlot::InProgress`, the tenant state being `TenantState::Stopping`, and the timelines being in `TimelineState::Stopping`, and the `Timeline::cancel` being cancelled. - Existing (and/or new?) page_service connections log errors `error reading relation or page version: Not found: Timed out waiting 30s for tenant active state. Latest state: None` # Root-Cause After a lengthy investigation ([internal write-up](https://www.notion.so/neondatabase/2025-01-09-batching-deadlock-Slow-Log-Analysis-in-Staging-176f189e00478050bc21c1a072157ca4?pvs=4)) I arrived at the following root cause. The `spsc_fold` channel (`batch_tx`/`batch_rx`) that connects the Batcher and Executor stages of the pipelined mode was storing a `Handle` and thus `GateGuard` of the Timeline that was not shutting down. The design assumption with pipelining was that this would always be a short transient state. However, that was incorrect: the Executor was stuck on writing/flushing an earlier response into the connection to the client, i.e., socket write being slow because of TCP backpressure. The probable scenario of how we end up in that case: 1. Compute backend process sends a continuous stream of getpage prefetch requests into the connection, but never reads the responses (why this happens: see Appendix section). 2. Batch N is processed by Batcher and Executor, up to the point where Executor starts flushing the response. 3. Batch N+1 is procssed by Batcher and queued in the `spsc_fold`. 4. Executor is still waiting for batch N flush to finish. 5. Batcher eventually hits the `TimeoutReader` error (10min). From here on it waits on the `spsc_fold.send(Err(QueryError(TimeoutReader_error)))` which will never finish because the batch already inside the `spsc_fold` is not being read by the Executor, because the Executor is still stuck in the flush. (This state is not observable at our default `info` log level) 6. Eventually, Compute backend process is killed (`close()` on the socket) or Compute as a whole gets killed (probably no clean TCP shutdown happening in that case). 7. Eventually, Pageserver TCP stack learns about (6) through RST packets and the Executor's flush() call fails with an error. 8. The Executor exits, dropping `cancel_batcher` and its end of the spsc_fold. This wakes Batcher, causing the `spsc_fold.send` to fail. Batcher exits. The pipeline shuts down as intended. We return from `process_query` and log the `Connection reset by peer` or `Broken pipe` error. The following diagram visualizes the wait-for graph at (5) ```mermaid flowchart TD Batcher --spsc_fold.send(TimeoutReader_error)--> Executor Executor --flush batch N responses--> socket.write_end socket.write_end --wait for TCP window to move forward--> Compute ``` # Analysis By holding the GateGuard inside the `spsc_fold` open, the pipelining implementation violated the principle established in (https://github.com/neondatabase/neon/pull/8339). That is, that `Handle`s must only be held across an await point if that await point is sensitive to the `<Handle as Deref<Target=Timeline>>::cancel` token. In this case, we were holding the Handle inside the `spsc_fold` while awaiting the `pgb_writer.flush()` future. One may jump to the conclusion that we should simply peek into the spsc_fold to get that Timeline cancel token and be sensitive to it during flush, then. But that violates another principle of the design from https://github.com/neondatabase/neon/pull/8339. That is, that the page_service connection lifecycle and the Timeline lifecycles must be completely decoupled. Tt must be possible to shut down one shard without shutting down the page_service connection, because on that single connection we might be serving other shards attached to this pageserver. (The current compute client opens separate connections per shard, but, there are plans to change that.) # Solution This PR adds a `handle::WeakHandle` struct that does _not_ hold the timeline gate open. It must be `upgrade()`d to get a `handle::Handle`. That `handle::Handle` _does_ hold the timeline gate open. The batch queued inside the `spsc_fold` only holds a `WeakHandle`. We only upgrade it while calling into the various `handle_` methods, i.e., while interacting with the `Timeline` via `<Handle as Deref<Target=Timeline>>`. All that code has always been required to be (and is!) sensitive to `Timeline::cancel`, and therefore we're guaranteed to bail from it quickly when `Timeline::shutdown` starts. We will drop the `Handle` immediately, before we start `pgb_writer.flush()`ing the responses. Thereby letting go of our hold on the `GateGuard`, allowing the timeline shutdown to complete while the page_service handler remains intact. # Code Changes * Reproducer & Regression Test * Developed and proven to reproduce the issue in https://github.com/neondatabase/neon/pull/10399 * Add a `Test` message to the pagestream protocol (`cfg(feature = "testing")`). * Drive-by minimal improvement to the parsing code, we now have a `PagestreamFeMessageTag`. * Refactor `pageserver/client` to allow sending and receiving `page_service` requests independently. * Add a Rust helper binary to produce situation (4) from above * Rationale: (4) and (5) are the same bug class, we're holding a gate open while `flush()`ing. * Add a Python regression test that uses the helper binary to demonstrate the problem. * Fix * Introduce and use `WeakHandle` as explained earlier. * Replace the `shut_down` atomic with two enum states for `HandleInner`, wrapped in a `Mutex`. * To make `WeakHandle::upgrade()` and `Handle::downgrade()` cache-efficient: * Wrap the `Types::Timeline` in an `Arc` * Wrap the `GateGuard` in an `Arc` * The separate `Arc`s enable uncontended cloning of the timeline reference in `upgrade()` and `downgrade()`. If instead we were `Arc<Timeline>::clone`, different connection handlers would be hitting the same cache line on every upgrade()/downgrade(), causing contention. * Please read the udpated module-level comment in `mod handle` module-level comment for details. # Testing & Performance The reproducer test that failed before the changes now passes, and obviously other tests are passing as well. We'll do more testing in staging, where the issue happens every ~4h if chaos migrations are enabled in storcon. Existing perf testing will be sufficient, no perf degradation is expected. It's a few more alloctations due to the added Arc's, but, they're low frequency. # Appendix: Why Compute Sometimes Doesn't Read Responses Remember, the whole problem surfaced because flush() was slow because Compute was not reading responses. Why is that? In short, the way the compute works, it only advances the page_service protocol processing when it has an interest in data, i.e., when the pagestore smgr is called to return pages. Thus, if compute issues a bunch of requests as part of prefetch but then it turns out it can service the query without reading those pages, it may very well happen that these messages stay in the TCP until the next smgr read happens, either in that session, or possibly in another session. If there’s too many unread responses in the TCP, the pageserver kernel is going to backpressure into userspace, resulting in our stuck flush(). All of this stems from the way vanilla Postgres does prefetching and "async IO": it issues `fadvise()` to make the kernel do the IO in the background, buffering results in the kernel page cache. It then consumes the results through synchronous `read()` system calls, which hopefully will be fast because of the `fadvise()`. If it turns out that some / all of the prefetch results are not needed, Postgres will not be issuing those `read()` system calls. The kernel will eventually react to that by reusing page cache pages that hold completed prefetched data. Uncompleted prefetch requests may or may not be processed -- it's up to the kernel. In Neon, the smgr + Pageserver together take on the role of the kernel in above paragraphs. In the current implementation, all prefetches are sent as GetPage requests to Pageserver. The responses are only processed in the places where vanilla Postgres would do the synchronous `read()` system call. If we never get to that, the responses are queued inside the TCP connection, which, once buffers run full, will backpressure into Pageserver's sending code, i.e., the `pgb_writer.flush()` that was the root cause of the problems we're fixing in this PR.	2025-01-16 20:34:02 +00:00
Matthias van de Meent	2eda484ef6	prefetch: Read more frequently from TCP buffer (#10394 ) This reduces pressure on the OS TCP read buffer by increasing the moments we read data out of the receive buffer, and increasing the number of bytes we can pull from that buffer when we do reads. ## Problem A backend may not always consume its prefetch data quick enough ## Summary of changes We add a new function `prefetch_pump_state` which pulls as many prefetch requests from the OS TCP receive buffer as possible, but without blocking. This thus reduces pressure on OS-level TCP buffers, thus increasing throughput by limiting throttling caused by full TCP buffers.	2025-01-16 02:43:47 +00:00
Heikki Linnakangas	846e8fdce4	Remove obsolete hnsw extension (#8008 ) This has been deprecated and disabled for new installations for a long time. Let's remove it for good.	2025-01-11 14:20:50 +00:00
Konstantin Knizhnik	20c40eb733	Add response tag to getpage request in V3 protocol version (#8686 ) ## Problem We have several serious data corruption incidents caused by mismatch of get-age requests: https://neondb.slack.com/archives/C07FJS4QF7V/p1723032720164359 We hope that the problem is fixed now. But it is better to prevent such kind of problems in future. Part of https://github.com/neondatabase/cloud/issues/16472 ## Summary of changes This PR introduce new V3 version of compute<->pageserver protocol, adding tag to getpage response. So now compute is able to check if it really gets response to the requested page. ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech> Co-authored-by: Heikki Linnakangas <heikki@neon.tech>	2025-01-09 13:12:04 +00:00
Matthias van de Meent	be38123e62	Fix accounting of dropped prefetched GetPage requests (#10276 ) Apparently, we failed to do this bookkeeping in quite a few places... ## Problem Fixes https://github.com/neondatabase/cloud/issues/22364 ## Summary of changes Add accounting of dropped requests. Note that this includes prefetches dropped due to things like "PS connection dropped unexpectedly" or "prefetch queue is already full", but not (yet?) "dropped due to backend shutdown".	2025-01-07 10:41:52 +00:00
Matthias van de Meent	30863c0104	libpagestore: timeout = max(0, difference), not min(0, difference) (#10274 ) Using `min(0, ...)` causes us to fail to wait in most situations, so a lack of data would be a hot wait loop, which is bad. ## Problem We noticed high CPU usage in some situations	2025-01-07 09:07:38 +00:00
Konstantin Knizhnik	b3cd883f93	Unlock LFC mutex when LFC cache is disabled (#10235 ) ## Problem See https://github.com/neondatabase/neon/issues/10233 `lfc_containsv` returns with holding lock when LFC was disabled. This bug was introduced in commit `78938d1b59` ## Summary of changes Release lock before return. Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2025-01-02 11:28:15 +00:00
Konstantin Knizhnik	04517c6ff3	Do not reload config file on PS reconnect (#10204 ) ## Problem See https://github.com/neondatabase/neon/issues/10184 and https://neondb.slack.com/archives/C04DGM6SMTM/p1733997259898819 Reloading config file inside parallel worker cause it's termination ## Summary of changes Remove call of `HandleMainLoopInterrupts()` Update of page server URL is propagated by postmaster through shared memory and we should not reload config for it. Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2024-12-19 15:22:39 +00:00
Mikhail Kot	cf161e1556	fix(adapter): password not set in role drop (#10130 ) ## Problem When entry was dropped and password wasn't set, new entry had uninitialized memory in controlplane adapter Resolves: https://github.com/neondatabase/cloud/issues/14914 ## Summary of changes Initialize password in all cases, add tests. Minor formatting for less indentation	2024-12-14 17:37:13 +00:00
Tristan Partin	07d1db54b3	Improve comments and log messages in the logical replication monitor (#9974 ) Improved comments will help others when they read the code, and the log messages will help others understand why the logical replication monitor works the way it does. Signed-off-by: Tristan Partin <tristan@neon.tech>	2024-12-13 18:10:42 +00:00
Konstantin Knizhnik	eeabecd89f	Correctly update LFC used_pages in case of LFC resize (#10128 ) ## Problem LFC used_pages statistic is not updated in case of LFC resize (shrinking `neon.file_cache_size_limit`) ## Summary of changes Update `lfc_ctl->used_pages` in `lfc_change_limit_hook` Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2024-12-13 17:40:26 +00:00

1 2 3 4 5 ...

314 Commits