rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-01-06 21:12:55 +00:00

Author	SHA1	Message	Date
Konstantin Knizhnik	60b9fb1baf	Ignore unlogged LSNs in set last written LSN (#11743 ) ## Problem See https://github.com/neondatabase/neon/issues/11718 and https://neondb.slack.com/archives/C033RQ5SPDH/p1745122797538509 GIST other indexes performing "unlogged build" are using so called fake LSNs - not a real LSN, but something like 0/1. Been stored in lwlsn cache they cause incorrect lookup at PS. ## Summary of changes Do not store fake LSNs in LwLSN hash. Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2025-04-28 12:16:29 +00:00
Konstantin Knizhnik	132b6154bb	Unlogged build debug compare local v2 (#11554 ) ## Problem Init fork is used in DEBUG_COMPARE_LOCAL to determine unlogged relation or unlogged build. But it is created only after the relation is initialized and so can be swapped out, producing `Page is evicted with zero LSN` error. ## Summary of changes Create init fork together with main fork for unlogged relations in DEBUG_COMPARE_LOCAL mode. --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2025-04-22 18:07:45 +00:00
Anastasia Lubennikova	7747a9619f	compute: fix copy-paste typo for neon GUC parameters check (#11610 ) fix for commit [`5063151`](`5063151271`)	2025-04-16 15:55:11 +00:00
Matthias van de Meent	2a46426157	Update neon GUCs with new default settings (#11595 ) Staging and prod both have these settings configured like this, so let's update this so we can eventually drop the overrides in prod.	2025-04-16 13:42:22 +00:00
Heikki Linnakangas	b4e26a6284	Set last-written LSN as part of smgr_end_unlogged_build() (#11584 ) This way, the callers don't need to do it, reducing the footprint of changes we've had to made to various index AM's build functions.	2025-04-16 12:34:18 +00:00
Konstantin Knizhnik	35170656fe	Allocate WalProposerConn using TopMemoryAllocator (#11577 ) ## Problem See https://neondb.slack.com/archives/C04DGM6SMTM/p1744659631698609 `WalProposerConn` is allocated using current memory context which life time is not long enough. ## Summary of changes Allocate `WalProposerConn` using `TopMemoryContext`. Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2025-04-15 19:13:12 +00:00
Konstantin Knizhnik	307fa2ceb7	Remove unused n_synced variable from HandleSafekeeperResponse (#11553 ) ## Problem clang produce warning about unused variable `n_synced` in HandleSafekeeperResponse ## Summary of changes Remove local variable. Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2025-04-14 09:45:13 +00:00
Anastasia Lubennikova	5063151271	compute: Add more neon ids to compute (#11366 ) Pass more neon ids to compute_ctl. Expose them to postgres as neon extension GUCs: neon.project_id, neon.branch_id, neon.endpoint_id. This is the compute side PR, not yet supported by cplane.	2025-04-10 13:04:18 +00:00
Arseny Sher	fae7528adb	walproposer: make it aware of membership (#11407 ) ## Problem Walproposer should get elected and commit WAL on safekeepers specified by the membership configuration. ## Summary of changes - Add to wp `members_safekeepers` and `new_members_safekeepers` arrays mapping configuration members to connection slots. Establish this mapping (by node id) when safekeeper sends greeting, giving its id and when mconf becomes known / changes. - Add to TermsCollected, VotesCollected, GetAcknowledgedByQuorumWALPosition membership aware logic. Currently it partially duplicates existing one, but we'll drop the latter eventually. - In python, rename Configuration to MembershipConfiguration for clarity. - Add test_quorum_sanity testing new logic. ref https://github.com/neondatabase/neon/issues/10851	2025-04-10 09:55:37 +00:00
Heikki Linnakangas	ef8101a9be	refactor: Split "communicator" routines to a separate source file (#11459 ) pagestore_smgr.c had grown pretty large. Split into two parts, such that the smgr routines that PostgreSQL code calls stays in pagestore_smgr.c, and all the prefetching logic and other lower-level routines related to communicating with the pageserver are moved to a new source file, "communicator.c". There are plans to replace communicator parts with a new implementation. See https://github.com/neondatabase/neon/pull/10799. This commit doesn't implement any of the new things yet, but it is good preparation for it. I'm imagining that the new implementation will approximately replace the current "communicator.c" code, exposing roughly the same functions to pagestore_smgr.c. This commit doesn't change any functionality or behavior, or make any other changes to the existing code: It just moves existing code around.	2025-04-09 12:28:59 +00:00
Konstantin Knizhnik	c9ca8b7c4a	One more fix for unlogged build support in DEBUG_COMPARE_LOCAL (#11474 ) ## Problem Support of unlogged build in DEBUG_COMPARE_LOCAL. Neon SMGR treats present of local file as indicator of unlogged relations. But it doesn't work in DEBUG_COMPARE_LOCAL mode. ## Summary of changes Use INIT_FORKNUM as indicator of unlogged file and create this file while unlogged index build. Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2025-04-09 05:14:29 +00:00
Heikki Linnakangas	7ffcbfde9a	refactor: Move LFC function prototypes to separate header file (#11458 ) Also, move the call to the lfc_init() function. It was weird to have it in libpagestore.c, when libpagestore.c otherwise had nothing to do with the LFC. Move it directly into _PG_init()	2025-04-08 09:03:56 +00:00
Konstantin Knizhnik	b2a0b2e9dd	Skip hole tags in local_cache view (#11454 ) ## Problem If the local file cache is shrunk, so that we punch some holes in the underlying file, the local_cache view displays the holes incorrectly. See https://github.com/neondatabase/neon/issues/10770 ## Summary of changes Skip hole tags in the local_cache view. --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2025-04-08 03:52:50 +00:00
Konstantin Knizhnik	8eb701d706	Save FSM/VM pages on normal shutdown (#11449 ) ## Problem See https://neondb.slack.com/archives/C03QLRH7PPD/p1743746717119179 We wallow FSM/VM pages when they are written to disk to persist them in PS. But it is not happen during shutdown checkpoint, because writing to WAL during checkpoint cause Postgres panic. ## Summary of changes Move `CheckPointBuffers` call to `PreCheckPointGuts` Postgres PRs: https://github.com/neondatabase/postgres/pull/615 https://github.com/neondatabase/postgres/pull/614 https://github.com/neondatabase/postgres/pull/613 https://github.com/neondatabase/postgres/pull/612 --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2025-04-07 13:56:55 +00:00
Heikki Linnakangas	b2a670c765	refactor: Use same prototype for neon_read_at_lsn on all PG versions (#11457 ) The 'neon_read' function needs to have a different prototype on PG < 16, because it's part of the smgr interface. But neon_read_at_lsn doesn't have that restriction.	2025-04-07 11:04:36 +00:00
Heikki Linnakangas	1a87975d95	Misc cleanup of #includes and comments in the neon extension (#11456 ) Remove useless and often wrong IDENTIFICATION comments. PostgreSQL sources have them, mostly for historical reasons, but there's no need for us to copy that style. Remove unnecessary #includes in header files, putting the #includes directly in the .c files that need them. The principle is that a header file should #include other header files if they need definitions from them, such that each header file can be compiled on its own, but not other #includes. (There are tools to enforce that, but this was just a manual clean up of violations that I happened to spot.)	2025-04-06 15:34:13 +00:00
Konstantin Knizhnik	02936b82c5	Fix effective_lsn calculation for prefetch (#11219 ) ## Problem See https://neondb.slack.com/archives/C04DGM6SMTM/p1741594233757489 Consider the following scenario: 1. Backend A wants to prefetch some block B 2. Backend A checks that block B is not present in shared buffer 3. Backend A registers new prefetch request and calls prefetch_do_request 4. prefetch_do_request calls neon_get_request_lsns 5. neon_get_request_lsns obtains LwLSN for block B 6. Backend B downloads B, updates and wallogs it (let say to Lsn1) 7. Block B is once again thrown from shared buffers, its LwLSN is set to Lsn1 8. Backend A obtains current flush LSN, let's say that it is Lsn1 9. Backend A stores Lsn1 as effective_lsn in prefetch slot. 10. Backend A reads page B with LwLSN=Lsn1 11. Backend A finds in prefetch ring response for prefetch request for block B with effective_lsn=Lsn1, so that it satisfies neon_prefetch_response_usable condition 12. Backend A uses deteriorated version of the page! ## Summary of changes Use `not_modified_since` as `effective_lsn`. It should not cause some degrade of performance because we store LwLSN when it was not found in LwLSN hash, so if page is not changed till prefetch response is arrived, then LwLSN should not be changed. --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2025-04-01 15:48:02 +00:00
Konstantin Knizhnik	cfe3e6d4e1	Remove loop from pageserver_try_receive (#11387 ) ## Problem Commit `3da70abfa5` cause noticeable performance regression (40% in update-with-prefetch in test_bulk_update): https://neondb.slack.com/archives/C04BLQ4LW7K/p1742633167580879 ## Summary of changes Remove loop from pageserver_try_receive to make it fetch not more than one response. There is still loop in `pump_prefetch_state` which can fetch as many responses as available. Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2025-03-31 19:49:32 +00:00
Matthias van de Meent	e5b95bc9dc	Neon LFC/prefetch: Improve page read handling (#11380 ) Previously we had different meanings for the bitmask of vector IOps. That has now been unified to "bit set = final result, no more scribbling". Furthermore, the LFC read path scribbled on pages that were already read; that's probably not a good thing so that's been fixed too. In passing, the read path of LFC has been updated to read only the requested pages into the provided buffers, thus reducing the IO size of vectorized IOs. ## Problem ## Summary of changes	2025-03-31 17:04:00 +00:00
Konstantin Knizhnik	21a891a06d	Fix IS_LOCAL_REL macro (first class has oid=FirstNormalObjectId) (#11369 ) ## Problem Macro IS_LOCAL_REL used for DEBUG_COMPARE_LOCAL mode use greater-than rather than greater-or-equal comparison while first table really is assigned FirstNormalObjectId. ## Summary of changes Replace strict greater with greater-or-equal comparison. Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2025-03-31 11:16:35 +00:00
Folke Behrens	23ad228310	pgxn: Increase the pageserver response timeout a bit (#11339 ) Increase the PS response timeout slightly but noticeably, so it does not coincide with the default TCP_RTO_MAX.	2025-03-21 14:21:53 +00:00
Folke Behrens	d0102a473a	pgxn: Include local port in no-response log messages (#11321 ) ## Problem Now that stuck connections are quickly terminated it's not easy to quickly find the right port from the pid to correlate the connection with the one seen on pageserver side. ## Summary of changes Call getsockname() and include the local port number in the no-response-from-pageserver log messages.	2025-03-20 16:06:00 +00:00
Alex Chi Z.	78502798ae	feat(compute_ctl): pass compute type to pageserver with pg_options (#11287 ) ## Problem second try of https://github.com/neondatabase/neon/pull/11185, part of https://github.com/neondatabase/cloud/issues/24706 ## Summary of changes Tristan reminded me of the `options` field of the pg wire protocol, which can be used to pass configurations. This patch adds the parsing on the pageserver side, and supplies `neon.endpoint_type` as part of the `options`. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-03-20 15:48:40 +00:00
Konstantin Knizhnik	3da70abfa5	Fix pageserver_try_receive (#11096 ) ## Problem See https://neondb.slack.com/archives/C04DGM6SMTM/p1741176713523469 The problem is that this function is using `PQgetCopyData(shard->conn, &resp_buff.data, 1 /* async = true */)` to try to fetch next message. But this function returns 0 if the whole message is not present in the buffer. And input buffer may contain only part of message so result is not fetched. ## Summary of changes Use `PQisBusy` + `WaitEventSetWait` to check if data is available and `PQgetCopyData(shard->conn, &resp_buff.data, 0)` to read whole message in this case. --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2025-03-20 15:21:00 +00:00
Suhas Thalanki	5589efb6de	moving LastWrittenLSNCache to Neon Extension (#11031 ) ## Problem We currently have this code duplicated across different PG versions. Moving this to an extension would reduce duplication and simplify maintenance. ## Summary of changes Moving the LastWrittenLSN code from PG versions to the Neon extension and linking it with hooks. Related Postgres PR: https://github.com/neondatabase/postgres/pull/590 Closes: https://github.com/neondatabase/neon/issues/10973 --------- Co-authored-by: Tristan Partin <tristan@neon.tech>	2025-03-19 17:29:40 +00:00
Konstantin Knizhnik	24f41bee5c	Update LFC in case of unlogged build (#11262 ) ## Problem Unlogged build is used for GIST/SPGIST/GIN/HNSW indexes. In this mode we first change relation class to `RELPERSISTENCE_UNLOGGED` and save them on local disk. But we do not save unlogged relations in LFC. It may cause fetching incorrect value from LFC if relfilenode is reused. ## Summary of changes Save modified pages in LFC on second stage of unlogged build (when modified pages are walloged). There is no need to save pages in LFC at first phase because the will be in any case overwritten with assigned LSN at second phase. Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2025-03-17 19:06:42 +00:00
Konstantin Knizhnik	15e63afe7d	Support DEBUG_COMPARE_LOCAL mode for unloggedindex build (#11257 ) ## Problem In unlogged index build (used fir GIST/SPGIST/GIN indexes) files is created on disk and then removed at the end. It contradicts to the logic of DEBUG_COMPARE_LOCAL mode. ## Summary of changes Do not create and unlink files in unlogged build in DEBUG_COMPARE_LOCAL mode. Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2025-03-17 06:07:24 +00:00
Konstantin Knizhnik	398d2794eb	Handle DEBUG_COMPARE_LOCAL mode in neon_zeroextend (#11220 ) ## Problem DEBUG_COMPARE_LOCAL is not supported in neon_zeroextend added in PG16 ## Summary of changes Add support of DEBUG_COMPARE_LOCAL in neon_zeroextend Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2025-03-13 16:30:32 +00:00
Konstantin Knizhnik	f60ffe3021	Rebase compare local debug mode (#11174 ) ## Problem DEBUG_COMPARE_LOCAL mode is broken See https://neondb.slack.com/archives/C03QLRH7PPD/p1732862608323269?thread_ts=1732711054.862919&cid=C03QLRH7PPD ## Summary of changes Fix compile errors and unlogged build issues. Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2025-03-12 05:52:18 +00:00
Arseny Sher	359c64c779	walproposer: pre generations refactoring (#11060 ) ## Problem https://github.com/neondatabase/neon/issues/10851 ## Summary of changes Do some refactoring before making walproposer generations aware. - Rename SS_VOTING to SS_WAIT_VOTING, SS_IDLE to SS_WAIT_ELECTED - Continue to get rid of epochs: rename GetEpoch to GetLastLogTerm, donorEpoch to donorLastLogTerm - Instead of counting n_votes, n_connected, introduce explicit WalProposerState (collecting terms / voting / elected). Refactor out TermsCollected and VotesCollected; they will determine state transition differently depending whether generations are enabled or not. There is no new logic in this PR and thus no new tests.	2025-03-11 14:01:00 +00:00
Konstantin Knizhnik	fb1957936c	Fix caclulation of LFC used_pages (#11095 ) ## Problem Async prefetch in LFC PR cause incorrect calculation of LFC `used_pages`when page is overwritten ## Summary of changes Decrement `used_pages` is page is overwritten. --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech> Co-authored-by: Matthias van de Meent <matthias@neon.tech>	2025-03-10 18:28:55 +00:00
Matthias van de Meent	bc052fd0fc	Add configuration options to disable prevlink checks (#11138 ) This allows for improved decoding of otherwise broken WAL. ## Problem Currently, if (or when) a WAL record has a wrong prevptr, that breaks decoding. With this, we don't have to break on that if we decide it's OK to proceed after that. ## Summary of changes Use a Neon GUC to allow the system to enable the NEON-specific skip_lsn_checks option in XLogReader.	2025-03-10 17:02:30 +00:00
Konstantin Knizhnik	c87d307e8c	Print state of connection buffer when no response is receioved from PS for a long time (#11145 ) ## Problem See https://neondb.slack.com/archives/C08DE6Q9C3B Sometimes compute is not able to receive responses from PS for a long time (minutes). I do not think that the problem is at compute side, but in order to exclude this possibility I wan to see more information about connection state at compute side, particularly amount of data cached in connection buffer. ## Summary of changes Right now we are dumping state of socket buffer. This PR adds printing state of connection buffer. --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2025-03-09 18:36:36 +00:00
Em Sharnoff	1fe23fe8d2	compute/lfc: Add chunk size to neon_lfc_stats (#11100 ) This PR adds a new key to neon.neon_lfc_stats — 'file_cache_chunk_size_pages'. It just returns the value of BLOCKS_PER_CHUNK from the LFC implementation. The new value should (eventually) allow changing the chunk size without breaking any places that rely on LFC stats values measured in number of chunks. See neondatabase/cloud#25170 for more.	2025-03-05 20:35:08 +00:00
Konstantin Knizhnik	438f7bb726	Check response status in prefetch_lookup (#11080 ) ## Problem New async prefetch introduces `prefetch+lookup[` function which is called before LFC lookup to check if prefetch request is already completed. This function is not containing now check that response is actually `T_NeonGetPageResponse` (and not error). ## Summary of changes Add checks for response tag. --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2025-03-05 10:03:09 +00:00
Arseny Sher	ef2b50994c	walproposer: basic infra to enable generations (#11002 ) ## Problem Preparation for https://github.com/neondatabase/neon/issues/10851 ## Summary of changes Add walproposer `safekeepers_generations` field which can be set by prefixing `neon.safekeepers` GUC with `g#n:`. Non zero value (n) forces walproposer to use generations. In particular, this also disables implicit timeline creation as timeline will be created by storcon. Add test checking this. Also add missing infra: `--safekeepers-generation` flag to neon_local endpoint start + fix `--start-timeout` flag: it existed but value wasn't used.	2025-03-03 13:20:20 +00:00
Konstantin Knizhnik	8669bfe493	Do not store zero pages in inmem SMGR for walredo (#11043 ) ## Problem See https://neondb.slack.com/archives/C033RQ5SPDH/p1740157873114339 smgrextend for FSM fork is called during page reconstruction by walredo process causing overflow of inmem SMGR (64 pages). ## Summary of changes Do not store zero pages in inmem SMGR because `inmem_read` returns zero page if it is not able to locate specified block. Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2025-03-03 12:50:07 +00:00
Konstantin Knizhnik	e58f264a05	Increase inmem SMGR size for walredo process to 100 pagees (#10937 ) ## Problem We see `Inmem storage overflow` in page server logs: https://neondb.slack.com/archives/C033RQ5SPDH/p1740157873114339 walked process is using inseam SMGR with storage size limited by 64 pages with warning watermark 32 (based ion the assumption that XLR_MAX_BLOCK_ID is 32, so WAL record can not access more than 32 pages). Actually it is not true. We can update up to 3 forks for each block (including update of FSM and VM forks). ## Summary of changes This PR increases inseam SMGR size for walled process to 100 pages and print stack trace in case of overflow. --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2025-02-27 14:31:05 +00:00
Matthias van de Meent	a283edaccf	PS/Prefetch: Use a timeout for reading data from TCP (#10834 ) This reduces pressure on OS TCP buffers, reducing flush times in other systems like PageServer. ## Problem ## Summary of changes	2025-02-27 14:00:18 +00:00
Arseny Sher	c1a040447d	walproposer: send valid timeline_start_lsn in v2 (#10994 ) ## Problem https://github.com/neondatabase/neon/pull/10647 dropped timeline_start_lsn from protocol messages as it can be taken from term history. In v2 0 was sent in the placeholder. However, until safekeepers are deployed with that PR they still use the value, setting timeline_start_lsn to 0, which confuses WAL reading; problem appears only when compute includes 10647 but safekeepers don't. ref https://neondb.slack.com/archives/C04DGM6SMTM/p1740577649644269?thread_ts=1740572363.541619&cid=C04DGM6SMTM ## Summary of changes Send real value instead of 0 in v2.	2025-02-26 17:38:44 +00:00
Konstantin Knizhnik	dc975d554a	Incremenet getpage histogram in prefetch_lookup (#10965 ) ## Problem PR https://github.com/neondatabase/neon/pull/10442 added prefetch_lookup function. It changed handling of getpage requests in compute. Before: 1. Lookup in LFC (return if found) 2. Register prefetch buffer 3. Wait prefetch result (increment getpage_hist) Now: 1. Lookup prefetch ring (return if prefetch request is already completed) 2. Lookup in LFC (return if found) 3. Register prefetch buffer 4. Wait prefetch result (increment getpage_hist) So if prefetch result is already available, then get page histogram is not incremented. It case failure of some test_throughtput benchmarks: https://neondb.slack.com/archives/C033RQ5SPDH/p1740425527249499 ## Summary of changes Increment getpage histogram in `prefetch_lookup` Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2025-02-25 19:51:38 +00:00
Konstantin Knizhnik	8f82c661d4	Move neon_pgstat_file_size_limit to the extension (#10959 ) ## Problem PG14 uses separate backend for stats collector having no access to shaerd memory. As far as AUX mechanism requires access to shared memory, persisting pgstat.stat file is not supported at pg14. And so there is no definition of `neon_pgstat_file_size_limit` variable. It makes it impossible to provide same config for all Postgres version. ## Summary of changes Move neon_pgstat_file_size_limit to Neon extension. Postgres submodules PR: https://github.com/neondatabase/postgres/pull/587 https://github.com/neondatabase/postgres/pull/588 https://github.com/neondatabase/postgres/pull/589 --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech> Co-authored-by: Tristan Partin <tristan@neon.tech>	2025-02-25 12:23:04 +00:00
Arseny Sher	758f597280	compute <-> sk protocol v3 (#10647 ) ## Problem As part of https://github.com/neondatabase/neon/issues/8614 we need to pass membership configurations between compute and safekeepers. ## Summary of changes Add version 3 of the protocol carrying membership configurations. Greeting message in both sides gets full conf, and other messages generation number only. Use protocol bump to include other accumulated changes: - stop packing whole structs on the wire as is; - make the tag u8 instead of u64; - send all ints in network order; - drop proposer_uuid, we can pass it in START_WAL_PUSH and it wasn't much useful anyway. Per message changes, apart from mconf: - ProposerGreeting: tenant / timeline id is sent now as hex cstring. Remove proto version, it is passed outside in START_WAL_PUSH. Remove postgres timeline, it is unused. Reorder fields a bit. - AcceptorGreeting: reorder fields - VoteResponse: timeline_start_lsn is removed. It can be taken from first member of term history, and later we won't need it at all when all timelines will be explicitly created. Vote itself is u8 instead of u64. - ProposerElected: timeline_start_lsn is removed for the same reasons. - AppendRequest: epoch_start_lsn removed, it is known from term history in ProposerElected. Both compute and sk are able to talk v2 and v3 to make rollbacks (in case we need them) easier; neon.safekeeper_proto_version GUC sets the client version. v2 code can be dropped later. So far empty conf is passed everywhere, future PRs will handle them. To test, add param to some tests choosing proto version; we want to test both 2 and 3 until we fully migrate. ref https://github.com/neondatabase/neon/issues/10326 --------- Co-authored-by: Arthur Petukhovsky <petuhovskiy@yandex.ru>	2025-02-25 11:56:05 +00:00
Heikki Linnakangas	565a9e62a1	compute: Disconnect if no response to a pageserver request is received (#10882 ) We've seen some cases in production where a compute doesn't get a response to a pageserver request for several minutes, or even more. We haven't found the root cause for that yet, but whatever the reason is, it seems overly optimistic to think that if the pageserver hasn't responded for 2 minutes, we'd get a response if we just wait patiently a little longer. More likely, the pageserver is dead or there's some kind of a network glitch so that the TCP connection is dead, or at least stuck for a long time. Either way, it's better to disconnect and reconnect. I set the default timeout to 2 minutes, which should be enough for any GetPage request under normal circumstances, even if the pageserver has to download several layer files from remote storage. Make the disconnect timeout configurable. Also make the "log interval", after which we print a message to the log configurable, so that if you change the disconnect timeout, you can set the log timeout correspondingly. The default log interval is still 10 s. The new GUCs are called "neon.pageserver_response_log_timeout" and "neon.pageserver_response_disconnect_timeout". Includes a basic test for the log and disconnect timeouts. Implements issue #10857	2025-02-24 20:16:37 +00:00
Heikki Linnakangas	40acb0c06d	Fix usage of WaitEventSetWait() with timeout (#10947 ) WaitEventSetWait() returns the number of "events" that happened, and only that many events in the WaitEvent array are updated. When the timeout is reached, it returns 0 and does not modify the WaitEvent array at all. We were reading 'event.events' without checking the return value, which would be uninitialized when the timeout was hit. No test included, as this is harmless at the moment. But this caused the test I'm including in PR #10882 to fail. That PR changes the logic to loop back to retry the PQgetCopyData() call if WL_SOCKET_READABLE was set. Currently, making an extra call to PQconsumeInput() is harmless, but with that change in logic, it turns into a busy-wait.	2025-02-24 17:15:07 +00:00
Konstantin Knizhnik	b1d8771d5f	Store prefetch results in LFC cache once as soon as they are received (#10442 ) ## Problem Prefetch is performed locally, so different backers can request the same pages form PS. Such duplicated request increase load of page server and network traffic. Making prefetch global seems to be very difficult and undesirable, because different queries can access chunks on different speed. Storing prefetch chunks in LFC will not completely eliminate duplicates, but can minimise such requests. The problem with storing prefetch result in LFC is that in this case page is not protected by share buffer lock. So we will have to perform extra synchronisation at LFC side. See: https://neondb.slack.com/archives/C0875PUD0LC/p1736772890602029?thread_ts=1736762541.116949&cid=C0875PUD0LC @MMeent implementation of prewarm: See https://github.com/neondatabase/neon/pull/10312/ ## Summary of changes Use conditional variables to sycnhronize access to LFC entry. --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2025-02-21 16:56:16 +00:00
Konstantin Knizhnik	0b9b391ea0	Fix caclulation of prefetch ring position to fit in-flight request in resized ring buffer (#10899 ) ## Problem Refer https://github.com/neondatabase/neon/issues/10885 Wait position in ring buffer to restrict number of in-flight requests is not correctly calculated. ## Summary of changes Update condition and remove redundant assertion Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2025-02-20 20:21:44 +00:00
Tristan Partin	da79cc5eee	Add neon.extension_server_{connect,request}_timeout (#10801 ) Instead of hardcoding the request timeout, let's make it configurable as a PGC_SUSET GUC. Additionally, add a connect timeout GUC. Although the extension server runs on the compute, it is always best to keep operations from hanging. Better to present a timeout error to the user than a stuck backend. Signed-off-by: Tristan Partin <tristan@neon.tech>	2025-02-17 15:40:43 +00:00
Konstantin Knizhnik	8c6d133d31	Fix out-of-boundaries access in addSHLL function (#10840 ) ## Problem See https://github.com/neondatabase/neon/issues/10839 rho(x,b) functions returns values in range [1,b+1] and addSHLL tries to store it in array of size b+1. ## Summary of changes Subtract 1 fro value returned by rho --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2025-02-17 12:54:17 +00:00
Christian Schwarz	a32e8871ac	compute/pageserver: correlation of logs through backend PID (via `application_name`) (#10810 ) This PR makes compute set the `application_name` field to the PG backend process PID which is also included in each compute log line. This allows correlation of Pageserver connection logs with compute logs in a way that was guesswork before this PR. In future, we can switch for a more unique identifier for a page_service session. Refs - discussion in https://neondb.slack.com/archives/C08DE6Q9C3B/p1739465208296169?thread_ts=1739462628.361019&cid=C08DE6Q9C3B - fixes https://github.com/neondatabase/neon/issues/10808	2025-02-14 20:11:42 +00:00

1 2 3 4 5 ...

384 Commits