rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2025-12-25 15:19:58 +00:00

Author	SHA1	Message	Date
Heikki Linnakangas	d696c41807	Bump default neon extension version to 1.5 (#9188 ) Commit `263dfba6ee` introduced neon extension version 1.5, which included some new functions and views for metrics. It didn't bump the default neon extension number yet, so that we could still safely roll back to the old binary if necessary. This bumps the default version.	2024-09-30 09:20:52 +03:00
Matthias van de Meent	5c5871111a	WalProposer: Read WAL directly from WAL buffers in PG17 (#9171 ) This reduces the overhead of the WalProposer when it is not being throttled by SK WAL acceptance rate	2024-09-27 17:47:05 +02:00
Tristan Partin	8ace9ea25f	Format long single DATA line in pgxn/Makefile This should be a little more readable. Signed-off-by: Tristan Partin <tristan@neon.tech>	2024-09-25 16:25:17 -05:00
Alexander Bayandin	523cf71721	Fix compiler warnings on macOS (#9128 ) ## Problem Compilation of neon extension on macOS produces a warning ``` pgxn/neon/neon_perf_counters.c:50:1: error: non-void function does not return a value [-Werror,-Wreturn-type] ``` ## Summary of changes - Change the return type of `NeonPerfCountersShmemInit` to void	2024-09-24 18:11:31 +00:00
Konstantin Knizhnik	1c5d6e59a0	Maintain number of used pages for LFC (#9088 ) ## Problem LFC cache entry is chunk (right now size of chunk is 1Mb). LFC statistics shows number of chunks, but not number of used pages. And autoscaling team wants to know how sparse LFC is: https://neondb.slack.com/archives/C04DGM6SMTM/p1726782793595969 It is possible to obtain it from the view `select count() from local_cache`. Nut it is expensive operation, enumerating all entries in LFC under lock. ## Summary of changes This PR added "file_cache_used_pages" to `neon_lfc_stats` view: ``` select from neon_lfc_stats; lfc_key \| lfc_value -----------------------+----------- file_cache_misses \| 3139029 file_cache_hits \| 4098394 file_cache_used \| 1024 file_cache_writes \| 3173728 file_cache_size \| 1024 file_cache_used_pages \| 25689 (6 rows) ``` Please notice that this PR doesn't change neon extension API, so no need to create new version of Neon extension. ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2024-09-23 22:05:32 +03:00
Heikki Linnakangas	263dfba6ee	Add views for metrics about pageserver requests (#9008 ) The metrics include a histogram of how long we need to wait for a GetPage request, number of reconnects, and number of requests among other things. The metrics are not yet exported anywhere, but you can query them manually. Note: This does not bump the default version of the 'neon' extension. We will do that later, as a separate PR. The reason is that this allows us to roll back the compute image smoothly, if necessary. Once the image that includes the new extension .so file with the new functions has been rolled out, and we're confident that we don't need to roll back the image anymore, we can change default extension version and actually start using the new functions and views. This is what the view looks like: ``` postgres=# select * from neon_perf_counters ; metric \| bucket_le \| value ---------------------------------------+-----------+---------- getpage_wait_seconds_count \| \| 300 getpage_wait_seconds_sum \| \| 0.048506 getpage_wait_seconds_bucket \| 2e-05 \| 0 getpage_wait_seconds_bucket \| 3e-05 \| 0 getpage_wait_seconds_bucket \| 6e-05 \| 71 getpage_wait_seconds_bucket \| 0.0001 \| 124 getpage_wait_seconds_bucket \| 0.0002 \| 248 getpage_wait_seconds_bucket \| 0.0003 \| 279 getpage_wait_seconds_bucket \| 0.0006 \| 297 getpage_wait_seconds_bucket \| 0.001 \| 298 getpage_wait_seconds_bucket \| 0.002 \| 298 getpage_wait_seconds_bucket \| 0.003 \| 298 getpage_wait_seconds_bucket \| 0.006 \| 300 getpage_wait_seconds_bucket \| 0.01 \| 300 getpage_wait_seconds_bucket \| 0.02 \| 300 getpage_wait_seconds_bucket \| 0.03 \| 300 getpage_wait_seconds_bucket \| 0.06 \| 300 getpage_wait_seconds_bucket \| 0.1 \| 300 getpage_wait_seconds_bucket \| 0.2 \| 300 getpage_wait_seconds_bucket \| 0.3 \| 300 getpage_wait_seconds_bucket \| 0.6 \| 300 getpage_wait_seconds_bucket \| 1 \| 300 getpage_wait_seconds_bucket \| 2 \| 300 getpage_wait_seconds_bucket \| 3 \| 300 getpage_wait_seconds_bucket \| 6 \| 300 getpage_wait_seconds_bucket \| 10 \| 300 getpage_wait_seconds_bucket \| 20 \| 300 getpage_wait_seconds_bucket \| 30 \| 300 getpage_wait_seconds_bucket \| 60 \| 300 getpage_wait_seconds_bucket \| 100 \| 300 getpage_wait_seconds_bucket \| Infinity \| 300 getpage_prefetch_requests_total \| \| 69 getpage_sync_requests_total \| \| 231 getpage_prefetch_misses_total \| \| 0 getpage_prefetch_discards_total \| \| 0 pageserver_requests_sent_total \| \| 323 pageserver_requests_disconnects_total \| \| 0 pageserver_send_flushes_total \| \| 323 file_cache_hits_total \| \| 0 (39 rows) ```	2024-09-23 21:28:50 +03:00
Christian Schwarz	59b4c2eaf9	walredo: add a ping method (#8952 ) Not used in production, but in benchmarks, to demonstrate minimal RTT. (It would be nice to not have to copy the 8KiB of zeroes, but, that would require larger protocol changes). Found this useful in investigation https://github.com/neondatabase/neon/pull/8952.	2024-09-23 10:19:37 +00:00
Matthias van de Meent	78938d1b59	[compute/postgres] feature: PostgreSQL 17 (#8573 ) This adds preliminary PG17 support to Neon, based on RC1 / 2024-09-04 `07b828e9d4` NOTICE: The data produced by the included version of the PostgreSQL fork may not be compatible with the future full release of PostgreSQL 17 due to expected or unexpected future changes in magic numbers and internals. DO NOT EXPECT DATA IN V17-TENANTS TO BE COMPATIBLE WITH THE 17.0 RELEASE! Co-authored-by: Anastasia Lubennikova <anastasia@neon.tech> Co-authored-by: Alexander Bayandin <alexander@neon.tech> Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech> Co-authored-by: Heikki Linnakangas <heikki@neon.tech>	2024-09-12 23:18:41 +01:00
Arseny Sher	11cf16e3f3	safekeeper: add term_bump endpoint. When walproposer observes now higher term it restarts instead of crashing whole compute with PANIC; this avoids compute crash after term_bump call. After successfull election we're still checking last_log_term of the highest given vote to ensure basebackup is good, and PANIC otherwise. It will be used for migration per 035-safekeeper-dynamic-membership-change.md and https://github.com/neondatabase/docs/pull/21 ref https://github.com/neondatabase/neon/issues/8700	2024-09-06 19:13:50 +03:00
Heikki Linnakangas	2d10306f7a	Remove support for pageserver <-> compute protocol version 1 (#8774 ) Protocol version 2 has been the default for a while now, and we no longer have any computes running in production that used protocol version 1. This completes the migration by removing support for v1 in both the pageserver and the compute. See issue #6211.	2024-08-27 18:36:33 +03:00
Alexey Kondratov	9b9f90c562	fix(walproposer): Do not restart on safekeepers reordering (#8840 ) ## Problem Currently, we compare `neon.safekeepers` values as is, so we unnecessarily restart walproposer even if safekeepers set didn't change. This leads to errors like: ```log FATAL: [WP] restarting walproposer to change safekeeper list from safekeeper-8.us-east-2.aws.neon.tech:6401,safekeeper-11.us-east-2.aws.neon.tech:6401,safekeeper-10.us-east-2.aws.neon.tech:6401 to safekeeper-11.us-east-2.aws.neon.tech:6401,safekeeper-8.us-east-2.aws.neon.tech:6401,safekeeper-10.us-east-2.aws.neon.tech:6401 ``` ## Summary of changes Split the GUC into the list of individual safekeepers and properly compare. We could've done that somewhere on the upper level, e.g., control plane, but I think it's still better when the actual config consumer is smarter and doesn't rely on upper levels.	2024-08-27 15:49:47 +02:00
Konstantin Knizhnik	7a485b599b	Fix race condition in LRU list update in get_cached_relsize (#8807 ) ## Problem See https://neondb.slack.com/archives/C07J14D8GTX/p1724347552023709 Manipulations with LRU list in relation size cache are performed under shared lock ## Summary of changes Take exclusive lock ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2024-08-22 23:53:37 +03:00
Konstantin Knizhnik	2be69af6c3	Track holes to be able to reuse them once LFC limit is increased (#8575 ) ## Problem Multiple increase/decrease LFC limit may cause unlimited growth of LFC file because punched holes while LFC shrinking are not reused when LFC is extended. ## Summary of changes Keep track of holes and reused them when LFC size is increased. ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2024-08-16 22:19:44 +03:00
Konstantin Knizhnik	f087423a01	Handle reload config file request in LR monitor (#8732 ) ## Problem Logical replication BGW checking replication lag is not reloading config ## Summary of changes Add handling of reload config request ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2024-08-15 16:28:25 +03:00
Konstantin Knizhnik	7a1736ddcf	Preserve HEAP_COMBOCID when restoring t_cid from WAL (#8503 ) ## Problem See https://github.com/neondatabase/neon/issues/8499 ## Summary of changes Save HEAP_COMBOCID flag in WAL and do not clear it in redo handlers. Related Postgres PRs: https://github.com/neondatabase/postgres/pull/457 https://github.com/neondatabase/postgres/pull/458 https://github.com/neondatabase/postgres/pull/459 ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech> Co-authored-by: Heikki Linnakangas <heikki@neon.tech>	2024-08-14 08:13:20 +03:00
Konstantin Knizhnik	afb68b0e7e	Report search_path to make it possible to use it in pgbouncer track_extra_parameters (#8303 ) ## Problem When pooled connections are used, session semantic its not preserved, including GUC settings. Many customers have particular problem with setting search_path. But pgbouncer 1.20 has `track_extra_parameters` settings which allows to track parameters included in startup package which are reported by Postgres. Postgres has [an official list of parameters that it reports to the client](https://www.postgresql.org/docs/15/protocol-flow.html#PROTOCOL-ASYNC). This PR makes Postgres also report `search_path` and so allows to include it in `track_extra_parameters`. ## Summary of changes Set GUC_REPORT flag for `search_path`. ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2024-08-13 15:07:24 +03:00
Arseny Sher	d24f1b6c04	Allow logical_replication_max_snap_files = -1 which disables the mechanism.	2024-08-13 09:42:16 +03:00
Sasha Krassovsky	32aa1fc681	Add on-demand WAL download to slot funcs (#8705 ) ## Problem Currently we can have an issue where if someone does `pg_logical_slot_advance`, it could fail because it doesn't have the WAL locally. ## Summary of changes Adds on-demand WAL download and a test to these slot funcs. Before adding these, the test fails with ``` requested WAL segment pg_wal/000000010000000000000001 has already been removed ``` After the changes, the test passes Relies on: - https://github.com/neondatabase/postgres/pull/466 - https://github.com/neondatabase/postgres/pull/467 - https://github.com/neondatabase/postgres/pull/468	2024-08-12 20:54:42 -08:00
Shinya Kato	41b5ee491e	Fix a comment in walproposer_pg.c (#8583 ) ## Problem Perhaps there is an error in the source code comment. ## Summary of changes Fix "walsender" to "walproposer"	2024-08-12 13:24:25 +01:00
Arseny Sher	a4eea5025c	Fix logical apply worker reporting of flush_lsn wrt sync replication. It should take syncrep flush_lsn into account because WAL before it on endpoint restart is lost, which makes replication miss some data if slot had already been advanced too far. This commit adds test reproducing the issue and bumps vendor/postgres to commit with the actual fix.	2024-08-12 13:14:02 +03:00
Alex Chi Z.	a155914c1c	fix(neon): disable create tablespace stmt (#8657 ) part of https://github.com/neondatabase/neon/issues/8653 Disable create tablespace stmt. It turns out it requires much less effort to do the regress test mode flag than patching the test cases, and given that we might need to support tablespaces in the future, I decided to add a new flag `regress_test_mode` to change the behavior of create tablespace. Tested manually that without setting regress_test_mode, create tablespace will be rejected. --------- Signed-off-by: Alex Chi Z <chi@neon.tech> Co-authored-by: Heikki Linnakangas <heikki@neon.tech>	2024-08-09 09:18:55 +01:00
Konstantin Knizhnik	925c5ad1e8	Make async connect work on MacOS: it is necessary top call WaitLatchOrSocket before PQconnectPoll (#8472 ) ## Problem While investigating problem with test_subscriber_restart flukyness, I found out that this test is not passed at all for PG 14/15 at MacOS (while working for PG16). ## Summary of changes Rewrite async connect state machine exactly in the same way as in Vanilla: call `WaitLatchOrSocket` with `WL_SOCKETR_WRTEABLE` before calling `PQconnectPoll`. Please notice that most likely it will not fix flukyness of test_subscriber_restart. ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2024-07-24 09:59:18 +03:00
Shinya Kato	d47c94b336	Fix to use a tab instead of spaces (#8394 ) ## Problem There were spaces instead of a tab in the C source file. ## Summary of changes I fixed to use a tab instead of spaces.	2024-07-23 17:46:05 +02:00
Konstantin Knizhnik	563d73d923	Use smgrexists() instead of access() to enforce uniqueness of generated relfilenumber (#7992 ) ## Problem Postgres is using `access()` function in `GetNewRelFileNumber` to check if assigned relfilenumber is not used for any other relation. This check will not work in Neon, because we do not have all files in local storage. ## Summary of changes Use smgrexists() instead which will check at page server if such relfilenode is used. ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2024-07-23 18:41:55 +03:00
Konstantin Knizhnik	a868e342d4	Change default version of Neon extensio to 1.4	2024-07-22 17:58:07 +01:00
Konstantin Knizhnik	8a8b83df27	Add neon.running_xacts_overflow_policy to make it possible for RO replica to startup without primary even in case running xacts overflow (#8323 ) ## Problem Right now if there are too many running xacts to be restored from CLOG at replica startup, then replica is not trying to restore them and wait for non-overflown running-xacs WAL record from primary. But if primary is not active, then replica will not start at all. Too many running xacts can be caused by transactions with large number of subtractions. But right now it can be also cause by two reasons: - Lack of shutdown checkpoint which updates `oldestRunningXid` (because of immediate shutdown) - nextXid alignment on 1024 boundary (which cause loosing ~1k XIDs on each restart) Both problems are somehow addressed now. But we have existed customers with "sparse" CLOG and lack of checkpoints. To be able to start RO replicas for such customers I suggest to add GUC which allows replica to start even in case of subxacts overflow. ## Summary of changes Add `neon.running_xacts_overflow_policy` with the following values: - ignore: restore from CLOG last N XIDs and accept connections - skip: do not restore any XIDs from CXLOGbut still accept connections - wait: wait non-overflown running xacts record from primary node ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2024-07-15 15:52:00 +03:00
Arseny Sher	cd29156927	Fix memory context of NeonWALReader allocation. Allocating it in short living context is wrong because it is reused during backend lifetime.	2024-07-11 20:31:15 +03:00
Alexander Bayandin	c9fd8d7693	SELECT 💣(); (#8270 ) ## Problem We want to be able to test how our infrastructure reacts on segfaults in Postgres (for example, we collect cores, and get some required logs/metrics, etc) ## Summary of changes - Add `trigger_segfauls` function to `neon_test_utils` to trigger a segfault in Postgres - Add `trigger_panic` function to `neon_test_utils` to trigger SIGABRT (by using `elog(PANIC, ...)) - Fix cleanup logic in regression tests in endpoint crashed	2024-07-05 15:12:01 +01:00
Konstantin Knizhnik	88b13d4552	implement rolling hyper-log-log algorithm (#8068 ) ## Problem See #7466 ## Summary of changes Implement algorithm descried in https://hal.science/hal-00465313/document Now new GUC is added: `neon.wss_max_duration` which specifies size of sliding window (in seconds). Default value is 1 hour. It is possible to request estimation of working set sizes (within this window using new function `approximate_working_set_size_seconds`. Old function `approximate_working_set_size` is preserved for backward compatibility. But its scope is also limited by `neon.wss_max_duration`. Version of Neon extension is changed to 1.4 ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech> Co-authored-by: Matthias van de Meent <matthias@neon.tech>	2024-07-04 22:03:58 +03:00
Konstantin Knizhnik	4a0c2aebe0	Add test for proper handling of connection failure to avoid 'cannot wait on socket event without a socket' error (#8231 ) ## Problem See https://github.com/neondatabase/cloud/issues/14289 and PR #8210 ## Summary of changes Add test for problems fixed in #8210 ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2024-07-02 21:45:42 +03:00
Konstantin Knizhnik	0497b99f3a	Check status of connection after PQconnectStartParams (#8210 ) ## Problem See https://github.com/neondatabase/cloud/issues/14289 ## Summary of changes Check connection status after calling PQconnectStartParams ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2024-07-02 06:56:10 +03:00
Heikki Linnakangas	0789160ffa	tests: Make neon_xlogflush() flush all WAL, if you omit the LSN arg (#8215 ) This makes it much more convenient to use in the common case that you want to flush all the WAL. (Passing pg_current_wal_insert_lsn() as the argument doesn't work for the same reasons as explained in the comments: we need to be back off to the beginning of a page if the previous record ended at page boundary.) I plan to use this to fix the issue that Arseny Sher called out at https://github.com/neondatabase/neon/pull/7288#discussion_r1660063852	2024-07-01 10:55:18 -05:00
Heikki Linnakangas	9ce193082a	Restore running xacts from CLOG on replica startup (#7288 ) We have one pretty serious MVCC visibility bug with hot standby replicas. We incorrectly treat any transactions that are in progress in the primary, when the standby is started, as aborted. That can break MVCC for queries running concurrently in the standby. It can also lead to hint bits being set incorrectly, and that damage can last until the replica is restarted. The fundamental bug was that we treated any replica start as starting from a shut down server. The fix for that is straightforward: we need to set 'wasShutdown = false' in InitWalRecovery() (see changes in the postgres repo). However, that introduces a new problem: with wasShutdown = false, the standby will not open up for queries until it receives a running-xacts WAL record from the primary. That's correct, and that's how Postgres hot standby always works. But it's a problem for Neon, because: * It changes the historical behavior for existing users. Currently, the standby immediately opens up for queries, so if they now need to wait, we can breka existing use cases that were working fine (assuming you don't hit the MVCC issues). * The problem is much worse for Neon than it is for standalone PostgreSQL, because in Neon, we can start a replica from an arbitrary LSN. In standalone PostgreSQL, the replica always starts WAL replay from a checkpoint record, and the primary arranges things so that there is always a running-xacts record soon after each checkpoint record. You can still hit this issue with PostgreSQL if you have a transaction with lots of subtransactions running in the primary, but it's pretty rare in practice. To mitigate that, we introduce another way to collect the running-xacts information at startup, without waiting for the running-xacts WAL record: We can the CLOG for XIDs that haven't been marked as committed or aborted. It has limitations with subtransactions too, but should mitigate the problem for most users. See https://github.com/neondatabase/neon/issues/7236. Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2024-07-01 12:58:12 +03:00
Heikki Linnakangas	75c84c846a	tests: Make neon_xlogflush() flush all WAL, if you omit the LSN arg This makes it much more convenient to use in the common case that you want to flush all the WAL. (Passing pg_current_wal_insert_lsn() as the argument doesn't work for the same reasons as explained in the comments: we need to be back off to the beginning of a page if the previous record ended at page boundary.) I plan to use this to fix the issue that Arseny Sher called out at https://github.com/neondatabase/neon/pull/7288#discussion_r1660063852	2024-07-01 12:58:08 +03:00
Arseny Sher	6f20a18e8e	Allow to change compute safekeeper list without restart. - Add --safekeepers option to neon_local reconfigure - Add it to python Endpoint reconfigure - Implement config reload in walproposer by restarting the whole bgw when safekeeper list changes. ref https://github.com/neondatabase/neon/issues/6341	2024-06-27 15:08:35 +03:00
Heikki Linnakangas	24ce73ffaf	Silence compiler warning (#8153 ) I saw this compiler warning on my laptop: pgxn/neon_walredo/walredoproc.c:178:10: warning: using the result of an assignment as a condition without parentheses [-Wparentheses] if (err = close_range_syscall(3, ~0U, 0)) { ~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ pgxn/neon_walredo/walredoproc.c:178:10: note: place parentheses around the assignment to silence this warning if (err = close_range_syscall(3, ~0U, 0)) { ^ ( ) pgxn/neon_walredo/walredoproc.c:178:10: note: use '==' to turn this assignment into an equality comparison if (err = close_range_syscall(3, ~0U, 0)) { ^ == 1 warning generated. I'm not sure what compiler version or options cause that, but it's a good warning. Write the call a little differently, to avoid the warning and to make it a little more clear anyway. (The 'err' variable wasn't used for anything, so I'm surprised we were not seeing a compiler warning on the unused value, too.)	2024-06-26 19:19:27 +03:00
Arthur Petukhovsky	47e5bf3bbb	Improve term reject message in walproposer (#8164 ) Co-authored-by: Tristan Partin <tristan@neon.tech>	2024-06-26 15:26:52 +01:00
Heikki Linnakangas	fdadd6a152	Remove primary_is_running (#8162 ) This was a half-finished mechanism to allow a replica to enter hot standby mode sooner, without waiting for a running-xacts record. It had issues, and we are working on a better mechanism to replace it. The control plane might still set the flag in the spec file, but compute_ctl will simply ignore it.	2024-06-26 15:13:03 +03:00
MMeent	fd0b22f5cd	Make sure we can handle temporarily offline PS when we first connect (#8094 ) Fixes https://github.com/neondatabase/neon/issues/7897 ## Problem `shard->delay_us` was potentially uninitialized when we connect to PS, as it wasn't set to a non-0 value until we've first connected to the shard's pageserver. That caused the exponential backoff to use an initial value (multiplier) of 0 for the first connection attempt to that pageserver, thus causing a hot retry loop with connection attempts to the pageserver without significant delay. That in turn caused attemmpts to reconnect to quickly fail, rather than showing the expected 'wait until pageserver is available' behaviour. ## Summary of changes We initialize shard->delay_us before connection initialization if we notice it is not initialized yet.	2024-06-19 15:05:31 +02:00
Arseny Sher	6bb8b1d7c2	Remove dead code from walproposer_pg.c Now that logical walsenders fetch WAL from safekeepers recovery in walproposer is not needed. Fixes warnings.	2024-06-18 21:12:02 +03:00
Heikki Linnakangas	dc2ab4407f	Fix on-demand SLRU download on standby starting at WAL segment boundary (#8031 ) If a standby is started right after switching to a new WAL segment, the request in the SLRU download request would point to the beginning of the segment (e.g. 0/5000000), while the not-modified-since LSN would point to just after the page header (e.g. 0/5000028). It's effectively the same position, as there cannot be any WAL records in between, but the pageserver rightly errors out on any request where the request LSN < not-modified since LSN. To fix, round down the not-modified since LSN to the beginning of the page like the request LSN. Fixes issue #8030	2024-06-13 00:31:31 +03:00
Sasha Krassovsky	b7a0c2b614	Add On-demand WAL Download to logicalfuncs (#7960 ) We implemented on-demand WAL download for walsender, but other things that may want to read the WAL from safekeepers don't do that yet. This PR makes it do that by adding the same set of hooks to logicalfuncs. Addresses https://github.com/neondatabase/neon/issues/7959 Also relies on: https://github.com/neondatabase/postgres/pull/438 https://github.com/neondatabase/postgres/pull/437 https://github.com/neondatabase/postgres/pull/436	2024-06-11 17:59:32 -07:00
Heikki Linnakangas	78a59b94f5	Copy editor config for the neon extension from PostgreSQL (#8009 ) This makes IDEs and github diff format the code the same way as PostgreSQL sources, which is the style we try to maintain.	2024-06-11 23:19:18 +03:00
Anastasia Lubennikova	66c6b270f1	Downgrade No response from reading prefetch entry WARNING to LOG	2024-06-06 20:56:19 +01:00
Arseny Sher	e6db8069b0	neon_walreader: check after local read that the segment still exists. Otherwise read might receive zeros/garbage if the file is recycled (renamed) for as a future segment.	2024-05-31 12:57:56 +03:00
Konstantin Knizhnik	d61e924103	Fix connect to PS on MacOS/X (#7885 ) ## Problem After [`0e4f182680`] which introduce async connect Neon is not able to connect to page server. ## Summary of changes Perform sync commit at MacOS/X ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2024-05-27 15:57:57 +03:00
MMeent	0e4f182680	Rework PageStream connection state handling: (#7611 ) * Make PS connection startup use async APIs This allows for improved query cancellation when we start connections * Make PS connections have per-shard connection retry state. Previously they shared global backoff state, which is bad for quickly getting all connections started and/or back online. * Make sure we clean up most connection state on failed connections. Previously, we could technically leak some resources that we'd otherwise clean up. Now, the resources are correctly cleaned up. * pagestore_smgr.c now PANICs on unexpected response message types. Unexpected responses are likely a symptom of having a desynchronized view of the connection state. As a desynchronized connection state can cause corruption, we PANIC, as we don't know what data may have been written to buffers: the only solution is to fail fast & hope we didn't write wrong data. * Catch errors in sync pagestream request handling. Previously, if a query was cancelled after a message was sent to the pageserver, but before the data was received, the backend could forget that it sent the synchronous request, and let others deal with the repercussions. This could then lead to incorrect responses, or errors such as "unexpected response from page server with tag 0x68"	2024-05-23 23:26:42 +02:00
Heikki Linnakangas	37f81289c2	Make 'neon.protocol_version = 2' the default, take two (#7819 ) Once all the computes in production have restarted, we can remove protocol version 1 altogether. See issue #6211. This was done earlier already in commit `0115fe6cb2`, but reverted before it was released to production in commit `bbe730d7ca` because of issue https://github.com/neondatabase/neon/issues/7692. That issue was fixed in commit `22afaea6e1`, so we are ready to change the default again.	2024-05-22 18:24:52 +03:00
Heikki Linnakangas	9217564026	Fix issues with determining request LSN in read replica (#7795 ) Don't set last-written LSN of a page when the record is replayed, only when the page is evicted from cache. For comparison, we don't update the last-written LSN on every page modification on the primary either, only when the page is evicted. Do update the last-written LSN when the page update is skipped in WAL redo, however. In neon_get_request_lsns(), don't be surprised if the last-written LSN is equal to the record being replayed. Use the LSN of the record being replayed as the request LSN in that case. Add a long comment explaining how that can happen. In neon_wallog_page, update last-written LSN also when Shutdown has been requested. We might still fetch and evict pages for a while, after shutdown has been requested, so we better continue to do that correctly. Enable the check that we don't evict a page with zero LSN also in standby, but make it a LOG message instead of PANIC Fixes issue https://github.com/neondatabase/neon/issues/7791	2024-05-22 18:24:21 +03:00
Heikki Linnakangas	3404e76a51	Fix confusion between 1-based Buffer and 0-based index (#7825 ) The code was working correctly, but was incorrectly using Buffer for a 0-based index into the BufferDesc array.	2024-05-22 18:24:21 +03:00

... 2 3 4 5 6 ...

371 Commits