rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-01-07 05:22:56 +00:00

Author	SHA1	Message	Date
Alex Chi Z.	b22a84a7bf	feat(pageserver): support key range for manual compaction trigger (#9723 ) part of https://github.com/neondatabase/neon/issues/9114, we want to be able to run partial gc-compaction in tests. In the future, we can also expand this functionality to legacy compaction, so that we can trigger compaction for a specific key range. ## Summary of changes * Support passing compaction key range through pageserver routes. * Refactor input parameters of compact related function to take the new `CompactOptions`. * Add tests for partial compaction. Note that the test may or may not trigger compaction based on GC horizon. We need to improve the test case to ensure things always get below the gc_horizon and the gc-compaction can be triggered. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-11-19 19:38:41 +00:00
Arpad Müller	b092126c94	scrubber: fix parsing issue with Azure (#9797 ) Apparently Azure returns timelines ending with `/` which confuses the parsing. So remove all trailing `/`s before attempting to parse. Part of https://github.com/neondatabase/cloud/issues/19963	2024-11-19 20:10:53 +01:00
Alex Chi Z.	5e3fbef721	fix(pageserver): queue stopped error should be ignored during create timeline (#9767 ) close https://github.com/neondatabase/neon/issues/9730 The test case tests if anything goes wrong during pageserver restart + during timeline creation not complete. Therefore, queue is stopped error is normal in this case, except that it should be categorized as a shutdown error instead of a real error. ## Summary of changes * More comments for the test case. * Queue stopped error will now be forwarded as CreateTimelineError::ShuttingDown. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-11-19 14:10:09 -05:00
dependabot[bot]	15468cd23c	build(deps): bump aiohttp from 3.10.2 to 3.10.11 (#9794 )	2024-11-19 19:08:00 +00:00
Peter Bendel	a8ac895b83	re-acquire S3 OIDC token after long running tests for report upload to S3 (#9799 ) ## Problem If a benchmark or test-case runs longer than the AWS OIDC token lifetime successive upload of test reports to S3 fail - example: https://github.com/neondatabase/neon/actions/runs/11905529176/job/33176168174#step:9:243 ## Summary of changes In actions that require access to S3 and which are invoked after a long running python testcase we re-acquire the OIDC token explicitly. Note that we need to pass down the aws_oicd_role_arn from the workflow to the action because actions have no access to GitHub vars for security reasons. Sample run https://github.com/neondatabase/neon/actions/runs/11912328276/job/33195676867	2024-11-19 18:22:51 +01:00
Heikki Linnakangas	ada84400b7	PostgreSQL minor version updates (17.2, 16.6, 15.10, 14.15) (#9795 ) The community decided to make a new off-schedule release due to ABI breakage in last week's release. We're not affected by the ABI breakage because we rebuild all extensions in our docker images, but let's stay up-to-date. There were a few other fixes in the release too.	2024-11-19 17:01:05 +02:00
Conrad Ludgate	191f745c81	fix(proxy/auth_broker): ignore -pooler suffix (#9800 ) Fixes https://github.com/neondatabase/cloud/issues/20400 We cannot mix local_proxy and pgbouncer, so we are filtering out the `-pooler` suffix prior to calling wake_compute.	2024-11-19 13:58:26 +00:00
Conrad Ludgate	37b97b3a68	chore(local_proxy): reduce some startup logging (#9798 ) Currently, local_proxy will write an error log if it doesn't find the config file. This is expected for startup, so it's just noise. It is an error if we do receive an explicit SIGHUP though. I've also demoted the build info logs to be debug level. We don't need them in the compute image since we have other ways to determine what code is running. Lastly, I've demoted SIGHUP signal handling from warn to info, since it's not really a warning event. See https://github.com/neondatabase/cloud/issues/10880 for more details	2024-11-19 13:58:11 +00:00
Konstantin Knizhnik	c9acd214ae	Do not create DSM segment for wal_redo_postgres (#9793 ) ## Problem See https://github.com/neondatabase/neon/issues/9738 ## Summary of changes Do not create DSM segment for wal_redo Postgres --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2024-11-19 11:56:40 +02:00
Peter Bendel	982cb1c15d	Move logic for ingest benchmark from GitHub workflow into python testcase (#9762 ) ## Problem The first version of the ingest benchmark had some parsing and reporting logic in shell script inside GitHub workflow. it is better to move that logic into a python testcase so that we can also run it locally. ## Summary of changes - Create new python testcase - invoke pgcopydb inside python test case - move the following logic into python testcase - determine backpressure - invoke pgcopydb and report its progress - parse pgcopydb log and extract metrics - insert metrics into perf test database - add additional column to perf test database that can receive endpoint ID used for pgcopydb run to have it available in grafana dashboard when retrieving other metrics for an endpoint ## Example run https://github.com/neondatabase/neon/actions/runs/11860622170/job/33056264386	2024-11-19 09:46:46 +00:00
Arpad Müller	9b6af2bcad	Add the ability to configure GenericRemoteStorage for the scrubber (#9652 ) Earlier work (#7547) has made the scrubber internally generic, but one could only configure it to use S3 storage. This is the final piece to make (most of, snapshotting still requires S3) the scrubber be able to be configured via GenericRemoteStorage. I.e. you can now set an env var like: ``` REMOTE_STORAGE_CONFIG='remote_storage = { bucket_name = "neon-dev-safekeeper-us-east-2d", bucket_region = "us-east-2" } ``` and the scrubber will read it instead.	2024-11-18 21:01:48 +00:00
Arpad Müller	4fc3af15dd	Remove at most one retain_lsn entry from (possibly offloaded) timelne's parent (#9791 ) There is a potential data corruption issue, not one I've encountered, but it's still not hard to hit with some correct looking code given our current architecture. It has to do with the timeline's memory object storage via reference counted `Arc`s, and the removal of `retain_lsn` entries at the drop of the last `Arc` reference. The corruption steps are as follows: 1. timeline gets offloaded. timeline object A doesn't get dropped though, because some long-running task accesses it 2. the same timeline gets unoffloaded again. timeline object B gets created for it, timeline object A still referenced. both point to the same timeline. 3. the task keeping the reference to timeline object A exits. destructor for object A runs, removing `retain_lsn` in the timeline's parent. 4. the timeline's parent runs gc without the `retain_lsn` of the still exant timleine's child, leading to data corruption. In general we are susceptible each time when we recreate a `Timeline` object in the same process, which happens both during a timeline offload/unoffload cycle, as well as during an ancestor detach operation. The solution this PR implements is to make the destructor for a timeline as well as an offloaded timeline remove at most one `retain_lsn`. PR #9760 has added a log line to print the refcounts at timeline offload, but this only detects one of the places where we do such a recycle operation. Plus it doesn't prevent the actual issue. I doubt that this occurs in practice. It is more a defense in depth measure. Usually I'd assume that the timeline gets dropped immediately in step 1, as there is no background tasks referencing it after its shutdown. But one never knows, and reducing the stakes of step 1 actually occurring is a really good idea, from potential data corruption to waste of CPU time. Part of #8088	2024-11-18 21:42:19 +01:00
Vlad Lazar	d7662fdc7b	feat(page_service): timeout-based batching of requests (#9321 ) ## Problem We don't take advantage of queue depth generated by the compute on the pageserver. We can process getpage requests more efficiently by batching them. ## Summary of changes Batch up incoming getpage requests that arrive within a configurable time window (`server_side_batch_timeout`). Then process the entire batch via one `get_vectored` timeline operation. By default, no merging takes place. ## Testing * Functional: https://github.com/neondatabase/neon/pull/9792 * Performance: will be done in staging/pre-prod # Refs * https://github.com/neondatabase/neon/issues/9377 * https://github.com/neondatabase/neon/issues/9376 Co-authored-by: Christian Schwarz <christian@neon.tech>	2024-11-18 20:24:03 +00:00
Alex Chi Z.	e5c89f3da3	feat(pageserver): drop disposable keys during gc-compaction (#9765 ) close https://github.com/neondatabase/neon/issues/9552, close https://github.com/neondatabase/neon/issues/8920, part of https://github.com/neondatabase/neon/issues/9114 ## Summary of changes * Drop keys not belonging to this shard during gc-compaction to avoid constructing history that might have been truncated during shard compaction. * Run gc-compaction at the end of shard compaction test. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-11-18 19:27:52 +00:00
Alexey Kondratov	5f0e9c9a94	feat(compute/tests): Report successful replication test runs as well (#9787 ) It should increase the visibility of whether they run and pass.	2024-11-18 16:05:09 +00:00
Alexander Bayandin	44f33b2bd6	Bump default Postgres version for tests to v17 (#9777 ) ## Problem Tests that are marked with `run_only_on_default_postgres` do not run on debug builds on CI because we run debug builds only for the latest Postgres version (which is 17) ## Summary of changes - Bump `PgVersion.DEFAULT` to `v17` - Skip `test_timeline_archival_chaos` in debug builds	2024-11-18 15:06:24 +00:00
Alexander Bayandin	913b5b7027	CI: remove separate check-build-tools-image workflow (#9708 ) ## Problem We call `check-build-tools-image` twice for each workflow whenever we use it, along with `build-build-tools-image`, once as a workflow itself, and the second time from `build-build-tools-image`. This is not necessary. ## Summary of changes - Inline `check-build-tools-image` into `build-build-tools-image` - Remove separate `check-build-tools-image` workflow	2024-11-18 13:14:28 +00:00
John Spray	3f401a328f	tests: mitigate bug to stabilize test_storage_controller_many_tenants (#9771 ) ## Problem Due to #9471 , the scale test occasionally gets 404s while trying to modify the config of a timeline that belongs to a tenant being migrated. We rarely see this narrow race in the field, but the test is quite good at reproducing it. ## Summary of changes - Ignore 404 errors in this test.	2024-11-18 11:33:27 +00:00
Peter Bendel	c3eecf6763	adapt pgvector bench to minor version upgrades of PostgreSql (#9784 ) ## Problem pgvector benchmark is failing because after PostgreSQL minor version upgrade previous version packages are no longer available in deb repository [example failure](https://github.com/neondatabase/neon/actions/runs/11875503070/job/33092787149#step:4:40) ## Summary of changes Update postgres minor version of packages to current version [Example run after this change](https://github.com/neondatabase/neon/actions/runs/11888978279/job/33124614605)	2024-11-18 10:47:43 +00:00
Konstantin Knizhnik	6fa9b0cd8c	Use DATA_DIR instead of current workign directory in restore_from_wal script (#9729 ) ## Problem See https://github.com/neondatabase/neon/issues/7750 test_wal_restore.sh is copying file to current working directory which can cause interfere of test_wa_restore.py tests spawned of different configurations. ## Summary of changes Copy file to $DATA_DIR Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2024-11-18 11:55:38 +02:00
a-masterov	10bc1903e1	Fix the regression test running against the staging instance (#9773 ) ## Problem The Postgres version was updated. The patch has to be updated accordingly. ## Summary of changes The patch of the regression test was updated.	2024-11-18 10:30:50 +01:00
John Spray	261d065e6f	pageserver: respect no_sync in `VirtualFile` (#9772 ) ## Problem `no_sync` initially just skipped syncfs on startup (#9677). I'm also interested in flaky tests that time out during pageserver shutdown while flushing l0s, so to eliminate disk throughput as a source of issues there, ## Summary of changes - Drive-by change for test timeouts: add a couple more ::info logs during pageserver startup so it's obvious which part got stuck. - Add a SyncMode enum to configure VirtualFile and respect it in sync_all and sync_data functions - During pageserver startup, set SyncMode according to `no_sync`	2024-11-18 08:59:05 +00:00
Christian Schwarz	b6154b03f4	build(deps): bump smallvec to 1.13.2 to get UB fix (#9781 ) Smallvec 1.13.2 contains [an UB fix](https://github.com/servo/rust-smallvec/pull/345). Upstream opened [a request](https://github.com/rustsec/advisory-db/issues/1960) for this in the advisory-db but it never got acted upon. Found while working on https://github.com/neondatabase/neon/pull/9321.	2024-11-17 21:25:16 +01:00
Erik Grinaker	8880134171	Cargo.toml: upgrade tikv-jemallocator to 0.6.0 (#9779 )	2024-11-17 19:52:05 +01:00
Erik Grinaker	de7e4a34ca	safekeeper: send `AppendResponse` on segment flush (#9692 ) ## Problem When processing pipelined `AppendRequest`s, we explicitly flush the WAL every second and return an `AppendResponse`. However, the WAL is also implicitly flushed on segment bounds, but this does not result in an `AppendResponse`. Because of this, concurrent transactions may take up to 1 second to commit and writes may take up to 1 second before sending to the pageserver. ## Summary of changes Advance `flush_lsn` when a WAL segment is closed and flushed, and emit an `AppendResponse`. To accommodate this, track the `flush_lsn` in addition to the `flush_record_lsn`.	2024-11-17 18:19:14 +01:00
Vlad Lazar	ac689ab014	wal_decoder: rename end_lsn to next_record_lsn (#9776 ) ## Problem It turns out that `WalStreamDecoder::poll_decode` returns the start LSN of the next record and not the end LSN of the current record. They are not always equal. For example, they're not equal when the record in question is an XLOG SWITCH record. ## Summary of changes Rename things to reflect that.	2024-11-15 21:53:11 +00:00
Tristan Partin	23eabb9919	Fix PG_MAJORVERSION_NUM typo In `ea32f1d0a3`, Matthias added a feature to our extension to expose more granular wait events. However, due to the typo, those wait events were never registered, so we used the more generic wait events instead. Signed-off-by: Tristan Partin <tristan@neon.tech>	2024-11-15 15:17:23 -06:00
Vlad Lazar	2af791ba83	wal_decoder: make InterpretedWalRecord serde (#9775 ) ## Problem We want to serialize interpreted records to send them over the wire from safekeeper to pageserver. ## Summary of changes Make `InterpretedWalRecord` ser/de. This is a temporary change to get the bulk of the lift merged in https://github.com/neondatabase/neon/pull/9746. For going to prod, we don't want to use bincode since we can't evolve the schema. Questions on serialization will be tackled separately.	2024-11-15 20:34:48 +00:00
Mikhail Kot	e12628fe93	Collect max_connections metric (#9770 ) This will further allow us to expose this metric to users	2024-11-15 17:42:41 +00:00
Arpad Müller	7880c246f1	Correct mistakes in offloaded timeline retain_lsn management (#9760 ) PR #9308 has modified tenant activation code to take offloaded child timelines into account for populating the list of `retain_lsn` values. However, there is more places than just tenant activation where one needs to update the `retain_lsn`s. This PR fixes some bugs of the current code that could lead to corruption in the worst case: 1. Deleting of an offloaded timeline would not get its `retain_lsn` purged from its parent. With the patch we now do it, but as the parent can be offloaded as well, the situatoin is a bit trickier than for non-offloaded timelines which can just keep a pointer to their parent. Here we can't keep a pointer because the parent might get offloaded, then unoffloaded again, creating a dangling pointer situation. Keeping a pointer to the tenant is not good either, because we might drop the offloaded timeline in a context where a `offloaded_timelines` lock is already held: so we don't want to acquire a lock in the drop code of OffloadedTimeline. 2. Unoffloading a timeline would not get its `retain_lsn` values populated, leading to it maybe garbage collecting values that its children might need. We now call `initialize_gc_info` on the parent. 3. Offloading of a timeline would not get its `retain_lsn` values registered as offloaded at the parent. So if we drop the `Timeline` object, and its registration is removed, the parent would not have any of the child's `retain_lsn`s around. Also, before, the `Timeline` object would delete anything related to its timeline ID, now it only deletes `retain_lsn`s that have `MaybeOffloaded::No` set. Incorporates Chi's reproducer from #9753. cc https://github.com/neondatabase/cloud/issues/20199 The `test_timeline_retain_lsn` test is extended: 1. it gains a new dimension, duplicating each mode, to either have the "main" branch be the direct parent of the timeline we archive, or the "test_archived_parent" branch intermediary, creating a three timeline structure. This doesn't test anything fixed by this PR in particular, just explores the vast space of possible configurations a little bit more. 2. it gains two new modes, `offload-parent`, which tests the second point, and `offload-no-restart` which tests the third point. It's easy to verify the test actually is "sharp" by removing one of the respective `self.initialize_gc_info()`, `gc_info.insert_child()` or `ancestor_children.push()`. Part of #8088 --------- Signed-off-by: Alex Chi Z <chi@neon.tech> Co-authored-by: Alex Chi Z <chi@neon.tech>	2024-11-15 14:22:29 +01:00
John Spray	04938d9d55	tests: tolerate pageserver 500s in test_timeline_archival_chaos (#9769 ) ## Problem Test exposes cases where pageserver gives 500 responses, causing failures like https://neon-github-public-dev.s3.amazonaws.com/reports/pr-9766/11844529470/index.html#suites/d1acc79950edeb0563fc86236c620898/3546be2ffed99ba6 ## Summary of changes - Tolerate such messages, and link an issue for cleaning up the pageserver not to return such 500s.	2024-11-15 13:22:05 +00:00
Erik Grinaker	19f7d40c1d	deny.toml: allow CDDL-1.0 license (#9766 ) #9764, which adds profiling support to Safekeeper, pulls in the dependency [`inferno`](https://crates.io/crates/inferno) via [`pprof-rs`](https://crates.io/crates/pprof). This is licenced under the [Common Development and Distribution License 1.0](https://spdx.org/licenses/CDDL-1.0.html), which is not allowed by `cargo-deny`. This patch allows the CDDL-1.0 license. It is a derivative of the Mozilla Public License, which we already allow, but avoids some issues around European copyright law that the MPL had. As such, I don't expect this to be problematic.	2024-11-15 10:41:43 +00:00
John Spray	38563de7dd	storcon: exclude non-Active tenants from shard autosplitting (#9743 ) ## Problem We didn't have a neat way to prevent auto-splitting of tenants. This could be useful during incidents or for testing. Closes https://github.com/neondatabase/neon/issues/9332 ## Summary of changes - Filter splitting candidates by scheduling policy	2024-11-14 19:41:10 +00:00
John Spray	93939f123f	tests: add test_timeline_archival_chaos (#9609 ) ## Problem - We lack test coverage of cases where multiple timelines fight for updates to the same manifest (https://github.com/neondatabase/neon/pull/9557), and in timeline archival changes while dual-attached (https://github.com/neondatabase/neon/pull/9555) ## Summary of changes - Add a chaos test for timeline creation->archival->offload->deletion	2024-11-14 17:31:35 +00:00
Tristan Partin	49b599c113	Remove the replication slot in test_snap_files at the end of the test Analysis of the LR benchmarking tests indicates that in the duration of test_subscriber_lag, a leftover 'slotter' replication slot can lead to retained WAL growing on the publisher. This replication slot is not used by any subscriber. The only purpose of the slot is to generate snapshot files for the puspose of test_snap_files. Signed-off-by: Tristan Partin <tristan@neon.tech>	2024-11-14 10:59:15 -06:00
Yuchen Liang	8cde37bc0b	test: disable test_readonly_node_gc until proper fix (#9755 ) ## Problem After investigation, we think to make `test_readonly_node_gc` less flaky, we need to make a proper fix (likely involving persisting part of the lease state). See https://github.com/neondatabase/neon/issues/9754 for details. ## Summary of changes - skip the test until proper fix. Signed-off-by: Yuchen Liang <yuchen@neon.tech>	2024-11-14 15:26:58 +00:00
Konstantin Knizhnik	f70611c8df	Correctly truncate VM (#9342 ) ## Problem https://github.com/neondatabase/neon/issues/9240 ## Summary of changes Correctly truncate VM page instead just replacing it with zero page. ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech> Co-authored-by: Heikki Linnakangas <heikki@neon.tech>	2024-11-14 17:19:13 +02:00
Vlad Lazar	21282aa113	cargo: use neon branch of rust-postgres (#9757 ) ## Problem We are pining our fork of rust-postgres to a commit hash and that prevents us from making further changes to it. The latest commit in rust-postgres requires https://github.com/neondatabase/neon/pull/8747, but that seems to have gone stale. I reverted rust-postgres `neon` branch to the pinned commit in https://github.com/neondatabase/rust-postgres/pull/31. ## Summary of changes Switch back to using the `neon` branch of the rust-postgres fork.	2024-11-14 15:16:43 +00:00
Arseny Sher	d06bf4b0fe	safekeeper: fix atomicity of WAL truncation (#9685 ) If WAL truncation fails in the middle it might leave some data on disk above the write/flush LSN. In theory, concatenated with previous records it might form bogus WAL (though very unlikely in practice because CRC would protect from that). To protect from that, set pending_wal_truncation flag: means before any WAL writes truncation must be retried until it succeeds. We already did that in case of safekeeper restart, now extend this mechanism for failures without restart. Also, importantly, reset LSNs in the beginning of the operation, not in the end, because once on disk deletion starts previous pointers are wrong. All this most likely haven't created any problems in practice because CRC protects from the consequences. Tests for this are hard; simulation infrastructure might be useful here in the future, but not yet.	2024-11-14 13:06:42 +03:00
Tristan Partin	1280b708f1	Improve error handling for NeonAPI fixture Move error handling to the common request function and add a debug log. Signed-off-by: Tristan Partin <tristan@neon.tech>	2024-11-13 20:35:48 -06:00
John Spray	b4e00b8b22	pageserver: refuse to load tenants with suspiciously old indices in old generations (#9719 ) ## Problem Historically, if a control component passed a pageserver "generation: 1" this could be a quick way to corrupt a tenant by loading a historic index. Follows https://github.com/neondatabase/neon/pull/9383 Closes #6951 ## Summary of changes - Introduce a Fatal variant to DownloadError, to enable index downloads to signal when they have encountered a scary enough situation that we shouldn't proceed to load the tenant. - Handle this variant by putting the tenant into a broken state (no matter which timeline within the tenant reported it) - Add a test for this case In the event that this behavior fires when we don't want it to, we have ways to intervene: - "Touch" an affected index to update its mtime (download+upload S3 object) - If this behavior is triggered, it indicates we're attaching in some old generation, so we should be able to fix that by manually bumping generation numbers in the storage controller database (this should never happen, but it's an option if it does)	2024-11-13 18:07:39 +00:00
Heikki Linnakangas	10aaa3677d	PostgreSQL minor version updates (17.1, 16.5, 15.9, 14.14) (#9727 ) This includes a patch to temporarily disable one test in the pg_anon test suite. It is an upstream issue, the test started failing with the new PostgreSQL minor versions because of a change in the default timezone used in tests. We don't want to block the release for this, so just disable the test for now. See `199f0a392b (note_2148017485)` Corresponding postgres repository PRs: https://github.com/neondatabase/postgres/pull/524 https://github.com/neondatabase/postgres/pull/525 https://github.com/neondatabase/postgres/pull/526 https://github.com/neondatabase/postgres/pull/527	2024-11-13 15:08:58 +02:00
Heikki Linnakangas	d5435b1a81	tests: Increase timeout in test_create_churn_during_restart (#9736 ) This test was seen to be flaky, e.g. at: https://neon-github-public-dev.s3.amazonaws.com/reports/pr-9457/11804246485/index.html#suites/ec4311502db344eee91f1354e9dc839b/982bd121ea698414/. If I _reduce_ the timeout from 10s to 8s on my laptop, it reliably hits that timeout and fails. That suggests that the test is pretty close to the edge even when it passes. Let's bump up the timeout to 30 s to make it more robust. See also https://github.com/neondatabase/neon/issues/9730, although the error message is different there.	2024-11-13 12:20:32 +02:00
Anastasia Lubennikova	080d585b22	Add installed_extensions prometheus metric (#9608 ) and add /metrics endpoint to compute_ctl to expose such metrics metric format example for extension pg_rag with versions 1.2.3 and 1.4.2 installed in 3 and 1 databases respectively: neon_extensions_installed{extension="pg_rag", version="1.2.3"} = 3 neon_extensions_installed{extension="pg_rag", version="1.4.2"} = 1 ------ infra part: https://github.com/neondatabase/flux-fleet/pull/251 --------- Co-authored-by: Tristan Partin <tristan@neon.tech>	2024-11-13 09:36:48 +00:00
John Spray	7595d3afe6	pageserver: add `no_sync` for use in regression tests (2/2) (#9678 ) ## Problem Followup to https://github.com/neondatabase/neon/pull/9677 which enables `no_sync` in tests. This can be merged once the next release has happened. ## Summary of changes - Always run pageserver with `no_sync = true` in tests.	2024-11-13 09:17:26 +00:00
Konstantin Knizhnik	1ff5333a1b	Do not wallog AUX files at replica (#9457 ) ## Problem Attempt to persist LR stuff at replica cause cannot make new WAL entries during recovery` error. See https://neondb.slack.com/archives/C07S7RBFVRA/p1729280401283389 ## Summary of changes Do not wallog AUX files at replica. Related Postgres PRs: https://github.com/neondatabase/postgres/pull/517 https://github.com/neondatabase/postgres/pull/516 https://github.com/neondatabase/postgres/pull/515 https://github.com/neondatabase/postgres/pull/514 ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech> Co-authored-by: Heikki Linnakangas <heikki@neon.tech>	2024-11-13 08:50:01 +02:00
Tristan Partin	d8f5d43549	Fix autocommit footguns in performance tests psycopg2 has the following warning related to autocommit: > By default, any query execution, including a simple SELECT will start > a transaction: for long-running programs, if no further action is > taken, the session will remain “idle in transaction”, an undesirable > condition for several reasons (locks are held by the session, tables > bloat…). For long lived scripts, either ensure to terminate a > transaction as soon as possible or use an autocommit connection. In the 2.9 release notes, psycopg2 also made the following change: > `with connection` starts a transaction on autocommit transactions too Some of these connections are indeed long-lived, so we were retaining tons of WAL on the endpoints because we had a transaction pinned in the past. Link: https://www.psycopg.org/docs/news.html#what-s-new-in-psycopg-2-9 Link: https://github.com/psycopg/psycopg2/issues/941 Signed-off-by: Tristan Partin <tristan@neon.tech>	2024-11-12 15:48:19 -06:00
Erik Grinaker	2256a5727a	safekeeper: use `WAL_SEGMENT_SIZE` for empty timeline state (#9734 ) ## Problem `TimelinePersistentState::empty()`, used for tests and benchmarks, had a hardcoded 16 MB WAL segment size. This caused confusion when attempting to change the global segment size. ## Summary of changes Inherit from `WAL_SEGMENT_SIZE` in `TimelinePersistentState::empty()`.	2024-11-12 20:35:44 +00:00
Tristan Partin	3f80af8b1d	Add neon.logical_replication_max_logicalsnapdir_size This GUC will drop replication slots if the size of the pg_logical/snapshots directory (not including temp snapshot files) becomes larger than the specified size. Keeping the size of this directory smaller will help with basebackup size from the pageserver. Part-of: https://github.com/neondatabase/neon/issues/8619 Signed-off-by: Tristan Partin <tristan@neon.tech>	2024-11-12 13:13:28 -06:00
Tristan Partin	a61d81bbc7	Calculate compute_backpressure_throttling_seconds correctly The original value that we get is measured in microseconds. It comes from a calculation using Postgres' GetCurrentTimestamp(), whihc is implemented in terms of gettimeofday(2). Signed-off-by: Tristan Partin <tristan@neon.tech>	2024-11-12 13:12:08 -06:00

1 2 3 4 5 ...

6572 Commits