rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-01-07 05:22:56 +00:00

Author	SHA1	Message	Date
Christian Schwarz	ed31dd2a3c	pageserver: better observability for slow wait_lsn (#11176 ) # Problem We leave too few observability breadcrumbs in the case where wait_lsn is exceptionally slow. # Changes - refactor: extract the monitoring logic out of `log_slow` into `monitor_slow_future` - add global + per-timeline counter for time spent waiting for wait_lsn - It is updated while we're still waiting, similar to what we do for page_service response flush. - add per-timeline counterpair for started & finished wait_lsn count - add slow-logging to leave breadcrumbs in logs, not just metrics For the slow-logging, we need to consider not flooding the logs during a broker or network outage/blip. The solution is a "log-streak-level" concurrency limit per timeline. At any given time, there is at most one slow wait_lsn that is logging the "still running" and "completed" sequence of logs. Other concurrent slow wait_lsn's don't log at all. This leaves at least one breadcrumb in each timeline's logs if some wait_lsn was exceptionally slow during a given period. The full degree of slowness can then be determined by looking at the per-timeline metric. # Performance Reran the `bench_log_slow` benchmark, no difference, so, existing call sites are fine. We do use a Semaphore, but only try_acquire it _after_ things have already been determined to be slow. So, no baseline overhead anticipated. # Refs - https://github.com/neondatabase/cloud/issues/23486#issuecomment-2711587222	2025-03-13 15:03:53 +00:00
devin-ai-integration[bot]	efb1df4362	fix: Change metric_unit from 'microseconds' to 'μs' in test_compute_ctl_api.py (#11209 ) # Fix metric_unit length in test_compute_ctl_api.py ## Description This PR changes the metric_unit from "microseconds" to "μs" in test_compute_ctl_api.py to fix the issue where perf test results were not being stored in the database due to the string exceeding the 10 character limit of the metric_unit column in the perf_test_results table. ## Problem As reported in Slack, the perf test results were not being uploaded to the database because the "microseconds" string (12 characters) exceeds the 10 character limit of the metric_unit column in the perf_test_results table. ## Solution Replace "microseconds" with "μs" in all metric_unit parameters in the test_compute_ctl_api.py file. ## Testing The changes have been committed and pushed. The PR is ready for review. Link to Devin run: https://app.devin.ai/sessions/e29edd672bd34114b059915820e8a853 Requested by: Peter Bendel Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com> Co-authored-by: peterbendel@neon.tech <peterbendel@neon.tech>	2025-03-13 10:17:01 +00:00
Alex Chi Z.	c3b3b507f7	feat(pageserver): support detaching behavior v2 (#11158 ) ## Problem close https://github.com/neondatabase/neon/issues/10310 ## Summary of changes This patch adds a new behavior for the detach_ancestor API: detach with multi-level ancestor and no reparenting. Though we can potentially support multi-level + do reparenting / single-level + no-reparenting in the future, as it's not required for the recovery/snapshot epic, I'd prefer keeping things simple now that we only handle the old one and the new one instead of supporting the full feature matrix. I only added a test case of successful detaching instead of testing failures. I'd like to make this into staging and add more tests in the future. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-03-12 22:27:23 +00:00
Alex Chi Z.	8a5a739af0	test(pageserver): add small tenant compaction (#11049 ) ## Problem close https://github.com/neondatabase/neon/issues/10881 ## Summary of changes Mock a tenant with very small amount of data. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-03-12 20:34:19 +00:00
Tristan Partin	5eed0e4b94	Add docs to performance/test_logical_replication.py on how to run the suite (#10175 ) These docs are in tandem with what was recently published on the internal docs site. Signed-off-by: Tristan Partin <tristan@neon.tech>	2025-03-12 17:31:09 +00:00
Vlad Lazar	02a83913ec	storcon: do not update observed state on node activation (#11155 ) ## Problem When a node becomes active, we query its locations and update the observed state in-place. This can race with the observed state updates done when processing reconcile results. ## Summary of changes The argument for this reconciliation step is that is reduces the need for background reconciliations. I don't think is actually true anymore. There's two cases. 1. Restart of node after drain. Usually the node does not go through the offline state here, so observed locations were not marked as none. In any case, there should be a handful of shards max on the node since we've just drained it. 2. Node comes back online after failure or network partition. When the node is marked offline, we reschedule everything away from it. When it later becomes active, the previous observed location is extraneous and requires a reconciliation anyway. Closes https://github.com/neondatabase/neon/issues/11148	2025-03-12 15:31:28 +00:00
Dmitrii Kovalkov	73e37ae388	Suppress "request was dropped" errors in test_timeline_archive (#11190 ) ## Problem Test `test_timeline_archive` is flaky because it makes requests that are intended to fail. It sometimes leads to warning in pageserver's logs. More details are in the issue. - Closes: https://github.com/neondatabase/neon/issues/11177 ## Summary of changes - Suppress such errors.	2025-03-12 13:23:31 +00:00
Alex Chi Z.	3451bdd3d2	fix(test): force L0 compaction before gc-compaction (#11143 ) ## Problem Fix test flakyness of `test_gc_feedback` Closes: https://github.com/neondatabase/neon/issues/11153 ## Summary of changes Looking at the log, gc-compaction is interrupted by L0 compaction. Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-03-10 20:03:49 +00:00
Dmitrii Kovalkov	63b22d3fb1	pageserver: https for management API (#11025 ) ## Problem Storage controller uses unencrypted HTTP requests for pageserver management API. Closes: https://github.com/neondatabase/cloud/issues/24283 ## Summary of changes - Implement `http_utils::server::Server` with TLS support. - Replace `hyper0::server::Server` with `http_utils::server::Server` in pageserver. - Add HTTPS handler for pageserver management API. - Generate local SSL certificates in neon local.	2025-03-10 15:07:59 +00:00
Tristan Partin	1b8c4286c4	Fetch remote extension in ALTER EXTENSION UPDATE statements (#11102 ) Previously, remote extensions were not fetched unless they were used in some other manner. For instance, loading a BM25 index in pg_search fetches the pg_search extension. However, if on a fresh compute with pg_search 0.15.5 installed, the user ran `ALTER EXTENSION pg_search UPDATE TO '0.15.6'` without first using the pg_search extension, we would not fetch the extension and fail to find an update path. Signed-off-by: Tristan Partin <tristan@neon.tech>	2025-03-09 17:29:44 +00:00
Tristan Partin	3fe5650039	Fix dropping role with table privileges granted by non-neon_superuser (#10964 ) We were previously only revoking privileges granted by neon_superuser. However, we need to do it for all grantors. Signed-off-by: Tristan Partin <tristan@neon.tech>	2025-03-07 19:00:11 +00:00
Alex Chi Z.	cd438406fb	feat(pageserver): add force patch index_part API (#11119 ) ## Problem As part of the disaster recovery tool. Partly for https://github.com/neondatabase/neon/issues/9114. ## Summary of changes * Add a new pageserver API to force patch the fields in index_part and modify the timeline internal structures. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-03-07 17:42:52 +00:00
Dmitrii Kovalkov	e876794ce5	storcon: use https safekeeper api (#11065 ) ## Problem Storage controller uses http for requests to safekeeper management API. Closes: https://github.com/neondatabase/cloud/issues/24835 ## Summary of changes - Add `use_https_safekeeper_api` option to storcon to use https api - Use https for requests to safekeeper management API if this option is enabled - Add `ssl_ca_file` option to storcon for ability to specify custom root CA certificate	2025-03-07 17:22:47 +00:00
John Spray	87e6117dfd	storage controller: API-driven graceful migrations (#10913 ) ## Problem The current migration API does a live migration, but if the destination doesn't already have a secondary, that live migration is unlikely to be able to warm up a tenant properly within its timeout (full warmup of a big tenant can take tens of minutes). Background optimisation code knows how to do this gracefully by creating a secondary first, but we don't currently give a human a way to trigger that. Closes: https://github.com/neondatabase/neon/issues/10540 ## Summary of changes - Add `prefererred_node` parameter to TenantShard, which is respected by optimize_attachment - Modify migration API to have optional prewarm=true mode, in which we set preferred_node and call optimize_attachment, rather than directly modifying intentstate - Require override_scheduler=true flag if migrating somewhere that is a less-than-optimal scheduling location (e.g. wrong AZ) - Add `origin_node_id` to migration API so that callers can ensure they're moving from where they think they're moving from - Add tests for the above The storcon_cli wrapper for this has a 'watch' mode that waits for eventual cutover. This doesn't show the warmth of the secondary evolve because we don't currently have an API for that in the controller, as the passthrough API only targets attached locations, not secondaries. It would be straightforward to add later as a dedicated endpoint for getting secondary status, then extend the storcon_cli to consume that and print a nice progress indicator.	2025-03-07 17:02:38 +00:00
Vlad Lazar	084fc4a757	pageserver: enable previous heatmaps by default (#11132 ) We add the off by default configs in https://github.com/neondatabase/neon/pull/11088 because the unarchival heatmap was causing oversized secondary locations. That was fixed in https://github.com/neondatabase/neon/pull/11098, so let's turn them on by default.	2025-03-07 16:05:31 +00:00
Alexey Kondratov	f5aa8c3eac	feat(compute_ctl): Add a basic HTTP API benchmark (#11123 ) ## Problem We just had a regression reported at https://neondb.slack.com/archives/C08EXUJF554/p1741102467515599, which clearly came with one of the releases. It's not a huge problem yet, but it's annoying that we cannot quickly attribute it to a specific commit. ## Summary of changes Add a very simple `compute_ctl` HTTP API benchmark that does 10k requests to `/status` and `metrics.json` and reports p50 and p99. --------- Co-authored-by: Peter Bendel <peterbendel@neon.tech>	2025-03-07 12:35:42 +00:00
Alexey Kondratov	a485022300	fix(compute_ctl): Properly escape identifiers inside PL/pgSQL blocks (#11045 ) ## Problem In `f37eeb56`, I properly escaped the identifier, but I haven't noticed that the resulting string is used in the `format('...')`, so it needs additional escaping. Yet, after looking at it closer and with Heikki's and Tristan's help, it appeared to be that it's a full can of worms and we have problems all over the code in places where we use PL/pgSQL blocks. ## Summary of changes Add a new `pg_quote_dollar()` helper to deal with it, as dollar-quoting of strings seems to be the only robust way to escape strings in dynamic PL/pgSQL blocks. We mimic the Postgres' `pg_get_functiondef` logic here [1]. While on it, I added more tests and caught a couple of more bugs with string escaping: 1. `get_existing_dbs_async()` was wrapping `owner` in additional double-quotes if it contained special characters 2. `construct_superuser_query()` was flawed in even more ways than the rest of the code. It wasn't realistic to fix it quickly, but after thinking about it more, I realized that we could drop most of it altogether. IIUC, it was added as some sort of migration, probably back when we haven't had migrations yet. So all the complicated code was needed to properly update existing roles and DBs. In the current Neon, this code only runs before we create the very first DB and role. When we create roles and DBs, all `neon_superuser` grants are added in the different places. So the worst thing that could happen is that there is an ancient branch somewhere, so when users poke it, they will realize that not all Neon features work as expected. Yet, the fix is simple and self-serve -- just create a new role via UI or API, and it will get a proper `neon_superuser` grant. [1]: `8b49392b27/src/backend/utils/adt/ruleutils.c (L3153)` Closes neondatabase/cloud#25048	2025-03-06 19:54:29 +00:00
Vlad Lazar	5ceb8c994d	pageserver: mark unarchival heatmap layers as cold (#11098 ) ## Problem On unarchival, we update the previous heatmap with all visible layers. When the primary generates a new heatmap it includes all those layers, so the secondary will download them. Since they're not actually resident on the primary (we didn't call the warm up API), they'll never be evicted, so they remain in the heatmap. We want these layers in the heatmap, since we might wish to warm-up an unarchived timeline after a shard migration. However, we don't want them to be downloaded on the secondary until we've warmed up the primary. ## Summary of Changes Include these layers in the heatmap and mark them as cold. All heatmap operations act on non-cold layers apart from the attached location warming up API, which will download the cold layers. Once the cold layers are downloaded on the primary, they'll be included in the next heatmap as hot and the secondary starts fetching them too.	2025-03-06 11:25:02 +00:00
Alex Chi Z.	2de3629b88	test(pageserver): use reldirv2 by default in regress tests (#11081 ) ## Problem For pg_regress test, we do both v1 and v2; for all the rest, we default to v2. part of https://github.com/neondatabase/neon/issues/9516 ## Summary of changes Use reldir v2 across test cases by default. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-03-05 21:02:44 +00:00
Peter Bendel	604eb5e8d4	fix grafana dashboard link for pooler endoints (#11099 ) ## Problem Our benchmarking workflows contain links to grafana dashboards to troubleshoot problems. This works fine for non-pooled endpoints. For pooled endpoints we need to remove the `-pooler` suffix from the endpoint's hostname to get a valid endpoint ID. Example link that doesn't work in this run https://github.com/neondatabase/neon/actions/runs/13678933253/job/38246028316#step:8:311 ## Summary of changes Check if connection string is a -pooler connection string and if so remove this suffix from the endpoint ID. --------- Co-authored-by: Alexander Bayandin <alexander@neon.tech>	2025-03-05 20:01:17 +00:00
Alexey Kondratov	8263107f6c	feat(compute): Add filename label to remote ext requests metric (#11091 ) ## Problem We realized that we may use this metric for more 'live' info about extension installations vs. what we have with installed extensions metric, which is only updated at start, atm. ## Summary of changes Add `filename` label to `compute_ctl_remote_ext_requests_total`. Note that it contains the raw archive name with `.tar.zst` at the end, so the consumer may need to strip this suffix. Closes https://github.com/neondatabase/cloud/issues/24694	2025-03-05 18:17:57 +00:00
Arpad Müller	2d45522fa6	storcon db: load safekeepers from DB again (#11087 ) Earlier PR #11041 soft-disabled the loading code for safekeepers from the storcon db. This PR makes us load the safekeepers from the database again, now that we have [JWTs available on staging](https://github.com/neondatabase/neon/pull/11087) and soon on prod. This reverts commit `23fb8053c5`. Part of https://github.com/neondatabase/cloud/issues/24727	2025-03-05 15:45:43 +00:00
Erik Grinaker	332aae1484	test_runner/regress: speed up `test_check_visibility_map` (#11086 ) ## Problem `test_check_visibility_map` is the slowest test in CI, and can cause timeouts under particularly slow configurations (`debug` and `without-lfc`). ## Summary of changes * Reduce the `pgbench` scale factor from 10 to 8. * Omit a redundant vacuum during `pgbench` init. * Remove a final `vacuum freeze` + `pg_check_visible` pass, which has questionable value (we've already done a vacuum freeze previously, and we don't flush the compute cache before checking anyway).	2025-03-05 13:50:35 +00:00
Vlad Lazar	8c12ccf729	pageserver: gate previous heatmap behind config flag (#11088 ) ## Problem On unarchival, we update the previous heatmap with all visible layers. When the primary generates a new heatmap it includes all those layers, so the secondary will download them. Since they're not actually resident on the primary (we didn't call the warm up API), they'll never be evicted, so they remain in the heatmap. This leads to oversized secondary locations like we saw in pre-prod. ## Summary of changes Gate the loading of the previous heatmaps and the heatmap generation on unarchival behind configuration flags. They are disabled by default, but enabled in tests.	2025-03-05 12:20:18 +00:00
Alex Chi Z.	20af9cef17	fix(test): use the same value for reldir v1+v2 (#11070 ) ## Problem part of https://github.com/neondatabase/neon/issues/11067 My observation is that with the current value of settings, x86-v1 usually takes 30s, arm-v1 1m30s, x86-v2 1m, arm-v2 3m. But sometimes the system could run too slow and cause test to timeout on arm with reldir v2. While I investigate what's going on and further improve the performance, I'd like to set both of them to use the same test input, so that it doesn't timeout and we don't abuse this test case as a performance test. ## Summary of changes Use the same settings for both test cases. Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-03-04 14:55:50 +00:00
Vlad Lazar	435bf452e6	tests: remove obsolete err log whitelisting (#11069 ) The pageserver read path now supports overlapped in-memory and image layers via https://github.com/neondatabase/neon/pull/11000. These allowed errors are now obsolete.	2025-03-04 08:18:19 +00:00
Erik Grinaker	65addfc524	storcon: add per-tenant rate limiting for API requests (#10924 ) ## Problem Incoming requests often take the service lock, and sometimes even do database transactions. That creates a risk that a rogue client can starve the controller of the ability to do its primary job of reconciling tenants to an available state. ## Summary of changes * Use the `governor` crate to rate limit tenant requests at 10 requests per second. This is ~10-100x lower than the worst "attack" we've seen from a client bug. Admin APIs are not rate limited. * Add a `storage_controller_http_request_rate_limited` histogram for rate limited requests. * Log a warning every 10 seconds for rate limited tenants. The rate limiter is parametrized on TenantId, because the kinds of client bug we're protecting against generally happen within tenant scope, and the rates should be somewhat stable: we expect the global rate of requests to increase as we do more work, but we do not expect the rate of requests to one tenant to increase. --------- Co-authored-by: John Spray <john@neon.tech>	2025-03-03 22:04:59 +00:00
Alex Chi Z.	6d0976dad5	feat(pageserver): persist reldir v2 migration status (#10980 ) ## Problem part of https://github.com/neondatabase/neon/issues/9516 ## Summary of changes Similar to the aux v2 migration, we persist the relv2 migration status into index_part, so that even the config item is set to false, we will still read from the v2 storage to avoid loss of data. Note that only the two variants `None` and `Some(RelSizeMigration::Migrating)` are used for now. We don't have full migration implemented so it will never be set to `RelSizeMigration::Migrated`. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-03-03 21:05:43 +00:00
John Spray	b953daa21f	safekeeper: allow remote deletion to proceed after dropped requests (#11042 ) ## Problem If a caller times out on safekeeper timeline deletion on a large timeline, and waits a while before retrying, the deletion will not progress while the retry is waiting. The net effect is very very slow deletion as it only proceeds in 30 second bursts across 5 minute idle periods. Related: https://github.com/neondatabase/neon/issues/10265 ## Summary of changes - Run remote deletion in a background task - Carry a watch::Receiver on the Timeline for other callers to join the wait - Restart deletion if the API is called again and the previous attempt failed	2025-03-03 16:03:51 +00:00
Peter Bendel	a07599949f	First version of a new benchmark to test larger OLTP workload (#11053 ) ## Problem We want to support larger tenants (regarding logical database size, number of transactions per second etc.) and should increase our test coverage of OLTP transactions at larger scale. ## Summary of changes Start a new benchmark that over time will add more OLTP tests at larger scale. This PR covers the first version and will be extended in further PRs. Also fix some infrastructure: - default for new connections and large tenants is to use connection pooler pgbouncer, however our fixture always added `statement_timeout=120` which is not compatible with pooler [see](https://neon.tech/docs/connect/connection-errors#unsupported-startup-parameter) - action to create branch timed out after 10 seconds and 10 retries but for large tenants it can take longer so use increasing back-off for retries ## Test run https://github.com/neondatabase/neon/actions/runs/13593446706	2025-03-03 15:25:48 +00:00
Arseny Sher	ef2b50994c	walproposer: basic infra to enable generations (#11002 ) ## Problem Preparation for https://github.com/neondatabase/neon/issues/10851 ## Summary of changes Add walproposer `safekeepers_generations` field which can be set by prefixing `neon.safekeepers` GUC with `g#n:`. Non zero value (n) forces walproposer to use generations. In particular, this also disables implicit timeline creation as timeline will be created by storcon. Add test checking this. Also add missing infra: `--safekeepers-generation` flag to neon_local endpoint start + fix `--start-timeout` flag: it existed but value wasn't used.	2025-03-03 13:20:20 +00:00
Suhas Thalanki	ee0c8ca8fd	Add -fsigned-char for cross platform signed chars (#10852 ) ## Problem In multi-character keys, the GIN index creates a CRC Hash of the first 3 bytes of the key. The hash can have the first bit to be set or unset, needing to have a consistent representation of `char` across architectures for consistent results. GIN stores these keys by their hashes which determines the order in which the keys are obtained from the GIN index. By default, chars are signed in x86 and unsigned in arm, leading to inconsistent behavior across different platform architectures. Adding the `-fsigned-char` flag to the GCC compiler forces chars to be treated as signed across platforms, ensuring the ordering in which the keys are obtained consistent. ## Summary of changes Added `-fsigned-char` to the `CFLAGS` to force GCC to use signed chars across platforms. Added a test to check this across platforms. Fixes: https://github.com/neondatabase/cloud/issues/23199	2025-02-28 21:07:21 +00:00
Vlad Lazar	23fb8053c5	storcon: soft disable SK heartbeats (#11041 ) ## Problem JWT tokens aren't in place, so all SK heartbeats fail. This is equivalent to a wait before applying the PS heartbeats and makes things more flaky. ## Summary of Changes Add a flag that skips loading SKs from the db on start-up and at runtime.	2025-02-28 15:49:09 +00:00
Conrad Ludgate	d9ced89ec0	feat(proxy): require TLS to compute if prompted by cplane (#10717 ) https://github.com/neondatabase/cloud/issues/23008 For TLS between proxy and compute, we are using an internally provisioned CA to sign the compute certificates. This change ensures that proxy will load them from a supplied env var pointing to the correct file - this file and env var will be configured later, using a kubernetes secret. Control plane responds with a `server_name` field if and only if the compute uses TLS. This server name is the name we use to validate the certificate. Control plane still sends us the IP to connect to as well (to support overlay IP). To support this change, I'd had to split `host` and `host_addr` into separate fields. Using `host_addr` and bypassing `lookup_addr` if possible (which is what happens in production). `host` then is only used for the TLS connection. There's no blocker to merging this. The code paths will not be triggered until the new control plane is deployed and the `enableTLS` compute flag is enabled on a project.	2025-02-28 14:20:25 +00:00
Vlad Lazar	0d6d58bd3e	pageserver: make heatmap layer download API more cplane friendly (#10957 ) ## Problem We intend for cplane to use the heatmap layer download API to warm up timelines after unarchival. It's tricky for them to recurse in the ancestors, and the current implementation doesn't work well when unarchiving a chain of branches and warming them up. ## Summary of changes * Add a `recurse` flag to the API. When the flag is set, the operation recurses into the parent timeline after the current one is done. * Be resilient to warming up a chain of unarchived branches. Let's say we unarchived `B` and `C` from the `A -> B -> C` branch hierarchy. `B` got unarchived first. We generated the unarchival heatmaps and stash them in `A` and `B`. When `C` unarchived, it dropped it's unarchival heatmap since `A` and `B` already had one. If `C` needed layers from `A` and `B`, it was out of luck. Now, when choosing whether to keep an unarchival heatmap we look at its end LSN. If it's more inclusive than what we currently have, keep it.	2025-02-28 10:36:53 +00:00
Christian Schwarz	e35f7758d8	impr(controller_upcall_client): clean up copy-pasta code & add context to retries (#10991 ) Before this PR, re-attach and validate would log the same warning ``` calling control plane generation validation API failed ``` on retry errors. This can be confusing. This PR makes the message generically valid for any upcall and adds additional tracing spans to capture context. Along the way, clean up some copy-pasta variable naming. refs - https://github.com/neondatabase/neon/issues/10381#issuecomment-2684755827 --------- Co-authored-by: Alexander Lakhin <alexander.lakhin@neon.tech>	2025-02-27 10:59:43 +00:00
Peter Bendel	3a3d62dc4f	Bodobolero/test cum stats persistence (#10995 ) ## Problem So far cumulative statistics have not been persisted when Neon scales to zero (suspends endpoint). With PR https://github.com/neondatabase/neon/pull/6560 the cumulative statistics should now survive endpoint restarts and correctly trigger the auto- vacuum and auto analyze maintenance So far we did not have a testcase that validates that improvement in our dev cloud environment with a real project. ## Summary of changes Introduce testcase `test_cumulative_statistics_persistence`in the benchmarking workflow running daily to verify: - Verifies that the cumulative statistics are correctly persisted across restarts. - Cumulative statistics are important to persist across restarts because they are used - when auto-vacuum an auto-analyze trigger conditions are met. - The test performs the following steps: - Seed a new project using pgbench - insert tuples that by itself are not enough to trigger auto-vacuum - suspend the endpoint - resume the endpoint - insert additional tuples that by itself are not enough to trigger auto-vacuum but in combination with the previous tuples are - verify that autovacuum is triggered by the combination of tuples inserted before and after endpoint suspension ## Test run https://github.com/neondatabase/neon/actions/runs/13546879714/job/37860609089#step:6:282	2025-02-27 10:45:13 +00:00
Alex Chi Z.	11aab9f0de	fix(pageserver): further stablize gc-compaction tests (#10975 ) ## Problem Yet another source of flakyness for https://github.com/neondatabase/neon/issues/10517 ## Summary of changes The test scenario we want to create is that we have an image layer in index_part and then overwrite it, so we have to ensure it gets persisted in index_part by doing a force checkpoint. Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-02-26 19:50:10 +00:00
Arseny Sher	643a48210f	safekeeper: exclude API (#10757 ) ## Problem https://github.com/neondatabase/neon/pull/10241 added configuration switch endpoint, but it didn't delete timeline if node was excluded. ## Summary of changes Add separate /exclude API endpoint which similarly accepts membership configuration where sk is supposed by be excluded. Implementation deletes the timeline locally. Some more small related tweaks: - make mconf switch API PUT instead of POST as it is idempotent; - return 409 if switch was refused instead of 200 with requested & current; - remove unused was_active flag from delete response; - remove meaningless _force suffix from delete functions names; - reuse timeline.rs delete_dir function in timelines_global_map instead of its own copy. part of https://github.com/neondatabase/neon/issues/9965	2025-02-26 19:26:33 +00:00
Alex Chi Z.	30f3be9840	fix(test): reduce number of relations in test_tx_abort_with_many_relations (#10997 ) ## Problem I see a lot of timeout errors, which indicates that this test is too slow. It seems that create relations are fast, but the subsequent truncating step is slow. ## Summary of changes Reduce number of relations for now, and investigate later. Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-02-26 17:19:14 +00:00
Arseny Sher	01581f3af5	safekeeper: drop json_ctrl (#10722 ) ## Problem json_ctrl.rs is an obsolete attempt to have tests with fine control of feeding messages into safekeeper superseded by desim framework. ## Summary of changes Drop it.	2025-02-26 13:32:37 +00:00
Alex Chi Z.	015092d259	feat(pageserver): add automatic trigger for gc-compaction (#10798 ) ## Problem part of https://github.com/neondatabase/neon/issues/9114 ## Summary of changes Add the auto trigger for gc-compaction. It computes two values: L1 size and L2 size. When L1 size >= initial trigger threshold, we will trigger an initial gc-compaction. When l1_size / l2_size >= gc_compaction_ratio_percent, we will trigger the "tiered" gc-compaction. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-02-25 14:50:39 +00:00
Alex Chi Z.	b7fcf2c7a7	test(pageserver): add reldir v2 into tests (#10750 ) ## Problem We have `test_perf_many_relations` but it only runs on remote clusters, and we cannot directly modify tenant config. Therefore, I patched one of the current tests to benchmark relv2 performance. close https://github.com/neondatabase/neon/issues/9986 ## Summary of changes * Add `v1/v2` selector to `test_tx_abort_with_many_relations`. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-02-25 14:50:22 +00:00
Konstantin Knizhnik	8f82c661d4	Move neon_pgstat_file_size_limit to the extension (#10959 ) ## Problem PG14 uses separate backend for stats collector having no access to shaerd memory. As far as AUX mechanism requires access to shared memory, persisting pgstat.stat file is not supported at pg14. And so there is no definition of `neon_pgstat_file_size_limit` variable. It makes it impossible to provide same config for all Postgres version. ## Summary of changes Move neon_pgstat_file_size_limit to Neon extension. Postgres submodules PR: https://github.com/neondatabase/postgres/pull/587 https://github.com/neondatabase/postgres/pull/588 https://github.com/neondatabase/postgres/pull/589 --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech> Co-authored-by: Tristan Partin <tristan@neon.tech>	2025-02-25 12:23:04 +00:00
Arseny Sher	758f597280	compute <-> sk protocol v3 (#10647 ) ## Problem As part of https://github.com/neondatabase/neon/issues/8614 we need to pass membership configurations between compute and safekeepers. ## Summary of changes Add version 3 of the protocol carrying membership configurations. Greeting message in both sides gets full conf, and other messages generation number only. Use protocol bump to include other accumulated changes: - stop packing whole structs on the wire as is; - make the tag u8 instead of u64; - send all ints in network order; - drop proposer_uuid, we can pass it in START_WAL_PUSH and it wasn't much useful anyway. Per message changes, apart from mconf: - ProposerGreeting: tenant / timeline id is sent now as hex cstring. Remove proto version, it is passed outside in START_WAL_PUSH. Remove postgres timeline, it is unused. Reorder fields a bit. - AcceptorGreeting: reorder fields - VoteResponse: timeline_start_lsn is removed. It can be taken from first member of term history, and later we won't need it at all when all timelines will be explicitly created. Vote itself is u8 instead of u64. - ProposerElected: timeline_start_lsn is removed for the same reasons. - AppendRequest: epoch_start_lsn removed, it is known from term history in ProposerElected. Both compute and sk are able to talk v2 and v3 to make rollbacks (in case we need them) easier; neon.safekeeper_proto_version GUC sets the client version. v2 code can be dropped later. So far empty conf is passed everywhere, future PRs will handle them. To test, add param to some tests choosing proto version; we want to test both 2 and 3 until we fully migrate. ref https://github.com/neondatabase/neon/issues/10326 --------- Co-authored-by: Arthur Petukhovsky <petuhovskiy@yandex.ru>	2025-02-25 11:56:05 +00:00
Heikki Linnakangas	565a9e62a1	compute: Disconnect if no response to a pageserver request is received (#10882 ) We've seen some cases in production where a compute doesn't get a response to a pageserver request for several minutes, or even more. We haven't found the root cause for that yet, but whatever the reason is, it seems overly optimistic to think that if the pageserver hasn't responded for 2 minutes, we'd get a response if we just wait patiently a little longer. More likely, the pageserver is dead or there's some kind of a network glitch so that the TCP connection is dead, or at least stuck for a long time. Either way, it's better to disconnect and reconnect. I set the default timeout to 2 minutes, which should be enough for any GetPage request under normal circumstances, even if the pageserver has to download several layer files from remote storage. Make the disconnect timeout configurable. Also make the "log interval", after which we print a message to the log configurable, so that if you change the disconnect timeout, you can set the log timeout correspondingly. The default log interval is still 10 s. The new GUCs are called "neon.pageserver_response_log_timeout" and "neon.pageserver_response_disconnect_timeout". Includes a basic test for the log and disconnect timeouts. Implements issue #10857	2025-02-24 20:16:37 +00:00
Vlad Lazar	459446fcb8	pagesever: include visible layers in heatmaps after unarchival (#10880 ) ## Problem https://github.com/neondatabase/neon/pull/10788 introduced an API for warming up attached locations by downloading all layers in the heatmap. We intend to use it for warming up timelines after unarchival too, but it doesn't work. Any heatmap generated after the unarchival will not include our timeline, so we've lost all those layers. ## Summary of changes Generate a cheeky heatmap on unarchival. It includes all the visible layers. Use that as the `PreviousHeatmap` which inputs into actual heatmap generation. Closes: https://github.com/neondatabase/neon/issues/10541	2025-02-24 15:21:17 +00:00
John Spray	2a5d7e5a78	tests: improve compat test coverage of controller-pageserver interaction (#10848 ) ## Problem We failed to detect https://github.com/neondatabase/neon/pull/10845 before merging, because the tests we run with a matrix of component versions didn't include the ones that did live migrations. ## Summary of changes - Do a live migration during the storage controller smoke test, since this is a pretty core piece of functionality - Apply a compat version matrix to the graceful cluster restart test, since this is the functionality that we most urgently need to work across versions to make deploys work. I expect the first CI run of this to fail, because https://github.com/neondatabase/neon/pull/10845 isn't merged yet.	2025-02-24 12:22:22 +00:00
Tristan Partin	c0c3ed94a9	Fix flaky test_compute_installed_extensions_metric test (#10933 ) There was a race condition with compute_ctl and the metric being collected related to whether the neon extension had been updated or not. compute_ctl will run `ALTER EXTENSION neon UPDATE` on compute start in the postgres database. Fixes: https://github.com/neondatabase/neon/issues/10932 Signed-off-by: Tristan Partin <tristan@neon.tech>	2025-02-21 18:29:48 +00:00
Konstantin Knizhnik	b1d8771d5f	Store prefetch results in LFC cache once as soon as they are received (#10442 ) ## Problem Prefetch is performed locally, so different backers can request the same pages form PS. Such duplicated request increase load of page server and network traffic. Making prefetch global seems to be very difficult and undesirable, because different queries can access chunks on different speed. Storing prefetch chunks in LFC will not completely eliminate duplicates, but can minimise such requests. The problem with storing prefetch result in LFC is that in this case page is not protected by share buffer lock. So we will have to perform extra synchronisation at LFC side. See: https://neondb.slack.com/archives/C0875PUD0LC/p1736772890602029?thread_ts=1736762541.116949&cid=C0875PUD0LC @MMeent implementation of prewarm: See https://github.com/neondatabase/neon/pull/10312/ ## Summary of changes Use conditional variables to sycnhronize access to LFC entry. --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2025-02-21 16:56:16 +00:00

1 2 3 4 5 ...

1970 Commits