rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-01-15 09:22:55 +00:00

Author	SHA1	Message	Date
Vlad Lazar	d0aaec2abb	storage_controller: create imported timelines on safekeepers (#11801 ) ## Problem SK timeline creations were skipped for imported timelines since we didn't know the correct start LSN of the timeline at that point. ## Summary of changes Created imported timelines on the SK as part of the import finalize step. We use the last record LSN of shard 0 as the start LSN for the safekeeper timeline. Closes https://github.com/neondatabase/neon/issues/11569	2025-05-09 10:55:26 +00:00
Conrad Ludgate	5cd7f936f9	fix(neon-rls): optimistically assume role grants are already assigned for replicas (#11811 ) ## Problem Read replicas cannot grant permissions for roles for Neon RLS. Usually the permission is already granted, so we can optimistically check. See INC-509 ## Summary of changes Perform a permission lookup prior to actually executing any grants.	2025-05-09 07:48:30 +00:00
Mark Novikov	d22377c754	Skip event triggers in dump-restore (#11794 ) ## Problem Data import fails if the src db has any event triggers, because those can only be restored by a superuser. Specifically imports from Heroku and Supabase are guaranteed to fail. Closes https://github.com/neondatabase/cloud/issues/27353 ## Summary of changes Depends on `pg_dump` patches per each supported PostgreSQL version: - https://github.com/neondatabase/postgres/pull/630 - https://github.com/neondatabase/postgres/pull/629 - https://github.com/neondatabase/postgres/pull/627 - https://github.com/neondatabase/postgres/pull/628	2025-05-08 11:04:28 +00:00
Dmitrii Kovalkov	7e55497e13	tests: flush wal before waiting for last record lsn (#11726 ) ## Problem Compute may flush WAL on page boundaries, leaving some records partially flushed for a long time. It leads to `wait_for_last_flush_lsn` stuck waiting for this partial LSN. - Closes: https://github.com/neondatabase/cloud/issues/27876 ## Summary of changes - Flush WAL via CHECKPOINT after requesting current_wal_lsn to make sure that the record we point to is flushed in full - Use proper endpoint in `test_timeline_detach_with_aux_files_with_detach_v1`	2025-05-08 10:00:45 +00:00
Shockingly Good	4d2e4b19c3	fix(compute) Correct the PGXN s3 gateway URL. (#11796 ) Corrects the postgres extension s3 gateway address to be not just a domain name but a full base URL. To make the code more readable, the option is renamed to "remote_ext_base_url", while keeping the old name also accessible by providing a clap argument alias. Also provides a very simple and, perhaps, even redundant unit test to confirm the logic behind parsing of the corresponding CLI argument. ## Problem As it is clearly stated in https://github.com/neondatabase/cloud/issues/26005, using of the short version of the domain name might work for now, but in the future, we should get rid of using the `default` namespace and this is where it will, most likely, break down. ## Summary of changes The changes adjust the domain name of the extension s3 gateway to use the proper base url format instead of the just domain name assuming the "default" namespace and add a new CLI argument name for to reflect the change and the expectance.	2025-05-07 16:34:08 +00:00
Alexey Kondratov	0691b73f53	fix(compute): Enforce cloud_admin role in compute_ctl connections (#11827 ) ## Problem Users can override some configuration parameters on the DB level with `ALTER DATABASE ... SET ...`. Some of these overrides, like `role` or `default_transaction_read_only`, affect `compute_ctl`'s ability to configure the DB schema properly. ## Summary of changes Enforce `role=cloud_admin`, `statement_timeout=0`, and move `default_transaction_read_only=off` override from control plane [1] to `compute_ctl`. Also, enforce `search_path=public` just in case, although we do not call any functions in user databases. [1]: `133dd8c4db/goapp/controlplane/internal/pkg/compute/provisioner/provisioner_common.go (L70)` Fixes https://github.com/neondatabase/cloud/issues/28532	2025-05-07 12:14:24 +00:00
Tristan Partin	0ef6851219	Make the audience claim in compute JWTs a vector (#11845 ) According to RFC 7519, `aud` is generally an array of StringOrURI, but in special cases may be a single StringOrURI value. To accomodate future control plane work where a single token may work for multiple services, make the claim a vector. Link: https://www.rfc-editor.org/rfc/rfc7519#section-4.1.3 Signed-off-by: Tristan Partin <tristan@neon.tech>	2025-05-06 22:19:15 +00:00
Mikhail	5c356c63eb	endpoint_storage compute_ctl integration (#11550 ) Add `/lfc/(prewarm\|offload)` routes to `compute_ctl` which interact with endpoint storage. Add `prewarm_lfc_on_startup` spec option which, if enabled, downloads LFC prewarm data on compute startup. Resolves: https://github.com/neondatabase/cloud/issues/26343	2025-05-06 22:02:12 +00:00
Tristan Partin	f9b3a2e059	Add scoping to compute_ctl JWT claims (#11639 ) Currently we only have an admin scope which allows a user to bypass the compute_id check. When the admin scope is provided, validate the audience of the JWT to be "compute". Closes: https://github.com/neondatabase/cloud/issues/27614 Signed-off-by: Tristan Partin <tristan@neon.tech>	2025-05-06 19:51:10 +00:00
Erik Grinaker	0e0ad073bf	storcon: fix split aborts removing other tenants (#11837 ) ## Problem When aborting a split, the code accidentally removes all other tenant shards from the in-memory map that have the same shard count as the aborted split, causing "tenant not found" errors. It will recover on a storcon restart, when it loads the persisted state. This issue has been present for at least a year. Resolves https://github.com/neondatabase/cloud/issues/28589. ## Summary of changes Only remove shards belonging to the relevant tenant when aborting a split. Also adds a regression test.	2025-05-06 13:57:34 +00:00
Konstantin Knizhnik	f0e7b3e0ef	Use unlogged build for gist_indexsortbuild_flush_ready_pages (#11753 ) ## Problem See https://github.com/neondatabase/neon/issues/11718 GIST index can be constructed in two ways: GIST_SORTED_BUILD and GIST_BUFFERING. We used unlogged build in the second case but not in the first. ## Summary of changes Use unlogged build in `gist_indexsortbuild_flush_ready_pages` Correspondent Postgres PRsL: https://github.com/neondatabase/postgres/pull/624 https://github.com/neondatabase/postgres/pull/625 https://github.com/neondatabase/postgres/pull/626 --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech> Co-authored-by: Heikki Linnakangas <heikki@neon.tech>	2025-05-06 07:24:27 +00:00
Alex Chi Z.	0b243242df	fix(test): allow flush error in gc-compaction tests (#11822 ) ## Problem Part of https://github.com/neondatabase/neon/issues/11762 ## Summary of changes While #11762 needs some work to refactor the error propagating thing, we can do a hacky fix for the gc-compaction tests to allow flush error during shutdown. It does not affect correctness. Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-05-05 12:15:22 +00:00
Alex Chi Z.	bbc35e10b8	fix(test): increase timeouts for some tests (#11781 ) ## Problem Those tests are timing out more frequently after https://github.com/neondatabase/neon/pull/11585 ## Summary of changes Increase timeout for `test_pageserver_gc_compaction_smoke` Increase rollback wait timeout for `test_tx_abort_with_many_relations` Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-05-01 18:36:26 +00:00
Alex Chi Z.	ae2c3ac12f	test: revert relsizev2 config (#11759 ) ## Problem part of https://github.com/neondatabase/neon/issues/9516 One thing I realized in the past few months is that "no-way-back" things like this are scary to roll out without a fine-grained rollout infra. The plan was to flip the flag in the repo and roll it out soon, but I don't think rolling out would happen in the near future. So I'd rather revert the flag to avoid creating a discrepancy between staging and the regress tests. ## Summary of changes Not using rel_size_v2 by default in unit tests; we still have a few tests to explicitly test the new format so we still get some test coverages. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-05-01 17:51:10 +00:00
devin-ai-integration[bot]	1d06172d59	pageserver: remove resident size from billing metrics (#11699 ) This is a rebase of PR #10739 by @henryliu2014 on the current main branch. ## Problem pageserver: remove resident size from billing metrics Fixes #10388 ## Summary of changes The following changes have been made to remove resident size from billing metrics: * removed the metric "resident_size" and related codes in consumption_metrics/metrics.rs * removed the item of the description of metric "resident_size" in consumption_metrics.md * refactored the metric "resident_size" related test case Requested by: John Spray (john@neon.tech) --------- Co-authored-by: liuheqing <hq.liu@qq.com> Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com> Co-authored-by: John Spray <john@neon.tech>	2025-04-29 18:34:56 +00:00
Konstantin Knizhnik	3593356c10	Prewarm sql api (#11742 ) ## Problem Continue work on prewarm, see https://github.com/neondatabase/neon/pull/11740 https://github.com/neondatabase/neon/pull/11741 ## Summary of changes Add SQL API to prewarm --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2025-04-29 06:44:28 +00:00
Tristan Partin	9e8ab2ab4f	Skip remote extensions WITH_LIB test when sanitizers are enabled (#11758 ) In order for the test to work when sanitizers are enabled, we would need to compile the dummy Postgres extension with the same sanitizer flags that we compile Postgres and the neon extension with. Doing this work would be a little more than trivial, so skipping is the best option, at least for now. Signed-off-by: Tristan Partin <tristan@neon.tech>	2025-04-28 19:13:35 +00:00
John Spray	0482690534	pageserver: make control_plane_api & generations fully mandatory (#10715 ) ## Problem We had retained the ability to run in a generation-less mode to support test_generations_upgrade, which was replaced with a cleaner backward compat test in https://github.com/neondatabase/neon/pull/10701 ## Summary of changes - Remove all the special cases for "if no generation" or "if no control plane api" - Make control_plane_api config mandatory --------- Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>	2025-04-28 17:24:55 +00:00
Dmitrii Kovalkov	ef53a76434	storage_broker: https handler (#11603 ) ## Problem Broker supports only HTTP, no HTTPS - Closes: https://github.com/neondatabase/cloud/issues/27492 ## Summary of changes - Add `listen_https_addr`, `ssl_key_file`, `ssl_cert_file`, `ssl_cert_reload_period` arguments to storage broker - Make `listen_addr` argument optional - Listen https in storage broker - Support https for storage broker request in neon_local - Add `use_https_storage_broker_api` option to NeonEnvBuilder	2025-04-25 14:28:56 +00:00
Tristan Partin	2526f6aea1	Add remote extension test with library component (#11301 ) The current test was just SQL files only, but we also want to test a remote extension which includes a loadable library. With both extensions we should cover a larger portion of compute_ctl's remote extension code paths. Fixes: https://github.com/neondatabase/neon/issues/11146 Signed-off-by: Tristan Partin <tristan@neon.tech>	2025-04-24 22:33:46 +00:00
Vlad Lazar	5ba7315c84	storage_controller: reconcile completed imports at start-up (#11614 ) ## Problem In https://github.com/neondatabase/neon/pull/11345 coordination of imports moved to the storage controller. It involves notifying cplane when the import has been completed by calling an idempotent endpoint. If the storage controller shuts down in the middle of finalizing an import, it would never be retried. ## Summary of changes Reconcile imports at start-up by fetching the complete imports from the database and spawning a background task which notifies cplane. Closes: https://github.com/neondatabase/neon/issues/11570	2025-04-24 18:39:19 +00:00
Vlad Lazar	6f7e3c18e4	storage_controller: make leadership protocol more robust (#11703 ) ## Problem We saw the following scenario in staging: 1. Pod A starts up. Becomes leader and steps down the previous pod cleanly. 2. Pod B starts up (deployment). 3. Step down request from pod B to pod A times out. Pod A did not manage to stop its reconciliations within 10 seconds and exited with return code 1 ([code](`7ba8519b43/storage_controller/src/service.rs (L8686-L8702)`)). 4. Pod B marks itself as the leader and finishes start-up 5. k8s restarts pod A 6. k8s marks pod B as ready 7. pod A sends step down request to pod A - this succeeds => pod A is now the leader 8. k8s kills pod A because it thinks pod B is healthy and pod A is part of the old replica set We end up in a situation where the only pod we have (B) is stepped down and attempts to forward requests to a leader that doesn't exist. k8s can't detect that pod B is in a bad state since the /status endpoint simply returns 200 hundred if the pod is running. ## Summary of changes This PR includes a number of robustness improvements to the leadership protocol: * use a single step down task per controller * add a new endpoint to be used as k8s liveness probe and check leadership status there * handle restarts explicitly (i.e. don't step yourself down) * increase the step down retry count * don't kill the process on long step down since k8s will just restart it	2025-04-24 16:59:56 +00:00
Alexey Kondratov	985056be37	feat(compute): Introduce Postgres downtime metrics (#11346 ) ## Problem Currently, we only report the timestamp of the last moment we think Postgres was active. The problem is that if Postgres gets completely unresponsive, we still report some old timestamp, and it's impossible to distinguish situations 'Postgres is effectively down' and 'Postgres is running, but no client activity'. ## Summary of changes Refactor the `compute_ctl`'s compute monitor so that it was easier to track the connection errors and failed activity checks, and report - `now() - last_successful_check` as current downtime on any failure - cumulative Postgres downtime during the whole compute lifetime After adding a test, I also noticed that the compute monitor may not reconnect even though queries fail with `connection closed` or `error communicating with the server: Connection reset by peer (os error 54)`, but for some reason we do not catch it with `client.is_closed()`, so I added an explicit reconnect in case of any failures. Discussion: https://neondb.slack.com/archives/C03TN5G758R/p1742489426966639	2025-04-24 13:51:09 +00:00
Vlad Lazar	3a50d95b6d	storage_controller: coordinate imports across shards in the storage controller (#11345 ) ## Problem Pageservers notify control plane directly when a shard import has completed. Control plane has to download the status of each shard from S3 and figure out if everything is truly done, before proceeding with branch activation. Issues with this approach are: * We can't control shard split behaviour on the storage controller side. It's unsafe to split during import. * Control plane needs to know about shards and implement logic to check all timelines are indeed ready. ## Summary of changes In short, storage controller coordinates imports, and, only when everything is done, notifies control plane. Big rocks: 1. Store timeline imports in the storage controller database. Each import stores the status of its shards in the database. We hook into the timeline creation call as our entry point for this. 2. Pageservers get a new upcall endpoint to notify the storage controller of shard import updates. 3. Storage controller handles these updates by updating persisted state. If an update finalizes the import, then poll pageservers until timeline activation, and, then, notify the control plane that the import is complete. Cplane side change with new endpoint is in https://github.com/neondatabase/cloud/pull/26166 Closes https://github.com/neondatabase/neon/issues/11566	2025-04-24 11:26:06 +00:00
Mikhail Kot	c3534cea39	Rename object_storage->endpoint_storage (#11678 ) 1. Rename service to avoid ambiguity as discussed in Slack 2. Ignore endpoint_id in read paths as requested in https://github.com/neondatabase/cloud/issues/26346#issuecomment-2806758224	2025-04-23 14:03:19 +00:00
Tristan Partin	b00db536bb	Add CPU architecture to the remote extensions object key (#11590 ) ARM computes are incoming and we need to account for that in remote extensions. Previously, we just blindly assumed that all computes were x86_64. Note that we use the Go architecture naming convention instead of the Rust one directly to do our best and be consistent across the stack. Part-of: https://github.com/neondatabase/cloud/issues/23148 Signed-off-by: Tristan Partin <tristan@neon.tech>	2025-04-22 22:47:22 +00:00
Arpad Müller	149cbd1e0a	Support single and two safekeeper scenarios (#11483 ) In tests and when one safekeeper is down in small regions, we need to contend with one or two safekeepers. Before, we gave an error in `safekeepers_for_new_timeline`. Now we just silently allow the timeline to be created on one or two safekeepers. Part of #9011	2025-04-22 21:27:01 +00:00
Alexander Lakhin	7b949daf13	fix(test): allow reconcile errors in test_storage_controller_heartbeats (#11665 ) ## Problem test_storage_controller_heartbeats is flaky because of unallowed reconciler errors (#11625) ## Summary of changes Allow reconcile errors as in other tests in test_storage_controller.py.	2025-04-22 18:13:16 +00:00
Arpad Müller	c1e4befd56	Additional fixes and improvements to storcon safekeeper timelines (#11477 ) This delivers some additional fixes and improvements to storcon managed safekeeper timelines: * use `i32::MAX` for the generation number of timeline deletion * start the generation for new timelines at 1 instead of 0: this ensures that the other components actually are generation enabled * fix database operations we use for metrics * use join in list_pending_ops to prevent the classical ORM issue where one does many db queries * use enums in `test_storcon_create_delete_sk_down`. we are adding a second parameter, and having two bool parameters is weird. * extend `test_storcon_create_delete_sk_down` with a test of whole tenant deletion. this hasn't been tested before. * remove some redundant logging contexts * Don't require mutable access to the service lock for scheduling pending ops in memory. In order to pull this off, create reconcilers eagerly. The advantage is that we don't need mutable access to the service lock that way any more. Part of #9011 --------- Co-authored-by: Arseny Sher <sher-ars@yandex.ru>	2025-04-17 20:25:30 +00:00
Alex Chi Z.	ad0c5fdae7	fix(test): allow stale generation warnings in storcon (#11624 ) ## Problem https://github.com/neondatabase/neon/pull/11531 did not fully fix the problem because the warning is part of the storcon instead of pageserver. ## Summary of changes Allow stale generation error in storcon. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-04-17 16:12:24 +00:00
Christian Schwarz	2b041964b3	cover direct IO + concurrent IO in unit, regression & perf tests (#11585 ) This mirrors the production config. Thread that discusses the merits of this: - https://neondb.slack.com/archives/C033RQ5SPDH/p1744742010740569 # Refs - context https://neondb.slack.com/archives/C04BLQ4LW7K/p1744724844844589?thread_ts=1744705831.014169&cid=C04BLQ4LW7K - prep for https://github.com/neondatabase/neon/pull/11558 which adds new io mode `direct-rw` # Impact on CI turnaround time Spot-checking impact on CI timings - Baseline: [some recent main commit](https://github.com/neondatabase/neon/actions/runs/14471549758/job/40587837475) - Comparison: [this commit](https://github.com/neondatabase/neon/actions/runs/14471945087/job/40589613274) in this PR here Impact on CI turnaround time - Regression tests: - x64: very minor, sometimes better; likely in the noise - arm64: substantial 30min => 40min - Benchmarks (x86 only I think): very minor; noise seems higher than regress tests --------- Signed-off-by: Alex Chi Z <chi@neon.tech> Co-authored-by: Alex Chi Z. <4198311+skyzh@users.noreply.github.com> Co-authored-by: Peter Bendel <peterbendel@neon.tech> Co-authored-by: Alex Chi Z <chi@neon.tech>	2025-04-17 15:53:10 +00:00
Alexander Bayandin	07c2411f6b	tests: remove mentions of ALLOW_*_COMPATIBILITY_BREAKAGE (#11618 ) ## Problem There are mentions of `ALLOW_BACKWARD_COMPATIBILITY_BREAKAGE` and `ALLOW_FORWARD_COMPATIBILITY_BREAKAGE`, but in reality, this mechanism doesn't work, so let's remove it to avoid confusion. The idea behind it was to allow some breaking changes by adding a special label to a PR that would `xfail` the test. However, in practice, this means we would need to carry this label through all subsequent PRs until the release (and artifact regeneration). This approach isn't really viable, as it increases the risk of missing a compatibility break in another PR. ## Summary of changes - Remove mentions and handling of `ALLOW_BACKWARD_COMPATIBILITY_BREAKAGE` / `ALLOW_FORWARD_COMPATIBILITY_BREAKAGE`	2025-04-17 10:03:21 +00:00
Konstantin Knizhnik	b7548de814	Disable autovacuum and increase limit for WS approximation (#11583 ) ## Problem Test lfc working set approximation becomes flaky after recent changes in prefetch. May be it is caused by updating HLL in `lfc_write`, may be by some other reasons. ## Summary of changes 1. Disable autovacuum in this test (as possible source of extra page accesses). 2. Increase upper boundary for WS approximation from 12 to 20. --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2025-04-17 05:07:45 +00:00
Tristan Partin	9794f386f4	Make Postgres 17 the default version (#11619 ) This is mostly a documentation update, but a few updates with regard to neon_local, pageserver, and tests. 17 is our default for users in production, so dropping references to 16 makes sense. Signed-off-by: Tristan Partin <tristan@neon.tech> Signed-off-by: Tristan Partin <tristan@neon.tech>	2025-04-16 23:23:37 +00:00
Vlad Lazar	0e00faf528	tests: stability fixes for `test_migration_to_cold_secondary` (#11606 ) 1. Compute may generate WAL on shutdown. The test assumes that after shutdown, no further ingest happens. Tweak the compute shutdown to make the assumption true. 2. Assertion of local layer count post cold migration is not right since we may have downloaded layers due to ingest. Remove it. Closes https://github.com/neondatabase/neon/issues/11587	2025-04-16 16:31:23 +00:00
Erik Grinaker	00eeff9b8d	pageserver: add `compaction_shard_ancestor` to disable shard ancestor compaction (#11608 ) ## Problem Splits of large tenants (several TB) can cause a huge amount of shard ancestor compaction work, which can overload Pageservers. Touches https://github.com/neondatabase/cloud/issues/22532. ## Summary of changes Add a setting `compaction_shard_ancestor` (default `true`) to disable shard ancestor compaction on a per-tenant basis.	2025-04-16 14:41:02 +00:00
Alex Chi Z.	aa19f10e7e	fix(test): allow shutdown warning in preempt tests (#11600 ) ## Problem test_gc_compaction_preempt is still flaky ## Summary of changes - allow shutdown warning logs Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-04-15 21:50:28 +00:00
Alex Chi Z.	057ce115de	fix(test): allow stale generation errors (1/2) (#11531 ) ## Problem Part of https://github.com/neondatabase/neon/issues/11486 ## Summary of changes 50% of the test instability of `test_create_churn_during_restart` are due to error message gets changed. Allow the new error message. Still need to fix other errors due to failure to acquire semaphore in this or the next patch. Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-04-14 14:51:17 +00:00
Vlad Lazar	a338984dc7	pageserver: support keys at different LSNs in one get page batch (#11494 ) ## Problem Get page batching stops when we encounter requests at different LSNs. We are leaving batching factor on the table. ## Summary of changes The goal is to support keys with different LSNs in a single batch and still serve them with a single vectored get. Important restriction: the same key at different LSNs is not supported in one batch. Returning different key versions is a much more intrusive change. Firstly, the read path is changed to support "scattered" queries. This is a conceptually simple step from https://github.com/neondatabase/neon/pull/11463. Instead of initializing the fringe for one keyspace, we do it for multiple at different LSNs and let the logic already present into the fringe handle selection. Secondly, page service code is updated to support batching at different LSNs. Eeach request parsed from the wire determines its effective request LSN and keeps it in mem for the batcher toinspect. The batcher allows keys at different LSNs in one batch as long one key is not requested at different LSNs. I'd suggest doing the first pass commit by commit to get a feel for the changes. ## Results I used the batching test from [Christian's PR](https://github.com/neondatabase/neon/pull/11391) which increases the change of batch breaks. Looking at the logs I think the new code is at the max batching factor for the workload (we only break batches due to them being oversized or because the executor is idle). ``` Main: Reasons for stopping batching: {'LSN changed': 22843, 'of batch size': 33417} test_throughput[release-pg16-50-pipelining_config0-30-100-128-batchable {'max_batch_size': 32, 'execution': 'concurrent-futures', 'mode': 'pipelined'}].perfmetric.batching_factor: 14.6662 My branch: Reasons for stopping batching: {'of batch size': 37024} test_throughput[release-pg16-50-pipelining_config0-30-100-128-batchable {'max_batch_size': 32, 'execution': 'concurrent-futures', 'mode': 'pipelined'}].perfmetric.batching_factor: 19.8333 ``` Related: https://github.com/neondatabase/neon/issues/10765	2025-04-14 09:05:29 +00:00
Konstantin Knizhnik	8936a7abd8	Increase limit for worker processes for isolation test (#11504 ) ## Problem See https://github.com/neondatabase/neon/issues/10652 Neon extension launches 2 BGW which reduce limit for parallel workers and so affecting parallel_deadlock isolation test. ## Summary of changes Increase `max_worker_processes` from default 8 to 16 for isolation test. --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2025-04-12 18:09:12 +00:00
Alex Chi Z.	4f7b2cdd4f	feat(pageserver): gc-compaction result verification (#11515 ) ## Problem Part of #9114 There was a debug-mode verification mode that verifies at every retain_lsn. However, the code was tangled within the actual history generation itself and it's hard to reason about correctness. This patch adds a separate post-verification of the gc-compaction result that redos logs at every retain_lsn and every record above the GC horizon. This ensures that all key history we produce with gc-compaction is readable, and if there're read errors after gc-compaction, it can only be read-path errors instead of gc-compaction bugs. ## Summary of changes * Add gc_compaction_verification flag, default to true. * Implement a post-verification process. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-04-11 15:50:29 +00:00
Alex Chi Z.	66f56ddaec	fix(pageserver): allow shutdown errors for gc compaction tests (#11530 ) ## Problem `test_pageserver_compaction_preempt` is flaky. ## Summary of changes Allow the shutdown errors. Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-04-11 15:20:51 +00:00
Tristan Partin	ff5a527167	Consolidate compute_ctl configuration structures (#11514 ) Previously, the structure of the spec file was just the compute spec. However, the response from the control plane get spec request included the compute spec and the compute_ctl config. This divergence was hindering other work such as adding regression tests for compute_ctl HTTP authorization. Signed-off-by: Tristan Partin <tristan@neon.tech>	2025-04-11 15:06:29 +00:00
Christian Schwarz	8884865bca	tests: make `test_pageserver_getpage_throttle` less flaky (#11482 ) # Refs - fixes https://github.com/neondatabase/neon/issues/11395 # Problem Since 2025-03-10, we have observed increased flakiness of `test_pageserver_getpage_throttle`. The test is timing-dependent by nature, and was hitting the ``` assert duration_secs >= 10 * actual_smgr_query_seconds, ( "smgr metrics should not include throttle wait time" ) ``` quite frequently. # Analysis These failures are not reproducible. In this PR's history is a commit that reran the test 100 times without requiring a single retry. In https://github.com/neondatabase/neon/issues/11395 there is a link to a query to the test results database. It shows that the flakiness was not constant, but rather episodic: 2025-03-{10,11,12,13} 2025-03-{19,20,21} 2025-03-31 and 2025-04-01. To me, this suggests variability in available CPU. # Solution The point of the offending assertion is to ensure that most of the request latency is spent on throttling, because testing of the throttling mechanism is the point of the test. The `10` magic number means at most 10% of mean latency may be spent on request processing. Ideally we would control the passage of time (virtual clock source) to make this test deterministic. But I don't see that happening in our regression test setup. So, this PR de-flakes the test as follows: - allot up to 66% of mean latency for request processing - increase duration from 10s to 20s, hoping to get better protection from momentary CPU spikes in noisy neighbor tests or VMs on the runner host As a drive-by, switch to `pytest.approx` and remove one self-test assertion I can't make sense of anymore.	2025-04-11 09:38:05 +00:00
Dmitrii Kovalkov	4c4e33bc2e	storage: add http/https server and cert resover metrics (#11450 ) ## Problem We need to export some metrics about certs/connections to configure alerts and make sure that all HTTP requests are gone before turning https-only mode on. - Closes: https://github.com/neondatabase/cloud/issues/25526 ## Summary of changes - Add started connection and connection error metrics to http/https Server. - Add certificate expiration time and reload metrics to ReloadingCertificateResolver.	2025-04-11 06:11:35 +00:00
John Spray	52dee408dc	storage controller: improve safety of shard splits coinciding with controller restarts (#11412 ) ## Problem The graceful leadership transfer process involves calling step_down on the old controller, but this was not waiting for shard splits to complete, and the new controller could therefore end up trying to abort a shard split while it was still going on. We mitigated this already in #11256 by avoiding the case where shard split completion would update the database incorrectly, but this was a fragile fix because it assumes that is the only problematic part of the split running concurrently. Precursors: - #11290 - #11256 Closes: #11254 ## Summary of changes - Hold the reconciler gate from shard splits, so that step_down will wait for them. Splits should always be fairly prompt, so it is okay to wait here. - Defense in depth: if step_down times out (hardcoded 10 second limit), then fully terminate the controller process rather than letting it continue running, potentially doing split-brainy things. This makes sense because the new controller will always declare itself leader unilaterally if step_down fails, so leaving an old controller running is not beneficial. - Tests: extend `test_storage_controller_leadership_transfer_during_split` to separately exercise the case of a split holding up step_down, and the case where the overall timeout on step_down is hit and the controller terminates.	2025-04-10 16:55:37 +00:00
Christian Schwarz	2e35f23085	tests: remove ignored `fair` field (#11521 ) Pageserver has been ignoring field `tenant_config.timeline_get_throttle.fair` for many monhts, since we removed it from the config struct in neondatabase/neon#8539. Refs - epic https://github.com/neondatabase/cloud/issues/27320	2025-04-10 14:24:30 +00:00
Erik Grinaker	0122d97f95	test_runner: only use last gen in `test_location_conf_churn` (#11511 ) ## Problem `test_location_conf_churn` performs random location updates on Pageservers. While doing this, it could instruct the compute to connect to a stale generation and execute queries. This is invalid, and will fail if a newer generation has removed layer files used by the stale generation. Resolves #11348. ## Summary of changes Only connect to the latest generation when executing queries.	2025-04-10 10:07:16 +00:00
Arseny Sher	fae7528adb	walproposer: make it aware of membership (#11407 ) ## Problem Walproposer should get elected and commit WAL on safekeepers specified by the membership configuration. ## Summary of changes - Add to wp `members_safekeepers` and `new_members_safekeepers` arrays mapping configuration members to connection slots. Establish this mapping (by node id) when safekeeper sends greeting, giving its id and when mconf becomes known / changes. - Add to TermsCollected, VotesCollected, GetAcknowledgedByQuorumWALPosition membership aware logic. Currently it partially duplicates existing one, but we'll drop the latter eventually. - In python, rename Configuration to MembershipConfiguration for clarity. - Add test_quorum_sanity testing new logic. ref https://github.com/neondatabase/neon/issues/10851	2025-04-10 09:55:37 +00:00
Alex Chi Z.	405a17bf0b	fix(pageserver): ensure gc-compaction gets preempted by L0 (#11512 ) ## Problem Part of #9114 ## Summary of changes Gc-compaction flag was not correctly set, causing it not getting preempted by L0. Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-04-09 20:57:50 +00:00

1 2 3 4 5 ...

1295 Commits