rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-01-06 21:12:55 +00:00

Author	SHA1	Message	Date
Konstantin Knizhnik	d759fcb8bd	Increase wait LFC prewarm timeout (#12174 ) ## Problem See https://github.com/neondatabase/neon/issues/12171 ## Summary of changes Increase LFC prewarm wait timeout to 1 minute Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2025-06-09 18:01:30 +00:00
Alexander Sarantcev	590301df08	storcon: Introduce deletion tombstones to support flaky node scenario (#12096 ) ## Problem Removed nodes can re-add themselves on restart if not properly tombstoned. We need a mechanism (e.g. soft-delete flag) to prevent this, especially in cases where the node is unreachable. More details there: #12036 ## Summary of changes - Introduced `NodeLifecycle` enum to represent node lifecycle states. - Added a string representation of `NodeLifecycle` to the `nodes` table. - Implemented node removal using a tombstone mechanism. - Introduced `/debug/v1/tombstone*` handlers to manage the tombstone state.	2025-06-06 10:16:55 +00:00
Vlad Lazar	868f194a3b	pageserver: remove handling of vanilla protocol (#12126 ) ## Problem We support two ingest protocols on the pageserver: vanilla and interpreted. Interpreted has been the only protocol in use for a long time. ## Summary of changes * Remove the ingest handling of the vanilla protocol * Remove tenant and pageserver configuration for it * Update all tests that tweaked the ingest protocol ## Compatibility Backward compatibility: * The new pageserver version can read the existing pageserver configuration and it will ignore the unknown field. * When the tenant config is read from the storcon db or from the pageserver disk, the extra field will be ignored. Forward compatiblity: * Both the pageserver config and the tenant config map missing fields to their default value. I'm not aware of any tenant level override that was made for this knob.	2025-06-05 11:43:04 +00:00
Konstantin Knizhnik	9c6c780201	Replica promote (#12090 ) ## Problem This PR is part of larger computes support activity: https://www.notion.so/neondatabase/Larger-computes-114f189e00478080ba01e8651ab7da90 Epic: https://github.com/neondatabase/cloud/issues/19010 In case of planned node restart, we are going to 1. create new read-only replica 2. capture LFC state at primary 3. use this state to prewarm replica 4. stop old primary 5. promote replica to primary Steps 1-3 are currently implemented and support from compute side. This PR provides compute level implementation of replica promotion. Support replica promotion ## Summary of changes Right now replica promotion is done in three steps: 1. Set safekeepers list (now it is empty for replica) 2. Call `pg_promote()` top promote replica 3. Update endpoint setting to that it ids not more treated as replica. May be all this three steps should be done by some function in compute_ctl. But right now this logic is only implement5ed in test. Postgres submodules PRs: https://github.com/neondatabase/postgres/pull/648 https://github.com/neondatabase/postgres/pull/649 https://github.com/neondatabase/postgres/pull/650 https://github.com/neondatabase/postgres/pull/651 --------- Co-authored-by: Matthias van de Meent <matthias@neon.tech> Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2025-06-05 11:27:14 +00:00
Vlad Lazar	b69d103b90	pageserver: make import job max byte range size configurable (#12117 ) ## Problem We want to repro an OOM situation, but large partial reads are required. ## Summary of Changes Make the max partial read size configurable for import jobs.	2025-06-04 10:44:23 +00:00
Vlad Lazar	8a6fc6fd8c	pageserver: hook importing timelines up into disk usage eviction (#12038 ) ## Problem Disk usage eviction isn't sensitive to layers of imported timelines. ## Summary of changes Hook importing timelines up into eviction and add a test for it. I don't think we need any special eviction logic for this. These layers will all be visible and their access time will be their creation time. Hence, we'll remove covered layers first and get to the imported layers if there's still disk pressure.	2025-05-29 13:01:10 +00:00
Vlad Lazar	51639cd6af	pageserver: allow for deletion of importing timelines (#12033 ) ## Problem Importing timelines can't currently be deleted. This is problematic because: 1. Cplane cannot delete failed imports and we leave the timeline behind. 2. The flow does not support user driven cancellation of the import ## Summary of changes On the pageserver: I've taken the path of least resistance, extended `TimelineOrOffloaded` with a new variant and added handling in the right places. I'm open to thoughts here, but I think it turned out better than I was envisioning. On the storage controller: Again, fairly simple business: when a DELETE timeline request is received, we remove the import from the DB and stop any finalization tasks/futures. In order to stop finalizations, we track them in-memory. For each finalizing import, we associate a gate and a cancellation token. Note that we delete the entry from the database before cancelling any finalizations. This is such that a concurrent request can't progress the import into finalize state and race with the deletion. This concern about deleting an import with on-going finalization is theoretical in the near future. We are only going to delete importing timelines after the storage controller reports the failure to cplane. Alas, the design works for user driven cancellation too. Closes https://github.com/neondatabase/neon/issues/11897	2025-05-29 11:13:52 +00:00
Vlad Lazar	9657fbc194	pageserver: add and stabilize import chaos test (#11982 ) ## Problem Test coverage of timeline imports is lacking. ## Summary of changes This PR adds a chaos import test. It runs an import while injecting various chaos events in the environment. All the commits that follow the test fix various issues that were surfaced by it. Closes https://github.com/neondatabase/neon/issues/10191	2025-05-27 09:52:59 +00:00
Erik Grinaker	ec991877f4	pageserver: add gRPC server (#11972 ) ## Problem We want to expose the page service over gRPC, for use with the communicator. Requires #11995. Touches #11728. ## Summary of changes This patch wires up a gRPC server in the Pageserver, using Tonic. It does not yet implement the actual page service. * Adds `listen_grpc_addr` and `grpc_auth_type` config options (disabled by default). * Enables gRPC by default with `neon_local`. * Stub implementation of `page_api.PageService`, returning unimplemented errors. * gRPC reflection service for use with e.g. `grpcurl`. Subsequent PRs will implement the actual page service, including authentication and observability. Notably, TLS support is not yet implemented. Certificate reloading requires us to reimplement the entire Tonic gRPC server.	2025-05-26 08:27:48 +00:00
Peter Bendel	87fc0a0374	periodic pagebench on hetzner runners (#11963 ) ## Problem - Benchmark periodic pagebench had inconsistent benchmarking results even when run with the same commit hash. Hypothesis is this was due to running on dedicated but virtualized EC instance with varying CPU frequency. - the dedicated instance type used for the benchmark is quite "old" and we increasingly get `An error occurred (InsufficientInstanceCapacity) when calling the StartInstances operation (reached max retries: 2): Insufficient capacity.` - periodic pagebench uses a snapshot of pageserver timelines to have the same layer structure in each run and get consistent performance. Re-creating the snapshot was a painful manual process (see https://github.com/neondatabase/cloud/issues/27051 and https://github.com/neondatabase/cloud/issues/27653) ## Summary of changes - Run the periodic pagebench on a custom hetzner GitHub runner with large nvme disk and governor set to defined perf profile - provide a manual dispatch option for the workflow that allows to create a new snapshot - keep the manual dispatch option to specify a commit hash useful for bi-secting regressions - always use the newest created snapshot (S3 bucket uses date suffix in S3 key, example `s3://neon-github-public-dev/performance/pagebench/shared-snapshots-2025-05-17/` - `--ignore` `test_runner/performance/pageserver/pagebench/test_pageserver_max_throughput_getpage_at_latest_lsn.py` in regular benchmarks run for each commit - improve perf copying snapshot by using `cp` subprocess instead of traversing tree in python ## Example runs with code in this PR: - run which creates new snapshot https://github.com/neondatabase/neon/actions/runs/15083408849/job/42402986376#step:19:55 - run which uses latest snapshot - https://github.com/neondatabase/neon/actions/runs/15084907676/job/42406240745#step:11:65	2025-05-23 09:37:19 +00:00
Arpad Müller	136cf1979b	Add metric for number of offloaded timelines (#11976 ) We want to keep track of the number of offloaded timelines. It's a per-tenant shard metric because each shard makes offloading decisions on its own.	2025-05-21 11:28:22 +00:00
Emmanuel Ferdman	81c6a5a796	Migrate to correct logger interface (#11956 ) ## Problem Currently the `logger` library throws annoying deprecation warnings: ```python DeprecationWarning: The 'warn' method is deprecated, use 'warning' instead ``` ## Summary of changes This small PR resolves the annoying deprecation warnings by migrating to `.warning` as suggested. Signed-off-by: Emmanuel Ferdman <emmanuelferdman@gmail.com>	2025-05-17 21:12:01 +00:00
Christian Schwarz	a7ce323949	benchmarking: extend `test_page_service_batching.py` to cover concurrent IO + batching under random reads (#10466 ) This PR commits the benchmarks I ran to qualify concurrent IO before we released it. Changes: - Add `l0stack` fixture; a reusable abstraction for creating a stack of L0 deltas each of which has 1 Value::Delta per page. - Such a stack of L0 deltas is a good and understandable demo for concurrent IO because to reconstruct any page, $layer_stack_height` Values need to be read. Before concurrent IO, the reads were sequential. With concurrent IO, they are executed concurrently. - So, switch `test_latency` to use the l0stack. - Teach `pagebench`, which is used by `test_latency`, to limit itself to the blocks of the relation created by the l0stack abstraction. - Additional parametrization of `test_latency` over dimensions `ps_io_concurrency,l0_stack_height,queue_depth` - Use better names for the tests to reflect what they do, leave interpretation of the (now quite high-dimensional) results to the reader - `test_{throughput => postgres_seqscan}` - `test_{latency => random_reads}` - Cut down on permutations to those we use in production. Runtime is about 2min. Refs - concurrent IO epic https://github.com/neondatabase/neon/issues/9378 - batching task: fixes https://github.com/neondatabase/neon/issues/9837 --------- Co-authored-by: Peter Bendel <peterbendel@neon.tech>	2025-05-15 17:48:13 +00:00
Vlad Lazar	2621ce2daf	pageserver: checkpoint import progress in the storage controller (#11862 ) ## Problem Timeline imports do not have progress checkpointing. Any time that the tenant is shut-down, all progress is lost and the import restarts from the beginning when the tenant is re-attached. ## Summary of changes This PR adds progress checkpointing. ### Preliminaries The unit of work is a `ChunkProcessingJob`. Each `ChunkProcessingJob` deals with the import for a set of key ranges. The job split is done by using an estimation of how many pages each job will produce. The planning stage must be pure: given a fixed set of contents in the import bucket, it will always yield the same plan. This property is enforced by checking that the hash of the plan is identical when resuming from a checkpoint. The storage controller tracks the progress of each shard in the import in the database in the form of the latest job that has has completed. ### Flow This is the high level flow for the happy path: 1. On the first run of the import task, the import task queries storcon for the progress and sees that none is recorded. 2. Execute the preparatory stage of the import 3. Import jobs start running concurrently in a `FuturesOrdered`. Every time the checkpointing threshold of jobs has been reached, notify the storage controller. 4. Tenant is detached and re-attached 5. Import task starts up again and gets the latest progress checkpoint from the storage controller in the form of a job index. 6. The plan is computed again and we check that the hash matches with the original plan. 7. Jobs are spawned from where the previous import task left off. Note that we will not report progress after the completion of each job, so some jobs might run twice. Closes https://github.com/neondatabase/neon/issues/11568 Closes https://github.com/neondatabase/neon/issues/11664	2025-05-15 13:18:22 +00:00
Conrad Ludgate	a77919f4b2	merge pg-sni-router into proxy (#11882 ) ## Problem We realised that pg-sni-router doesn't need to be separate from proxy. just a separate port. ## Summary of changes Add pg-sni-router config to proxy and expose the service.	2025-05-12 15:48:48 +00:00
Christian Schwarz	79ddc803af	feat(direct IO): runtime alignment validation; support config flag on macOS; default to `DirectRw` (#11868 ) This PR adds a runtime validation mode to check adherence to alignment and size-multiple requirements at the VirtualFile level. This can help prevent alignment bugs from slipping into production because test systems may have more lax requirements than production. (This is not the case today, but it could change in the future). It also allows catching O_DIRECT bugs on systems that don't have O_DIRECT (macOS). Consequently, we can now accept `virtual_file_io_mode={direct,direct-rw}` on macOS now. This has the side benefit of removing some annoying conditional compilation around `IoMode`. A third benefit is that it helped weed out size-multiple requirement violation bugs in how the VirtualFile unit tests exercise read and write APIs. I seized the opportunity to trim these tests down to what actually matters, i.e., exercising of the `OpenFiles` file descriptor cache. Lastly, this PR flips the binary-built-in default to `DirectRw` so that when running Python regress tests and benchmarks without specifying `PAGESERVER_VIRTUAL_FILE_IO_MODE`, one gets the production behavior. Refs - fixes https://github.com/neondatabase/neon/issues/11676	2025-05-10 14:19:52 +00:00
Christian Schwarz	f5070f6aa4	fixup(direct IO): PR #11864 broke test suite parametrization (#11887 ) PR - github.com/neondatabase/neon/pull/11864 committed yesterday rendered the `PAGESERVER_VIRTUAL_FILE_IO_MODE` env-var-based parametrization ineffective. As a consequence, the tests and benchmarks in `test_runner/` were using the binary built-in-default, i.e., `buffered`.	2025-05-09 18:13:35 +00:00
Conrad Ludgate	5cd7f936f9	fix(neon-rls): optimistically assume role grants are already assigned for replicas (#11811 ) ## Problem Read replicas cannot grant permissions for roles for Neon RLS. Usually the permission is already granted, so we can optimistically check. See INC-509 ## Summary of changes Perform a permission lookup prior to actually executing any grants.	2025-05-09 07:48:30 +00:00
Conrad Ludgate	bef5954fd7	feat(proxy): track SNI usage by protocol, including for http (#11863 ) ## Problem We want to see how many users of the legacy serverless driver are still using the old URL for SQL-over-HTTP traffic. ## Summary of changes Adds a protocol field to the connections_by_sni metric. Ensures it's incremented for sql-over-http.	2025-05-08 16:46:57 +00:00
Christian Schwarz	8477d15f95	feat(direct IO): remove special case in test suite for compat tests (#11864 ) PR - https://github.com/neondatabase/neon/pull/11558 adds special treatment for compat snapshot binaries which don't understand the `direct-rw` mode. A new compat snapshot has been published since, so, we can remove the special case. refs: - fixes https://github.com/neondatabase/neon/issues/11598	2025-05-08 16:11:45 +00:00
Arpad Müller	622b3b2993	Fixes for enabling --timelines-onto-safekeepers in tests (#11854 ) Second PR with fixes extracted from #11712, relating to `--timelines-onto-safekeepers`. Does the following: * Moves safekeeper registration to `neon_local` instead of the test fixtures * Pass safekeeper JWT token if `--timelines-onto-safekeepers` is enabled * Allow some warnings related to offline safekeepers (similarly to how we allow them for offline pageservers) * Enable generations on the compute's config if `--timelines-onto-safekeepers` is enabled * fix parallel `pull_timeline` race condition (the one that #11786 put for later) Fixes #11424 Part of #11670	2025-05-08 15:13:11 +00:00
Dmitrii Kovalkov	7e55497e13	tests: flush wal before waiting for last record lsn (#11726 ) ## Problem Compute may flush WAL on page boundaries, leaving some records partially flushed for a long time. It leads to `wait_for_last_flush_lsn` stuck waiting for this partial LSN. - Closes: https://github.com/neondatabase/cloud/issues/27876 ## Summary of changes - Flush WAL via CHECKPOINT after requesting current_wal_lsn to make sure that the record we point to is flushed in full - Use proper endpoint in `test_timeline_detach_with_aux_files_with_detach_v1`	2025-05-08 10:00:45 +00:00
Shockingly Good	4d2e4b19c3	fix(compute) Correct the PGXN s3 gateway URL. (#11796 ) Corrects the postgres extension s3 gateway address to be not just a domain name but a full base URL. To make the code more readable, the option is renamed to "remote_ext_base_url", while keeping the old name also accessible by providing a clap argument alias. Also provides a very simple and, perhaps, even redundant unit test to confirm the logic behind parsing of the corresponding CLI argument. ## Problem As it is clearly stated in https://github.com/neondatabase/cloud/issues/26005, using of the short version of the domain name might work for now, but in the future, we should get rid of using the `default` namespace and this is where it will, most likely, break down. ## Summary of changes The changes adjust the domain name of the extension s3 gateway to use the proper base url format instead of the just domain name assuming the "default" namespace and add a new CLI argument name for to reflect the change and the expectance.	2025-05-07 16:34:08 +00:00
Mikhail	5c356c63eb	endpoint_storage compute_ctl integration (#11550 ) Add `/lfc/(prewarm\|offload)` routes to `compute_ctl` which interact with endpoint storage. Add `prewarm_lfc_on_startup` spec option which, if enabled, downloads LFC prewarm data on compute startup. Resolves: https://github.com/neondatabase/cloud/issues/26343	2025-05-06 22:02:12 +00:00
Tristan Partin	f9b3a2e059	Add scoping to compute_ctl JWT claims (#11639 ) Currently we only have an admin scope which allows a user to bypass the compute_id check. When the admin scope is provided, validate the audience of the JWT to be "compute". Closes: https://github.com/neondatabase/cloud/issues/27614 Signed-off-by: Tristan Partin <tristan@neon.tech>	2025-05-06 19:51:10 +00:00
Alex Chi Z.	ae2c3ac12f	test: revert relsizev2 config (#11759 ) ## Problem part of https://github.com/neondatabase/neon/issues/9516 One thing I realized in the past few months is that "no-way-back" things like this are scary to roll out without a fine-grained rollout infra. The plan was to flip the flag in the repo and roll it out soon, but I don't think rolling out would happen in the near future. So I'd rather revert the flag to avoid creating a discrepancy between staging and the regress tests. ## Summary of changes Not using rel_size_v2 by default in unit tests; we still have a few tests to explicitly test the new format so we still get some test coverages. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-05-01 17:51:10 +00:00
John Spray	0482690534	pageserver: make control_plane_api & generations fully mandatory (#10715 ) ## Problem We had retained the ability to run in a generation-less mode to support test_generations_upgrade, which was replaced with a cleaner backward compat test in https://github.com/neondatabase/neon/pull/10701 ## Summary of changes - Remove all the special cases for "if no generation" or "if no control plane api" - Make control_plane_api config mandatory --------- Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>	2025-04-28 17:24:55 +00:00
Dmitrii Kovalkov	ef53a76434	storage_broker: https handler (#11603 ) ## Problem Broker supports only HTTP, no HTTPS - Closes: https://github.com/neondatabase/cloud/issues/27492 ## Summary of changes - Add `listen_https_addr`, `ssl_key_file`, `ssl_cert_file`, `ssl_cert_reload_period` arguments to storage broker - Make `listen_addr` argument optional - Listen https in storage broker - Support https for storage broker request in neon_local - Add `use_https_storage_broker_api` option to NeonEnvBuilder	2025-04-25 14:28:56 +00:00
Tristan Partin	2526f6aea1	Add remote extension test with library component (#11301 ) The current test was just SQL files only, but we also want to test a remote extension which includes a loadable library. With both extensions we should cover a larger portion of compute_ctl's remote extension code paths. Fixes: https://github.com/neondatabase/neon/issues/11146 Signed-off-by: Tristan Partin <tristan@neon.tech>	2025-04-24 22:33:46 +00:00
Vlad Lazar	5ba7315c84	storage_controller: reconcile completed imports at start-up (#11614 ) ## Problem In https://github.com/neondatabase/neon/pull/11345 coordination of imports moved to the storage controller. It involves notifying cplane when the import has been completed by calling an idempotent endpoint. If the storage controller shuts down in the middle of finalizing an import, it would never be retried. ## Summary of changes Reconcile imports at start-up by fetching the complete imports from the database and spawning a background task which notifies cplane. Closes: https://github.com/neondatabase/neon/issues/11570	2025-04-24 18:39:19 +00:00
Christian Schwarz	8afb783708	feat: Direct IO for the pageserver write path (#11558 ) # Problem The Pageserver read path exclusively uses direct IO if `virtual_file_io_mode=direct`. The write path is half-finished. Here is what the various writing components use: \|what\|buffering\|flags on <br/>`v_f_io_mode`<br/>=`buffered`\|flags on <br/>`virtual_file_io_mode`<br/>=`direct`\| \|-\|-\|-\|-\| \|`DeltaLayerWriter`\| BlobWriter<BUFFERED=true> \| () \| () \| \|`ImageLayerWriter`\| BlobWriter<BUFFERED=false> \| () \| () \| \|`download_layer_file`\|BufferedWriter\|()\|()\| \|`InMemoryLayer`\|BufferedWriter\|()\|O_DIRECT\| The vehicle towards direct IO support is `BufferedWriter` which - largely takes care of O_DIRECT alignment & size-multiple requirements - double-buffering to mask latency `DeltaLayerWriter`, `ImageLayerWriter` use `blob_io::BlobWriter` , which has neither of these. # Changes ## High-Level At a high-level this PR makes the following primary changes: - switch the two layer writer types to use `BufferedWriter` & make sensitive to `virtual_file_io_mode` (via open_with_options_v2) - make `download_layer_file` sensitive to `virtual_file_io_mode` (also via open_with_options_v2) - add `virtual_file_io_mode=direct-rw` as a feature gate - we're hackish-ly piggybacking on OpenOptions's ask for write access here - this means with just `=direct` InMemoryLayer reads and writes no longer uses O_DIRECT - this is transitory and we'll remove the `direct-rw` variant once the rollout is complete (The `_v2` APIs for opening / creating VirtualFile are those that are sensitive to `virtual_file_io_mode`) The result is: \|what\|uses <br/>`BufferedWriter`\|flags on <br/>`v_f_io_mode`<br/>=`buffered`\|flags on <br/>`v_f_io_mode`<br/>=`direct`\|flags on <br/>`v_f_io_mode`<br/>=`direct-rw`\| \|-\|-\|-\|-\|-\| \|`DeltaLayerWriter`\| ~~Blob~~BufferedWriter \| () \| () \| O_DIRECT \| \|`ImageLayerWriter`\| ~~Blob~~BufferedWriter \| () \| () \| O_DIRECT \| \|`download_layer_file`\|BufferedWriter\|()\|()\|O_DIRECT\| \|`InMemoryLayer`\|BufferedWriter\|()\|~~O_DIRECT~~()\|O_DIRECT\| ## Code-Level The main change is: - Switch `blob_io::BlobWriter` away from its own buffering method to use `BufferedWriter`. Additional prep for upholding `O_DIRECT` requirements: - Layer writer `finish()` methods switched to use IoBufferMut for guaranteed buffer address alignment. The size of the buffers is PAGE_SZ and thereby implicitly assumed to fulfill O_DIRECT requirements. For the hacky feature-gating via `=direct-rw`: - Track `OpenOptions::write(true\|false)` in a field; bunch of mechanical churn. - Consolidate the APIs in which we "open" or "create" VirtualFile for better overview over which parts of the code use the `_v2` APIs. Necessary refactorings & infra work: - Add doc comments explaining how BufferedWriter ensures that writes are compliant with O_DIRECT alignment & size constraints. This isn't new, but should be spelled out. - Add the concept of shutdown modes to `BufferedWriter::shutdown` to make writer shutdown adhere to these constraints. - The `PadThenTruncate` mode might not be necessary in practice because I believe all layer files ever written are sized in multiples `PAGE_SZ` and since `PAGE_SZ` is larger than the current alignment requirements (512/4k depending on platform), it won't be necesary to pad. - Some test (I believe `round_trip_test_compressed`?) required it though - [ ] TODO: decide if we want to accept that complexity; if we do then address TODO in the code to separate alignment requirement from buffer capacity - Add `set_len` (=`ftruncate`) VirtualFile operation to support the above. - Allow `BufferedWriter` to start at a non-zero offset (to make room for the summary block). Cleanups unlocked by this change: - Remove non-positional APIs from VirtualFile (e.g. seek, write_full, read_full) Drive-by fixes: - PR https://github.com/neondatabase/neon/pull/11585 aimed to run unit tests for all `virtual_file_io_mode` combinations but didn't because of a missing `_` in the env var. # Performance This section assesses this PR's impact on deployments with current production setting (`=direct`) and anticipated impact of switching to (`=direct-rw`). For `DeltaLayerWriter`, `=direct` should remain unchanged to slightly improved on throughput because the `BlobWriter`'s buffer had the same size as the `BufferedWriter`'s buffer, but it didn't have the double-buffering that `BufferedWriter` has. The `=direct-rw` enables direct IO; throughput should not be suffering because of double-buffering; benchmarks will show if this is true. The `ImageLayerWriter` was previously not doing any buffering (`BUFFERED=false`). It went straight to issuing the IO operation to the underlying VirtualFile and the buffering was done by the kernel. The switch to `BufferedWriter` under `=direct` adds an additional memcpy into the BufferedWriter's buffer. We will win back that memcpy when enabling direct IO via `=direct-rw`. A nice win from the switch to `BufferedWriter` is that ImageLayerWriter performs >=16x fewer write operations to VirtualFile (the BlobWriter performs one write per len field and one write per image value). This should save low tens of microseconds of CPU overhead from doing all these syscalls/io_uring operations, regardless of `=direct` or `=direct-rw`. Aside from problems with alignment, this write frequency without double-buffering is prohibitive if we actually have to wait for the disk, which is what will happen when we enable direct IO via (`=direct-rw`). Throughput should not be suffering because of BufferedWrite's double-buffering; benchmarks will show if this is true. `InMemoryLayer` at `=direct` will flip back to using buffered IO but remain on BufferedWriter. The buffered IO adds back one memcpy of CPU overhead. Throughput should not suffer and will might improve on not-memory-pressured Pageservers but let's remember that we're doing the whole direct IO thing to eliminate global memory pressure as a source of perf variability. ## bench_ingest I reran `bench_ingest` on `im4gn.2xlarge` and `Hetzner AX102`. Use `git diff` with `--word-diff` or similar to see the change. General guidance on interpretation: - immediate production impact of this PR without production config change can be gauged by comparing the same `io_mode=Direct` - end state of production switched over to `io_mode=DirectRw` can be gauged by comparing old results' `io_mode=Direct` to new results' `io_mode=DirectRw` Given above guidance, on `im4gn.2xlarge` - immediate impact is a significant improvement in all cases - end state after switching has same significant improvements in all cases - ... except `ingest/io_mode=DirectRw volume_mib=128 key_size_bytes=8192 key_layout=Sequential write_delta=Yes` which only achieves `238 MiB/s` instead of `253.43 MiB/s` - this is a 6% degradation - this workload is typical for image layer creation # Refs - epic https://github.com/neondatabase/neon/issues/9868 - stacked atop - preliminary refactor https://github.com/neondatabase/neon/pull/11549 - bench_ingest overhaul https://github.com/neondatabase/neon/pull/11667 - derived from https://github.com/neondatabase/neon/pull/10063 Co-authored-by: Yuchen Liang <yuchen@neon.tech>	2025-04-24 14:57:36 +00:00
Mikhail Kot	c3534cea39	Rename object_storage->endpoint_storage (#11678 ) 1. Rename service to avoid ambiguity as discussed in Slack 2. Ignore endpoint_id in read paths as requested in https://github.com/neondatabase/cloud/issues/26346#issuecomment-2806758224	2025-04-23 14:03:19 +00:00
a-masterov	6c2e5c044c	random operations test (#10986 ) ## Problem We need to test the stability of Neon. ## Summary of changes The test runs random operations on a Neon project. It performs via the Public API calls the following operations: `create a branch`, `delete a branch`, `add a read-only endpoint`, `delete a read-only endpoint`, `restore a branch to a random position in the past`. All the branches and endpoints are loaded with `pgbench`. --------- Co-authored-by: Peter Bendel <peterbendel@neon.tech> Co-authored-by: Alexander Bayandin <alexander@neon.tech>	2025-04-17 19:59:35 +00:00
Tristan Partin	c002236145	Remove compute_ctl authorization bypass if testing feature was enable (#11596 ) We want to exercise the authorization middleware in our regression tests. Signed-off-by: Tristan Partin <tristan@neon.tech>	2025-04-16 17:54:51 +00:00
Vlad Lazar	96b46365e4	tests: attach final metrics to allure report (#11604 ) ## Problem Metrics are saved in https://github.com/neondatabase/neon/pull/11559, but the file is not matched by the attachment regex. ## Summary of changes Make attachment regex match the metrics file.	2025-04-16 10:26:47 +00:00
Fedor Dikarev	9a6ace9bde	introduce new runners: unit-perf and use them for benchmark jobs (#11409 ) ## Problem Benchmarks results are inconsistent on existing small-metal runners ## Summary of changes Introduce new `unit-perf` runners, and lets run benchmark on them. The new hardware has slower, but consistent, CPU frequency - if run with default governor schedutil. Thus we needed to adjust some testcases' timeouts and add some retry steps where hard-coded timeouts couldn't be increased without changing the system under test. - [wait_for_last_record_lsn](`6592d69a67/test_runner/fixtures/pageserver/utils.py (L193)`) 1000s -> 2000s - [test_branch_creation_many](https://github.com/neondatabase/neon/pull/11409/files#diff-2ebfe76f89004d563c7e53e3ca82462e1d85e92e6d5588e8e8f598bbe119e927) 1000s - [test_ingest_insert_bulk](https://github.com/neondatabase/neon/pull/11409/files#diff-e90e685be4a87053bc264a68740969e6a8872c8897b8b748d0e8c5f683a68d9f) - with back throttling disabled compute becomes unresponsive for more than 60 seconds (PG hard-coded client authentication connection timeout) - [test_sharded_ingest](https://github.com/neondatabase/neon/pull/11409/files#diff-e8d870165bd44acb9a6d8350f8640b301c1385a4108430b8d6d659b697e4a3f1) 600s -> 1200s Right now there are only 2 runners of that class, and if we decide to go with them, we have to check how much that type of runners we need, so jobs not stuck with waiting for that type of runners available. However we now decided to run those runners with governor performance instead of schedutil. This achieves almost same performance as previous runners but still achieves consistent results for same commit Related issue to activate performance governor on these runners https://github.com/neondatabase/runner/pull/138 ## Verification that it helps ### analyze runtimes on new runner for same commit Table of runtimes for the same commit on different runners in [run](https://github.com/neondatabase/neon/actions/runs/14417589789) \| Run \| Benchmarks (1) \| Benchmarks (2) \|Benchmarks (3) \|Benchmarks (4) \| Benchmarks (5) \| \|--------\|--------\|---------\|---------\|---------\|---------\| \| 1 \| 1950.37s \| 6374.55s \| 3646.15s \| 4149.48s \| 2330.22s \| \| 2 \| - \| 6369.27s \| 3666.65s \| 4162.42s \| 2329.23s \| \| Delta % \| - \| 0,07 % \| 0,5 % \| 0,3 % \| 0,04 % \| \| with governor performance \| 1519.57s \| 4131.62s \| - \| - \| - \| \| second run gov. perf. \| 1513.62s \| 4134.67s \| - \| - \| - \| \| Delta % \| 0,3 % \| 0,07 % \| - \| - \| - \| \| speedup gov. performance \| 22 % \| 35 % \| - \| - \| - \| \| current desktop class hetzner runners (main) \| 1487.10s \| 3699.67s \| - \| - \| - \| \| slower than desktop class \| 2 % \| 12 % \| - \| - \| - \| In summary, the runtimes for the same commit on this hardware varies less than 1 %. --------- Co-authored-by: BodoBolero <peterbendel@neon.tech>	2025-04-15 08:21:44 +00:00
Vlad Lazar	90b706cd96	tests: save pageserver metrics at the end of the test (#11559 ) ## Problem Sometimes it's useful to see the pageserver metrics after a test in order to debug stuff. For example, for https://github.com/neondatabase/neon/issues/11465 I'd like to know what the remote storage latencies are from the client. ## Summary of changes When stopping the env, record the pageserver metrics into a file in the pageserver's workdir.	2025-04-14 15:13:20 +00:00
Vlad Lazar	e85607eed8	tests: remove config tweak allowing old versions to start with a batching config (#11560 ) ## Problem Pageservers now ignore unknown config fields, so this config tweaking is no longer needed. ## Summary of changes Get rid of the hack. Closes https://github.com/neondatabase/neon/issues/11524	2025-04-14 14:42:35 +00:00
Vlad Lazar	148b3701cf	pageserver: add metrics for get page batch breaking reasons (#11545 ) ## Problem https://github.com/neondatabase/neon/pull/11494 changes the batching logic, but we don't have a way to evaluate it. ## Summary of changes This PR introduces a global and per timeline metric which tracks the reason for which a batch was broken.	2025-04-14 13:24:47 +00:00
Vlad Lazar	a338984dc7	pageserver: support keys at different LSNs in one get page batch (#11494 ) ## Problem Get page batching stops when we encounter requests at different LSNs. We are leaving batching factor on the table. ## Summary of changes The goal is to support keys with different LSNs in a single batch and still serve them with a single vectored get. Important restriction: the same key at different LSNs is not supported in one batch. Returning different key versions is a much more intrusive change. Firstly, the read path is changed to support "scattered" queries. This is a conceptually simple step from https://github.com/neondatabase/neon/pull/11463. Instead of initializing the fringe for one keyspace, we do it for multiple at different LSNs and let the logic already present into the fringe handle selection. Secondly, page service code is updated to support batching at different LSNs. Eeach request parsed from the wire determines its effective request LSN and keeps it in mem for the batcher toinspect. The batcher allows keys at different LSNs in one batch as long one key is not requested at different LSNs. I'd suggest doing the first pass commit by commit to get a feel for the changes. ## Results I used the batching test from [Christian's PR](https://github.com/neondatabase/neon/pull/11391) which increases the change of batch breaks. Looking at the logs I think the new code is at the max batching factor for the workload (we only break batches due to them being oversized or because the executor is idle). ``` Main: Reasons for stopping batching: {'LSN changed': 22843, 'of batch size': 33417} test_throughput[release-pg16-50-pipelining_config0-30-100-128-batchable {'max_batch_size': 32, 'execution': 'concurrent-futures', 'mode': 'pipelined'}].perfmetric.batching_factor: 14.6662 My branch: Reasons for stopping batching: {'of batch size': 37024} test_throughput[release-pg16-50-pipelining_config0-30-100-128-batchable {'max_batch_size': 32, 'execution': 'concurrent-futures', 'mode': 'pipelined'}].perfmetric.batching_factor: 19.8333 ``` Related: https://github.com/neondatabase/neon/issues/10765	2025-04-14 09:05:29 +00:00
Tristan Partin	ff5a527167	Consolidate compute_ctl configuration structures (#11514 ) Previously, the structure of the spec file was just the compute spec. However, the response from the control plane get spec request included the compute spec and the compute_ctl config. This divergence was hindering other work such as adding regression tests for compute_ctl HTTP authorization. Signed-off-by: Tristan Partin <tristan@neon.tech>	2025-04-11 15:06:29 +00:00
Erik Grinaker	3c8565a194	test_runner: propagate config via `attach_hook` for test fix (#11529 ) ## Problem The `pagebench` benchmarks set up an initial dataset by creating a template tenant, copying the remote storage to a bunch of new tenants, and attaching them to Pageservers. In #11420, we found that `test_pageserver_characterize_throughput_with_n_tenants` had degraded performance because it set a custom tenant config in Pageservers that was then replaced with the default tenant config by the storage controller. The initial fix was to register the tenants directly in the storage controller, but this created the tenants with generation 1. This broke `test_basebackup_with_high_slru_count`, where the template tenant was at generation 2, leading to all layer files at generation 2 being ignored. Resolves #11485. Touches #11381. ## Summary of changes This patch addresses both test issues by modifying `attach_hook` to also take a custom tenant config. This allows attaching tenants to Pageservers from pre-existing remote storage, specifying both the generation and tenant config when registering them in the storage controller.	2025-04-11 11:31:12 +00:00
Christian Schwarz	979fa0682b	tests: update batching perf test workload to include scattered LSNs (#11391 ) The batching perf test workload is currently read-only sequential scans. However, realistic workloads have concurrent writes (to other pages) going on. This PR simulates concurrent writes to other pages by emitting logical replication messages. These degrade the achieved batching factor, for the reason see - https://github.com/neondatabase/neon/issues/10765 PR - https://github.com/neondatabase/neon/pull/11494 will fix this problem and get batching factor back up. --------- Co-authored-by: Vlad Lazar <vlad@neon.tech>	2025-04-11 09:55:49 +00:00
Tristan Partin	342607473a	Make Endpoint::respec_deep() infinitely deep (#11527 ) Because it wasn't recursive, there was a limit to the depth of updates. This work is necessary because as we teach neon_local and compute_ctl that the content in --spec-path should match a similar structure we get from the control plane, the spec object itself will no longer be toplevel. It will be under the "spec" key. Signed-off-by: Tristan Partin <tristan@neon.tech>	2025-04-10 19:55:51 +00:00
Arseny Sher	fae7528adb	walproposer: make it aware of membership (#11407 ) ## Problem Walproposer should get elected and commit WAL on safekeepers specified by the membership configuration. ## Summary of changes - Add to wp `members_safekeepers` and `new_members_safekeepers` arrays mapping configuration members to connection slots. Establish this mapping (by node id) when safekeeper sends greeting, giving its id and when mconf becomes known / changes. - Add to TermsCollected, VotesCollected, GetAcknowledgedByQuorumWALPosition membership aware logic. Currently it partially duplicates existing one, but we'll drop the latter eventually. - In python, rename Configuration to MembershipConfiguration for clarity. - Add test_quorum_sanity testing new logic. ref https://github.com/neondatabase/neon/issues/10851	2025-04-10 09:55:37 +00:00
Erik Grinaker	63ee8e2181	test_runner: ignore `.___temp` files in `evict_random_layers` (#11509 ) ## Problem `test_location_conf_churn` often fails with `neither image nor delta layer`, but doesn't say what the file actually is. However, past local failures have indicated that it might be `.___temp` files. Touches https://github.com/neondatabase/neon/issues/11348. ## Summary of changes Ignore `.___temp` files when evicting local layers, and include the file name in the error message.	2025-04-09 19:03:49 +00:00
Mikhail Kot	6138d61592	Object storage proxy (#11357 ) Service targeted for storing and retrieving LFC prewarm data. Can be used for proxying S3 access for Postgres extensions like pg_mooncake as well. Requests must include a Bearer JWT token. Token is validated using a pemfile (should be passed in infra/). Note: app is not tolerant to extra trailing slashes, see app.rs `delete_prefix` test for comments. Resolves: https://github.com/neondatabase/cloud/issues/26342 Unrelated changes: gate a `rename_noreplace` feature and disable it in `remote_storage` so as `object_storage` can be built with musl	2025-04-08 14:54:53 +00:00
Christian Schwarz	aad410c8f1	improve ondemand-download latency observability (#11421 ) ## Problem We don't have metrics to exactly quantify the end user impact of on-demand downloads. Perf tracing is underway (#11140) to supply us with high-resolution samples. But it will also be useful to have some aggregate per-timeline and per-instance metrics that definitively contain all observations. ## Summary of changes This PR consists of independent commits that should be reviewed independently. However, for convenience, we're going to merge them together. - refactor(metrics): measure_remote_op can use async traits - impr(pageserver metrics): task_kind dimension for remote_timeline_client latency histo - implements https://github.com/neondatabase/cloud/issues/26800 - refs https://github.com/neondatabase/cloud/issues/26193#issuecomment-2769705793 - use the opportunity to rename the metric and add a _global suffix; checked grafana export, it's only used in two personal dashboards, one of them mine, the other by Heikki - log on-demand download latency for expensive-to-query but precise ground truth - metric for wall clock time spent waiting for on-demand downloads ## Refs - refs https://github.com/neondatabase/cloud/issues/26800 - a bunch of minor investigations / incidents into latency outliers	2025-04-04 18:04:39 +00:00
Christian Schwarz	4f94751b75	pageserver config: ignore+warn about unknown fields (instead of `deny_unknown_fields`) (#11275 ) # Refs - refs https://github.com/neondatabase/neon/issues/8915 - discussion thread: https://neondb.slack.com/archives/C033RQ5SPDH/p1742406381132599 - stacked atop https://github.com/neondatabase/neon/pull/11298 - corresponding internal docs update that illustrates how this PR removes friction: https://github.com/neondatabase/docs/pull/404 # Problem Rejecting `pageserver.toml`s with unknown fields adds friction, especially when using `pageserver.toml` fields as feature flags that need to be decommissioned. See the added paragraphs on `pageserver_api::models::ConfigToml` for details on what kind of friction it causes. Also read the corresponding internal docs update linked above to see a more imperative guide for using `pageserver.toml` flags as feature flags. # Solution ## Ignoring unknown fields Ignoring is the serde default behavior. So, just remove `serde(deny_unknown_fields)` from all structs in `pageserver_api::config::ConfigToml` `pageserver_api::config::TenantConfigToml`. I went through all the child fields and verified they don't use `deny_unknown_fields` either, including those shared with `pageserver_api::models`. ## Warning about unknown fields We still want to warn about unknown fields to - be informed about typos in the config template - be reminded about feature-flag style configs that have been cleaned up in code but not yet in config templates We tried `serde_ignore` (cf draft #11319) but it doesn't work with `serde(flatten)`. The solution we arrived at is to compare the on-disk TOML with the TOML that we produce if we serialize the `ConfigToml` again. Any key specified in the on-disk TOML but not present in the serialized TOML is flagged as an ignored key. The mechanism to do it is a tiny recursive decent visitor on the `toml_edit::DocumentMut`. # Future Work Invalid config _values_ in known fields will continue to fail pageserver startup. See - https://github.com/neondatabase/cloud/issues/24349 for current worst case impact to deployments & ideas to improve.	2025-04-04 17:30:58 +00:00
Dmitrii Kovalkov	181af302b5	storcon + safekeeper + scrubber: propagate root CA certs everywhere (#11418 ) ## Problem There are some places in the code where we create `reqwest::Client` without providing SSL CA certs from `ssl_ca_file`. These will break after we enable TLS everywhere. - Part of https://github.com/neondatabase/cloud/issues/22686 ## Summary of changes - Support `ssl_ca_file` in storage scrubber. - Add `use_https_safekeeper_api` option to safekeeper to use https for peer requests. - Propagate SSL CA certs to storage_controller/client, storcon's ComputeHook, PeerClient and maybe_forward.	2025-04-04 06:30:48 +00:00

1 2 3 4 5 ...

1026 Commits