rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-07-05 05:00:37 +00:00

Author	SHA1	Message	Date
Andrew Rudenko	acc075071d	feat(compute_ctl): add periodic `lease lsn` requests for static computes (#7994 ) Part of #7497 ## Problem Static computes pinned at some fix LSN could be created initially within PITR interval but eventually go out it. To make sure that Static computes are not affected by GC, we need to start using the LSN lease API (introduced in #8084) in compute_ctl. ## Summary of changes compute_ctl - Spawn a thread for when a static compute starts to periodically ping pageserver(s) to make LSN lease requests. - Add `test_readonly_node_gc` to test if static compute can read all pages without error. - (test will fail on main without the code change here) page_service - `wait_or_get_last_lsn` will now allow `request_lsn` less than `latest_gc_cutoff_lsn` to proceed if there is a lease on `request_lsn`. Signed-off-by: Yuchen Liang <yuchen@neon.tech> Co-authored-by: Alexey Kondratov <kondratov.aleksey@gmail.com>	2024-08-28 19:09:26 +00:00
Christian Schwarz	9627747d35	bypass `PageCache` for `InMemoryLayer` + avoid `Value::deser` on L0 flush (#8537 ) Part of [Epic: Bypass PageCache for user data blocks](https://github.com/neondatabase/neon/issues/7386). # Problem `InMemoryLayer` still uses the `PageCache` for all data stored in the `VirtualFile` that underlies the `EphemeralFile`. # Background Before this PR, `EphemeralFile` is a fancy and (code-bloated) buffered writer around a `VirtualFile` that supports `blob_io`. The `InMemoryLayerInner::index` stores offsets into the `EphemeralFile`. At those offset, we find a varint length followed by the serialized `Value`. Vectored reads (`get_values_reconstruct_data`) are not in fact vectored - each `Value` that needs to be read is read sequentially. The `will_init` bit of information which we use to early-exit the `get_values_reconstruct_data` for a given key is stored in the serialized `Value`, meaning we have to read & deserialize the `Value` from the `EphemeralFile`. The L0 flushing also needs to re-determine the `will_init` bit of information, by deserializing each value during L0 flush. # Changes 1. Store the value length and `will_init` information in the `InMemoryLayer::index`. The `EphemeralFile` thus only needs to store the values. 2. For `get_values_reconstruct_data`: - Use the in-memory `index` figures out which values need to be read. Having the `will_init` stored in the index enables us to do that. - View the EphemeralFile as a byte array of "DIO chunks", each 512 bytes in size (adjustable constant). A "DIO chunk" is the minimal unit that we can read under direct IO. - Figure out which chunks need to be read to retrieve the serialized bytes for thes values we need to read. - Coalesce chunk reads such that each DIO chunk is only read once to serve all value reads that need data from that chunk. - Merge adjacent chunk reads into larger `EphemeralFile::read_exact_at_eof_ok` of up to 128k (adjustable constant). 3. The new `EphemeralFile::read_exact_at_eof_ok` fills the IO buffer from the underlying VirtualFile and/or its in-memory buffer. 4. The L0 flush code is changed to use the `index` directly, `blob_io` 5. We can remove the `ephemeral_file::page_caching` construct now. The `get_values_reconstruct_data` changes seem like a bit overkill but they are necessary so we issue the equivalent amount of read system calls compared to before this PR where it was highly likely that even if the first PageCache access was a miss, remaining reads within the same `get_values_reconstruct_data` call from the same `EphemeralFile` page were a hit. The "DIO chunk" stuff is truly unnecessary for page cache bypass, but, since we're working on [direct IO](https://github.com/neondatabase/neon/issues/8130) and https://github.com/neondatabase/neon/issues/8719 specifically, we need to do _something_ like this anyways in the near future. # Alternative Design The original plan was to use the `vectored_blob_io` code it relies on the invariant of Delta&Image layers that `index order == values order`. Further, `vectored_blob_io` code's strategy for merging IOs is limited to adjacent reads. However, with direct IO, there is another level of merging that should be done, specifically, if multiple reads map to the same "DIO chunk" (=alignment-requirement-sized and -aligned region of the file), then it's "free" to read the chunk into an IO buffer and serve the two reads from that buffer. => https://github.com/neondatabase/neon/issues/8719 # Testing / Performance Correctness of the IO merging code is ensured by unit tests. Additionally, minimal tests are added for the `EphemeralFile` implementation and the bit-packed `InMemoryLayerIndexValue`. Performance testing results are presented below. All pref testing done on my M2 MacBook Pro, running a Linux VM. It's a release build without `--features testing`. We see definitive improvement in ingest performance microbenchmark and an ad-hoc microbenchmark for getpage against InMemoryLayer. ``` baseline: commit `7c74112b2a` origin/main HEAD: `ef1c55c52e` ``` <details> ``` cargo bench --bench bench_ingest -- 'ingest 128MB/100b seq, no delta' baseline ingest-small-values/ingest 128MB/100b seq, no delta time: [483.50 ms 498.73 ms 522.53 ms] thrpt: [244.96 MiB/s 256.65 MiB/s 264.73 MiB/s] HEAD ingest-small-values/ingest 128MB/100b seq, no delta time: [479.22 ms 482.92 ms 487.35 ms] thrpt: [262.64 MiB/s 265.06 MiB/s 267.10 MiB/s] ``` </details> We don't have a micro-benchmark for InMemoryLayer and it's quite cumbersome to add one. So, I did manual testing in `neon_local`. <details> ``` ./target/release/neon_local stop rm -rf .neon ./target/release/neon_local init ./target/release/neon_local start ./target/release/neon_local tenant create --set-default ./target/release/neon_local endpoint create foo ./target/release/neon_local endpoint start foo psql 'postgresql://cloud_admin@127.0.0.1:55432/postgres' psql (13.16 (Debian 13.16-0+deb11u1), server 15.7) CREATE TABLE wal_test ( id SERIAL PRIMARY KEY, data TEXT ); DO $$ DECLARE i INTEGER := 1; BEGIN WHILE i <= 500000 LOOP INSERT INTO wal_test (data) VALUES ('data'); i := i + 1; END LOOP; END $$; -- => result is one L0 from initdb and one 137M-sized ephemeral-2 DO $$ DECLARE i INTEGER := 1; random_id INTEGER; random_record wal_test%ROWTYPE; start_time TIMESTAMP := clock_timestamp(); selects_completed INTEGER := 0; min_id INTEGER := 1; -- Minimum ID value max_id INTEGER := 100000; -- Maximum ID value, based on your insert range iters INTEGER := 100000000; -- Number of iterations to run BEGIN WHILE i <= iters LOOP -- Generate a random ID within the known range random_id := min_id + floor(random() * (max_id - min_id + 1))::int; -- Select the row with the generated random ID SELECT * INTO random_record FROM wal_test WHERE id = random_id; -- Increment the select counter selects_completed := selects_completed + 1; -- Check if a second has passed IF EXTRACT(EPOCH FROM clock_timestamp() - start_time) >= 1 THEN -- Print the number of selects completed in the last second RAISE NOTICE 'Selects completed in last second: %', selects_completed; -- Reset counters for the next second selects_completed := 0; start_time := clock_timestamp(); END IF; -- Increment the loop counter i := i + 1; END LOOP; END $$; ./target/release/neon_local stop baseline: commit `7c74112b2a` origin/main NOTICE: Selects completed in last second: 1864 NOTICE: Selects completed in last second: 1850 NOTICE: Selects completed in last second: 1851 NOTICE: Selects completed in last second: 1918 NOTICE: Selects completed in last second: 1911 NOTICE: Selects completed in last second: 1879 NOTICE: Selects completed in last second: 1858 NOTICE: Selects completed in last second: 1827 NOTICE: Selects completed in last second: 1933 ours NOTICE: Selects completed in last second: 1915 NOTICE: Selects completed in last second: 1928 NOTICE: Selects completed in last second: 1913 NOTICE: Selects completed in last second: 1932 NOTICE: Selects completed in last second: 1846 NOTICE: Selects completed in last second: 1955 NOTICE: Selects completed in last second: 1991 NOTICE: Selects completed in last second: 1973 ``` NB: the ephemeral file sizes differ by ca 1MiB, ours being 1MiB smaller. </details> # Rollout This PR changes the code in-place and is not gated by a feature flag.	2024-08-28 18:31:41 +00:00
Vlad Lazar	793b5061ec	storcon: track pageserver availability zone (#8852 ) ## Problem In order to build AZ aware scheduling, the storage controller needs to know what AZ pageservers are in. Related https://github.com/neondatabase/neon/issues/8848 ## Summary of changes This patch set adds a new nullable column to the `nodes` table: `availability_zone_id`. The node registration request is extended to include the AZ id (pageservers already have this in their `metadata.json` file). If the node is already registered, then we update the persistent and in-memory state with the provided AZ. Otherwise, we add the node with the AZ to begin with. A couple assumptions are made here: 1. Pageserver AZ ids are stable 2. AZ ids do not change over time Once all pageservers have a configured AZ, we can remove the optionals in the code and make the database column not nullable.	2024-08-28 18:23:55 +01:00
Yuchen Liang	a889a49e06	pageserver: do vectored read on each dio-aligned section once (#8763 ) Part of #8130, closes #8719. ## Problem Currently, vectored blob io only coalesce blocks if they are immediately adjacent to each other. When we switch to Direct IO, we need a way to coalesce blobs that are within the dio-aligned boundary but has gap between them. ## Summary of changes - Introduces a `VectoredReadCoalesceMode` for `VectoredReadPlanner` and `StreamingVectoredReadPlanner` which has two modes: - `AdjacentOnly` (current implementation) - `Chunked(<alignment requirement>)` - New `ChunkedVectorBuilder` that considers batching `dio-align`-sized read, the start and end of the vectored read will respect `stx_dio_offset_align` / `stx_dio_mem_align` (`vectored_read.start` and `vectored_read.blobs_at.first().start_offset` will be two different value). - Since we break the assumption that blobs within single `VectoredRead` are next to each other (implicit end offset), we start to store blob end offsets in the `VectoredRead`. - Adapted existing tests to run in both `VectoredReadCoalesceMode`. - The io alignment can also be live configured at runtime. Signed-off-by: Yuchen Liang <yuchen@neon.tech>	2024-08-28 15:54:42 +01:00
Heikki Linnakangas	2d10306f7a	Remove support for pageserver <-> compute protocol version 1 (#8774 ) Protocol version 2 has been the default for a while now, and we no longer have any computes running in production that used protocol version 1. This completes the migration by removing support for v1 in both the pageserver and the compute. See issue #6211.	2024-08-27 18:36:33 +03:00
Alex Chi Z.	0f65684263	feat(pageserver): use split layer writer in gc-compaction (#8608 ) Part of #8002, the final big PR in the batch. ## Summary of changes This pull request uses the new split layer writer in the gc-compaction. * It changes how layers are split. Previously, we split layers based on the original split point, but this creates too many layers (test_gc_feedback has one key per layer). * Therefore, we first verify if the layer map can be processed by the current algorithm (See https://github.com/neondatabase/neon/pull/8191, it's basically the same check) * On that, we proceed with the compaction. This way, it creates a large enough layer close to the target layer size. * Added a new set of functions `with_discard` in the split layer writer. This helps us skip layers if we are going to produce the same persistent key. * The delta writer will keep the updates of the same key in a single file. This might create a super large layer, but we can optimize it later. * The split layer writer is used in the gc-compaction algorithm, and it will split layers based on size. * Fix the image layer summary block encoded the wrong key range. --------- Signed-off-by: Alex Chi Z <chi@neon.tech> Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com> Co-authored-by: Christian Schwarz <christian@neon.tech>	2024-08-26 14:19:47 -04:00
Christian Schwarz	97241776aa	pageserver: startup: ensure local disk state is durable (#8835 ) refs https://github.com/neondatabase/neon/issues/6989 Problem ------- After unclean shutdown, we get restarted, start reading the local filesystem, and make decisions based on those reads. However, some of the data might have not yet been fsynced when the unclean shutdown completed. Durability matters even though Pageservers are conceptually just a cache of state in S3. For example: - the cloud control plane is no control loop => pageserver responses to tenant attachmentm, etc, needs to be durable. - the storage controller does not rely on this (as much?) - we don't have layer file checksumming, so, downloaded+renamed but not fsynced layer files are technically not to be trusted - https://github.com/neondatabase/neon/issues/2683 Solution -------- `syncfs` the tenants directory during startup, before we start reading from it. This is a bit overkill because we do remove some temp files (InMemoryLayer!) later during startup. Further, these temp files are particularly likely to be dirty in the kernel page cache. However, we don't want to refactor that cleanup code right now, and the dirty data on pageservers is generally not that high. Last, with [direct IO](https://github.com/neondatabase/neon/issues/8130) we're going to have near-zero kernel page cache anyway quite soon.	2024-08-26 18:07:55 +02:00
Arpad Müller	2dd53e7ae0	Timeline archival test (#8824 ) This PR: * Implements the rule that archived timelines require all of their children to be archived as well, as specified in the RFC. There is no fancy locking mechanism though, so the precondition can still be broken. As a TODO for later, we still allow unarchiving timelines with archived parents. * Adds an `is_archived` flag to `TimelineInfo` * Adds timeline_archival_config to `PageserverHttpClient` * Adds a new `test_timeline_archive` test, loosely based on `test_timeline_delete` Part of #8088	2024-08-26 17:30:19 +02:00
John Spray	b65a95f12e	controller: use PageserverUtilization for scheduling (#8711 ) ## Problem Previously, the controller only used the shard counts for scheduling. This works well when hosting only many-sharded tenants, but works much less well when hosting single-sharded tenants that have a greater deviation in size-per-shard. Closes: https://github.com/neondatabase/neon/issues/7798 ## Summary of changes - Instead of UtilizationScore, carry the full PageserverUtilization through into the Scheduler. - Use the PageserverUtilization::score() instead of shard count when ordering nodes in scheduling. Q: Why did test_sharding_split_smoke need updating in this PR? A: There's an interesting side effect during shard splits: because we do not decrement the shard count in the utilization when we de-schedule the shards from before the split, the controller will now prefer to pick _different_ nodes for shards compared with which ones held secondaries before the split. We could use our knowledge of splitting to fix up the utilizations more actively in this situation, but I'm leaning toward leaving the code simpler, as in practical systems the impact of one shard on the utilization of a node should be fairly low (single digit %).	2024-08-23 18:32:56 +01:00
Alex Chi Z.	f4cac1f30f	impr(pageserver): error if keys are unordered in merge iter (#8818 ) In case of corrupted delta layers, we can detect the corruption and bail out the compaction. ## Summary of changes * Detect wrong delta desc of key range * Detect unordered deltas Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-08-23 16:38:42 +00:00
Alex Chi Z.	bc8cfe1b55	fix(pageserver): l0 check criteria (#8797 ) close https://github.com/neondatabase/neon/issues/8579 ## Summary of changes The `is_l0` check now takes both layer key range and the layer type. This allows us to have image layers covering the full key range in btm-most compaction (upcoming PR). However, we still don't allow delta layers to cover the full key range, and I will make btm-most compaction to generate delta layers with the key range of the keys existing in the layer instead of `Key::MIN..Key::HACK_MAX` (upcoming PR). Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-08-23 09:42:45 -04:00
Alex Chi Z.	6a74bcadec	feat(pageserver): remove features=testing restriction for compact (#8815 ) A small PR to make it possible to run force compaction in staging for btm-gc compaction testing. Part of https://github.com/neondatabase/neon/issues/8002 Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-08-23 14:32:00 +01:00
Alex Chi Z.	6eb638f4b3	feat(pageserver): warn on aux v1 tenants + default to v2 (#8625 ) part of https://github.com/neondatabase/neon/issues/8623 We want to discover potential aux v1 customers that we might have missed from the migrations. ## Summary of changes Log warnings on basebackup, load timeline, and the first put_file. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-08-22 22:31:38 +01:00
John Spray	7c74112b2a	pageserver: batch InMemoryLayer `put`s, remove need to sort items by LSN during ingest (#8591 ) ## Problem/Solution TimelineWriter::put_batch is simply a loop over individual puts. Each put acquires and releases locks, and checks for potentially starting a new layer. Batching these is more efficient, but more importantly unlocks future changes where we can pre-build serialized buffers much earlier in the ingest process, potentially even on the safekeeper (imagine a future model where some variant of DatadirModification lives on the safekeeper). Ensuring that the values in put_batch are written to one layer also enables a simplification upstream, where we no longer need to write values in LSN-order. This saves us a sort, but also simplifies follow-on refactors to DatadirModification: we can store metadata keys and data keys separately at that level without needing to zip them together in LSN order later. ## Why? In this PR, these changes are simplify optimizations, but they are motivated by evolving the ingest path in the direction of disentangling extracting DatadirModification from Timeline. It may not obvious how right now, but the general idea is that we'll end up with three phases of ingest: - A) Decode walrecords and build a datadirmodification with all the simple data contents already in a big serialized buffer ready to write to an ephemeral layer <-- this part can be pipelined and parallelized, and done on a safekeeper! - B) Let that datadirmodification see a Timeline, so that it can also generate all the metadata updates that require a read-modify-write of existing pages - C) Dump the results of B into an ephemeral layer. Related: https://github.com/neondatabase/neon/issues/8452 ## Caveats Doing a big monolithic buffer of values to write to disk is ordinarily an anti-pattern: we prefer nice streaming I/O. However: - In future, when we do this first decode stage on the safekeeper, it would be inefficient to serialize a Vec of Value, and then later deserialize it just to add blob size headers while writing into the ephemeral layer format. The idea is that for bulk write data, we will serialize exactly once. - The monolithic buffer is a stepping stone to pipelining more of this: by seriailizing earlier (rather than at the final put_value), we will be able to parallelize the wal decoding and bulk serialization of data page writes. - The ephemeral layer's buffered writer already stalls writes while it waits to flush: so while yes we'll stall for a couple milliseconds to write a couple megabytes, we already have stalls like this, just distributed across smaller writes. ## Benchmarks This PR is primarily a stepping stone to safekeeper ingest filtering, but also provides a modest efficiency improvement to the `wal_recovery` part of `test_bulk_ingest`. test_bulk_ingest: ``` test_bulk_insert[neon-release-pg16].insert: 23.659 s test_bulk_insert[neon-release-pg16].pageserver_writes: 5,428 MB test_bulk_insert[neon-release-pg16].peak_mem: 626 MB test_bulk_insert[neon-release-pg16].size: 0 MB test_bulk_insert[neon-release-pg16].data_uploaded: 1,922 MB test_bulk_insert[neon-release-pg16].num_files_uploaded: 8 test_bulk_insert[neon-release-pg16].wal_written: 1,382 MB test_bulk_insert[neon-release-pg16].wal_recovery: 18.981 s test_bulk_insert[neon-release-pg16].compaction: 0.055 s vs. tip of main: test_bulk_insert[neon-release-pg16].insert: 24.001 s test_bulk_insert[neon-release-pg16].pageserver_writes: 5,428 MB test_bulk_insert[neon-release-pg16].peak_mem: 604 MB test_bulk_insert[neon-release-pg16].size: 0 MB test_bulk_insert[neon-release-pg16].data_uploaded: 1,922 MB test_bulk_insert[neon-release-pg16].num_files_uploaded: 8 test_bulk_insert[neon-release-pg16].wal_written: 1,382 MB test_bulk_insert[neon-release-pg16].wal_recovery: 23.586 s test_bulk_insert[neon-release-pg16].compaction: 0.054 s ```	2024-08-22 10:04:42 +00:00
Alex Chi Z.	a968554a8c	fix(pageserver): unify initdb optimization for sparse keyspaces; fix force img generation (#8776 ) close https://github.com/neondatabase/neon/issues/8558 * Directly generate image layers for sparse keyspaces during initdb optimization. * Support force image layer generation for sparse keyspaces. * Fix a bug of incorrect image layer key range in case of duplicated keys. (The added line: `start = img_range.end;`) This can cause overlapping image layers and keys to disappear. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-08-21 21:25:21 +01:00
Joonas Koivunen	07b7c63975	test: avoid some too long shutdowns by flushing before shutdown (#8772 ) After #8655, we needed to mark some tests to shut down immediately. To aid these tests, try the new pattern of `flush_ep_to_pageserver` followed by a non-compacting checkpoint. This moves the general graceful shutdown problem of having too much to flush at shutdown into the test. Also, add logging for how long the graceful shutdown took, if we got to complete it for faster log eyeballing. Fixes: #8712 Cc: #8715, #8708	2024-08-21 14:26:27 -04:00
Christian Schwarz	21b684718e	pageserver: add counter for wait time on background loop semaphore (#8769 ) ## Problem Compaction jobs and other background loops are concurrency-limited through a global semaphore. The current counters allow quantifying how _many_ tasks are waiting. But there is no way to tell how _much_ delay is added by the semaphore. So, add a counter that aggregates the wall clock time seconds spent acquiring the semaphore. The metrics can be used as follows: * retroactively calculate average acquisition time in a given time range * compare the degree of background loop backlog among pageservers The metric is insufficient to calculate * run-up of ongoing acquisitions that haven't finished acquiring yet * Not easily feasible because ["Cancelling a call to acquire makes you lose your place in the queue"](https://docs.rs/tokio/latest/tokio/sync/struct.Semaphore.html#method.acquire) ## Summary of changes * Refactor the metrics to follow the current best practice for typed metrics in `metrics.rs`. * Add the new counter.	2024-08-21 10:55:01 +00:00
Alex Chi Z.	c8b9116a97	impr(pageserver): abort on fatal I/O writer error (#8777 ) part of https://github.com/neondatabase/neon/issues/8140 The blob writer path now uses `maybe_fatal_err` Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-08-20 20:05:33 +01:00
John Spray	beefc7a810	pageserver: add metric pageserver_secondary_heatmap_total_size (#8768 ) ## Problem We don't have a convenient way for a human to ask "how far are secondary downloads along for this tenant". This is useful when driving migrations of tenants to the storage controller, as we first create a secondary location and want to see it warm up before we cut over. That can already be done via storcon_cli, but we would like a way that doesn't require direct API access. ## Summary of changes Add a metric that reports to total size of layers in the heatmap: this may be used in conjunction with the existing `pageserver_secondary_resident_physical_size` to estimate "warmth" of the secondary location.	2024-08-20 19:47:42 +01:00
Alexander Bayandin	c96593b473	Make Postgres 16 default version (#8745 ) ## Problem The default Postgres version is set to 15 in code, while we use 16 in most of the other places (and Postgres 17 is coming) ## Summary of changes - Run `benchmarks` job with Postgres 16 (instead of Postgres 14) - Set `DEFAULT_PG_VERSION` to 16 in all places - Remove deprecated `--pg-version` pytest argument - Update `test_metadata_bincode_serde_ensure_roundtrip` for Postgres 16	2024-08-20 10:46:58 +01:00
Christian Schwarz	ef57e73fbf	task_mgr::spawn: require a `TenantId` (#8462 ) … to dis-incentivize global tasks via task_mgr in the future (As of https://github.com/neondatabase/neon/pull/8339 all remaining task_mgr usage is tenant or timeline scoped.)	2024-08-20 08:26:44 +00:00
Christian Schwarz	eb7241c798	l0_flush: remove support for mode `page-cached` (#8739 ) It's been rolled out everywhere, no configs are referencing it. All code that's made dead by the removal of the config option is removed as part of this PR. The `page_caching::PreWarmingWriter` in `::No` mode is equivalent to a `size_tracking_writer`, so, use that. part of https://github.com/neondatabase/neon/issues/7418	2024-08-19 16:35:34 +02:00
Arpad Müller	188bde7f07	Default image compression to zstd at level 1 (#8677 ) After the rollout has succeeded, we now set the default image compression to be enabled. We also remove its explicit mention from `neon_fixtures.py` added in #8368 as it is now the default (and we switch to `zstd(1)` which is a bit nicer on CPU time). Part of https://github.com/neondatabase/neon/issues/5431	2024-08-18 18:32:10 +01:00
John Spray	e2d89f7991	pageserver: prioritize secondary downloads to get most recent layers first, except l0s (#8729 ) ## Problem When a secondary location is trying to catch up while a tenant is receiving new writes, it can become quite wasteful: - Downloading L0s which are soon destroyed by compaction to L1s - Downloading older layer files which are soon made irrelevant when covered by image layers. ## Summary of changes Sort the layer files in the heatmap: - L0 layers are the lowest priority - Other layers are sorted to download the highest LSNs first.	2024-08-16 14:35:02 +02:00
Joonas Koivunen	4763a960d1	chore: log if we have an open layer or any frozen on shutdown (#8740 ) Some benchmarks are failing with a "long" flushing, which might be because there is a queue of in-memory layers, or something else. Add logging to narrow it down. Private slack DM ref: https://neondb.slack.com/archives/D049K7HJ9JM/p1723727305238099	2024-08-16 06:10:05 +01:00
Alexander Bayandin	a9c28be7d0	fix(pageserver): allow unused_imports in download.rs on macOS (#8733 ) ## Problem On macOS, clippy fails with the following error: ``` error: unused import: `crate::virtual_file::owned_buffers_io::io_buf_ext::IoBufExt` --> pageserver/src/tenant/remote_timeline_client/download.rs:26:5 \| 26 \| use crate::virtual_file::owned_buffers_io::io_buf_ext::IoBufExt; \| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ \| = note: `-D unused-imports` implied by `-D warnings` = help: to override `-D warnings` add `#[allow(unused_imports)]` ``` Introduced in https://github.com/neondatabase/neon/pull/8717 ## Summary of changes - allow `unused_imports` for `crate::virtual_file::owned_buffers_io::io_buf_ext::IoBufExt` on macOS in download.rs	2024-08-15 10:06:28 +01:00
Christian Schwarz	168913bdf0	refactor(write path): newtype to enforce use of fully initialized slices (#8717 ) The `tokio_epoll_uring::Slice` / `tokio_uring::Slice` type is weird. The new `FullSlice` newtype is better. See the doc comment for details. The naming is not ideal, but we'll clean that up in a future refactoring where we move the `FullSlice` into `tokio_epoll_uring`. Then, we'll do the following: * tokio_epoll_uring::Slice is removed * `FullSlice` becomes `tokio_epoll_uring::IoBufView` * new type `tokio_epoll_uring::IoBufMutView` for the current `tokio_epoll_uring::Slice<IoBufMut>` Context ------- I did this work in preparation for https://github.com/neondatabase/neon/pull/8537. There, I'm changing the type that the `inmemory_layer.rs` passes to `DeltaLayerWriter::put_value_bytes` and thus it seemed like a good opportunity to make this cleanup first.	2024-08-14 21:57:17 +02:00
Joonas Koivunen	60fc1e8cc8	chore: even more responsive compaction cancellation (#8725 ) Some benchmarks and tests might still fail because of #8655 (tracked in #8708) because we are not fast enough to shut down ([one evidence]). Partially this is explained by the current validation mode of streaming k-merge, but otherwise because that is where we use a lot of time in compaction. Outside of L0 => L1 compaction, the image layer generation is already guarded by vectored reads doing cancellation checks. 32768 is a wild guess based on looking how many keys we put in each layer in a bench (1-2 million), but I assume it will be good enough divisor. Doing checks more often will start showing up as contention which we cannot currently measure. Doing checks less often might be reasonable. [one evidence]: https://neon-github-public-dev.s3.amazonaws.com/reports/main/10384136483/index.html#suites/9681106e61a1222669b9d22ab136d07b/96e6d53af234924/ Earlier PR: #8706.	2024-08-14 14:48:15 +01:00
Joonas Koivunen	6c9e3c9551	refactor: error/anyhow::Error wrapping (#8697 ) We can get CompactionError::Other(Cancelled) via the error handling with a few ways. [evidence](https://neon-github-public-dev.s3.amazonaws.com/reports/pr-8655/10301613380/index.html#suites/cae012a1e6acdd9fdd8b81541972b6ce/653a33de17802bb1/). Hopefully fix it by: 1. replace the `map_err` which hid the `GetReadyAncestorError::Cancelled` with `From<GetReadyAncestorError> for GetVectoredError` conversion 2. simplifying the code in pgdatadir_mapping to eliminate the token anyhow wrapping for deserialization errors 3. stop wrapping GetVectoredError as anyhow errors 4. stop wrapping PageReconstructError as anyhow errors Additionally, produce warnings if we treat any other error (as was legal before this PR) as missing key. Cc: #8708.	2024-08-14 12:45:56 +01:00
John Spray	19d69d515c	pageserver: evict covered layers earlier (#8679 ) ## Problem When pageservers do compaction, they frequently create image layers that make earlier layers un-needed for reads, but then keep those earlier layers around for 24 hours waiting for time-based eviction to expire them. Now that we track layer visibility, we can use it as an input to eviction, and avoid the 24 hour "disk bump" that happens around pageserver restarts. ## Summary of changes - During time-based eviction, if a layer is marked Covered, use the eviction period as the threshold: i.e. these layers get to remain resident for at least one iteration of the eviction loop, but then get evicted. With current settings this means they get evicted after 1h instead of 24h. - During disk usage eviction, prioritized evicting covered layers above all other layers. Caveats: - Using the period as the threshold for time based eviction in this case is a bit of a hack, but it avoids adding yet another configuration property, and in any case the value of a new property would be somewhat arbitrary: there's no "right" length of time to keep covered layers around just in case. - We had previously planned on removing time-based eviction: this change would motivate us to keep it around, but we can still simplify the code later to just do the eviction of covered layers, rather than applying a TTL policy to all layers.	2024-08-14 12:10:15 +01:00
Joonas Koivunen	485d76ac62	timeline_detach_ancestor: adjust error handling (#8528 ) With additional phases from #8430 the `detach_ancestor::Error` became untenable. Split it up into phases, and introduce laundering for remaining `anyhow::Error` to propagate them as most often `Error::ShuttingDown`. Additionally, complete FIXMEs. Cc: #6994	2024-08-14 10:16:18 +01:00
John Spray	4049d2b7e1	scrubber: fix spurious "Missed some shards" errors (#8661 ) ## Problem The storage scrubber was reporting warnings for lots of timelines like: ``` WARN Missed some shards at count ShardCount(0) tenant_id=25eb7a83d9a2f90ac0b765b6ca84cf4c ``` These were spurious: these tenants are fine. There was a bug in accumulating the ShardIndex for each tenant, whereby multiple timelines would lead us to add the same ShardIndex more than one. Closes: #8646 ## Summary of changes - Accumulate ShardIndex in a BTreeSet instead of a Vec - Extend the test to reproduce the issue	2024-08-14 09:29:06 +01:00
Joonas Koivunen	6d6e2c6a39	feat(detach_ancestor): better retries with persistent gc blocking (#8430 ) With the persistent gc blocking, we can now retry reparenting timelines which had failed for whatever reason on the previous attempt(s). Restructure the detach_ancestor into three phases: - prepare (insert persistent gc blocking, copy lsn prefix, layers) - detach and reparent - reparenting can fail, so we might need to retry this portion - complete (remove persistent gc blocking) Cc: #6994	2024-08-13 18:51:51 +01:00
Joonas Koivunen	8f170c5105	fix: make compaction more sensitive to cancellation (#8706 ) A few of the benchmarks have started failing after #8655 where they are waiting for compactor task. Reads done by image layer creation should already be cancellation sensitive because vectored get does a check each time, but try sprinkling additional cancellation points to: - each partition - after each vectored read batch	2024-08-13 18:00:54 +01:00
John Spray	ecb01834d6	pageserver: implement utilization score (#8703 ) ## Problem When the utilization API was added, it was just a stub with disk space information. Disk space information isn't a very good metric for assigning tenants to pageservers, because pageservers making full use of their disks would always just have 85% utilization, irrespective of how much pressure they had for disk space. ## Summary of changes - Use the new layer visibiilty metric to calculate a "wanted size" per tenant, and sum these to get a total local disk space wanted per pageserver. This acts as the primary signal for utilization. - Also use the shard count to calculate a utilization score, and take the max of this and the disk-driven utilization. The shard count limit is currently set as a constant 20,000, which matches contemporary operational practices when loading pageservers. The shard count limit means that for tiny/empty tenants, on a machine with 3.84TB disk, each tiny tenant influences the utilization score as if it had size 160MB.	2024-08-13 15:15:55 +01:00
Vlad Lazar	b9d2c7bdd5	pageserver: remove vectored get related configs (#8695 ) ## Problem Pageserver exposes some vectored get related configs which are not in use. ## Summary of changes Remove the following pageserver configs: get_impl, get_vectored_impl, and `validate_get_vectored`. They are not used in the pageserver since https://github.com/neondatabase/neon/pull/8601. Manual overrides have been removed from the aws repo in https://github.com/neondatabase/aws/pull/1664.	2024-08-13 12:45:54 +01:00
John Spray	3379cbcaa4	pageserver: add CompactKey, use it in InMemoryLayer (#8652 ) ## Problem This follows a PR that insists all input keys are representable in 16 bytes: - https://github.com/neondatabase/neon/pull/8648 & a PR that prevents postgres from sending us keys that use the high bits of field2: - https://github.com/neondatabase/neon/pull/8657 Motivation for this change: 1. Ingest is bottlenecked on CPU 2. InMemoryLayer can create huge (~1M value) BTreeMap<Key,_> for its index. 3. Maps over i128 are much faster than maps over an arbitrary 18 byte struct. It may still be worthwhile to make the index two-tier to optimize for the case where only the last 4 bytes (blkno) of the key vary frequently, but simply using the i128 representation of keys has a big impact for very little effort. Related: #8452 ## Summary of changes - Introduce `CompactKey` type which contains an i128 - Use this instead of Key in InMemoryLayer's index, converting back and forth as needed. ## Performance All the small-value `bench_ingest` cases show improved throughput. The one that exercises this index most directly shows a 35% throughput increase: ``` ingest-small-values/ingest 128MB/100b seq, no delta time: [374.29 ms 378.56 ms 383.38 ms] thrpt: [333.88 MiB/s 338.13 MiB/s 341.98 MiB/s] change: time: [-26.993% -26.117% -25.111%] (p = 0.00 < 0.05) thrpt: [+33.531% +35.349% +36.974%] Performance has improved. ```	2024-08-13 11:48:23 +01:00
Christian Schwarz	ce0d0a204c	fix(walredo): shutdown can complete too early (#8701 ) Problem ------- The following race is possible today: ``` walredo_extraordinary_shutdown_thread: shutdown gets until Poll::Pending of self.launched_processes.close().await call other thread: drops the last Arc<Process> = 1. drop(_launched_processes_guard) runs, this ... walredo_extraordinary_shutdown_thread: ... wakes self.launched_processes.close().await walredo_extraordinary_shutdown_thread: logs `done` other thread: = 2. drop(process): this kill & waits ``` Solution -------- Change drop order so that `process` gets dropped first. Context ------- https://neondb.slack.com/archives/C06Q661FA4C/p1723478188785719?thread_ts=1723456706.465789&cid=C06Q661FA4C refs https://github.com/neondatabase/neon/pull/8572 refs https://github.com/neondatabase/cloud/issues/11387	2024-08-12 18:15:48 +01:00
Joonas Koivunen	9dc9a9b2e9	test: do graceful shutdown by default (#8655 ) It should give us all possible allowed_errors more consistently. While getting the workflows to pass on https://github.com/neondatabase/neon/pull/8632 it was noticed that allowed_errors are rarely hit (1/4). This made me realize that we always do an immediate stop by default. Doing a graceful shutdown would had made the draining more apparent and likely we would not have needed the #8632 hotfix. Downside of doing this is that we will see more timeouts if tests are randomly leaving pause failpoints which fail the shutdown. The net outcome should however be positive, we could even detect too slow shutdowns caused by a bug or deadlock.	2024-08-12 15:37:15 +03:00
Vlad Lazar	f5cef7bf7f	storcon: skip draining shard if it's secondary is lagging too much (#8644 ) ## Problem Migrations of tenant shards with cold secondaries are holding up drains in during production deployments. ## Summary of changes If a secondary locations is lagging by more than 256MiB (configurable, but that's the default), then skip cutting it over to the secondary as part of the node drain.	2024-08-09 15:45:07 +01:00
John Spray	e6770d79fd	pageserver: don't treat NotInitialized::Stopped as unexpected (#8675 ) ## Problem This type of error can happen during shutdown & was triggering a circuit breaker alert. ## Summary of changes - Map NotIntialized::Stopped to CompactionError::ShuttingDown, so that we may handle it cleanly	2024-08-09 14:01:56 +01:00
John Spray	953b7d4f7e	pageserver: remove paranoia double-calculation of retain_lsns (#8617 ) ## Problem This code was to mitigate risk in https://github.com/neondatabase/neon/pull/8427 As expected, we did not hit this code path - the new continuous updates of gc_info are working fine, we can remove this code now. ## Summary of changes - Remove block that double-checks retain_lsns	2024-08-08 12:57:48 +01:00
Joonas Koivunen	8561b2c628	fix: stop leaking BackgroundPurges (#8650 ) avoid "leaking" the completions of BackgroundPurges by: 1. switching it to TaskTracker for provided close+wait 2. stop using tokio::fs::remove_dir_all which will consume two units of memory instead of one blocking task Additionally, use more graceful shutdown in tests which do actually some background cleanup.	2024-08-08 12:02:53 +01:00
Joonas Koivunen	21638ee96c	fix(test): do not fail test for filesystem race (#8643 ) evidence: https://neon-github-public-dev.s3.amazonaws.com/reports/pr-8632/10287641784/index.html#suites/0e58fb04d9998963e98e45fe1880af7d/c7a46335515142b/	2024-08-08 10:34:47 +01:00
John Spray	cf3eac785b	pageserver: make bench_ingest build (but panic) on macOS (#8641 ) ## Problem Some developers build on MacOS, which doesn't have io_uring. ## Summary of changes - Add `io_engine_for_bench`, which on linux will give io_uring or panic if it's unavailable, and on MacOS will always panic. We do not want to run such benchmarks with StdFs: the results aren't interesting, and will actively waste the time of any developers who start investigating performance before they realize they're using a known-slow I/O backend. Why not just conditionally compile this benchmark on linux only? Because even on linux, I still want it to refuse to run if it can't get io_uring.	2024-08-07 21:17:08 +01:00
Yuchen Liang	542385e364	feat(pageserver): add direct io pageserver config (#8622 ) Part of #8130, [RFC: Direct IO For Pageserver](https://github.com/neondatabase/neon/blob/problame/direct-io-rfc/docs/rfcs/034-direct-io-for-pageserver.md) ## Description Add pageserver config for evaluating/enabling direct I/O. - Disabled: current default, uses buffered io as is. - Evaluate: still uses buffered io, but could do alignment checking and perf simulation (pad latency by direct io RW to a fake file). - Enabled: uses direct io, behavior on alignment error is configurable. Signed-off-by: Yuchen Liang <yuchen@neon.tech>	2024-08-07 21:04:19 +01:00
Joonas Koivunen	05dd1ae9e0	fix: drain completed page_service connections (#8632 ) We've noticed increased memory usage with the latest release. Drain the joinset of `page_service` connection handlers to avoid leaking them until shutdown. An alternative would be to use a TaskTracker. TaskTracker was not discussed in original PR #8339 review, so not hot fixing it in here either.	2024-08-07 17:14:45 +00:00
Joonas Koivunen	a81fab4826	refactor(timeline_detach_ancestor): replace ordered reparented with a hashset (#8629 ) Earlier I was thinking we'd need a (ancestor_lsn, timeline_id) ordered list of reparented. Turns out we did not need it at all. Replace it with an unordered hashset. Additionally refactor the reparented direct children query out, it will later be used from more places. Split off from #8430. Cc: #6994	2024-08-07 18:19:00 +02:00
Joonas Koivunen	fc78774f39	fix: EphemeralFiles can outlive their Timeline via `enum LayerManager` (#8229 ) Ephemeral files cleanup on drop but did not delay shutdown, leading to problems with restarting the tenant. The solution is as proposed: - make ephemeral files carry the gate guard to delay `Timeline::gate` closing - flush in-memory layers and strong references to those on `Timeline::shutdown` The above are realized by making LayerManager an `enum` with `Open` and `Closed` variants, and fail requests to modify `LayerMap`. Additionally: - fix too eager anyhow conversions in compaction - unify how we freeze layers and handle errors - optimize likely_resident_layers to read LayerFileManager hashmap values instead of bouncing through LayerMap Fixes: #7830	2024-08-07 17:50:09 +03:00
Arpad Müller	4d7c0dac93	Add missing colon to ArchivalConfigRequest specification (#8627 ) Add a missing colon to the API specification of `ArchivalConfigRequest`. The `state` field is required. Pointed out by Gleb.	2024-08-07 14:53:52 +02:00

... 3 4 5 6 7 ...

2583 Commits