rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-01-16 18:02:56 +00:00

Author	SHA1	Message	Date
Christian Schwarz	850421ec06	refactor(pageserver): rely on serde derive for toml deserialization (#7656 ) This PR simplifies the pageserver configuration parsing as follows: * introduce the `pageserver_api::config::ConfigToml` type * implement `Default` for `ConfigToml` * use serde derive to do the brain-dead leg-work of processing the toml document * use `serde(default)` to fill in default values * in `pageserver` crate: * use `toml_edit` to deserialize the pageserver.toml string into a `ConfigToml` * `PageServerConfig::parse_and_validate` then * consumes the `ConfigToml` * destructures it exhaustively into its constituent fields * constructs the `PageServerConfig` The rules are: * in `ConfigToml`, use `deny_unknown_fields` everywhere * static default values go in `pageserver_api` * if there cannot be a static default value (e.g. which default IO engine to use, because it depends on the runtime), make the field in `ConfigToml` an `Option` * if runtime-augmentation of a value is needed, do that in `parse_and_validate` * a good example is `virtual_file_io_engine` or `l0_flush`, both of which need to execute code to determine the effective value in `PageServerConf` The benefits: * massive amount of brain-dead repetitive code can be deleted * "unused variable" compile-time errors when removing a config value, due to the exhaustive destructuring in `parse_and_validate` * compile-time errors guide you when adding a new config field Drawbacks: * serde derive is sometimes a bit too magical * `deny_unknown_fields` is easy to miss Future Work / Benefits: * make `neon_local` use `pageserver_api` to construct `ConfigToml` and write it to `pageserver.toml` * This provides more type safety / coompile-time errors than the current approach. ### Refs Fixes #3682 ### Future Work * `remote_storage` deser doesn't reject unknown fields https://github.com/neondatabase/neon/issues/8915 * clean up `libs/pageserver_api/src/config.rs` further * break up into multiple files, at least for tenant config * move `models` as appropriate / refine distinction between config and API models / be explicit about when it's the same * use `pub(crate)` visibility on `mod defaults` to detect stale values	2024-09-05 14:59:49 +02:00
Christian Schwarz	9627747d35	bypass `PageCache` for `InMemoryLayer` + avoid `Value::deser` on L0 flush (#8537 ) Part of [Epic: Bypass PageCache for user data blocks](https://github.com/neondatabase/neon/issues/7386). # Problem `InMemoryLayer` still uses the `PageCache` for all data stored in the `VirtualFile` that underlies the `EphemeralFile`. # Background Before this PR, `EphemeralFile` is a fancy and (code-bloated) buffered writer around a `VirtualFile` that supports `blob_io`. The `InMemoryLayerInner::index` stores offsets into the `EphemeralFile`. At those offset, we find a varint length followed by the serialized `Value`. Vectored reads (`get_values_reconstruct_data`) are not in fact vectored - each `Value` that needs to be read is read sequentially. The `will_init` bit of information which we use to early-exit the `get_values_reconstruct_data` for a given key is stored in the serialized `Value`, meaning we have to read & deserialize the `Value` from the `EphemeralFile`. The L0 flushing also needs to re-determine the `will_init` bit of information, by deserializing each value during L0 flush. # Changes 1. Store the value length and `will_init` information in the `InMemoryLayer::index`. The `EphemeralFile` thus only needs to store the values. 2. For `get_values_reconstruct_data`: - Use the in-memory `index` figures out which values need to be read. Having the `will_init` stored in the index enables us to do that. - View the EphemeralFile as a byte array of "DIO chunks", each 512 bytes in size (adjustable constant). A "DIO chunk" is the minimal unit that we can read under direct IO. - Figure out which chunks need to be read to retrieve the serialized bytes for thes values we need to read. - Coalesce chunk reads such that each DIO chunk is only read once to serve all value reads that need data from that chunk. - Merge adjacent chunk reads into larger `EphemeralFile::read_exact_at_eof_ok` of up to 128k (adjustable constant). 3. The new `EphemeralFile::read_exact_at_eof_ok` fills the IO buffer from the underlying VirtualFile and/or its in-memory buffer. 4. The L0 flush code is changed to use the `index` directly, `blob_io` 5. We can remove the `ephemeral_file::page_caching` construct now. The `get_values_reconstruct_data` changes seem like a bit overkill but they are necessary so we issue the equivalent amount of read system calls compared to before this PR where it was highly likely that even if the first PageCache access was a miss, remaining reads within the same `get_values_reconstruct_data` call from the same `EphemeralFile` page were a hit. The "DIO chunk" stuff is truly unnecessary for page cache bypass, but, since we're working on [direct IO](https://github.com/neondatabase/neon/issues/8130) and https://github.com/neondatabase/neon/issues/8719 specifically, we need to do _something_ like this anyways in the near future. # Alternative Design The original plan was to use the `vectored_blob_io` code it relies on the invariant of Delta&Image layers that `index order == values order`. Further, `vectored_blob_io` code's strategy for merging IOs is limited to adjacent reads. However, with direct IO, there is another level of merging that should be done, specifically, if multiple reads map to the same "DIO chunk" (=alignment-requirement-sized and -aligned region of the file), then it's "free" to read the chunk into an IO buffer and serve the two reads from that buffer. => https://github.com/neondatabase/neon/issues/8719 # Testing / Performance Correctness of the IO merging code is ensured by unit tests. Additionally, minimal tests are added for the `EphemeralFile` implementation and the bit-packed `InMemoryLayerIndexValue`. Performance testing results are presented below. All pref testing done on my M2 MacBook Pro, running a Linux VM. It's a release build without `--features testing`. We see definitive improvement in ingest performance microbenchmark and an ad-hoc microbenchmark for getpage against InMemoryLayer. ``` baseline: commit `7c74112b2a` origin/main HEAD: `ef1c55c52e` ``` <details> ``` cargo bench --bench bench_ingest -- 'ingest 128MB/100b seq, no delta' baseline ingest-small-values/ingest 128MB/100b seq, no delta time: [483.50 ms 498.73 ms 522.53 ms] thrpt: [244.96 MiB/s 256.65 MiB/s 264.73 MiB/s] HEAD ingest-small-values/ingest 128MB/100b seq, no delta time: [479.22 ms 482.92 ms 487.35 ms] thrpt: [262.64 MiB/s 265.06 MiB/s 267.10 MiB/s] ``` </details> We don't have a micro-benchmark for InMemoryLayer and it's quite cumbersome to add one. So, I did manual testing in `neon_local`. <details> ``` ./target/release/neon_local stop rm -rf .neon ./target/release/neon_local init ./target/release/neon_local start ./target/release/neon_local tenant create --set-default ./target/release/neon_local endpoint create foo ./target/release/neon_local endpoint start foo psql 'postgresql://cloud_admin@127.0.0.1:55432/postgres' psql (13.16 (Debian 13.16-0+deb11u1), server 15.7) CREATE TABLE wal_test ( id SERIAL PRIMARY KEY, data TEXT ); DO $$ DECLARE i INTEGER := 1; BEGIN WHILE i <= 500000 LOOP INSERT INTO wal_test (data) VALUES ('data'); i := i + 1; END LOOP; END $$; -- => result is one L0 from initdb and one 137M-sized ephemeral-2 DO $$ DECLARE i INTEGER := 1; random_id INTEGER; random_record wal_test%ROWTYPE; start_time TIMESTAMP := clock_timestamp(); selects_completed INTEGER := 0; min_id INTEGER := 1; -- Minimum ID value max_id INTEGER := 100000; -- Maximum ID value, based on your insert range iters INTEGER := 100000000; -- Number of iterations to run BEGIN WHILE i <= iters LOOP -- Generate a random ID within the known range random_id := min_id + floor(random() * (max_id - min_id + 1))::int; -- Select the row with the generated random ID SELECT * INTO random_record FROM wal_test WHERE id = random_id; -- Increment the select counter selects_completed := selects_completed + 1; -- Check if a second has passed IF EXTRACT(EPOCH FROM clock_timestamp() - start_time) >= 1 THEN -- Print the number of selects completed in the last second RAISE NOTICE 'Selects completed in last second: %', selects_completed; -- Reset counters for the next second selects_completed := 0; start_time := clock_timestamp(); END IF; -- Increment the loop counter i := i + 1; END LOOP; END $$; ./target/release/neon_local stop baseline: commit `7c74112b2a` origin/main NOTICE: Selects completed in last second: 1864 NOTICE: Selects completed in last second: 1850 NOTICE: Selects completed in last second: 1851 NOTICE: Selects completed in last second: 1918 NOTICE: Selects completed in last second: 1911 NOTICE: Selects completed in last second: 1879 NOTICE: Selects completed in last second: 1858 NOTICE: Selects completed in last second: 1827 NOTICE: Selects completed in last second: 1933 ours NOTICE: Selects completed in last second: 1915 NOTICE: Selects completed in last second: 1928 NOTICE: Selects completed in last second: 1913 NOTICE: Selects completed in last second: 1932 NOTICE: Selects completed in last second: 1846 NOTICE: Selects completed in last second: 1955 NOTICE: Selects completed in last second: 1991 NOTICE: Selects completed in last second: 1973 ``` NB: the ephemeral file sizes differ by ca 1MiB, ours being 1MiB smaller. </details> # Rollout This PR changes the code in-place and is not gated by a feature flag.	2024-08-28 18:31:41 +00:00
Yuchen Liang	a889a49e06	pageserver: do vectored read on each dio-aligned section once (#8763 ) Part of #8130, closes #8719. ## Problem Currently, vectored blob io only coalesce blocks if they are immediately adjacent to each other. When we switch to Direct IO, we need a way to coalesce blobs that are within the dio-aligned boundary but has gap between them. ## Summary of changes - Introduces a `VectoredReadCoalesceMode` for `VectoredReadPlanner` and `StreamingVectoredReadPlanner` which has two modes: - `AdjacentOnly` (current implementation) - `Chunked(<alignment requirement>)` - New `ChunkedVectorBuilder` that considers batching `dio-align`-sized read, the start and end of the vectored read will respect `stx_dio_offset_align` / `stx_dio_mem_align` (`vectored_read.start` and `vectored_read.blobs_at.first().start_offset` will be two different value). - Since we break the assumption that blobs within single `VectoredRead` are next to each other (implicit end offset), we start to store blob end offsets in the `VectoredRead`. - Adapted existing tests to run in both `VectoredReadCoalesceMode`. - The io alignment can also be live configured at runtime. Signed-off-by: Yuchen Liang <yuchen@neon.tech>	2024-08-28 15:54:42 +01:00
John Spray	7c74112b2a	pageserver: batch InMemoryLayer `put`s, remove need to sort items by LSN during ingest (#8591 ) ## Problem/Solution TimelineWriter::put_batch is simply a loop over individual puts. Each put acquires and releases locks, and checks for potentially starting a new layer. Batching these is more efficient, but more importantly unlocks future changes where we can pre-build serialized buffers much earlier in the ingest process, potentially even on the safekeeper (imagine a future model where some variant of DatadirModification lives on the safekeeper). Ensuring that the values in put_batch are written to one layer also enables a simplification upstream, where we no longer need to write values in LSN-order. This saves us a sort, but also simplifies follow-on refactors to DatadirModification: we can store metadata keys and data keys separately at that level without needing to zip them together in LSN order later. ## Why? In this PR, these changes are simplify optimizations, but they are motivated by evolving the ingest path in the direction of disentangling extracting DatadirModification from Timeline. It may not obvious how right now, but the general idea is that we'll end up with three phases of ingest: - A) Decode walrecords and build a datadirmodification with all the simple data contents already in a big serialized buffer ready to write to an ephemeral layer <-- this part can be pipelined and parallelized, and done on a safekeeper! - B) Let that datadirmodification see a Timeline, so that it can also generate all the metadata updates that require a read-modify-write of existing pages - C) Dump the results of B into an ephemeral layer. Related: https://github.com/neondatabase/neon/issues/8452 ## Caveats Doing a big monolithic buffer of values to write to disk is ordinarily an anti-pattern: we prefer nice streaming I/O. However: - In future, when we do this first decode stage on the safekeeper, it would be inefficient to serialize a Vec of Value, and then later deserialize it just to add blob size headers while writing into the ephemeral layer format. The idea is that for bulk write data, we will serialize exactly once. - The monolithic buffer is a stepping stone to pipelining more of this: by seriailizing earlier (rather than at the final put_value), we will be able to parallelize the wal decoding and bulk serialization of data page writes. - The ephemeral layer's buffered writer already stalls writes while it waits to flush: so while yes we'll stall for a couple milliseconds to write a couple megabytes, we already have stalls like this, just distributed across smaller writes. ## Benchmarks This PR is primarily a stepping stone to safekeeper ingest filtering, but also provides a modest efficiency improvement to the `wal_recovery` part of `test_bulk_ingest`. test_bulk_ingest: ``` test_bulk_insert[neon-release-pg16].insert: 23.659 s test_bulk_insert[neon-release-pg16].pageserver_writes: 5,428 MB test_bulk_insert[neon-release-pg16].peak_mem: 626 MB test_bulk_insert[neon-release-pg16].size: 0 MB test_bulk_insert[neon-release-pg16].data_uploaded: 1,922 MB test_bulk_insert[neon-release-pg16].num_files_uploaded: 8 test_bulk_insert[neon-release-pg16].wal_written: 1,382 MB test_bulk_insert[neon-release-pg16].wal_recovery: 18.981 s test_bulk_insert[neon-release-pg16].compaction: 0.055 s vs. tip of main: test_bulk_insert[neon-release-pg16].insert: 24.001 s test_bulk_insert[neon-release-pg16].pageserver_writes: 5,428 MB test_bulk_insert[neon-release-pg16].peak_mem: 604 MB test_bulk_insert[neon-release-pg16].size: 0 MB test_bulk_insert[neon-release-pg16].data_uploaded: 1,922 MB test_bulk_insert[neon-release-pg16].num_files_uploaded: 8 test_bulk_insert[neon-release-pg16].wal_written: 1,382 MB test_bulk_insert[neon-release-pg16].wal_recovery: 23.586 s test_bulk_insert[neon-release-pg16].compaction: 0.054 s ```	2024-08-22 10:04:42 +00:00
John Spray	3379cbcaa4	pageserver: add CompactKey, use it in InMemoryLayer (#8652 ) ## Problem This follows a PR that insists all input keys are representable in 16 bytes: - https://github.com/neondatabase/neon/pull/8648 & a PR that prevents postgres from sending us keys that use the high bits of field2: - https://github.com/neondatabase/neon/pull/8657 Motivation for this change: 1. Ingest is bottlenecked on CPU 2. InMemoryLayer can create huge (~1M value) BTreeMap<Key,_> for its index. 3. Maps over i128 are much faster than maps over an arbitrary 18 byte struct. It may still be worthwhile to make the index two-tier to optimize for the case where only the last 4 bytes (blkno) of the key vary frequently, but simply using the i128 representation of keys has a big impact for very little effort. Related: #8452 ## Summary of changes - Introduce `CompactKey` type which contains an i128 - Use this instead of Key in InMemoryLayer's index, converting back and forth as needed. ## Performance All the small-value `bench_ingest` cases show improved throughput. The one that exercises this index most directly shows a 35% throughput increase: ``` ingest-small-values/ingest 128MB/100b seq, no delta time: [374.29 ms 378.56 ms 383.38 ms] thrpt: [333.88 MiB/s 338.13 MiB/s 341.98 MiB/s] change: time: [-26.993% -26.117% -25.111%] (p = 0.00 < 0.05) thrpt: [+33.531% +35.349% +36.974%] Performance has improved. ```	2024-08-13 11:48:23 +01:00
John Spray	cf3eac785b	pageserver: make bench_ingest build (but panic) on macOS (#8641 ) ## Problem Some developers build on MacOS, which doesn't have io_uring. ## Summary of changes - Add `io_engine_for_bench`, which on linux will give io_uring or panic if it's unavailable, and on MacOS will always panic. We do not want to run such benchmarks with StdFs: the results aren't interesting, and will actively waste the time of any developers who start investigating performance before they realize they're using a known-slow I/O backend. Why not just conditionally compile this benchmark on linux only? Because even on linux, I still want it to refuse to run if it can't get io_uring.	2024-08-07 21:17:08 +01:00
Joonas Koivunen	fc78774f39	fix: EphemeralFiles can outlive their Timeline via `enum LayerManager` (#8229 ) Ephemeral files cleanup on drop but did not delay shutdown, leading to problems with restarting the tenant. The solution is as proposed: - make ephemeral files carry the gate guard to delay `Timeline::gate` closing - flush in-memory layers and strong references to those on `Timeline::shutdown` The above are realized by making LayerManager an `enum` with `Open` and `Closed` variants, and fail requests to modify `LayerMap`. Additionally: - fix too eager anyhow conversions in compaction - unify how we freeze layers and handle errors - optimize likely_resident_layers to read LayerFileManager hashmap values instead of bouncing through LayerMap Fixes: #7830	2024-08-07 17:50:09 +03:00
John Spray	ca5390a89d	pageserver: add `bench_ingest` (#7409 ) ## Problem We lack a rust bench for the inmemory layer and delta layer write paths: it is useful to benchmark these components independent of postgres & WAL decoding. Related: https://github.com/neondatabase/neon/issues/8452 ## Summary of changes - Refactor DeltaLayerWriter to avoid carrying a Timeline, so that it can be cleanly tested + benched without a Tenant/Timeline test harness. It only needed the Timeline for building `Layer`, so this can be done in a separate step. - Add `bench_ingest`, which exercises a variety of workload "shapes" (big values, small values, sequential keys, random keys) - Include a small uncontroversial optimization: in `freeze`, only exhaustively walk values to assert ordering relative to end_lsn in debug mode. These benches are limited by drive performance on a lot of machines, but still useful as a local tool for iterating on CPU/memory improvements around this code path. Anecdotal measurements on Hetzner AX102 (Ryzen 7950xd): ``` ingest-small-values/ingest 128MB/100b seq time: [1.1160 s 1.1230 s 1.1289 s] thrpt: [113.38 MiB/s 113.98 MiB/s 114.70 MiB/s] Found 1 outliers among 10 measurements (10.00%) 1 (10.00%) low mild Benchmarking ingest-small-values/ingest 128MB/100b rand: Warming up for 3.0000 s Warning: Unable to complete 10 samples in 10.0s. You may wish to increase target time to 18.9s. ingest-small-values/ingest 128MB/100b rand time: [1.9001 s 1.9056 s 1.9110 s] thrpt: [66.982 MiB/s 67.171 MiB/s 67.365 MiB/s] Benchmarking ingest-small-values/ingest 128MB/100b rand-1024keys: Warming up for 3.0000 s Warning: Unable to complete 10 samples in 10.0s. You may wish to increase target time to 11.0s. ingest-small-values/ingest 128MB/100b rand-1024keys time: [1.0715 s 1.0828 s 1.0937 s] thrpt: [117.04 MiB/s 118.21 MiB/s 119.46 MiB/s] ingest-small-values/ingest 128MB/100b seq, no delta time: [425.49 ms 429.07 ms 432.04 ms] thrpt: [296.27 MiB/s 298.32 MiB/s 300.83 MiB/s] Found 1 outliers among 10 measurements (10.00%) 1 (10.00%) low mild ingest-big-values/ingest 128MB/8k seq time: [373.03 ms 375.84 ms 379.17 ms] thrpt: [337.58 MiB/s 340.57 MiB/s 343.13 MiB/s] Found 1 outliers among 10 measurements (10.00%) 1 (10.00%) high mild ingest-big-values/ingest 128MB/8k seq, no delta time: [81.534 ms 82.811 ms 83.364 ms] thrpt: [1.4994 GiB/s 1.5095 GiB/s 1.5331 GiB/s] Found 1 outliers among 10 measurements (10.00%) ```	2024-08-06 16:39:40 +00:00

8 Commits