rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2025-12-22 21:59:59 +00:00

Author	SHA1	Message	Date
Folke Behrens	9c0efba91e	Bump rand crate to 0.9 (#12674 )	2025-07-22 09:31:39 +00:00
Heikki Linnakangas	1950ccfe33	Eliminate dependency from pageserver_api to postgres_ffi (#12273 ) Introduce a separate `postgres_ffi_types` crate which contains a few types and functions that were used in the API. `postgres_ffi_types` is a much small crate than `postgres_ffi`, and it doesn't depend on bindgen or the Postgres C headers. Move NeonWalRecord and Value types to wal_decoder crate. They are only used in the pageserver-safekeeper "ingest" API. The rest of the ingest API types are defined in wal_decoder, so move these there as well.	2025-06-19 10:31:27 +00:00
Christian Schwarz	a7ce323949	benchmarking: extend `test_page_service_batching.py` to cover concurrent IO + batching under random reads (#10466 ) This PR commits the benchmarks I ran to qualify concurrent IO before we released it. Changes: - Add `l0stack` fixture; a reusable abstraction for creating a stack of L0 deltas each of which has 1 Value::Delta per page. - Such a stack of L0 deltas is a good and understandable demo for concurrent IO because to reconstruct any page, $layer_stack_height` Values need to be read. Before concurrent IO, the reads were sequential. With concurrent IO, they are executed concurrently. - So, switch `test_latency` to use the l0stack. - Teach `pagebench`, which is used by `test_latency`, to limit itself to the blocks of the relation created by the l0stack abstraction. - Additional parametrization of `test_latency` over dimensions `ps_io_concurrency,l0_stack_height,queue_depth` - Use better names for the tests to reflect what they do, leave interpretation of the (now quite high-dimensional) results to the reader - `test_{throughput => postgres_seqscan}` - `test_{latency => random_reads}` - Cut down on permutations to those we use in production. Runtime is about 2min. Refs - concurrent IO epic https://github.com/neondatabase/neon/issues/9378 - batching task: fixes https://github.com/neondatabase/neon/issues/9837 --------- Co-authored-by: Peter Bendel <peterbendel@neon.tech>	2025-05-15 17:48:13 +00:00
Arpad Müller	88f01c1ca1	Introduce WalIngestError (#11506 ) Introduces a `WalIngestError` struct together with a `WalIngestErrorKind` enum, to be used for walingest related failures and errors. * the enum captures backtraces, so we don't regress in comparison to `anyhow::Error`s (backtraces might be a bit shorter if we use one of the `anyhow::Error` wrappers) * it explicitly lists most/all of the potential cases that can occur. I've originally been inspired to do this in #11496, but it's a longer-term TODO.	2025-04-11 14:08:46 +00:00
Arpad Müller	920040e402	Update storage components to edition 2024 (#10919 ) Updates storage components to edition 2024. We like to stay on the latest edition if possible. There is no functional changes, however some code changes had to be done to accommodate the edition's breaking changes. The PR has two commits: * the first commit updates storage crates to edition 2024 and appeases `cargo clippy` by changing code. i have accidentially ran the formatter on some files that had other edits. * the second commit performs a `cargo fmt` I would recommend a closer review of the first commit and a less close review of the second one (as it just runs `cargo fmt`). part of https://github.com/neondatabase/neon/issues/10918	2025-02-25 23:51:37 +00:00
Alex Chi Z.	ae091c6913	feat(pageserver): store reldir in sparse keyspace (#10593 ) ## Problem Part of https://github.com/neondatabase/neon/issues/9516 ## Summary of changes This patch adds the support for storing reldir in the sparse keyspace. All logic are guarded with the `rel_size_v2_enabled` flag, so if it's set to false, the code path is exactly the same as what's currently in prod. Note that we did not persist the `rel_size_v2_enabled` flag and the logic around it will be implemented in the next patch. (i.e., what if we enabled it, restart the pageserver, and then it gets set to false? we should still read from v2 using the rel_size_v2_migration_status in the index_part). The persistence logic I'll implement in the next patch will disallow switching from v2->v1 via config item. I also refactored the metrics so that it can work with the new reldir store. However, this metric is not correctly computed for reldirs (see the comments) before. With the refactor, the value will be computed only when we have an initial value for the reldir size. The refactor keeps the incorrectness of the computation when there are more than 1 database. For the tests, we currently run all the tests with v2, and I'll set it to false and add some v2-specific tests before merging, probably also v1->v2 migration tests. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-02-14 20:31:54 +00:00
John Spray	fb0e2acb2f	pageserver: add `page_trace` API for debugging (#10293 ) ## Problem When a pageserver is receiving high rates of requests, we don't have a good way to efficiently discover what the client's access pattern is. Closes: https://github.com/neondatabase/neon/issues/10275 ## Summary of changes - Add `/v1/tenant/x/timeline/y/page_trace?size_limit_bytes=...&time_limit_secs=...` API, which returns a binary buffer. - Add `pagectl page-trace` tool to decode and analyze the output. --------- Co-authored-by: Erik Grinaker <erik@neon.tech>	2025-01-15 19:07:22 +00:00
Alex Chi Z.	e9ed53b14f	feat(pageserver): support inherited sparse keyspace (#10313 ) ## Problem In preparation to https://github.com/neondatabase/neon/issues/9516. We need to store rel size and directory data in the sparse keyspace, but it does not support inheritance yet. ## Summary of changes Add a new type of keyspace "sparse but inherited" into the system. On the read path: we don't remove the key range when we descend into the ancestor. The search will stop when (1) the full key range is covered by image layers (which has already been implemented before), or (2) we reach the end of the ancestor chain. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-01-13 15:43:01 +00:00
John Spray	ebcbc1a482	pageserver: tighten up code around SLRU dir key handling (#10082 ) ## Problem Changes in #9786 were functionally complete but missed some edges that made testing less robust than it should have been: - `is_key_disposable` didn't consider SLRU dir keys disposable - Timeline `init_empty` was always creating SLRU dir keys on all shards The result was that when we had a bug (https://github.com/neondatabase/neon/pull/10080), it wasn't apparent in tests, because one would only encounter the issue if running on a long-lived timeline with enough compaction to drop the initially created empty SLRU dir keys, _and_ some CLog truncation going on. Closes: https://github.com/neondatabase/cloud/issues/21516 ## Summary of changes - Update is_key_global and init_empty to handle SLRU dir keys properly -- the only functional impact is that we avoid writing some spurious keys in shards >0, but this makes testing much more robust. - Make `test_clog_truncate` explicitly use a sharded tenant The net result is that if one reverts #10080, then tests fail (i.e. this PR is a reproducer for the issue)	2024-12-16 10:06:08 +00:00
Vlad Lazar	665369c439	wal_decoder: fix compact key protobuf encoding (#10074 ) ## Problem Protobuf doesn't support 128 bit integers, so we encode the keys as two 64 bit integers. Issue is that when we split the 128 bit compact key we use signed 64 bit integers to represent the two halves. This may result in a negative lower half when relnode is larger than `0x00800000`. When we convert the lower half to an i128 we get a negative `CompactKey`. ## Summary of Changes Use unsigned integers when encoding into Protobuf. ## Deployment * Prod: We disabled the interpreted proto, so no compat concerns. * Staging: Disable the interpreted proto, do one release, and then release the fixed version. We do this because a negative int32 will convert to a large uint32 value and could give a key in the actual pageserver space. In production we would around this by adding new fields to the proto and deprecating the old ones, but we can make our lives easy here. * Pre-prod: Same as staging	2024-12-11 12:35:02 +00:00
John Spray	dcb629532b	pageserver: only store SLRUs & aux files on shard zero (#9786 ) ## Problem Since https://github.com/neondatabase/neon/pull/9423 the non-zero shards no longer need SLRU content in order to do GC. This data is now redundant on shards >0. One release cycle after merging that PR, we may merge this one, which also stops writing those pages to shards > 0, reaping the efficiency benefit. Closes: https://github.com/neondatabase/neon/issues/7512 Closes: https://github.com/neondatabase/neon/issues/9641 ## Summary of changes - Avoid storing SLRUs on non-zero shards - Bonus: avoid storing aux files on non-zero shards	2024-12-03 17:22:49 +00:00
Vlad Lazar	9e0148de11	safekeeper: use protobuf for sending compressed records to pageserver (#9821 ) ## Problem https://github.com/neondatabase/neon/pull/9746 lifted decoding and interpretation of WAL to the safekeeper. This reduced the ingested amount on the pageservers by around 10x for a tenant with 8 shards, but doubled the ingested amount for single sharded tenants. Also, https://github.com/neondatabase/neon/pull/9746 uses bincode which doesn't support schema evolution. Technically the schema can be evolved, but it's very cumbersome. ## Summary of changes This patch set addresses both problems by adding protobuf support for the interpreted wal records and adding compression support. Compressed protobuf reduced the ingested amount by 100x on the 32 shards `test_sharded_ingest` case (compared to non-interpreted proto). For the 1 shard case the reduction is 5x. Sister change to `rust-postgres` is [here](https://github.com/neondatabase/rust-postgres/pull/33). ## Links Related: https://github.com/neondatabase/neon/issues/9336 Epic: https://github.com/neondatabase/neon/issues/9329	2024-11-27 12:12:21 +00:00
Vlad Lazar	2af791ba83	wal_decoder: make InterpretedWalRecord serde (#9775 ) ## Problem We want to serialize interpreted records to send them over the wire from safekeeper to pageserver. ## Summary of changes Make `InterpretedWalRecord` ser/de. This is a temporary change to get the bulk of the lift merged in https://github.com/neondatabase/neon/pull/9746. For going to prod, we don't want to use bincode since we can't evolve the schema. Questions on serialization will be tackled separately.	2024-11-15 20:34:48 +00:00
Vlad Lazar	552fa2b972	pageserver: tweak oversized key read path warning (#9221 ) ## Problem `Oversized vectored read [...]` logs are spewing in prod because we have a few keys that are unexpectedly large: * reldir/relblock - these are unbounded, so it's known technical debt * slru block - they can be a bit bigger than 128KiB due to storage format overhead ## Summary of changes * Bump threshold to 130KiB * Don't warn on oversized reldir and dbdir keys Closes https://github.com/neondatabase/neon/issues/8967	2024-10-03 16:40:35 +01:00
Matthias van de Meent	78938d1b59	[compute/postgres] feature: PostgreSQL 17 (#8573 ) This adds preliminary PG17 support to Neon, based on RC1 / 2024-09-04 `07b828e9d4` NOTICE: The data produced by the included version of the PostgreSQL fork may not be compatible with the future full release of PostgreSQL 17 due to expected or unexpected future changes in magic numbers and internals. DO NOT EXPECT DATA IN V17-TENANTS TO BE COMPATIBLE WITH THE 17.0 RELEASE! Co-authored-by: Anastasia Lubennikova <anastasia@neon.tech> Co-authored-by: Alexander Bayandin <alexander@neon.tech> Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech> Co-authored-by: Heikki Linnakangas <heikki@neon.tech>	2024-09-12 23:18:41 +01:00
Alex Chi Z.	e158df4e86	feat(pageserver): split delta writer automatically determines key range (#8850 ) close https://github.com/neondatabase/neon/issues/8838 ## Summary of changes This patch modifies the split delta layer writer to avoid taking start_key and end_key when creating/finishing the layer writer. The start_key for the delta layers will be the first key provided to the layer writer, and the end_key would be the `last_key.next()`. This simplifies the delta layer writer API. On that, the layer key hack is removed. Image layers now use the full key range, and delta layers use the first/last key provided by the user. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-09-09 22:03:27 +01:00
Alex Chi Z.	653a6532a2	fix(pageserver): reject non-i128 key on the write path (#8648 ) It's better to reject invalid keys on the write path than storing it and panic-ing the pageserver. https://github.com/neondatabase/neon/issues/8636 ## Summary of changes If a key cannot be represented using i128, we don't allow writing that key into the pageserver. There are two versions of the check valid function: the normal one that simply rejects i128 keys, and the stronger one that rejects all keys that we don't support. The current behavior when a key gets rejected is that safekeeper will keep retrying streaming that key to the pageserver. And once such key gets written, no new computes can be started. Therefore, there could be a large amount of pageserver warnings if a key cannot be ingested. To validate this behavior by yourself, the reviewer can (1) use the stronger version of the valid check (2) run the following SQL. ``` set neon.regress_test_mode = true; CREATE TABLESPACE regress_tblspace LOCATION '/Users/skyzh/Work/neon-test/tablespace'; CREATE SCHEMA testschema; CREATE TABLE testschema.foo (i int) TABLESPACE regress_tblspace; insert into testschema.foo values (1), (2), (3); ``` For now, I'd like to merge the patch with only rejecting non-i128 keys. It's still unknown whether the stronger version covers all the cases that basebackup doesn't support. Furthermore, the behavior of rejecting a key will produce large amounts of warnings due to safekeeper retry. Therefore, I'd like to reject the minimum set of keys that we don't support (i128 ones) for now. (well, erroring out is better than panic on `to_compact_key`) The next step is to fix the safekeeper behavior (i.e., on such key rejections, stop streaming WAL), so that we can properly stop writing. An alternative solution is to simply drop these keys on the write path. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-08-29 10:07:05 -04:00
Alex Chi Z.	0f65684263	feat(pageserver): use split layer writer in gc-compaction (#8608 ) Part of #8002, the final big PR in the batch. ## Summary of changes This pull request uses the new split layer writer in the gc-compaction. * It changes how layers are split. Previously, we split layers based on the original split point, but this creates too many layers (test_gc_feedback has one key per layer). * Therefore, we first verify if the layer map can be processed by the current algorithm (See https://github.com/neondatabase/neon/pull/8191, it's basically the same check) * On that, we proceed with the compaction. This way, it creates a large enough layer close to the target layer size. * Added a new set of functions `with_discard` in the split layer writer. This helps us skip layers if we are going to produce the same persistent key. * The delta writer will keep the updates of the same key in a single file. This might create a super large layer, but we can optimize it later. * The split layer writer is used in the gc-compaction algorithm, and it will split layers based on size. * Fix the image layer summary block encoded the wrong key range. --------- Signed-off-by: Alex Chi Z <chi@neon.tech> Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com> Co-authored-by: Christian Schwarz <christian@neon.tech>	2024-08-26 14:19:47 -04:00
John Spray	3379cbcaa4	pageserver: add CompactKey, use it in InMemoryLayer (#8652 ) ## Problem This follows a PR that insists all input keys are representable in 16 bytes: - https://github.com/neondatabase/neon/pull/8648 & a PR that prevents postgres from sending us keys that use the high bits of field2: - https://github.com/neondatabase/neon/pull/8657 Motivation for this change: 1. Ingest is bottlenecked on CPU 2. InMemoryLayer can create huge (~1M value) BTreeMap<Key,_> for its index. 3. Maps over i128 are much faster than maps over an arbitrary 18 byte struct. It may still be worthwhile to make the index two-tier to optimize for the case where only the last 4 bytes (blkno) of the key vary frequently, but simply using the i128 representation of keys has a big impact for very little effort. Related: #8452 ## Summary of changes - Introduce `CompactKey` type which contains an i128 - Use this instead of Key in InMemoryLayer's index, converting back and forth as needed. ## Performance All the small-value `bench_ingest` cases show improved throughput. The one that exercises this index most directly shows a 35% throughput increase: ``` ingest-small-values/ingest 128MB/100b seq, no delta time: [374.29 ms 378.56 ms 383.38 ms] thrpt: [333.88 MiB/s 338.13 MiB/s 341.98 MiB/s] change: time: [-26.993% -26.117% -25.111%] (p = 0.00 < 0.05) thrpt: [+33.531% +35.349% +36.974%] Performance has improved. ```	2024-08-13 11:48:23 +01:00
Alex Chi Z.	b3eea45277	fix(pageserver): dump the key when it's invalid (#8633 ) We see an assertion error in staging. Dump the key to guess where it was from, and then we can fix it. Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-08-07 16:37:46 +01:00
Alex Chi Z	891cb5a9a8	fix(pageserver): comments about metadata key range (#8236 ) Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-07-02 16:54:32 +00:00
Alex Chi Z	76864e6a2a	feat(pageserver): add image layer iterator (#8006 ) part of https://github.com/neondatabase/neon/issues/8002 ## Summary of changes This pull request adds the image layer iterator. It buffers a fixed amount of key-value pairs in memory, and give developer an iterator abstraction over the image layer. Once the buffer is exhausted, it will issue 1 I/O to fetch the next batch. Due to the Rust lifetime mysteries, the `get_stream_from` API has been refactored to `into_stream` and consumes `self`. Delta layer iterator implementation will be similar, therefore I'll add it after this pull request gets merged. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-06-25 20:49:29 +00:00
Konstantin Knizhnik	7006caf3a1	Store logical replication origin in KV storage (#7099 ) Store logical replication origin in KV storage ## Problem See #6977 ## Summary of changes * Extract origin_lsn from commit WAl record * Add ReplOrigin key to KV storage and store origin_lsn * In basebackup replace snapshot origin_lsn with last committed origin_lsn at basebackup LSN ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist --------- Signed-off-by: Alex Chi Z <chi@neon.tech> Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech> Co-authored-by: Alex Chi Z <chi@neon.tech>	2024-06-03 19:37:33 +03:00
Arpad Müller	acf0a11fea	Move keyspace utils to inherent impls (#7929 ) The keyspace utils like `is_rel_size_key` or `is_rel_fsm_block_key` and many others are free functions and have to be either imported separately or specified with the full path starting in `pageserver_api:🔑:`. This is less convenient than if these functions were just inherent impls. Follow-up of #7890 Fixes #6438	2024-06-03 16:18:07 +02:00
Joonas Koivunen	ef83f31e77	pagectl: key command for dumping what we know about the key (#7890 ) What we know about the key via added `pagectl key $key` command: - debug formatting - shard placement when `--shard-count` is specified - different boolean queries in `key.rs` - aux files v2 Example: ``` $ cargo run -qp pagectl -- key 000000063F00004005000060270000100E2C parsed from hex: 000000063F00004005000060270000100E2C: Key { field1: 0, field2: 1599, field3: 16389, field4: 24615, field5: 0, field6: 1052204 } rel_block: true rel_vm_block: false rel_fsm_block: false slru_block: false inherited: true rel_size: false slru_segment_size: false recognized kind: None ```	2024-05-31 18:19:41 +00:00
Alex Chi Z	1eca8b8a6b	fix(pageserver): ensure to_i128 works for metadata keys (#7895 ) field2 of metadata keys can be 0xFFFF because of the mapping. Allow 0xFFFF for `to_i128`. An alternative is to encode 0xFFFF as 0xFFFFFFFF (which is allowed in the original `to_i128`). But checking the places where field2 is referenced, the rest part of the system does not seem to depend on this assertion. Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-05-30 10:03:17 -04:00
Alex Chi Z	a3fe12b6d8	feat(pageserver): add scan interface (#7468 ) This pull request adds the scan interface. Scan operates on a sparse keyspace and retrieves all the key-value pairs from the keyspaces. Currently, scan only supports the metadata keyspace, and by default do not retrieve anything from the ancestor branch. This should be fixed in the future if we need to have some keyspaces that inherits from the parent. The scan interface reuses the vectored get code path by disabling the missing key errors. This pull request also changes the behavior of vectored get on aux file v1/v2 key/keyspace: if the key is not found, it is simply not included in the result, instead of throwing a missing key error. TODOs in future pull requests: limit memory consumption, ensure the search stops when all keys are covered by the image layer, remove `#[allow(dead_code)]` once the code path is used in basebackups / aux files, remove unnecessary fine-grained keyspace tracking in vectored get (or have another code path for scan) to improve performance. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-05-03 10:43:30 -04:00
Alex Chi Z	ee3437cbd8	chore(pageserver): shrink aux keyspace to 0x60-0x7F (#7502 ) extracted from https://github.com/neondatabase/neon/pull/7468, part of https://github.com/neondatabase/neon/issues/7462. In the page server, we use i128 (instead of u128) to do the integer representation of the key, which indicates that the highest bit of the key should not be 1. This constraints our keyspace to <= 0x7F. Also fix the bug of `to_i128` that dropped the highest 4b. Now we keep 3b of them, dropping the sign bit. And on that, we shrink the metadata keyspace to 0x60-0x7F for now, and once we add support for u128, we can have a larger metadata keyspace. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-04-26 13:35:01 -04:00
Alex Chi Z	89f023e6b0	feat(pageserver): add metadata key range and aux key encoding (#7401 ) Extracted from https://github.com/neondatabase/neon/pull/7375. We assume everything >= 0x80 are metadata keys. AUX file keys are part of the metadata keys, and we use `0x90` as the prefix for AUX file keys. The AUX file encoding is described in the code comment. We use xxhash128 as the hash algorithm. It seems to be portable according to the introduction, > xxHash is an Extremely fast Hash algorithm, processing at RAM speed limits. Code is highly portable, and produces hashes identical across all platforms (little / big endian). ...though whether the Rust version follows the same convention is unknown and might need manual review of the library. Anyways, we can always change the hash algorithm before rolling it out in staging/end-user, and I made a quick decision to use xxhash here because it generates 128b hash + portable. We can save the discussion of which hash algorithm to use later. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-04-23 15:16:04 +00:00
Vlad Lazar	a9fda8c832	pageserver: fix vectored read aux key handling (#7404 ) ## Problem Vectored get would descend into ancestor timelines for aux files. This is not the behaviour of the legacy read path and blocks cutting over to the vectored read path. Fixes https://github.com/neondatabase/neon/issues/7379 ## Summary of Changes Treat non inherited keys specially in vectored get. At the point when we want to descend into the ancestor mark all pending non inherited keys as errored out at the key level. Note that this diverges from the standard vectored get behaviour for missing keys which is a top level error. This divergence is required to avoid blocking compaction in case such an error is encountered when compaction aux files keys. I'm pretty sure the bug I just described predates the vectored get implementation, but it's still worth fixing.	2024-04-23 14:03:33 +01:00
Vlad Lazar	f1901833a6	pageserver_api: migrate keyspace related functions from `pgdatadir_mapping` (#6406 ) The idea is to achieve separation between keyspace layout definition and operating on said keyspace. I've inlined all these function since they're small and we don't use LTO in the storage release builds at the moment. Closes https://github.com/neondatabase/neon/issues/6347	2024-01-22 19:16:38 +00:00
Vlad Lazar	02c6abadf0	pageserver: remove depenency of pagebench on pageserver (#6334 ) To achieve this I had to lift the BlockNumber and key_to_rel_block definitions to pageserver_api (similar to a change in #5980). Closes #6299	2024-01-12 17:11:19 +00:00
Christian Schwarz	4e1b0b84eb	pagebench: fixup after is_rel_block_key changes in #6266 (#6303 ) PR #6266 broke the getpage_latest_lsn benchmark. Before this patch, we'd fail with ``` not implemented: split up range ``` because `r.start = rel size key` and `r.end = rel size key + 1`. The filtering of the key ranges in that loop is a bit ugly, but, I measured: * setup with 180k layer files (20k tenants * 9 layers). * total physical size is 463GiB * 5k tenants, the range filtering takes `0.6 seconds` on an i3en.3xlarge. That's a tiny fraction of the overall time it takes for pagebench to get ready to send requests. So, this is good enough for now / there are other bottlenecks that are bigger.	2024-01-09 19:00:37 +01:00
Christian Schwarz	d260426a14	is_rel_block_key: exclude the relsize key (#6266 ) Before this PR, `is_rel_block_key` returns true for the blknum `0xffffffff`, which is a blknum that's actually never written by Postgres, but used by Neon Pageserver to store the relsize. Quoting @MMeent: > PostgreSQL can't extend the relation beyond size of 0xFFFFFFFF blocks, > so block number 0xFFFFFFFE is the last valid block number. This PR changes the definition of the function to exclude blknum 0xffffffff. My motivation for doing this change is to fix the `pagebench` getpage benchmark, which uses `is_rel_block_key` to filter the keyspace for valid pages to request from page_service. fixes https://github.com/neondatabase/neon/issues/6210 I checked other users of the function. The first one is `key_is_shard0`, which already had added an exemption for 0xffffffff. So, there's no functional change with this PR. The second one is `DatadirModification::flush`[^1]. With this PR, `.flush()` will skip the relsize key, whereas it didn't before. This means we will pile up all the relsize key-value pairs `(Key,u32)` in `DatadirModification::pending_updates` until `.commit()` is called. The only place I can think of where that would be a problem is if we import from a full basebackup, and don't `.commit()` regularly, like we currently don't do in `import_basebackup_from_tar`. It exposes us to input-controlled allocations. However, that was already the case for the other keys that are skipped, so, one can argue that this change is not making the situation much worse. [^1]: That type's `flush()` and `commit()` methods are terribly named, but, that's for another time	2024-01-05 11:48:06 +01:00
Christian Schwarz	47873470db	pageserver: add method to dump keyspace in mgmt api client (#6145 ) Part of getpage@lsn benchmark epic: https://github.com/neondatabase/neon/issues/5771	2023-12-16 10:52:48 +00:00
John Spray	61fe9d360d	pageserver: add Key->Shard mapping logic & use it in page service (#5980 ) ## Problem When a pageserver receives a page service request identified by TenantId, it must decide which `Tenant` object to route it to. As in earlier PRs, this stuff is all a no-op for tenants with a single shard: calls to `is_key_local` always return true without doing any hashing on a single-shard ShardIdentity. Closes: https://github.com/neondatabase/neon/issues/6026 ## Summary of changes - Carry immutable `ShardIdentity` objects in Tenant and Timeline. These provide the information that Tenants/Timelines need to figure out which shard is responsible for which Key. - Augment `get_active_tenant_with_timeout` to take a `ShardSelector` specifying how the shard should be resolved for this tenant. This mode depends on the kind of request (e.g. basebackups always go to shard zero). - In `handle_get_page_at_lsn_request`, handle the case where the Timeline we looked up at connection time is not the correct shard for the page being requested. This can happen whenever one node holds multiple shards for the same tenant. This is currently written as a "slow path" with the optimistic expectation that usually we'll run with one shard per pageserver, and the Timeline resolved at connection time will be the one serving page requests. There is scope for optimization here later, to avoid doing the full shard lookup for each page. - Omit consumption metrics from nonzero shards: only the 0th shard is responsible for tracing accurate relation sizes. Note to reviewers: - Testing of these changes is happening separately on the `jcsp/sharding-pt1` branch, where we have hacked neon_local etc needed to run a test_pg_regress. - The main caveat to this implementation is that page service connections still look up one Timeline when the connection is opened, before they know which pages are going to be read. If there is one shard per pageserver then this will always also be the Timeline that serves page requests. However, if multiple shards are on one pageserver then get page requests will incur the cost of looking up the correct Timeline on each getpage request. We may look to improve this in future with a "sticky" timeline per connection handler so that subsequent requests for the same Timeline don't have to look up again, and/or by having postgres pass a shard hint when connecting. This is tracked in the "Loose ends" section of https://github.com/neondatabase/neon/issues/5507	2023-12-05 12:01:55 +00:00
John Spray	ab631e6792	pageserver: make TenantsMap shard-aware (#5819 ) ## Problem When using TenantId as the key, we are unable to handle multiple tenant shards attached to the same pageserver for the same tenant ID. This is an expected scenario if we have e.g. 8 shards and 5 pageservers. ## Summary of changes - TenantsMap is now a BTreeMap instead of a HashMap: this enables looking up by range. In future, we will need this for page_service, as incoming requests will just specify the Key, and we'll have to figure out which shard to route it to. - A new key type TenantShardId is introduced, to act as the key in TenantsMap, and as the id type in external APIs. Its human readable serialization is backward compatible with TenantId, and also forward-compatible as long as sharding is not actually used (when we construct a TenantShardId with ShardCount(0), it serializes to an old-fashioned TenantId). - Essential tenant APIs are updated to accept TenantShardIds: tenant/timeline create, tenant delete, and /location_conf. These are the APIs that will enable driving sharded tenants. Other apis like /attach /detach /load /ignore will not work with sharding: those will soon be deprecated and replaced with /location_conf as part of the live migration work. Closes: #5787	2023-11-15 23:20:21 +02:00

37 Commits