rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-01-08 05:52:55 +00:00

Author	SHA1	Message	Date
Arpad Müller	3a6fa76828	Tiered compaction: cut deltas along lsn as well if needed (#7671 ) In general, tiered compaction is splitting delta layers along the key dimension, but this can only continue until a single key is reached: if the changes from a single key don't fit into one layer file, we used to create layer files of unbounded sizes. This patch implements the method listed as TODO/FIXME in the source code. It does the following things: * Make `accum_key_values` take the target size and if one key's modifications exceed it, make it fill `partition_lsns`, a vector of lsns to use for partitioning. * Have `retile_deltas` use that `partition_lsns` to create delta layers separated by lsn. * Adjust the `test_many_updates_for_single_key` to allow layer files below 0.5 the target size. This situation can create arbitarily small layer files: The amount of data is arbitrary that sits between having just cut a new delta, and then stumbling upon the key that needs to be split along lsn. This data will end up in a dedicated layer and it can be arbitrarily small. * Ignore single-key delta layers for depth calculation: in theory we might have only single-key delta layers in a tier, and this might confuse depth calculation as well, but this should be unlikely. Fixes #7243 Part of #7554 --------- Co-authored-by: Heikki Linnakangas <heikki@neon.tech>	2024-05-14 01:13:25 +02:00
John Spray	972470b174	pageserver: use adaptive concurrency in secondary layer downloads (#7675 ) ## Problem Secondary downloads are a low priority task, and intentionally do not try to max out download speeds. This is almost always fine when they are used through the life of a tenant shard as a continuous "trickle" of background downloads. However, there are sometimes circumstances where we would like to populate a secondary location as fast as we can, within the constraint that we don't want to impact the activity of attached tenants: - During node removal, where we will need to create replacements for secondary locations on the node being removed - After a shard split, we need new secondary locations for the new shards to populate before the shards can be migrated to their final location. ## Summary of changes - Add an activity() function to the remote storage interface, enabling callers to query how busy the remote storage backend is - In the secondary download code, use a very modest amount of concurrency, driven by the remote storage's state: we only use concurrency if the remote storage semaphore is 75% free, and scale the amount of concurrency used within that range. This is not a super clever form of prioritization, but it should accomplish the key goals: - Enable secondary downloads to happen faster when the system is idle - Make secondary downloads a much lower priority than attached tenants when the remote storage is busy. --------- Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>	2024-05-13 17:38:30 +00:00
Vlad Lazar	1412e9b3e8	pagectl: fix diagrams generation for paths containing generations (#7739 ) ## Problem When layer paths include generations, the lsn parsing does not work and `pagectl` errors out. ## Summary of changes If the last "word" of the layer path contains 8 characters, discard it for the purpose of lsn parsing.	2024-05-13 18:24:12 +01:00
John Spray	be0c73f8e7	pageserver: improve API for invoking GC (#7655 ) ## Problem In https://github.com/neondatabase/neon/pull/7531, I had a test flaky because the GC API endpoint fails if the tenant happens not to be active yet. ## Summary of changes While adding that wait for the tenant to be active, I noticed that this endpoint is kind of strange (spawns a TaskManager task) and has a comment `// TODO: spawning is redundant now, need to hold the gate`, so this PR cleans it up to just run the GC inline while holding a gate. The GC code is updated to avoid assuming it runs inside a task manager task. Avoiding checking the task_mgr cancellation token is safe, because our timeline shutdown always cancels Timeline::cancel.	2024-05-13 17:59:59 +01:00
Alex Chi Z	7f51764001	feat(pageserver): add metrics for aux file size (#7623 ) ref https://github.com/neondatabase/neon/issues/7443 ## Summary of changes This pull request adds a size estimator for aux files. Each timeline stores a cached `isize` for the estimated total size of aux files. It gets reset on basebackup, and gets updated for each aux file modification. TODO: print a warning when it exceeds the size. The size metrics is not accurate. Race between `on_basebackup` and other functions could create a negative basebackup size, but the chance is rare. Anyways, this does not impose any extra I/Os to the storage as everything is computed in-memory. The aux files are only stored on shard 0. As basebackups are only generated on shard 0, only shard 0 will report this metrics. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-05-13 15:33:41 +00:00
Joonas Koivunen	4d8a10af1c	fix: do not create metrics contention from background task permit (#7730 ) The background task loop permit metrics do two of `with_label_values` very often. Change the codepath to cache the counters on first access into a `Lazy` with `enum_map::EnumMap`. The expectation is that this should not fix for metric collection failures under load, but it doesn't hurt. Cc: #7161	2024-05-13 17:49:50 +03:00
Christian Schwarz	6ff74295b5	chore(pageserver): plumb through RequestContext to VirtualFile open methods (#7725 ) This PR introduces no functional changes. The `open()` path will be done separately. refs https://github.com/neondatabase/neon/issues/6107 refs https://github.com/neondatabase/neon/issues/7386 Co-authored-by: Joonas Koivunen <joonas@neon.tech>	2024-05-13 14:52:06 +02:00
John Spray	f50ff14560	pageserver: refuse to run without remote storage (#7722 ) ## Problem Since https://github.com/neondatabase/neon/pull/6769, the pageserver is intentionally not usable without remote storage: it's purpose is to act as a cache to an object store, rather than as a source of truth in its own right. ## Summary of changes - Make remote storage configuration mandatory: the pageserver will refuse to start if it is not provided. This is a precursor that will make it safe to subsequently remove all the internal Option<>s	2024-05-13 13:05:46 +01:00
Christian Schwarz	b58a615197	chore(pageserver): plumb through RequestContext to VirtualFile read methods (#7720 ) This PR introduces no functional changes. The `open()` path will be done separately. refs https://github.com/neondatabase/neon/issues/6107 refs https://github.com/neondatabase/neon/issues/7386	2024-05-13 09:22:10 +00:00
Joonas Koivunen	6351313ae9	feat: allow detaching from ancestor for timelines without writes (#7639 ) The first implementation #7456 did not include `index_part.json` changes in an attempt to keep amount of changes down. Tracks the historic reparentings and earlier detach in `index_part.json`. - `index_part.json` receives a new field `lineage: Lineage` - `Lineage` is queried through RemoteTimelineClient during basebackup, creating `PREV LSN: none` for the invalid prev record lsn just as it would had been created for a newly created timeline - as `struct IndexPart` grew, it is now boxed in places Cc: #6994	2024-05-10 22:30:05 +03:00
Arpad Müller	d7c68dc981	Tiered compaction: fix early exit check in main loop (#7702 ) The old test based on the immutable `target_file_size` that was a parameter to the function. It makes no sense to go further once `current_level_target_height` has reached `u64::MAX`, as lsn's are u64 typed. In practice, we should only run into this if there is a bug, as the practical lsn range usually ends much earlier. Testing on `target_file_size` makes less sense, it basically implements an invocation mode that turns off the looping and only runs one iteration of it. @hlinnaka agrees that `current_level_target_height` is better here. Part of #7554	2024-05-10 18:50:47 +03:00
Joonas Koivunen	d7f34bc339	draw_timeline_dir: draw branch points and gc cutoff lines (#7657 ) in addition to layer names, expand the input vocabulary to recognize lines in the form of: ${kind}:${lsn} where: - kind in `gc_cutoff` or `branch` - lsn is accepted in Lsn display format (x/y) or hex (as used in layer names) gc_cutoff and branch have different colors.	2024-05-10 17:41:34 +03:00
Joonas Koivunen	86905c1322	openapi: resolve the synthetic_size duplication (#7651 ) We had accidentally left two endpoints for `tenant`: `/synthetic_size` and `/size`. Size had the more extensive description but has returned 404 since renaming. Remove the `/size` in favor of the working one and describe the `text/html` output.	2024-05-10 17:15:11 +03:00
John Spray	13d9589c35	pageserver: don't call get_vectored with empty keyspace (#7686 ) ## Problem This caused a variation of the stats bug fixed by https://github.com/neondatabase/neon/pull/7662. That PR also fixed this case, but we still shouldn't make redundant get calls. ## Summary of changes - Only call get in the create image layers loop at the end of a range if some keys have been accumulated	2024-05-10 11:01:39 +00:00
Arpad Müller	41fb838799	Fix tiered compaction k-merge bug and use in-memory alternative (#7661 ) This PR does two things: First, it fixes a bug with tiered compaction's k-merge implementation. It ignored the lsn of a key during ordering, so multiple updates of the same key could be read in arbitrary order, say from different layers. For example there is layers `[(a, 2),(b, 3)]` and `[(a, 1),(c, 2)]` in the heap, they might return `(a,2)` and `(a,1)`. Ultimately, this change wasn't enough to fix the ordering issues in #7296, in other words there is likely still bugs in the k-merge. So as the second thing, we switch away from the k-merge to an in-memory based one, similar to #4839, but leave the code around to be improved and maybe switched to later on. Part of #7296	2024-05-09 16:01:16 +02:00
Vlad Lazar	d5399b729b	pageserver: fix division by zero in layer counting metric (#7662 ) For aux file keys (v1 or v2) the vectored read path does not return an error when they're missing. Instead they are omitted from the resulting btree (this is a requirement, not a bug). Skip updating the metric in these cases to avoid infinite results.	2024-05-08 18:29:16 +00:00
John Spray	ca154d9cd8	pageserver: local layer path followups (#7640 ) - Rename "filename" types which no longer map directly to a filename (LayerFileName -> LayerName) - Add a -v1- part to local layer paths to smooth the path to future updates (we anticipate a -v2- that uses checksums later) - Rename methods that refer to the string-ized version of a LayerName to no longer be called "filename" - Refactor reconcile() function to use a LocalLayerFileMetadata type that includes the local path, rather than carrying local path separately in a tuple and unwrap()'ing it later.	2024-05-08 16:50:21 +00:00
Arpad Müller	870786bd82	Improve tiered compaction tests (#7643 ) Improves the tiered compaction tests: * Adds a new test that is a simpler version of the ignored `test_many_updates_for_single_key` test. * Reduces the amount of data that `test_many_updates_for_single_key` processes to make it execute more quickly. * Adds logging support.	2024-05-08 13:22:55 +02:00
Arpad Müller	b6d547cf92	Tiered compaction: add order asserts after delta key k-merge (#7648 ) Adds ordering asserts to the output of the delta key iterator `MergeDeltaKeys` that implements a k-merge. Part of #7296 : the asserts added by this PR get hit in the reproducers of #7296 as well, but they are earlier in the pipeline.	2024-05-08 13:22:27 +02:00
John Spray	0af66a6003	pageserver: include generation number in local layer paths (#7609 ) ## Problem In https://github.com/neondatabase/neon/pull/7531, we would like to be able to rewrite layers safely. One option is to make `Layer` able to rewrite files in place safely (e.g. by blocking evictions/deletions for an old Layer while a new one is created), but that's relatively fragile. It's more robust in general if we simply never overwrite the same local file: we can do that by putting the generation number in the filename. ## Summary of changes - Add `local_layer_path` (counterpart to `remote_layer_path`) and convert all locations that manually constructed a local layer path by joining LayerFileName to timeline path - In the layer upload path, construct remote paths with `remote_layer_path` rather than trying to build them out of a local path. - During startup, carry the full path to layer files through `init::reconcile`, and pass it into `Layer::for_resident` - Add a test to make sure we handle upgrades properly. - Comment out the generation part of `local_layer_path`, since we need to maintain forward compatibility for one release. A tiny followup PR will enable it afterwards. We could make this a bit simpler if we bulk renamed existing layers on startup instead of carrying literal paths through init, but that is operationally risky on existing servers with millions of layer files. We can always do a renaming change in future if it becomes annoying, but for the moment it's kind of nice to have a structure that enables us to change local path names again in future quite easily. We should rename `LayerFileName` to `LayerName` or somesuch, to make it more obvious that it's not a literal filename: this was already a bit confusing where that type is used in remote paths. That will be a followup, to avoid polluting this PR's diff.	2024-05-07 18:03:12 +01:00
Alex Chi Z	017c34b773	feat(pageserver): generate basebackup from aux file v2 storage (#7517 ) This pull request adds the new basebackup read path + aux file write path. In the regression test, all logical replication tests are run with matrix aux_file_v2=false/true. Also fixed the vectored get code path to correctly return missing key error when being called from the unified sequential get code path. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-05-07 16:30:18 +00:00
Joonas Koivunen	d041f9a887	refactor(rtc): remove excess cloning (#7635 ) RemoteTimelineClient has a lot of mandatory cloning. By using a single way of creating IndexPart out of UploadQueueInitialized we can simplify things and also avoid cloning the latest files for each `index_part.json` upload (the contents will still be cloned).	2024-05-07 19:22:29 +03:00
Joonas Koivunen	3c9b484c4d	feat: Timeline detach ancestor (#7456 ) ## Problem Timelines cannot be deleted if they have children. In many production cases, a branch or a timeline has been created off the main branch for various reasons to the effect of having now a "new main" branch. This feature will make it possible to detach a timeline from its ancestor by inheriting all of the data before the branchpoint to the detached timeline and by also reparenting all of the ancestor's earlier branches to the detached timeline. ## Summary of changes - Earlier added copy_lsn_prefix functionality is used - RemoteTimelineClient learns to adopt layers by copying them from another timeline - LayerManager adds support for adding adopted layers - `timeline::Timeline::{prepare_to_detach,complete_detaching}_from_ancestor` and `timeline::detach_ancestor` are added - HTTP PUT handler Cc: #6994 Co-authored-by: Christian Schwarz <christian@neon.tech>	2024-05-07 13:47:57 +03:00
John Spray	af849a1f61	pageserver: post-shard-split layer trimming (1/2) (#7572 ) ## Problem After a shard split of a large existing tenant, child tenants can end up with oversized historic layers indefinitely, if those layers are prevented from being GC'd by branchpoints. This PR is followed by https://github.com/neondatabase/neon/pull/7531 Related issue: https://github.com/neondatabase/neon/issues/7504 ## Summary of changes - Add a new compaction phase `compact_shard_ancestors`, which identifies layers that are no longer needed after a shard split. - Add a Timeline->LayerMap code path called `rewrite_layers` , which is currently only used to drop layers, but will later be used to rewrite them as well in https://github.com/neondatabase/neon/pull/7531 - Add a new test that compacts after a split, and checks that something is deleted. Note that this doesn't have much impact on a tenant's resident size (since unused layers would end up evicted anyway), but it: - Makes index_part.json much smaller - Makes the system easier to reason about: avoid having tenants which are like "my physical size is 4TiB but don't worry I'll never actually download it", instead have tenants report the real physical size of what they might download. Why do we remove these layers in compaction rather than during the split? Because we have existing split tenants that need cleaning up. We can add it to the split operation in future as an optimization.	2024-05-07 11:15:58 +01:00
Christian Schwarz	ac7dc82103	use less `neon_local --pageserver-config-override` / `pageserver -c` (#7613 )	2024-05-06 22:31:26 +02:00
Christian Schwarz	df1def7018	refactor(pageserver): remove --update-init flag (#7612 ) We don't actually use it. refs https://github.com/neondatabase/neon/issues/7555	2024-05-06 16:40:44 +02:00
John Spray	67a2215163	pageserver: label tenant_slots metric by slot type (#7603 ) ## Problem The current `tenant_slots` metric becomes less useful once we have lots of secondaries, because we can't tell how many tenants are really attached (without doing a sum() on some other metric). ## Summary of changes - Add a `mode` label to this metric - Update the metric with `slot_added` and `slot_removed` helpers that are called at all the places we mutate the tenants map. - Add a debug assertion at shutdown that checks the metrics add up to the right number, as a cheap way of validating that we're calling the metric hooks in all the right places.	2024-05-06 14:07:15 +01:00
John Spray	3764dd2e84	pageserver: call maybe_freeze_ephemeral_layer from a dedicated task (#7594 ) ## Problem In testing of the earlier fix for OOMs under heavy write load (https://github.com/neondatabase/neon/pull/7218), we saw that the limit on ephemeral layer size wasn't being reliably enforced. That was diagnosed as being due to overwhelmed compaction loops: most tenants were waiting on the semaphore for background tasks, and thereby not running the function that proactively rolls layers frequently enough. Related: https://github.com/neondatabase/neon/issues/6939 ## Summary of changes - Create a new per-tenant background loop for "ingest housekeeping", which invokes maybe_freeze_ephemeral_layer() without taking the background task semaphore. - Downgrade to DEBUG a log line in maybe_freeze_ephemeral_layer that had been INFO, but turns out to be pretty common in the field. There's some discussion on the issue (https://github.com/neondatabase/neon/issues/6939#issuecomment-2083554275) about alternatives for calling this maybe_freeze_epemeral_layer periodically without it getting stuck behind compaction. A whole task just for this feels like kind of a big hammer, but we may in future find that there are other pieces of lightweight housekeeping that we want to do here too. Why is it okay to call maybe_freeze_ephemeral_layer outside of the background tasks semaphore? - this is the same work we would do anyway if we receive writes from the safekeeper, just done a bit sooner. - The period of the new task is generously jittered (+/- 5%), so when the ephemeral layer size tips over the threshold, we shouldn't see an excessively aggressive thundering herd of layer freezes (and only layers larger than the mean layer size will be frozen) - All that said, this is an imperfect approach that relies on having a generous amount of RAM to dip into when we need to freeze somewhat urgently. It would be nice in future to also block compaction/GC when we recognize resource stress and need to do other work (like layer freezing) to reduce memory footprint.	2024-05-06 14:07:07 +01:00
Christian Schwarz	1e7cd6ac9f	refactor: move `NodeMetadata` to `pageserver_api`; use it from `neon_local` (#7606 ) This is the first step towards representing all of Pageserver configuration as clean `serde::Serialize`able Rust structs in `pageserver_api`. The `neon_local` code will then use those structs instead of the crude `toml_edit` / string concatenation that it does today. refs https://github.com/neondatabase/neon/issues/7555 --------- Co-authored-by: Alex Chi Z <iskyzh@gmail.com>	2024-05-03 13:15:38 -04:00
Alex Chi Z	ef03b38e52	fix(pageserver): remove update_gc_info calls in tests (#7608 ) introduced by https://github.com/neondatabase/neon/pull/7468 conflicting with https://github.com/neondatabase/neon/pull/7584 Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-05-03 16:01:33 +00:00
Alex Chi Z	a3fe12b6d8	feat(pageserver): add scan interface (#7468 ) This pull request adds the scan interface. Scan operates on a sparse keyspace and retrieves all the key-value pairs from the keyspaces. Currently, scan only supports the metadata keyspace, and by default do not retrieve anything from the ancestor branch. This should be fixed in the future if we need to have some keyspaces that inherits from the parent. The scan interface reuses the vectored get code path by disabling the missing key errors. This pull request also changes the behavior of vectored get on aux file v1/v2 key/keyspace: if the key is not found, it is simply not included in the result, instead of throwing a missing key error. TODOs in future pull requests: limit memory consumption, ensure the search stops when all keys are covered by the image layer, remove `#[allow(dead_code)]` once the code path is used in basebackups / aux files, remove unnecessary fine-grained keyspace tracking in vectored get (or have another code path for scan) to improve performance. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-05-03 10:43:30 -04:00
John Spray	b5a6e68e68	storage controller: check warmth of secondary before doing proactive migration (#7583 ) ## Problem The logic in Service::optimize_all would sometimes choose to migrate a tenant to a secondary location that was only recently created, resulting in Reconciler::live_migrate hitting its 5 minute timeout warming up the location, and proceeding to attach a tenant to a location that doesn't have a warm enough local set of layer files for good performance. Closes: #7532 ## Summary of changes - Add a pageserver API for checking download progress of a secondary location - During `optimize_all`, connect to pageservers of candidate optimization secondary locations, and check they are warm. - During shard split, do heatmap uploads and start secondary downloads, so that the new shards' secondary locations start downloading ASAP, rather than waiting minutes for background downloads to kick in. I have intentionally not implemented this by continuously reading the status of locations, to avoid dealing with the scale challenge of efficiently polling & updating 10k-100k locations status. If we implement that in the future, then this code can be simplified to act based on latest state of a location rather than fetching it inline during optimize_all.	2024-05-03 14:28:23 +00:00
Arpad Müller	426598cf76	Update rust to 1.78.0 (#7598 ) We keep the practice of keeping the compiler up to date, pointing to the latest release. This is done by many other projects in the Rust ecosystem as well. Release notes: https://blog.rust-lang.org/2024/05/02/Rust-1.78.0.html Prior update was in #7198	2024-05-03 15:59:28 +02:00
John Spray	8b4dd5dc27	pageserver: jitter secondary periods (#7544 ) ## Problem After some time the load from heatmap uploads gets rather spiky. They're unintentionally synchronising. Chart (does this make a _boing_ sound in anyone else's head?): ![image](https://github.com/neondatabase/neon/assets/944640/18829fc8-c5b7-4739-9a9b-491b5d6fcade) ## Summary of changes - Add a helper `period_jitter` and apply a 5% jitter from downloader and heatmap_uploader when updating the next runtime at the end of an interation. - Refactor existing places that we pick a startup interval into `period_warmup`, so that the intent is obvious.	2024-05-03 12:31:25 +00:00
Joonas Koivunen	ed9a114bde	fix: find gc cutoff points without holding Tenant::gc_cs (#7585 ) The current implementation of finding timeline gc cutoff Lsn(s) is done while holding `Tenant::gc_cs`. In recent incidents long create branch times were caused by holding the `Tenant::gc_cs` over extremely long `Timeline::find_lsn_by_timestamp`. The fix is to find the GC cutoff values before taking the `Tenant::gc_cs` lock. This change is safe to do because the GC cutoff values and the branch points have no dependencies on each other. In the case of `Timeline::find_gc_cutoff` taking a long time with this change, we should no longer see `Tenant::gc_cs` interfering with branch creation. Additionally, the `Tenant::refresh_gc_info` is now tolerant of timeline deletions (or any other failures to find the pitr_cutoff). This helps with the synthetic size calculation being constantly completed instead of having a break for a timely timeline deletion. Fixes: #7560 Fixes: #7587	2024-05-03 14:57:26 +03:00
Joonas Koivunen	60f570c70d	refactor(update_gc_info): split GcInfo to compose out of GcCutoffs (#7584 ) Split `GcInfo` and replace `Timeline::update_gc_info` with a method that simply finds gc cutoffs `Timeline::find_gc_cutoffs` to be combined as `Timeline::gc_info` at the caller. This change will be followed up with a change that finds the GC cutoff values before taking the `Tenant::gc_cs` lock. Cc: #7560	2024-05-03 13:11:51 +03:00
Alex Chi Z	3582a95c87	fix(pageserver): compile warning of download_object.ctx on macos (#7596 ) fix macOS compile warning introduced in `45ec8688ea` Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-05-03 10:55:48 +02:00
Arpad Müller	7a49e5d5c2	Remove tenant_id from TenantLocationConfigRequest (#7469 ) Follow-up of #7055 and #7476 to remove `tenant_id` from `TenantLocationConfigRequest` completely. All components of our system should now not specify the `tenant_id`. cc https://github.com/neondatabase/cloud/pull/11791	2024-05-02 20:18:13 +02:00
Christian Schwarz	45ec8688ea	chore(pageserver): plumb through RequestContext to VirtualFile write methods (#7566 ) This PR introduces no functional changes. The read path will be done separately. refs https://github.com/neondatabase/neon/issues/6107 refs https://github.com/neondatabase/neon/issues/7386	2024-05-02 18:58:10 +02:00
Alex Chi Z	f656db09a4	fix(pageserver): properly propagate missing key error for vectored get (#7569 ) Some part of the code requires missing key error to be propagated to the code path correctly (i.e., aux key range scan). Currently, it's an anyhow error. * remove `stuck_lsn` from the missing key error. * as a result, when matching missing key, we do not distinguish the case `stuck_lsn = false/true`. * vectored get now use the unified missing key error. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-05-02 09:19:45 -04:00
Conrad Ludgate	cb4b4750ba	update to reqwest 0.12 (#7561 ) ## Problem #7557 ## Summary of changes	2024-05-02 11:16:04 +02:00
Alex Chi Z	5558457c84	chore(pageserver): categorize basebackup errors (#7523 ) close https://github.com/neondatabase/neon/issues/7391 ## Summary of changes Categorize basebackup error into two types: server error and client error. This makes it easier to set up alerts. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-05-01 16:31:59 +00:00
Alex Chi Z	26e6ff8ba6	chore(pageserver): concise error message for layer traversal (#7565 ) Instead of showing the full path of layer traversal, we now only show tenant (in tracing context)+timeline+filename. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-05-01 11:44:42 -04:00
Arthur Petukhovsky	50a45e67dc	Discover safekeepers via broker request (#7279 ) We had an incident where pageserver requests timed out because pageserver couldn't fetch WAL from safekeepers. This incident was caused by a bug in safekeeper logic for timeline activation, which prevented pageserver from finding safekeepers. This bug was since fixed, but there is still a chance of a similar bug in the future due to overall complexity. We add a new broker message to "signal interest" for timeline. This signal will be sent by pageservers `wait_lsn`, and safekeepers will receive this signal to start broadcasting broker messages. Then every broker subscriber will be able to find the safekeepers and connect to them (to start fetching WAL). This feature is not limited to pageservers and any service that wants to download WAL from safekeepers will be able to use this discovery request. This commit changes pageserver's connection_manager (walreceiver) to send a SafekeeperDiscoveryRequest when there is no information about safekeepers present in memory. Current implementation will send these requests only if there is an active wait_lsn() call and no more often than once per 10 seconds. Add `test_broker_discovery` to test this: safekeepers started with `--disable-periodic-broker-push` will not push info to broker so that pageserver must use a discovery to start fetching WAL. Add task_stats in safekeepers broker module to log a warning if there is no message received from the broker for the last 10 seconds. Closes #5471 --------- Co-authored-by: Christian Schwarz <christian@neon.tech>	2024-04-30 18:50:03 +00:00
Arpad Müller	010f0a310a	Make test_random_updates and test_read_at_max_lsn compatible with new compaction (#7551 ) Makes two of the tests work with the tiered compaction that I had to ignore in #7283. The issue was that tiered compaction actually created image layers, but the keys didn't appear in them as `collect_keyspace` didn't include them. Not a compaction problem, but due to how the test is structured. Fixes #7287	2024-04-30 16:52:54 +02:00
John Spray	eb53345d48	pageserver: reduce runtime of init_tenant_mgr (#7553 ) ## Problem `init_tenant_mgr` blocks the rest of pageserver startup, including starting the admin API. This was noticeable in #7475 , where the init_tenant_mgr runtime could be long enough to trip the controller's 30 second heartbeat timeout. ## Summary of changes - When detaching tenants during startup, spawn the background deletes as background tasks instead of doing them inline - Write all configs before spawning any tenants, so that the config writes aren't fighting tenants for system resources - Write configs with some concurrency (16) rather than writing them all sequentially.	2024-04-30 15:16:15 +01:00
Alex Chi Z	45c625fb34	feat(pageserver): separate sparse and dense keyspace (#7503 ) extracted (and tested) from https://github.com/neondatabase/neon/pull/7468, part of https://github.com/neondatabase/neon/issues/7462. The current codebase assumes the keyspace is dense -- which means that if we have a keyspace of 0x00-0x100, we assume every key (e.g., 0x00, 0x01, 0x02, ...) exists in the storage engine. However, the assumption does not hold any more in metadata keyspace. The metadata keyspace is sparse. It is impossible to do per-key check. Ideally, we should not have the assumption of dense keyspace at all, but this would incur a lot of refactors. Therefore, we split the keyspaces we have to dense/sparse and handle them differently in the code for now. At some point in the future, we should assume all keyspaces are sparse. ## Summary of changes * Split collect_keyspace to return dense+sparse keyspace. * Do not allow generating image layers for sparse keyspace (for now -- will fix this next week, we need image layers anyways). * Generate delta layers for sparse keyspace. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-04-30 09:39:10 -04:00
John Spray	577982b778	pageserver: remove workarounds from #7454 (#7550 ) PR #7454 included a workaround that let any existing bugged databases start up. Having used that already, we may now Closes: https://github.com/neondatabase/neon/issues/7480	2024-04-30 11:04:54 +01:00
John Spray	574645412b	pageserver: shard-aware keyspace partitioning (#6778 ) ## Problem Followup to https://github.com/neondatabase/neon/pull/6776 While #6776 makes compaction safe on sharded tenants, the logic for keyspace partitioning remains inefficient: it assumes that the size of data on a pageserver can be calculated simply as the range between start and end of a Range -- this is not the case in sharded tenants, where data within a range belongs to a variety of shards. Closes: https://github.com/neondatabase/neon/issues/6774 ## Summary of changes I experimented with using a sharding-aware range type in KeySpace to replace all the Range<Key> uses, but the impact on other code was quite large (many places use the ranges), and not all of them need this property of being able to approximate the physical size of data within a key range. So I compromised on expressing this as a ShardedRange type, but only using that type selctively: during keyspace repartition, and in tiered compaction when accumulating key ranges. - keyspace partitioning methods take sharding parameters as an input - new `ShardedRange` type wraps a Range<Key> and a shard identity - ShardedRange::page_count is the shard-aware replacement for key_range_size - Callers that don't need to be shard-aware (e.g. vectored get code that just wants to count the number of keys in a keyspace) can use ShardedRange::raw_size to get the faster, shard-naive code (same as old `key_range_size`) - Compaction code is updated to carry a shard identity so that it can use shard aware calculations - Unit tests for the new fragmentation logic. - Add a test for compaction on sharded tenants, that validates that we generate appropriately sized image layers (this fails before fixing keyspace partitioning)	2024-04-29 17:46:46 +00:00
Alex Chi Z	11945e64ec	chore(pageserver): improve in-memory layer vectored get (#7467 ) previously in https://github.com/neondatabase/neon/pull/7375, we observed that for in-memory layers, we will need to iterate every key in the key space in order to get the result. The operation can be more efficient if we use BTreeMap as the in-memory layer representation, even if we are doing vectored get in a dense keyspace. Imagine a case that the in-memory layer covers a very little part of the keyspace, and most of the keys need to be found in lower layers. Using a BTreeMap can significantly reduce probes for nonexistent keys. ## Summary of changes * Use BTreeMap as in-memory layer representation. * Optimize the vectored get flow to utilize the range scan functionality of BTreeMap. Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-04-29 17:16:42 +00:00

1 2 3 4 5 ...

2110 Commits