rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-07-05 21:20:37 +00:00

Author	SHA1	Message	Date
Arpad Müller	acf0a11fea	Move keyspace utils to inherent impls (#7929 ) The keyspace utils like `is_rel_size_key` or `is_rel_fsm_block_key` and many others are free functions and have to be either imported separately or specified with the full path starting in `pageserver_api:🔑:`. This is less convenient than if these functions were just inherent impls. Follow-up of #7890 Fixes #6438	2024-06-03 16:18:07 +02:00
Alex Chi Z	c1f55c1525	feat(pageserver): collect aux file tombstones (#7900 ) close https://github.com/neondatabase/neon/issues/7800 This is a small change to enable the tombstone -> exclude from image layer path. Most of the pull request is unit tests. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-06-03 09:56:36 -04:00
John Spray	7e60563910	pageserver: add GcError type (#7917 ) ## Problem - Because GC exposes all errors as an anyhow::Error, we have intermittent issues with spurious log errors during shutdown, e.g. in this failure of a performance test https://neon-github-public-dev.s3.amazonaws.com/reports/main/9300804302/index.html#suites/07874de07c4a1c9effe0d92da7755ebf/214a2154f6f0217a/ ``` Gc failed 1 times, retrying in 2s: shutting down ``` GC really doesn't do a lot of complicated IO: it doesn't benefit from the backtrace capabilities of anyhow::Error, and can be expressed more robustly as an enum. ## Summary of changes - Add GcError type and use it instead of anyhow::Error in GC functions - In `gc_iteration_internal`, return GcError::Cancelled on shutdown rather than Ok(()) (we only used Ok before because we didn't have a clear cancellation error variant to use). - In `gc_iteration_internal`, skip past timelines that are shutting down, to avoid having to go through another GC iteration if we happen to see a deleting timeline during a GC run. - In `refresh_gc_info_internal`, avoid an error case where a timeline might not be found after being looked up, by carrying an Arc<Timeline> instead of a TimelineId between the first loop and second loop in the function. - In HTTP request handler, handle Cancelled variants as 503 instead of turning all GC errors into 500s.	2024-05-31 22:20:06 +01:00
Joonas Koivunen	ef83f31e77	pagectl: key command for dumping what we know about the key (#7890 ) What we know about the key via added `pagectl key $key` command: - debug formatting - shard placement when `--shard-count` is specified - different boolean queries in `key.rs` - aux files v2 Example: ``` $ cargo run -qp pagectl -- key 000000063F00004005000060270000100E2C parsed from hex: 000000063F00004005000060270000100E2C: Key { field1: 0, field2: 1599, field3: 16389, field4: 24615, field5: 0, field6: 1052204 } rel_block: true rel_vm_block: false rel_fsm_block: false slru_block: false inherited: true rel_size: false slru_segment_size: false recognized kind: None ```	2024-05-31 18:19:41 +00:00
John Spray	9fda85b486	pageserver: remove AncestorStopping error variants (#7916 ) ## Problem In all cases, AncestorStopping is equivalent to Cancelled. This became more obvious in https://github.com/neondatabase/neon/pull/7912#discussion_r1620582309 when updating these error types. ## Summary of changes - Remove AncestorStopping, always use Cancelled instead	2024-05-31 17:02:10 +01:00
Alex Chi Z	87afbf6b24	test(pageserver): add test interface to create artificial layers (#7899 ) This pull request adds necessary interfaces to deterministically create scenarios we want to test. Simplify some test cases to use this interface to make it stable + reproducible. Compaction test will be able to use this interface. Also the upcoming delete tombstone tests will use this interface to make test reproducible. ## Summary of changes * `force_create_image_layer` * `force_create_delta_layer` * `force_advance_lsn` * `create_test_timeline_with_states` * `branch_timeline_test_with_states` --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-05-31 12:00:40 -04:00
Arthur Petukhovsky	16b2e74037	Add FullAccessTimeline guard in safekeepers (#7887 ) This is a preparation for https://github.com/neondatabase/neon/issues/6337. The idea is to add FullAccessTimeline, which will act as a guard for tasks requiring access to WAL files. Eviction will be blocked on these tasks and WAL won't be deleted from disk until there is at least one active FullAccessTimeline. To get FullAccessTimeline, tasks call `tli.full_access_guard().await?`. After eviction is implemented, this function will be responsible for downloading missing WAL file and waiting until the download finishes. This commit also contains other small refactorings: - Separate `get_tenant_dir` and `get_timeline_dir` functions for building a local path. This is useful for looking at usages and finding tasks requiring access to local filesystem. - `timeline_manager` is now responsible for spawning all background tasks - WAL removal task is now spawned instantly after horizon is updated	2024-05-31 13:19:45 +00:00
John Spray	5a394fde56	pageserver: avoid spurious "bad state" logs/errors during shutdown (#7912 ) ## Problem - Initial size calculations tend to fail with `Bad state (not active)` Closes: https://github.com/neondatabase/neon/issues/7911 ## Summary of changes - In `wait_lsn`, return WaitLsnError::Cancelled rather than BadState when the state is Stopping - Replace PageReconstructError's `Other` variant with a specific `BadState` variant - Avoid returning anyhow::Error from get_ready_ancestor_timeline -- this was only used for the case where there was no ancestor. All callers of this function had implicitly checked that the ancestor timeline exists before calling it, so they can pass in the ancestor instead of handling an error.	2024-05-31 13:31:42 +01:00
John Spray	98dadf8543	pageserver: quieten some shutdown logs around logical size and flush (#7907 ) ## Problem Looking at several noisy shutdown logs: - In https://github.com/neondatabase/neon/issues/7861 we're hitting a log error with `InternalServerError(timeline shutting down\n'` on the checkpoint API handler. - In the field, we see initial_logical_size_calculation errors on shutdown, via DownloadError - In the field, we see errors logged from layer download code (independent of the error propagated) during shutdown Closes: https://github.com/neondatabase/neon/issues/7861 ## Summary of changes The theme of these changes is to avoid propagating anyhow::Errors for cases that aren't really unexpected error cases that we might want a stacktrace for, and avoid "Other" error variants unless we really do have unexpected error cases to propagate. - On the flush_frozen_layers path, use the `FlushLayerError` type throughout, rather than munging it into an anyhow::Error. Give FlushLayerError an explicit from_anyhow helper that checks for timeline cancellation, and uses it to give a Cancelled error instead of an Other error when the timeline is shutting down. - In logical size calculation, remove BackgroundCalculationError (this type was just a Cancelled variant and an Other variant), and instead use CalculateLogicalSizeError throughout. This can express a PageReconstructError, and has a From impl that translates cancel-like page reconstruct errors to Cancelled. - Replace CalculateLogicalSizeError's Other(anyhow::Error) variant case with a Decode(DeserializeError) variant, as this was the only kind of error we actually used in the Other case. - During layer download, drop out early if the timeline is shutting down, so that we don't do an `error!()` log of the shutdown error in this case.	2024-05-31 09:18:58 +01:00
Alex Chi Z	f20a9e760f	chore(pageserver): warn on delete non-existing file (#7847 ) Consider the following sequence of migration: ``` 1. user starts compute 2. force migrate to v2 3. user continues to write data ``` At the time of (3), the compute node is not aware that the page server does not contain replication states any more, and might continue to ingest neon-file records into the safekeeper. This will leave the pageserver store a partial replication state and cause some errors. For example, the compute could issue a deletion of some aux files in v1, but this file does not exist in v2. Therefore, we should ignore all these errors until everyone is migrated to v2. Also note that if we see this warning in prod, it is likely because we did not fully suspend users' compute when flipping the v1/v2 flag. Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-05-30 14:45:34 +00:00
Alex Chi Z	33395dcf4e	perf(pageserver): postpone vectored get fringe keyspace construction (#7904 ) Perf shows a significant amount of time is spent on `Keyspace::merge`. This pull request postpones merging keyspace until retrieving the layer, which contributes to a 30x improvement in aux keyspace basebackup time. ``` --- old 10000 files found in 0.580569459s --- new 10000 files found in 0.02995075s ``` Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-05-30 10:31:57 -04:00
YukiSeino	167394a073	refacter : VirtualFile::open uses AsRef (#7908 ) ## Problem #7371 ## Summary of changes * The VirtualFile::open, open_with_options, and create methods use AsRef, similar to the standard library's std::fs APIs.	2024-05-30 15:58:20 +02:00
John Spray	352b08d0be	pageserver: fix a warning on secondary mode downloads after evictions (#7877 ) ## Problem In `4ce6e2d2fc` we added a warning when progress stats don't look right at the end of a secondary download pass. This `Correcting drift in progress stats` warning fired in staging on a pageserver that had been doing some disk usage eviction. The impact is low because in the same place we log the warning, we also fix up the progress values. ## Summary of changes - When we skip downloading a layer because it was recently evicted, update the progress stats to ensure they still reach a clean complete state at the end of a download pass. - Also add a log for evicting secondary location layers, for symmetry with attached locations, so that we can clearly see when eviction has happened for a particular tenant's layers when investigating issues. This is a point fix -- the code would also benefit from being refactored so that there is some "download result" type with a Skip variant, to ensure that we are updating the progress stats uniformly for those cases.	2024-05-28 16:06:47 +01:00
Arseny Sher	3797566c36	safekeeper: test pull_timeline with WAL gc. Do pull_timeline while WAL is being removed. To this end - extract pausable_failpoint to utils, sprinkle pull_timeline with it - add 'checkpoint' sk http endpoint to force WAL removal. After fixing checking for pull file status code test fails so far which is expected.	2024-05-25 06:06:32 +03:00
John Spray	1455f5a261	pageserver: revert concurrent secondary downloads, make DownloadStream always yield Err after cancel (#7866 ) ## Problem Ongoing hunt for secondary location shutdown hang issues. ## Summary of changes - Revert the functional changes from #7675 - Tweak a log in secondary downloads to make it more apparent when we drop out on cancellation - Modify DownloadStream's behavior to always return an Err after it has been cancelled. This _should_ not impact anything, but it makes the behavior simpler to reason about (e.g. even if the poll function somehow got called again, it could never end up in an un-cancellable state) Related #https://github.com/neondatabase/cloud/issues/13576	2024-05-24 11:45:34 +03:00
John Spray	3860bc9c6c	pageserver: post-shard-split layer rewrites (2/2) (#7531 ) ## Problem - After a shard split of a large existing tenant, child tenants can end up with oversized historic layers indefinitely, if those layers are prevented from being GC'd by branchpoints. This PR follows https://github.com/neondatabase/neon/pull/7531, and adds rewriting of layers that contain a mixture of needed & un-needed contents, in addition to dropping un-needed layers. Closes: https://github.com/neondatabase/neon/issues/7504 ## Summary of changes - Add methods to ImageLayer for reading back existing layers - Extend `compact_shard_ancestors` to rewrite layer files that contain a mixture of keys that we want and keys we do not, if unwanted keys are the majority of those in the file. - Amend initialization code to handle multiple layers with the same LayerName properly - Get rid of of renaming bad layer files to `.old` since that's now expected on restarts during rewrites.	2024-05-24 08:33:19 +00:00
MMeent	0e4f182680	Rework PageStream connection state handling: (#7611 ) * Make PS connection startup use async APIs This allows for improved query cancellation when we start connections * Make PS connections have per-shard connection retry state. Previously they shared global backoff state, which is bad for quickly getting all connections started and/or back online. * Make sure we clean up most connection state on failed connections. Previously, we could technically leak some resources that we'd otherwise clean up. Now, the resources are correctly cleaned up. * pagestore_smgr.c now PANICs on unexpected response message types. Unexpected responses are likely a symptom of having a desynchronized view of the connection state. As a desynchronized connection state can cause corruption, we PANIC, as we don't know what data may have been written to buffers: the only solution is to fail fast & hope we didn't write wrong data. * Catch errors in sync pagestream request handling. Previously, if a query was cancelled after a message was sent to the pageserver, but before the data was received, the backend could forget that it sent the synchronous request, and let others deal with the repercussions. This could then lead to incorrect responses, or errors such as "unexpected response from page server with tag 0x68"	2024-05-23 23:26:42 +02:00
Joonas Koivunen	7cf726e36e	refactor(rtc): remove the duplicate IndexLayerMetadata (#7860 ) Once upon a time, we used to have duplicated types for runtime IndexPart and whatever we stored. Because of the serde fixes in #5335 we have no need for duplicated IndexPart type anymore, but the `IndexLayerMetadata` stayed. - remove the type - remove LayerFileMetadata::file_size() in favor of direct field access Split off from #7833. Cc: #3072.	2024-05-23 23:24:31 +03:00
Alex Chi Z	6b3164269c	chore(pageserver): reduce logging related to image layers (#7864 ) * Reduce the logging level for create image layers of metadata keys. (question: is it possible to adjust logging levels at runtime?) * Do a info logging of image layers only after the layer is created. Now there are a lot of cases where we create the image layer writer but then discarding that image layer because it does not contain any key. Therefore, I changed the new image layer logging to trace, and create image layer logging to info. Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-05-23 15:30:43 +00:00
Arpad Müller	75a52ac7fd	Use Timeline::create_image_layer_for_rel_blocks in tiered compaction (#7850 ) Reduces duplication between tiered and legacy compaction by using the `Timeline::create_image_layer_for_rel_blocks` function. This way, we also use vectored get in tiered compaction, so the change has two benefits in one. fixes #7659 --------- Co-authored-by: Alex Chi Z. <iskyzh@gmail.com>	2024-05-23 15:10:24 +00:00
Alex Chi Z	e28e46f20b	fix(pageserver): make wal connstr a connstr (#7846 ) The list timeline API gives something like `"wal_source_connstr":"PgConnectionConfig { host: Domain(\"safekeeper-5.us-east-2.aws.neon.build\"), port: 6500, password: Some(REDACTED-STRING) }"`, which is weird. This pull request makes it somehow like a connection string. This field is not used at least in the neon database, so I assume no one is reading or parsing it. Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-05-23 09:45:29 -04:00
Arpad Müller	d5d15eb6eb	Warn if a blob in an image is larger than 256 MiB (#7852 ) We'd like to get some bits reserved in the length field of image layers for future usage (compression). This PR bases on the assumption that we don't have any blobs that require more than 28 bits (3 bytes + 4 bits) to store the length, but as a preparation, before erroring, we want to first emit warnings as if the assumption is wrong, such warnings are less disruptive than errors. A metric would be even less disruptive (log messages are more slow, if we have a LOT of such large blobs then it would take a lot of time to print them). At the same time, likely such 256 MiB blobs will occupy an entire layer file, as they are larger than our target size. For layer files we already log something, so there shouldn't be a large increase in overhead. Part of #5431	2024-05-23 14:28:05 +02:00
John Spray	a43a1ad1df	pageserver: fix API-driven secondary downloads possibly colliding with background downloads (#7848 ) ## Problem We've seen some strange behaviors when doing lots of migrations involving secondary locations. One of these was where a tenant was apparently stuck in the `Scheduler::running` list, but didn't appear to be making any progress. Another was a shutdown hang (https://github.com/neondatabase/cloud/issues/13576). ## Summary of changes - Fix one issue (probably not the only one) where a tenant in the `pending` list could proceed to `spawn` even if the same tenant already had a running task via `handle_command` (this could have resulted in a weird value of SecondaryProgress) - Add various extra logging: - log before as well as after layer downloads so that it would be obvious if we were stuck in remote storage code (we shouldn't be, it has built in timeouts) - log the number of running + pending jobs from the scheduler every time it wakes up to do a scheduling iteration (~10s) -- this is quite chatty, but not compared with the volume of logs on a busy pageserver. It should give us confidence that the scheduler loop is still alive, and visibility of how many tasks the scheduler thinks are running.	2024-05-23 09:13:55 +01:00
Alex Chi Z	ff560a1113	chore(pageserver): use kebab case for compaction algorithms (#7845 ) Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-05-22 21:28:47 +00:00
Alex Chi Z	4a278cce7c	chore(pageserver): add force aux file policy switch handler (#7842 ) For existing users, we want to allow doing a force switch for their aux file policy. Part of #7462 --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-05-22 19:05:26 +00:00
Alex Chi Z	64577cfddc	feat(pageserver): auto-detect previous aux file policy (#7841 ) ## Problem If an existing user already has some aux v1 files, we don't want to switch them to the global tenant-level config. Part of #7462 --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-05-22 12:41:13 -04:00
Joonas Koivunen	62aac6c8ad	fix(Layer): carry gate until eviction is complete (#7838 ) the gate was accidentially being dropped before the final blocking phase, possibly explaining the resident physical size global problems during deletions. it could had caused more harm as well, but the path is not actively being tested because cplane no longer puts locationconfigs with higher generation number during normal operation which prompted the last wave of fixes. Cc: #7341.	2024-05-22 18:13:45 +03:00
Joonas Koivunen	ce44dfe353	openapi: document timeline ancestor detach (#7650 ) The openapi description with the error descriptions: - 200 is used for "detached or has been detached previously" - 400 is used for "cannot be detached right now" -- it's an odd thing, but good enough - 404 is used for tenant or timeline not found - 409 is used for "can never be detached" (root timeline) - 500 is used for transient errors (basically ill-defined shutdown errors) - 503 is used for busy (other tenant ancestor detach underway, pageserver shutdown) Cc: #6994	2024-05-22 13:55:34 +00:00
Arpad Müller	664f92dc6e	Refactor PageServerHandler::process_query parsing (#7835 ) In the process_query function in page_service.rs there was some redundant duplication. Remove it and create a vector of whitespace separated parts at the start and then use `slice::strip_prefix`. Only use `starts_with` in the places with multiple whitespace separated parameters: here we want to preserve grep/rg ability. Followup of #7815, requested in https://github.com/neondatabase/neon/pull/7815#pullrequestreview-2068835674	2024-05-22 12:43:03 +02:00
Arpad Müller	679e031cf6	Add dummy lsn lease http and page service APIs (#7815 ) We want to introduce a concept of temporary and expiring LSN leases. This adds both a http API as well as one for the page service to obtain temporary LSN leases. This adds a dummy implementation to unblock integration work of this API. A functional implementation of the lease feature is deferred to a later step. Fixes #7808 Co-authored-by: Joonas Koivunen <joonas@neon.tech>	2024-05-21 23:31:20 +02:00
Alex Chi Z	e3f6a07ca3	chore(pageserver): remove metrics for in-memory ingestion (#7823 ) The metrics was added in https://github.com/neondatabase/neon/pull/7515/ to observe if https://github.com/neondatabase/neon/pull/7467 introduces any perf regressions. The change was deployed on 5/7 and no changes are observed in the metrics. So it's safe to remove the metrics now. Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-05-21 13:33:29 -04:00
Joonas Koivunen	a8a88ba7bc	test(detach_ancestor): ensure L0 compaction in history is ok (#7813 ) detaching a timeline from its ancestor can leave the resulting timeline with more L0 layers than the compaction threshold. most of the time, the detached timeline has made progress, and next time the L0 -> L1 compaction happens near the original branch point and not near the last_record_lsn. add a test to ensure that inheriting the historical L0s does not change fullbackup. additionally: - add `wait_until_completed` to test-only timeline checkpoint and compact HTTP endpoints. with `?wait_until_completed=true` the endpoints will wait until the remote client has completed uploads. - for delta layers, describe L0-ness with the `/layer` endpoint Cc: #6994	2024-05-21 20:08:43 +03:00
Arseny Sher	f2771a99b7	Add metric for pageserver standby horizon. Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2024-05-21 16:21:29 +03:00
Arseny Sher	478cc37a70	Propagate standby apply LSN to pageserver to hold off GC. To avoid pageserver gc'ing data needed by standby, propagate standby apply LSN through standby -> safekeeper -> broker -> pageserver flow and hold off GC for it. Iteration of GC resets the value to remove the horizon when standby goes away -- pushes are assumed to happen at least once between gc iterations. As a safety guard max allowed lag compared to normal GC horizon is hardcoded as 10GB. Add test for the feature. Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2024-05-21 16:21:29 +03:00
John Spray	4ce6e2d2fc	pageserver: fix secondary progress stats when layers are 404 (#7814 ) ## Problem Noticed this issue in staging. When a tenant is under somewhat heavy timeline creation/deletion thrashing, it becomes quite common for secondary downloads to encounter 404s downloading layers. This is tolerated by design, because heatmaps are not guaranteed to be up to date with what layers/timelines actually exist. However, we were not updating the SecondaryProgress structure in this case, so after such a download pass, we would leave a SecondaryProgress state with lower "downloaded" stats than "total" stats. This causes the storage controller to consider this secondary location inelegible for optimization actions such as we do after shard splits This issue has relative low impact because a typical tenant will eventually upload a heatmap where we do download all the layers and thereby enable the controller to progress with migrations -- the heavy thrashing of timeline creation/deletion is an artifact of our nightly stress tests. ## Summary of changes - In the layer 404 case, subtract the skipped layer's stats from the totals, so that at the end of this download pass we should still end up in a complete state. - When updating `last_downloaded`, do a sanity check that our progress is complete. In debug builds, assert out if this is not the case. In prod builds, correct the stats and log a warning.	2024-05-21 13:46:04 +01:00
Alex Chi Z	6810d2aa53	feat(pageserver): do not read past image layers for vectored get (#7773 ) ## Problem Part of https://github.com/neondatabase/neon/issues/7462 On metadata keyspace, vectored get will not stop if a key is not found, and will read past the image layer. However, the semantics is different from single get, because if a key does not exist in the image layer, it means that the key does not exist in the past, or have been deleted. This pull request fixed it by recording image layer coverage during the vectored get process and stop when the full keyspace is covered by an image layer. A corresponding test case is added to ensure generating image layer reduces the number of delta layers. This optimization (or bug fix) also applies to rel block keyspaces. If a key is missing, we can know it's missing once the first image layer is reached. Page server will not attempt to read lower layers, which potentially incurs layer downloads + evictions. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-05-20 14:24:18 -04:00
Alex Chi Z	7701ca45dd	feat(pageserver): generate image layers for sparse keyspace (#7567 ) Part of https://github.com/neondatabase/neon/issues/7462 Sparse keyspace does not generate image layers for now. This pull request adds support for generating image layers for sparse keyspace. ## Summary of changes * Use the scan interface to generate compaction data for sparse keyspace. * Track num of delta layers reads during scan. * Read-trigger compaction: when a scan on the keyspace touches too many delta files, generate an image layer. There are one hard-coded threshold for now: max delta layers we want to touch for a scan. * L0 compaction does not need to compute holes for metadata keyspace. Know issue: the scan interface currently reads past the image layer, which causes `delta_layer_accessed` keeps increasing even if image layers are generated. The pull request to fix that will be separate, and orthogonal to this one. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-05-20 16:08:45 +00:00
John Spray	291fcb9e4f	pageserver: use the heatmap upload interval to set the secondary download interval (#7793 ) ## Problem The heatmap upload period is configurable, but secondary mode downloads were using a fixed download period. Closes: #6200 ## Summary of changes - Use the upload period in the heatmap to adjust the download period. In practice, this will reduce the frequency of downloads from its current 60 second period to what heatmaps use, which is 5-10m depending on environment. This is an improvement rather than being optimal: we could be smarter about periods, and schedule downloads to occur around the time we expect the next upload, rather than just using the same period, but that's something we can address in future if it comes up.	2024-05-20 09:25:25 +01:00
Alex Chi Z	e1a9669d05	feat(pagebench): add aux file bench (#7746 ) part of https://github.com/neondatabase/neon/issues/7462 ## Summary of changes This pull request adds two APIs to the pageserver management API: list_aux_files and ingest_aux_files. The aux file pagebench is intended to be used on an empty timeline because the data do not go through the safekeeper. LSNs are advanced by 8 for each ingestion, to avoid invariant checks inside the pageserver. For now, I only care about space amplification / read amplification, so the bench is designed in a very simple way: ingest 10000 files, and I will manually dump the layer map to analyze. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-05-17 20:04:02 +00:00
Alex Chi Z	aaf60819fa	feat(pageserver): persist aux file policy in index part (#7668 ) Part of https://github.com/neondatabase/neon/issues/7462 ## Summary of changes Tenant config is not persisted unless it's attached on the storage controller. In this pull request, we persist the aux file policy flag in the `index_part.json`. Admins can set `switch_aux_file_policy` in the storage controller or using the page server API. Upon the first aux file gets written, the write path will compare the aux file policy target with the current policy. If it is switch-able, we will do the switch. Otherwise, the original policy will be used. The test cases show what the admins can do / cannot do. The `last_aux_file_policy` is stored in `IndexPart`. Updates to the persisted policy are done via `schedule_index_upload_for_aux_file_policy_update`. On the write path, the writer will update the field. --------- Signed-off-by: Alex Chi Z <chi@neon.tech> Co-authored-by: Joonas Koivunen <joonas@neon.tech>	2024-05-17 19:22:49 +00:00
John Spray	c84656a53e	pageserver: implement auto-splitting (#7681 ) ## Problem Currently tenants are only split into multiple shards if a human being calls the API to do it. Issue: #7388 ## Summary of changes - Add a pageserver API for returning the top tenants by size - Add a step to the controller's background loop where if there is no reconciliation or optimization to be done, it looks for things to split. - Add a test that runs pgbench on many tenants concurrently, and checks that splitting happens as expected as tenants grow, without interrupting the client I/O. This PR is quite basic: there is a tasklist in https://github.com/neondatabase/neon/issues/7388 for further work. This PR is meant to be safe (off by default), and sufficient to enable our staging environment to run lots of sharded tenants without a human having to set them up.	2024-05-17 16:01:24 +00:00
John Spray	a8e6d259cb	pageserver: fixes for layer path changes (#7786 ) ## Problem - When a layer with legacy local path format is evicted and then re-downloaded, a panic happened because the path downloaded by remote storage didn't match the path stored in Layer. - While investigating, I also realized that secondary locations would have a similar issue with evictions. Closes: #7783 ## Summary of changes - Make remote timeline client take local paths as an input: it should not have its own ideas about local paths, instead it just uses the layer path that the Layer has. - Make secondary state store an explicit local path, populated on scan of local disk at startup. This provides the same behavior as for Layer, that our local_layer_path is a _default_, but the layer path can actually be anything (e.g. an old style one). - Add tests for both cases.	2024-05-17 13:24:03 +01:00
Joonas Koivunen	c1390bfc3b	chore: update defaults for timeline_detach_ancestor (#7779 ) by having 100 copy operations in flight twe climb up to 2500 requests per min or 41/s. This is still probably less than is allowed, but fast enough for our purposes.	2024-05-17 12:25:01 +02:00
Arpad Müller	4b8809b280	Tiered compaction: improvements to the windows (#7787 ) Tiered compaction employs two sliding windows over the keyspace: `KeyspaceWindow` for the image layer generation and `Window` for the delta layer generation. Do some fixes to both windows: * The distinction between the two windows is not very clear. Do the absolute minimum to mention where they are used in the rustdoc description of the struct. Maybe we should rename them (say `WindowForImage` and `WindowForDelta`) or merge them into one window implementation. * Require the keys to strictly increase. The `accum_key_values` already combines the key, so there is no logic needed in `Window::feed` for the same key repeating. This is a follow-up to address the request in https://github.com/neondatabase/neon/pull/7671#pullrequestreview-2051995541 * In `choose_next_delta`, we claimed in the comment to use 1.25 as the factor but it was 1.66 instead. Fix this discrepancy by using `*5/4` as the two operations.	2024-05-16 22:25:19 +02:00
Arpad Müller	ec069dc45e	tiered compaction: introduce PAGE_SZ constant and use it (#7785 ) pointed out by @problame : we use the literal 8192 instead of a properly defined constant. replace the literal by a PAGE_SZ constant.	2024-05-16 16:48:49 +02:00
John Spray	03c6039707	pageserver: refine tenant_id->shard lookup (#7762 ) ## Problem This is tech debt from when shard splitting was implemented, to handle more nicely the edge case of a client reconnect at the moment of the split. During shard splits, there were edge cases where we could incorrectly return NotFound to a getpage@lsn request, prompting an unwanted reconnect/backoff from the client. It is already the case that parent shards during splits are marked InProgress before child shards are created, so `resolve_attached_shard` will not match on them, thereby implicitly preferring child shards (good). However, we were not doing any elegant handling of InProgress in general: `get_active_tenant_with_timeout` was previously mostly dead code: it was inspecting the slot found by `resolve_attached_shard` and maybe waiting for InProgress, but that path is never taken because since `ef7c9c2ccc` the resolve function only ever returns attached slots. Closes: https://github.com/neondatabase/neon/issues/7044 ## Summary of changes - Change return value of `resolve_attached_shard` to distinguish between true NotFound case, and the case where we skipped slots that were InProgress. - Rework `get_active_tenant_with_timeout` to loop over calling resolve_attached_shard, waiting if it sees an InProgress result. The resulting behavior during a shard split is: - If we look up a shard early in split when parent is InProgress but children aren't created yet, we'll wait for the parent to be shut down. This corresponds to the part of the split where we wait for LSNs to catch up: so a small delay to the request, but a clean enough handling. - If we look up a shard while child shards are already present, we will match on those shards rather than the parent, as intended.	2024-05-16 08:26:34 +00:00
Alex Chi Z	4b97683338	feat(pageserver): use fnv hash for aux file encoding (#7742 ) FNV hash is simple, portable, and stable. This pull request vendors the FNV hash implementation from servo and modified it to use the u128 variant. replaces https://github.com/neondatabase/neon/pull/7644 ref https://github.com/neondatabase/neon/issues/7462 --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-05-15 13:17:57 -04:00
Jure Bajic	affc18f912	Add performance regress `test_ondemand_download_churn.py` (#7242 ) Add performance regress test for on-demand download throughput. Closes https://github.com/neondatabase/neon/issues/7146 Co-authored-by: Christian Schwarz <christian@neon.tech> Co-authored-by: Alexander Bayandin <alexander@neon.tech>	2024-05-15 18:41:12 +02:00
Christian Schwarz	c3dd646ab3	chore!: always use async walredo, warn if sync is configured (#7754 ) refs https://github.com/neondatabase/neon/issues/7753 This PR is step (1) of removing sync walredo from Pageserver. Changes: * Remove the sync impl * If sync is configured, warn! and use async instead * Remove the metric that exposes `kind` * Remove the tenant status API that exposes `kind` Future Work ----------- After we've released this change to prod and are sure we won't roll back, we will 1. update the prod Ansible to remove the config flag from the prod pageserver.toml. 2. remove the remaining `kind` code in pageserver These two changes need no release inbetween. See https://github.com/neondatabase/neon/issues/7753 for details.	2024-05-15 15:04:52 +02:00
John Spray	f342b87f30	pageserver: remove Option<> around remote storage, clean up metadata file refs (#7752 ) ## Problem This is historical baggage from when the pageserver could be run with local disk only: we had a bunch of places where we had to treat remote storage as optional. Closes: https://github.com/neondatabase/neon/issues/6890 ## Changes - Remove Option<> around remote storage (in https://github.com/neondatabase/neon/pull/7722 we made remote storage clearly mandatory) - Remove code for deleting old metadata files: they're all gone now. - Remove other references to metadata files when loading directories, as none exist. I checked last 14 days of logs for "found legacy metadata", there are no instances.	2024-05-15 12:05:24 +00:00

... 4 5 6 7 8 ...

2413 Commits