rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-07-05 21:20:37 +00:00

Author	SHA1	Message	Date
Christian Schwarz	4825b0fec3	compaction_level0_phase1: bypass PS PageCache for data blocks (#8543 ) part of https://github.com/neondatabase/neon/issues/8184 # Problem We want to bypass PS PageCache for all data block reads, but `compact_level0_phase1` currently uses `ValueRef::load` to load the WAL records from delta layers. Internally, that maps to `FileBlockReader:read_blk` which hits the PageCache [here](`e78341e1c2/pageserver/src/tenant/block_io.rs (L229-L236)`). # Solution This PR adds a mode for `compact_level0_phase1` that uses the `MergeIterator` for reading the `Value`s from the delta layer files. `MergeIterator` is a streaming k-merge that uses vectored blob_io under the hood, which bypasses the PS PageCache for data blocks. Other notable changes: * change the `DiskBtreeReader::into_stream` to buffer the node, instead of holding a `PageCache` `PageReadGuard`. * Without this, we run out of page cache slots in `test_pageserver_compaction_smoke`. * Generally, `PageReadGuard`s aren't supposed to be held across await points, so, this is a general bugfix. # Testing / Validation / Performance `MergeIterator` has not yet been used in production; it's being developed as part of * https://github.com/neondatabase/neon/issues/8002 Therefore, this PR adds a validation mode that compares the existing approach's value iterator with the new approach's stream output, item by item. If they're not identical, we log a warning / fail the unit/regression test. To avoid flooding the logs, we apply a global rate limit of once per 10 seconds. In any case, we use the existing approach's value. Expected performance impact that will be monitored in staging / nightly benchmarks / eventually pre-prod: * with validation: * increased CPU usage * ~doubled VirtualFile read bytes/second metric * no change in disk IO usage because the kernel page cache will likely have the pages buffered on the second read * without validation: * slightly higher DRAM usage because each iterator participating in the k-merge has a dedicated buffer (as opposed to before, where compactions would rely on the PS PageCaceh as a shared evicting buffer) * less disk IO if previously there were repeat PageCache misses (likely case on a busy production Pageserver) * lower CPU usage: PageCache out of the picture, fewer syscalls are made (vectored blob io batches reads) # Rollout The new code is used with validation mode enabled-by-default. This gets us validation everywhere by default, specifically in - Rust unit tests - Python tests - Nightly pagebench (shouldn't really matter) - Staging Before the next release, I'll merge the following aws.git PR that configures prod to continue using the existing behavior: * https://github.com/neondatabase/aws/pull/1663 # Interactions With Other Features This work & rollout should complete before Direct IO is enabled because Direct IO would double the IOPS & latency for each compaction read (#8240). # Future Work The streaming k-merge's memory usage is proportional to the amount of memory per participating layer. But `compact_level0_phase1` still loads all keys into memory for `all_keys_iter`. Thus, it continues to have active memory usage proportional to the number of keys involved in the compaction. Future work should replace `all_keys_iter` with a streaming keys iterator. This PR has a draft in its first commit, which I later reverted because it's not necessary to achieve the goal of this PR / issue #8184.	2024-07-31 14:17:59 +02:00
Christian Schwarz	d95b46f3f3	cleanup(compact_level0_phase1): some commentary and wrapping into block expressions (#8544 ) Byproduct of scouting done for https://github.com/neondatabase/neon/issues/8184 refs https://github.com/neondatabase/neon/issues/8184	2024-07-30 18:13:18 +02:00
Arpad Müller	1c7b06c988	Add metrics for input data considered and taken for compression (#8522 ) If compression is enabled, we currently try compressing each image larger than a specific size and if the compressed version is smaller, we write that one, otherwise we use the uncompressed image. However, this might sometimes be a wasteful process, if there is a substantial amount of images that don't compress well. The compression metrics added in #8420 `pageserver_compression_image_in_bytes_total` and `pageserver_compression_image_out_bytes_total` are well designed for answering the question how space efficient the total compression process is end-to-end, which helps one to decide whether to enable it or not. To answer the question of how much waste there is in terms of trial compression, so CPU time, we add two metrics: * one about the images that have been trial-compressed (considered), and * one about the images where the compressed image has actually been written (chosen). There is different ways of weighting them, like for example one could look at the count, or the compressed data. But the main contributor to compression CPU usage is amount of data processed, so we weight the images by their uncompressed size. In other words, the two metrics are: * `pageserver_compression_image_in_bytes_considered` * `pageserver_compression_image_in_bytes_chosen` Part of #5431	2024-07-30 09:59:15 +02:00
Christian Schwarz	4be58522fb	l0_flush: use mode=direct by default => coverage in automated tests (#8534 ) Testing in staging and pre-prod has been [going well](https://github.com/neondatabase/neon/issues/7418#issuecomment-2255474917). This PR enables mode=direct by default, thereby providing additional coverage in the automated tests: - Rust tests - Integration tests - Nightly pagebench (likely irrelevant because it's read-only) Production deployments continue to use `mode=page-cache` for the time being: https://github.com/neondatabase/aws/pull/1655 refs https://github.com/neondatabase/neon/issues/7418	2024-07-29 16:49:22 +02:00
Christian Schwarz	d09dad0ea2	pageserver: fail if `id` is present in pageserver.toml (#8489 ) Overall plan: https://www.notion.so/neondatabase/Rollout-Plan-simplified-pageserver-initialization-f935ae02b225444e8a41130b7d34e4ea?pvs=4 --- `identity.toml` is the authoritative place for `id` as of https://github.com/neondatabase/neon/pull/7766 refs https://github.com/neondatabase/neon/issues/7736	2024-07-29 15:16:32 +01:00
John Spray	5775662276	pageserver: fix return code from secondary_download_handler (#8508 ) ## Problem The secondary download HTTP API is meant to return 200 if the download is complete, and 202 if it is still in progress. In #8198 the download implementation was changed to drop out with success early if it over-runs a time budget, which resulted in 200 responses for incomplete downloads. This breaks storcon_cli's "tenant-warmup" command, which uses the OK status to indicate download complete. ## Summary of changes - Only return 200 if we get an Ok() _and_ the progress stats indicate the download is complete.	2024-07-29 15:05:30 +01:00
Arpad Müller	859f019185	Adopt list_streaming in tenant deletion (#8504 ) Uses the Stream based `list_streaming` function added by #8457 in tenant deletion, as suggested in https://github.com/neondatabase/neon/pull/7932#issuecomment-2150480180 . We don't have to worry about retries, as the function is wrapped inside an outer retry block. If there is a retryable error either during the listing or during deletion, we just do a fresh start. Also adds `+ Send` bounds as they are required by the `delete_tenant_remote` function.	2024-07-29 12:05:18 +02:00
Arpad Müller	b5e95f68b5	Persist archival information (#8479 ) Persists whether a timeline is archived or not in `index_part.json`. We only return success if the upload has actually worked successfully. Also introduces a new `index_part.json` version number. Fixes #8459 Part of #8088	2024-07-27 02:32:05 +00:00
Alex Chi Z.	dd40b19db4	fix(pageserver): give L0 compaction priorities over image layer creation (#8443 ) close https://github.com/neondatabase/neon/issues/8435 ## Summary of changes If L0 compaction did not include all L0 layers, skip image generation. There are multiple possible solutions to the original issue, i.e., an alternative is to wrap the partial L0 compaction in a loop until it compacts all L0 layers. However, considering that we should weight all tenants equally, the current solution can ensure everyone gets a chance to run compaction, and those who write too much won't get a chance to create image layers. This creates a natural backpressure feedback that they get a slower read due to no image layers are created, slowing down their writes, and eventually compaction could keep up with their writes + generate image layers. Consider deployment, we should add an alert on "skipping image layer generation", so that we won't run into the case that image layers are not generated => incidents again. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-07-26 18:09:55 +00:00
Christian Schwarz	68241f5a3e	raise wait_lsn timeout from 60s to 300s (#8529 ) Problem ------- wait_lsn timeouts result in a user-facing errors like ``` $ /tmp/neon/pg_install/v16/bin/pgbench -s3424 -i -I dtGvp user=neondb_owner dbname=neondb host=ep-tiny-wave-w23owa37.eastus2.azure.neon.build sslmode=require options='-cstatement_timeout=0 ' dropping old tables... NOTICE: table "pgbench_accounts" does not exist, skipping NOTICE: table "pgbench_branches" does not exist, skipping NOTICE: table "pgbench_history" does not exist, skipping NOTICE: table "pgbench_tellers" does not exist, skipping creating tables... generating data (server-side)... vacuuming... pgbench: error: query failed: ERROR: [NEON_SMGR] [shard 0] could not read block 214338 in rel 1663/16389/16839.0 from page server at lsn C/E1C12828 DETAIL: page server returned error: LSN timeout: Timed out while waiting for WAL record at LSN C/E1418528 to arrive, last_record_lsn 6/999D9CA8 disk consistent LSN=6/999D9CA8, WalReceiver status: (update 2024-07-25 08:30:07): connecting to node 25, safekeeper candidates (id\|update_time\|commit_lsn): [(21\|08:30:16\|C/E1C129E0), (23\|08:30:16\|C/E1C129E0), (25\|08:30:17\|C/E1C129E0)] CONTEXT: while scanning block 214338 of relation "public.pgbench_accounts" pgbench: detail: Query was: vacuum analyze pgbench_accounts ``` Solution -------- Its better to be slow than to fail the queries. If the app has a deadline, it can use `statement_timeout`. In the long term, we want to eliminate wait_lsn timeout. In the short term (this PR), we bump the wait_lsn timeout to a larger value to reduce the frequency at which these wait_lsn timeouts occur. We will observe SLOs and specifically `pageserver_wait_lsn_seconds_bucket` before we eliminate the timeout completely.	2024-07-26 16:44:57 +01:00
Christian Schwarz	8154e88732	refactor(layer load API): all errors are permanent (#8527 ) I am not aware of a case of "transient" VirtualFile errors as mentioned in https://github.com/neondatabase/neon/pull/5880 Private DM with Joonas discussing this: https://neondb.slack.com/archives/D049K7HJ9JM/p1721836424615799	2024-07-26 15:48:44 +01:00
Vlad Lazar	cdaa2816e7	pageserver: make vectored get the default read path for the pageserver (#8384 ) ## Problem Vectored get is already enabled in all prod regions without validation. The pageserver defaults are out of sync however. ## Summary of changes Update the pageserver defaults to match the prod config. Also means that when running tests locally, people don't have to use the env vars to get the prod config.	2024-07-26 14:19:52 +01:00
Arpad Müller	bb2a3f9b02	Update Rust to 1.80.0 (#8518 ) We keep the practice of keeping the compiler up to date, pointing to the latest release. This is done by many other projects in the Rust ecosystem as well. [Release notes](https://github.com/rust-lang/rust/blob/master/RELEASES.md#version-180-2024-07-25). Prior update was in #8048	2024-07-26 11:17:33 +02:00
John Spray	6711087ddf	remote_storage: expose last_modified in listings (#8497 ) ## Problem The scrubber would like to check the highest mtime in a tenant's objects as a safety check during purges. It recently switched to use GenericRemoteStorage, so we need to expose that in the listing methods. ## Summary of changes - In Listing.keys, return a ListingObject{} including a last_modified field, instead of a RemotePath --------- Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>	2024-07-26 10:57:52 +03:00
Arpad Müller	8e02db1ab9	Handle NotInitialized::ShuttingDown error in shard split (#8506 ) There is a race condition between timeline shutdown and the split task. Timeline shutdown first shuts down the upload queue, and only then fires the cancellation token. A parallel running timeline split operation might thus encounter a cancelled upload queue before the cancellation token is fired, and print a noisy error. Fix this by mapping `anyhow::Error{ NotInitialized::ShuttingDown }) to `FlushLayerError::Cancelled` instead of `FlushLayerError::Other(_)`. Fixes #8496	2024-07-26 02:16:10 +02:00
Alex Chi Z.	bea0468f1f	fix(pageserver): allow incomplete history in btm-gc-compaction (#8500 ) This pull request (should) fix the failure of test_gc_feedback. See the explanation in the newly-added test case. Part of https://github.com/neondatabase/neon/issues/8002 Allow incomplete history for the compaction algorithm. Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-07-25 12:56:37 -04:00
Vlad Lazar	9c5ad21341	storcon: make heartbeats restart aware (#8222 ) ## Problem Re-attach blocks the pageserver http server from starting up. Hence, it can't reply to heartbeats until that's done. This makes the storage controller mark the node off-line (not good). We worked around this by setting the interval after which nodes are marked offline to 5 minutes. This isn't a long term solution. ## Summary of changes * Introduce a new `NodeAvailability` state: `WarmingUp`. This state models the following time interval: * From receiving the re-attach request until the pageserver replies to the first heartbeat post re-attach * The heartbeat delta generator becomes aware of this state and uses a separate longer interval * Flag `max-warming-up-interval` now models the longer timeout and `max-offline-interval` the shorter one to match the names of the states Closes https://github.com/neondatabase/neon/issues/7552	2024-07-25 14:09:12 +01:00
Christian Schwarz	a1256b2a67	fix: remote timeline client shutdown trips circuit breaker (#8495 ) Before this PR 1.The circuit breaker would trip on CompactionError::Shutdown. That's wrong, we want to ignore those cases. 2. remote timeline client shutdown would not be mapped to CompactionError::Shutdown in all circumstances. We observed this in staging, see https://neondb.slack.com/archives/C033RQ5SPDH/p1721829745384449 This PR fixes (1) with a simple `match` statement, and (2) by switching a bunch of `anyhow` usage over to distinguished errors that ultimately get mapped to `CompactionError::Shutdown`. I removed the implicit `#[from]` conversion from `anyhow::Error` to `CompactionError::Other` to discover all the places that were mapping remote timeline client shutdown to `anyhow::Error`. In my opinion `#[from]` is an antipattern and we should avoid it, especially for `anyhow::Error`. If some callee is going to return anyhow, the very least the caller should to is to acknowledge, through a `map_err(MyError::Other)` that they're conflating different failure reasons.	2024-07-25 09:44:31 +01:00
Christian Schwarz	d57412aaab	followup(#8359 ): pre-initialize circuitbreaker metrics (#8491 )	2024-07-25 10:24:28 +02:00
John Spray	5f4e14d27d	pageserver: fix a compilation error (#8487 ) ## Problem PR that modified compaction raced with PR that modified the GcInfo structure ## Summary of changes Fix it Co-authored-by: Vlad Lazar <vlalazar.vlad@gmail.com>	2024-07-24 16:37:15 +01:00
Vlad Lazar	2723a8156a	pageserver: faster and simpler inmem layer vec read (#8469 ) ## Problem The in-memory layer vectored read was very slow in some conditions (walingest::test_large_rel) test. Upon profiling, I realised that 80% of the time was spent building up the binary heap of reads. This stage isn't actually needed. ## Summary of changes Remove the planning stage as we never took advantage of it in order to merge reads. There should be no functional change from this patch.	2024-07-24 14:23:03 +01:00
John Spray	2ef8e57f86	pageserver: maintain gc_info incrementally (#8427 ) ## Problem Previously, Timeline::gc_info was only updated in a batch operation at the start of GC. That means that timelines didn't generally have accurate information about who their children were before the first GC, or between GC cycles. Knowledge of child branches is important for calculating layer visibility in #8398 ## Summary of changes - Split out part of refresh_gc_info into initialize_gc_info, which is now called early in startup - Include TimelineId in retain_lsns so that we can later add/remove the LSNs for particular children - When timelines are added/removed, update their parent's retain_lsns	2024-07-24 12:33:44 +02:00
John Spray	842c3d8c10	tests: simplify code around unstable `test_basebackup_with_high_slru_count` (#8477 ) ## Problem In `test_basebackup_with_high_slru_count`, the pageserver is sometimes mysteriously hanging on startup, having been started+stopped earlier in the test setup while populating template tenant data. - #7586 We can't see why this is hanging in this particular test. The test does some weird stuff though, like attaching a load of broken tenants and then doing a SIGQUIT kill of a pageserver. ## Summary of changes - Attach tenants normally instead of doing a failpoint dance to attach them as broken - Shut the pageserver down gracefully during init instead of using immediate mode - Remove the "sequential" variant of the unstable test, as this is going away soon anyway - Log before trying to acquire lock file, so that if it hangs we have a clearer sense of if that's really where it's hanging. It seems like it is, but that code does a non-blocking flock so it's surprising.	2024-07-24 11:26:24 +01:00
John Spray	f5db655447	pageserver: simplify LayerAccessStats (#8431 ) ## Problem LayerAccessStats contains a lot of detail that we don't use: short histories of most recent accesses, specifics on what kind of task accessed a layer, etc. This is all stored inside a Mutex, which is locked every time something accesses a layer. ## Summary of changes - Store timestamps at a very low resolution (to the nearest second), sufficient for use on the timescales of eviction. - Pack access time and last residence change time into a single u64 - Use the high bits of the u64 for other flags, including the new layer visibility concept. - Simplify the external-facing model for access stats to just include what we now track. Note that the `HistoryBufferWithDropCounter` is removed here because it is no longer used. I do not dislike this type, we just happen not to use it for anything else at present. Co-authored-by: Christian Schwarz <christian@neon.tech>	2024-07-24 08:17:28 +01:00
Alex Chi Z.	18cf5cfefd	feat(pageserver): support retain_lsn in bottommost gc-compaction (#8328 ) part of https://github.com/neondatabase/neon/issues/8002 The main thing in this pull request is the new `generate_key_retention` function. It decides which deltas to retain and generate images for a given key based on its history + retain_lsn + horizon. On that, we generate a flat single level of delta layers over all deltas included in the compaction. In the future, we can decide whether to split them over the LSN axis as described in the RFC. --------- Signed-off-by: Alex Chi Z <chi@neon.tech> Co-authored-by: Christian Schwarz <christian@neon.tech>	2024-07-24 00:28:43 +01:00
Konstantin Knizhnik	563d73d923	Use smgrexists() instead of access() to enforce uniqueness of generated relfilenumber (#7992 ) ## Problem Postgres is using `access()` function in `GetNewRelFileNumber` to check if assigned relfilenumber is not used for any other relation. This check will not work in Neon, because we do not have all files in local storage. ## Summary of changes Use smgrexists() instead which will check at page server if such relfilenode is used. ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2024-07-23 18:41:55 +03:00
John Spray	1a4c1eba92	pageserver: add LayerVisibilityHint (#8432 ) ## Problem As described in https://github.com/neondatabase/neon/issues/8398, layer visibility is a new hint that will help us manage disk space more efficiently. ## Summary of changes - Introduce LayerVisibilityHint and store it as part of access stats - Automatically mark a layer visible if it is accessed, or when it is created. The impact on the access stats size will be reversed in https://github.com/neondatabase/neon/pull/8431 This is functionally a no-op change: subsequent PRs will add the logic that sets layers to Covered, and which uses the layer visibility as an input to eviction and heatmap generation. --------- Co-authored-by: Joonas Koivunen <joonas@neon.tech>	2024-07-23 15:37:12 +01:00
John Spray	80c8ceacbc	tests: make `test_scrubber_physical_gc_ancestors` more stable (#8453 ) ## Problem This test sometimes found that ancestors were getting cleaned up before it had done any compaction. Compaction was happening implicitly via Workload. Example: https://neon-github-public-dev.s3.amazonaws.com/reports/pr-8298/10032173390/index.html#testresult/fb04786402f80822/retries ## Summary of changes - Set upload=False when writing data after shard split, to avoid doing a checkpoint - Add a checkpoint_period & explicit wait for uploads so that we ensure data lands in S3 without doing a checkpoint	2024-07-23 12:57:57 +01:00
Vlad Lazar	35854928d9	pageserver: use identity file as node id authority and remove init command and config-override flags (#7766 ) Ansible will soon write the node id to `identity.toml` in the work dir for new pageservers. On the pageserver side, we read the node id from the identity file if it is present and use that as the source of truth. If the identity file is missing, cannot be read, or does not deserialise, start-up is aborted. This PR also removes the `--init` mode and the `--config-override` flag from the `pageserver` binary. The neon_local is already not using these flags anymore. Ansible still uses them until the linked change is merged & deployed, so, this PR has to land simultaneously or after the Ansible change due to that. Related Ansible change: https://github.com/neondatabase/aws/pull/1322 Cplane change to remove config-override usages: https://github.com/neondatabase/cloud/pull/13417 Closes: https://github.com/neondatabase/neon/issues/7736 Overall plan: https://www.notion.so/neondatabase/Rollout-Plan-simplified-pageserver-initialization-f935ae02b225444e8a41130b7d34e4ea?pvs=4 Co-authored-by: Christian Schwarz <christian@neon.tech>	2024-07-23 11:41:12 +01:00
Arpad Müller	f17fe75169	Mark body of archival_config endpoint as required (#8458 ) As pointed out in https://github.com/neondatabase/neon/pull/8414#discussion_r1684881525 Part of https://github.com/neondatabase/neon/issues/8088	2024-07-22 17:39:18 +01:00
Christian Schwarz	e8523014d4	refactor(pageserver) remove `task_mgr` for most global tasks (#8449 ) ## Motivation & Context We want to move away from `task_mgr` towards explicit tracking of child tasks. This PR is extracted from https://github.com/neondatabase/neon/pull/8339 where I refactor `PageRequestHandler` to not depend on task_mgr anymore. ## Changes This PR refactors all global tasks but `PageRequestHandler` to use some combination of `JoinHandle`/`JoinSet` + `CancellationToken`. The `task_mgr::spawn(.., shutdown_process_on_error)` functionality is preserved through the new `exit_on_panic_or_error` wrapper. Some global tasks were not using it before, but as of this PR, they are. The rationale is that all global tasks are relevant for correct operation of the overall Neon system in one way or another. ## Future Work After #8339, we can make `task_mgr::spawn` require a `TenantId` instead of an `Option<TenantId>` which concludes this step of cleanup work and will help discourage future usage of task_mgr for global tasks.	2024-07-22 17:25:06 +02:00
Yuchen Liang	595c450036	fix(scrubber): more robust metadata consistency checks (#8344 ) Part of #8128. ## Problem Scrubber uses `scan_metadata` command to flag metadata inconsistencies. To trust it at scale, we need to make sure the errors we emit is a reflection of real scenario. One check performed in the scrubber is to see whether layers listed in the latest `index_part.json` is present in object listing. Currently, the scrubber does not robustly handle the case where objects are uploaded/deleted during the scan. ## Summary of changes Condition for success: An object in the index is (1) in the object listing we acquire from S3 or (2) found in a HeadObject request (new object). - Add in the `HeadObject` requests for the layers missing from the object listing. - Keep the order of first getting the object listing and then downloading the layers. - Update check to only consider shards with highest shard count. - Skip analyzing a timeline if `deleted_at` tombstone is marked in `index_part.json`. - Add new test to see if scrubber actually detect the metadata inconsistency. _Misc_ - A timeline with no ancestor should always have some layers. - Removed experimental histograms _Caveat_ - Ancestor layer is not cleaned until #8308 is implemented. If ancestor layers reference non-existing layers in the index, the scrubber will emit false positives. Signed-off-by: Yuchen Liang <yuchen@neon.tech>	2024-07-22 14:53:33 +01:00
John Spray	98af1e365b	pageserver: remove absolute-order disk usage eviction (#8454 ) ## Problem Deployed pageserver configurations are all like this: ``` disk_usage_based_eviction: max_usage_pct: 85 min_avail_bytes: 0 period: "10s" eviction_order: type: "RelativeAccessed" args: highest_layer_count_loses_first: true ``` But we're maintaining this optional absolute order eviction, with test cases etc. ## Summary of changes - Remove absolute order eviction. Make the default eviction policy the same as how we really deploy pageservers.	2024-07-22 13:15:55 +01:00
Arpad Müller	4e547e6274	Use DefaultCredentialsChain AWS authentication in remote_storage (#8440 ) PR #8299 has switched the storage scrubber to use `DefaultCredentialsChain`. Now we do this for `remote_storage`, as it allows us to use `remote_storage` from inside kubernetes. Most of the diff is due to `GenericRemoteStorage::from_config` becoming `async fn`.	2024-07-19 21:19:30 +02:00
Arpad Müller	3d582b212a	Add archival_config endpoint to pageserver (#8414 ) This adds an archival_config endpoint to the pageserver. Currently it has no effect, and always "works", but later the intent is that it will make a timeline archived/unarchived. - [x] add yml spec - [x] add endpoint handler Part of https://github.com/neondatabase/neon/issues/8088	2024-07-19 21:01:59 +02:00
Shinya Kato	3fbb84d741	Fix openapi specification (#8273 ) ## Problem There are some swagger errors in `pageserver/src/http/openapi_spec.yml` ``` Error 431 15000 Object includes not allowed fields Error 569 3100401 should always have a 'required' Error 569 15000 Object includes not allowed fields Error 1111 10037 properties members must be schemas ``` ## Summary of changes Fixed the above errors.	2024-07-19 18:20:57 +00:00
John Spray	44781518d0	storage scrubber: GC ancestor shard layers (#8196 ) ## Problem After a shard split, the pageserver leaves the ancestor shard's content in place. It may be referenced by child shards, but eventually child shards will de-reference most ancestor layers as they write their own data and do GC. We would like to eventually clean up those ancestor layers to reclaim space. ## Summary of changes - Extend the physical GC command with `--mode=full`, which includes cleaning up unreferenced ancestor shard layers - Add test `test_scrubber_physical_gc_ancestors` - Remove colored log output: in testing this is irritating ANSI code spam in logs, and in interactive use doesn't add much. - Refactor storage controller API client code out of storcon_client into a `storage_controller/client` crate - During physical GC of ancestors, call into the storage controller to check that the latest shards seen in S3 reflect the latest state of the tenant, and there is no shard split in progress.	2024-07-19 19:07:59 +03:00
Christian Schwarz	16071e57c6	pageserver: remove obsolete cached_metric_collection_interval (#8370 ) We're removing the usage of this long-meaningless config field in https://github.com/neondatabase/aws/pull/1599 Once that PR has been deployed to staging and prod, we can merge this PR.	2024-07-19 17:01:02 +01:00
Arpad Müller	c96e8012ce	Enable zstd in tests (#8368 ) Successor of #8288 , just enable zstd in tests. Also adds a test that creates easily compressable data. Part of #5431 --------- Co-authored-by: John Spray <john@neon.tech> Co-authored-by: Joonas Koivunen <joonas@neon.tech>	2024-07-18 19:09:57 +01:00
Alex Chi Z.	a4434cf1c0	pageserver: integrate k-merge with bottom-most compaction (#8415 ) Use the k-merge iterator in the compaction process to reduce memory footprint. part of https://github.com/neondatabase/neon/issues/8002 ## Summary of changes * refactor the bottom-most compaction code to use k-merge iterator * add Send bound on some structs as it is used across the await points --------- Signed-off-by: Alex Chi Z <chi@neon.tech> Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>	2024-07-18 17:16:44 +01:00
John Spray	0c236fa465	pageserver: layer count & size metrics (#8410 ) ## Problem We lack insight into: - How much of a tenant's physical size is image vs. delta layers - Average sizes of image vs. delta layers - Total layer counts per timeline, indicating size of index_part object As well as general observability love, this is motivated by https://github.com/neondatabase/neon/issues/6738, where we need to define some sensible thresholds for storage amplification, and using total physical size may not work well (if someone does a lot of DROPs then it's legitimate for the physical-synthetic ratio to be huge), but the ratio between image layer size and delta layer size may be a better indicator of whether we're generating unreasonable quantities of image layers. ## Summary of changes - Add pageserver_layer_bytes and pageserver_layer_count metrics, labelled by timeline and `kind` (delta or image) - Add & subtract these with LayerInner's lifetime. I'm intentionally avoiding using a generic metric RAII guard object, to avoid bloating LayerInner: it already has all the information it needs to update metric on new+drop.	2024-07-17 21:55:20 +01:00
Alex Chi Z.	839a5724a4	test(pageserver): more k-merge tests on duplicated keys (#8404 ) Existing tenants and some selection of layers might produce duplicated keys. Add tests to ensure the k-merge iterator handles it correctly. We also enforced ordering of the k-merge iterator to put images before deltas. part of https://github.com/neondatabase/neon/issues/8002 --------- Signed-off-by: Alex Chi Z <chi@neon.tech> Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>	2024-07-17 16:22:38 +01:00
John Spray	4a90423292	pageserver: reduce size of delta layer ValueRef (#8401 ) ## Problem ValueRef is an unnecessarily large structure, because it carries a cursor. L0 compaction currently instantiates gigabytes of these under some circumstances. ## Summary of changes - Carry a ref to the parent layer instead of a cursor, and construct a cursor on demand. This reduces RSS high watermark during L0 compaction by about 20%.	2024-07-16 21:36:17 +01:00
John Spray	f4f0869dc8	pageserver: exclude un-read layers from short residence statistic (#8396 ) ## Problem The `evictions_with_low_residence_duration` is used as an indicator of cache thrashing. However, there are situations where it is quite legitimate to only have a short residence during compaction, where a delta is downloaded, used to generate an image layer, and then discarded. This can lead to false positive alerts. ## Summary of changes - Only track low residence duration for layers that have been accessed at least once (compaction doesn't count as an access). This will give us a metric that indicates thrashing on layers that the _user_ is using, rather than those we're downloading for housekeeping purposes. Once we add "layer visibility" as an explicit property of layers, this can also be used as a cleaner condition (residence of non-visible layers should never be alertable)	2024-07-16 20:55:29 +01:00
Alex Chi Z.	0950866fa8	fix(pageserver): limit num of delta layers for l0 compaction (#8391 ) ## Problem close https://github.com/neondatabase/neon/issues/8389 ## Summary of changes A quick mitigation for tenants with fast writes. We compact at most 60 delta layers at a time, expecting a memory footprint of 15GB. We will pick the oldest 60 L0 layers. This should be a relatively safe change so no test is added. Question is whether to make this parameter configurable via tenant config. --------- Signed-off-by: Alex Chi Z <chi@neon.tech> Co-authored-by: John Spray <john@neon.tech>	2024-07-16 20:43:24 +01:00
John Spray	a40b402957	pageserver: clean up GcCutoffs names (#8379 ) - `horizon` is a confusing term, it's not at all obvious that this means space-based retention limit, rather than the total GC history limit. Rename to `GcCutoffs::space`. - `pitr` is less confusing, but still an unecessary level of indirection from what we really mean: a time-based condition. The fact that we use that that time-history for Point In Time Recovery doesn't mean we have to refer to time as "pitr" everywhere. Rename to `GcCutoffs::time`.	2024-07-16 13:54:54 +00:00
John Spray	83e07c1a5b	pageserver: un-Arc Timeline::layers (#8386 ) ## Problem This structure was in an Arc<> unnecessarily, making it harder to reason about its lifetime (i.e. it was superficially possible for LayerManager to outlive timeline, even though no code used it that way) ## Summary of changes - Remove the Arc<>	2024-07-16 08:52:49 +01:00
Arpad Müller	ee263e6a62	Allow the new clippy::doc_lazy_continuation lint (#8388 ) The `doc_lazy_continuation` lint of clippy is still unknown on latest rust stable. Fixes fall-out from #8151.	2024-07-16 00:16:18 +00:00
Joonas Koivunen	730db859c7	feat(timeline_detach_ancestor): success idempotency (#8354 ) Right now timeline detach ancestor reports an error (409, "no ancestor") on a new attempt after successful completion. This makes it troublesome for storage controller retries. Fix it to respond with `200 OK` as if the operation had just completed quickly. Additionally, the returned timeline identifiers in the 200 OK response are now ordered so that responses between different nodes for error comparison are done by the storage controller added in #8353. Design-wise, this PR introduces a new strategy for accessing the latest uploaded IndexPart: `RemoteTimelineClient::initialized_upload_queue(&self) -> Result<UploadQueueAccessor<'_>, NotInitialized>`. It should be a more scalable way to query the latest uploaded `IndexPart` than to add a query method for each question directly on `RemoteTimelineClient`. GC blocking will need to be introduced to make the operation fully idempotent. However, it is idempotent for the cases demonstrated by tests. Cc: #6994	2024-07-15 17:47:53 +00:00
John Spray	04448ac323	pageserver: use PITR GC cutoffs as authoritative (#8365 ) ## Problem Pageserver GC uses a size-based condition (GC "horizon" in addition to time-based "PITR"). Eventually we plan to retire the size-based condition: https://github.com/neondatabase/neon/issues/6374 Currently, we always apply the more conservative of the two, meaning that tenants always retain at least 64MB of history (default horizon), even after a very long time has passed. This is particularly acute in cases where someone has dropped tables/databases, and then leaves a database idle: the horizon can prevent GCing very large quantities of historical data (we already account for this in synthetic size by ignoring gc horizon). We're not entirely removing GC horizon right now because we don't want to 100% rely on standby_horizon for robustness of physical replication, but we can tweak our logic to avoid retaining that 64MB LSN length indefinitely. ## Summary of changes - Rework `Timeline::find_gc_cutoffs`, with new logic: - If there is no PITR set, then use `DEFAULT_PITR_INTERVAL` (1 week) to calculate a time threshold. Retain either the horizon or up to that thresholds, whichever requires less data. - When there is a PITR set, and we have unambiguously resolved the timestamp to an LSN, then ignore the GC horizon entirely. For typical PITRs (1 day, 1 week), this will still easily retain enough data to avoid stressing read only replicas. The key property we end up with, whether a PITR is set or not, is that after enough time has passed, our GC cutoff on an idle timeline will catch up with the last_record_lsn. Using `DEFAULT_PITR_INTERVAL` is a bit of an arbitrary hack, but this feels like it isn't really worth the noise of exposing in TenantConfig. We could just make it a different named constant though. The end-end state will be that there is no gc_horizon at all, and that tenants with pitr_interval=0 would truly retain no history, so this constant would go away.	2024-07-15 17:43:05 +01:00

1 2 3 4 5 ...

2313 Commits