mirror of
https://github.com/neondatabase/neon.git
synced 2026-01-15 17:32:56 +00:00
## Problem The previous garbage cleanup functionality relied on doing a dry run, inspecting logs, and then doing a deletion. This isn't ideal, because what one actually deletes might not be the same as what one saw in the dry run. It's also risky UX to rely on presence/absence of one CLI flag to control deletion: ideally the deletion command should be totally separate from the one that scans the bucket. Related: https://github.com/neondatabase/neon/issues/5037 ## Summary of changes This is a major re-work of the code, which results in a net decrease in line count of about 600. The old code for removing garbage was build around the idea of doing discovery and purging together: a "delete_batch_producer" sent batches into a deleter. The new code writes out both procedures separately, in functions that use the async streams introduced in https://github.com/neondatabase/neon/pull/5176 to achieve fast concurrent access to S3 while retaining the readability of a single function. - Add `find-garbage`, which writes out a JSON file of tenants/timelines to purge - Add `purge-garbage` which consumes the garbage JSON file, applies some extra validations, and does deletions. - The purge command will refuse to execute if the garbage file indicates that only garbage was found: this guards against classes of bugs where the scrubber might incorrectly deem everything garbage. - The purge command defaults to only deleting tenants that were found in "deleted" state in the control plane. This guards against the risk that using the wrong console API endpoint could cause all tenants to appear to be missing. Outstanding work for a future PR: - Make whatever changes are needed to adapt to the Console/Control Plane separation. - Make purge even safer by checking S3 `Modified` times for index_part.json files (not doing this here, because it will depend on the generation-aware changes for finding index_part.json files) ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist --------- Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com> Co-authored-by: Shany Pozin <shany@neon.tech>
101 lines
5.1 KiB
Markdown
101 lines
5.1 KiB
Markdown
# Neon S3 scrubber
|
|
|
|
This tool directly accesses the S3 buckets used by the Neon `pageserver`
|
|
and `safekeeper`, and does housekeeping such as cleaning up objects for tenants & timelines that no longer exist.
|
|
|
|
## Usage
|
|
|
|
### Generic Parameters
|
|
|
|
#### S3
|
|
|
|
Do `aws sso login --profile dev` to get the SSO access to the bucket to clean, get the SSO_ACCOUNT_ID for your profile (`cat ~/.aws/config` may help).
|
|
|
|
- `SSO_ACCOUNT_ID`: Credentials id to use for accessing S3 buckets
|
|
- `REGION`: A region where the bucket is located at.
|
|
- `BUCKET`: Bucket name
|
|
|
|
#### Console API
|
|
|
|
_This section is only relevant if using a command that requires access to Neon's internal control plane_
|
|
|
|
- `CLOUD_ADMIN_API_URL`: The URL base to use for checking tenant/timeline for existence via the Cloud API. e.g. `https://<admin host>/admin`
|
|
|
|
- `CLOUD_ADMIN_API_TOKEN`: The token to provide when querying the admin API. Get one on the corresponding console page, e.g. `https://<admin host>/app/settings/api-keys`
|
|
|
|
### Commands
|
|
|
|
#### `find-garbage`
|
|
|
|
Walk an S3 bucket and cross-reference the contents with the Console API to identify data for
|
|
tenants or timelines that should no longer exist.
|
|
|
|
- `--node-kind`: whether to inspect safekeeper or pageserver bucket prefix
|
|
- `--depth`: whether to only search for deletable tenants, or also search for
|
|
deletable timelines within active tenants. Default: `tenant`
|
|
- `--output-path`: filename to write garbage list to. Default `garbage.json`
|
|
|
|
This command outputs a JSON file describing tenants and timelines to remove, for subsequent
|
|
processing by the `purge-garbage` subcommand.
|
|
|
|
**Note that the garbage list format is not stable. The output of `find-garbage` is only
|
|
intended for use by the exact same version of the tool running `purge-garbage`**
|
|
|
|
Example:
|
|
|
|
`env SSO_ACCOUNT_ID=123456 REGION=eu-west-1 BUCKET=my-dev-bucket CLOUD_ADMIN_API_TOKEN=${NEON_CLOUD_ADMIN_API_STAGING_KEY} CLOUD_ADMIN_API_URL=[url] cargo run --release -- find-garbage --node-kind=pageserver --depth=tenant --output-path=eu-west-1-garbage.json`
|
|
|
|
#### `purge-garbage`
|
|
|
|
Consume a garbage list from `find-garbage`, and delete the related objects in the S3 bucket.
|
|
|
|
- `--input-path`: filename to read garbage list from. Default `garbage.json`.
|
|
- `--mode`: controls whether to purge only garbage that was specifically marked
|
|
deleted in the control plane (`deletedonly`), or also to purge tenants/timelines
|
|
that were not present in the control plane at all (`deletedandmissing`)
|
|
|
|
This command learns region/bucket details from the garbage file, so it is not necessary
|
|
to pass them on the command line
|
|
|
|
Example:
|
|
|
|
`env SSO_ACCOUNT_ID=123456 cargo run --release -- purge-garbage --node-kind=pageserver --depth=tenant --input-path=eu-west-1-garbage.json`
|
|
|
|
Add the `--delete` argument before `purge-garbage` to enable deletion. This is intentionally
|
|
not provided inline in the example above to avoid accidents. Without the `--delete` flag
|
|
the purge command will log all the keys that it would have deleted.
|
|
|
|
#### `scan-metadata`
|
|
|
|
Walk objects in a pageserver S3 bucket, and report statistics on the contents.
|
|
|
|
```
|
|
env SSO_ACCOUNT_ID=123456 REGION=eu-west-1 BUCKET=my-dev-bucket CLOUD_ADMIN_API_TOKEN=${NEON_CLOUD_ADMIN_API_STAGING_KEY} CLOUD_ADMIN_API_URL=[url] cargo run --release -- scan-metadata
|
|
|
|
Timelines: 31106
|
|
With errors: 3
|
|
With warnings: 13942
|
|
With garbage: 0
|
|
Index versions: 2: 13942, 4: 17162
|
|
Timeline size bytes: min 22413312, 1% 52133887, 10% 56459263, 50% 101711871, 90% 191561727, 99% 280887295, max 167535558656
|
|
Layer size bytes: min 24576, 1% 36879, 10% 36879, 50% 61471, 90% 44695551, 99% 201457663, max 275324928
|
|
Timeline layer count: min 1, 1% 3, 10% 6, 50% 16, 90% 25, 99% 39, max 1053
|
|
```
|
|
|
|
## Cleaning up running pageservers
|
|
|
|
If S3 state is altered first manually, pageserver in-memory state will contain wrong data about S3 state, and tenants/timelines may get recreated on S3 (due to any layer upload due to compaction, pageserver restart, etc.). So before proceeding, for tenants/timelines which are already deleted in the console, we must remove these from pageservers.
|
|
|
|
First, we need to group pageservers by buckets, `https://<admin host>/admin/pageservers`` can be used for all env nodes, then `cat /storage/pageserver/data/pageserver.toml` on every node will show the bucket names and regions needed.
|
|
|
|
Per bucket, for every pageserver id related, find deleted tenants:
|
|
|
|
`curl -X POST "https://<admin_host>/admin/check_pageserver/{id}" -H "Accept: application/json" -H "Authorization: Bearer ${NEON_CLOUD_ADMIN_API_STAGING_KEY}" | jq`
|
|
|
|
use `?check_timelines=true` to find deleted timelines, but the check runs a separate query on every alive tenant, so that could be long and time out for big pageservers.
|
|
|
|
Note that some tenants/timelines could be marked as deleted in console, but console might continue querying the node later to fully remove the tenant/timeline: wait for some time before ensuring that the "extra" tenant/timeline is not going away by itself.
|
|
|
|
When all IDs are collected, manually go to every pageserver and detach/delete the tenant/timeline.
|
|
In future, the cleanup tool may access pageservers directly, but now it's only console and S3 it has access to.
|