rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-01-07 21:42:56 +00:00

Files

John Spray 616e7046c7 s3_scrubber: import into the main neon repository (#5141 )

## Problem

The S3 scrubber currently lives at
https://github.com/neondatabase/s3-scrubber

We don't have tests that use it, and it has copies of some data
structures that can get stale.

## Summary of changes

- Import the s3-scrubber as `s3_scrubber/
- Replace copied_definitions/ in the scrubber with direct access to the
`utils` and `pageserver` crates
- Modify visibility of a few definitions in `pageserver` to allow the
scrubber to use them
- Update scrubber code for recent changes to `IndexPart`
- Update `KNOWN_VERSIONS` for IndexPart and move the definition into
index.rs so that it is easier to keep up to date

As a future refinement, it would be good to pull the remote persistence
types (like IndexPart) out of `pageserver` into a separate library so
that the scrubber doesn't have to link against the whole pageserver, and
so that it's clearer which types need to be public.

Co-authored-by: Kirill Bulatov <kirill@neon.tech>
Co-authored-by: Dmitry Rodionov <dmitry@neon.tech>
Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>

2023-08-31 19:01:39 +01:00

5.6 KiB

Raw Blame History

Neon S3 scrubber

This tool directly accesses the S3 buckets used by the Neon pageserver and safekeeper, and does housekeeping such as cleaning up objects for tenants & timelines that no longer exist.

Usage

Generic Parameters

S3

Do aws sso login --profile dev to get the SSO access to the bucket to clean, get the SSO_ACCOUNT_ID for your profile (cat ~/.aws/config may help).

SSO_ACCOUNT_ID: Credentials id to use for accessing S3 buckets
REGION: A region where the bucket is located at.
BUCKET: Bucket name

Console API

This section is only relevant if using a command that requires access to Neon's internal control plane

CLOUD_ADMIN_API_URL: The URL base to use for checking tenant/timeline for existence via the Cloud API. e.g. https://<admin host>/admin
CLOUD_ADMIN_API_TOKEN: The token to provide when querying the admin API. Get one on the corresponding console page, e.g. https://<admin host>/app/settings/api-keys

Commands

`tidy`

Iterate over S3 buckets for storage nodes, checking their contents and removing the data not present in the console. Node S3 data that's not removed is then further checked for discrepancies and, sometimes, validated.

Unless the global --delete argument is provided, this command only dry-runs and logs what it would have deleted.

tidy --node-kind=<safekeeper|pageserver> [--depth=<tenant|timeline>] [--skip-validation]

--node-kind: whether to inspect safekeeper or pageserver bucket prefix
--depth: whether to only search for deletable tenants, or also search for deletable timelines within active tenants. Default: tenant
--skip-validation: skip additional post-deletion checks. Default: false

For a selected S3 path, the tool lists the S3 bucket given for either tenants or both tenants and timelines — for every found entry, console API is queried: any deleted or missing in the API entity is scheduled for deletion from S3.

If validation is enabled, only the non-deleted tenants' ones are checked. For pageserver, timelines' index_part.json on S3 is also checked for various discrepancies: no files are removed, even if there are "extra" S3 files not present in index_part.json: due to the way pageserver updates the remote storage, it's better to do such removals manually, stopping the corresponding tenant first.

Command examples:

env SSO_ACCOUNT_ID=369495373322 REGION=eu-west-1 BUCKET=neon-dev-storage-eu-west-1 CLOUD_ADMIN_API_TOKEN=${NEON_CLOUD_ADMIN_API_STAGING_KEY} CLOUD_ADMIN_API_URL=[url] cargo run --release -- tidy --node-kind=safekeeper

env SSO_ACCOUNT_ID=369495373322 REGION=us-east-2 BUCKET=neon-staging-storage-us-east-2 CLOUD_ADMIN_API_TOKEN=${NEON_CLOUD_ADMIN_API_STAGING_KEY} CLOUD_ADMIN_API_URL=[url] cargo run --release -- tidy --node-kind=pageserver --depth=timeline

When dry run stats look satisfying, use -- --delete before the tidy command to disable dry run and run the binary with deletion enabled.

See these lines (and lines around) in the logs for the final stats:

Finished listing the bucket for tenants
Finished active tenant and timeline validation
Total tenant deletion stats
Total timeline deletion stats

Current implementation details

The tool does not have any peristent state currently: instead, it creates very verbose logs, with every S3 delete request logged, every tenant/timeline id check, etc. Worse, any panic or early errored tasks might force the tool to exit without printing the final summary — all affected ids will still be in the logs though. The tool has retries inside it, so it's error-resistant up to some extent, and recent runs showed no traces of errors/panics.
Instead of checking non-deleted tenants' timelines instantly, the tool attempts to create separate tasks (futures) for that, complicating the logic and slowing down the process, this should be fixed and done in one "task".
The tool does uses only publicly available remote resources (S3, console) and does not access pageserver/safekeeper nodes themselves. Yet, its S3 set up should be prepared for running on any pageserver/safekeeper node, using node's S3 credentials, so the node API access logic could be implemented relatively simply on top.

Cleanup procedure:

Pageserver preparations

If S3 state is altered first manually, pageserver in-memory state will contain wrong data about S3 state, and tenants/timelines may get recreated on S3 (due to any layer upload due to compaction, pageserver restart, etc.). So before proceeding, for tenants/timelines which are already deleted in the console, we must remove these from pageservers.

First, we need to group pageservers by buckets, https://<admin host>/admin/pageservers`` can be used for all env nodes, then cat /storage/pageserver/data/pageserver.toml` on every node will show the bucket names and regions needed.

Per bucket, for every pageserver id related, find deleted tenants:

curl -X POST "https://<admin_host>/admin/check_pageserver/{id}" -H "Accept: application/json" -H "Authorization: Bearer ${NEON_CLOUD_ADMIN_API_STAGING_KEY}" | jq

use ?check_timelines=true to find deleted timelines, but the check runs a separate query on every alive tenant, so that could be long and time out for big pageservers.

Note that some tenants/timelines could be marked as deleted in console, but console might continue querying the node later to fully remove the tenant/timeline: wait for some time before ensuring that the "extra" tenant/timeline is not going away by itself.

When all IDs are collected, manually go to every pageserver and detach/delete the tenant/timeline. In future, the cleanup tool may access pageservers directly, but now it's only console and S3 it has access to.

5.6 KiB Raw Blame History