rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-01-16 09:52:54 +00:00

Author	SHA1	Message	Date
Joonas Koivunen	7cf726e36e	refactor(rtc): remove the duplicate IndexLayerMetadata (#7860 ) Once upon a time, we used to have duplicated types for runtime IndexPart and whatever we stored. Because of the serde fixes in #5335 we have no need for duplicated IndexPart type anymore, but the `IndexLayerMetadata` stayed. - remove the type - remove LayerFileMetadata::file_size() in favor of direct field access Split off from #7833. Cc: #3072.	2024-05-23 23:24:31 +03:00
Arpad Müller	4c5afb7b10	Remove SSO_ACCOUNT_ID from scrubber docs and BucketConfig (#7774 ) As of #6202 we support `AWS_PROFILE` as well, which is more convenient. Change the docs to using it instead of `SSO_ACCOUNT_ID`. Also, remove `SSO_ACCOUNT_ID` from BucketConfig as it is confusing to the code's reader: it's not the "main" way of setting up authentication for the scrubber any more. It is a breaking change for the on-disk format as we persist `sso_account_id` to disk, but it was quite inconsistent with the other methods which are not persistet. Also, I don't think we want to support the case where one version writes the json and another version reads it. Related: #7667	2024-05-16 19:35:13 +02:00
Joonas Koivunen	6351313ae9	feat: allow detaching from ancestor for timelines without writes (#7639 ) The first implementation #7456 did not include `index_part.json` changes in an attempt to keep amount of changes down. Tracks the historic reparentings and earlier detach in `index_part.json`. - `index_part.json` receives a new field `lineage: Lineage` - `Lineage` is queried through RemoteTimelineClient during basebackup, creating `PREV LSN: none` for the invalid prev record lsn just as it would had been created for a newly created timeline - as `struct IndexPart` grew, it is now boxed in places Cc: #6994	2024-05-10 22:30:05 +03:00
Christian Schwarz	ab10523cc1	remote_storage: AWS_PROFILE with endpoint overrides in ~/.aws/config (updates AWS SDKs) (#7664 ) Before this PR, using the AWS SDK profile feature for running against minio didn't work because * our SDK versions were too old and didn't include https://github.com/awslabs/aws-sdk-rust/issues/1060 and * we didn't massage the s3 client config builder correctly. This PR * udpates all the AWS SDKs we use to, respectively, the latest version I could find on crates.io (Is there a better process?) * changes the way remote_storage constructs the S3 client, and * documents how to run the test suite against real S3 & local minio. Regarding the changes to `remote_storage`: if one reads the SDK docs, it is clear that the recommended way is to use `aws_config::from_env`, then customize. What we were doing instead is to use the `aws_sdk_s3` builder directly. To get the `local-minio` in the added docs working, I needed to update both the SDKs and make the changes to the `remote_storage`. See the commit history in this PR for details. Refs: * byproduct: https://github.com/smithy-lang/smithy-rs/pull/3633 * follow-up on deprecation: https://github.com/neondatabase/neon/issues/7665 * follow-up for scrubber S3 setup: https://github.com/neondatabase/neon/issues/7667	2024-05-09 10:58:38 +02:00
John Spray	ca154d9cd8	pageserver: local layer path followups (#7640 ) - Rename "filename" types which no longer map directly to a filename (LayerFileName -> LayerName) - Add a -v1- part to local layer paths to smooth the path to future updates (we anticipate a -v2- that uses checksums later) - Rename methods that refer to the string-ized version of a LayerName to no longer be called "filename" - Refactor reconcile() function to use a LocalLayerFileMetadata type that includes the local path, rather than carrying local path separately in a tuple and unwrap()'ing it later.	2024-05-08 16:50:21 +00:00
Arseny Sher	3a2f10712a	Add more context to s3 listing error.	2024-04-30 18:19:52 +03:00
Arseny Sher	4ac4b21598	Add retries to cloud_admin client.	2024-04-30 18:19:52 +03:00
Arseny Sher	9f792f9c0b	Recheck tenant_id in find_timeline_branch. As it turns out we have at least one case of the same timeline_id in different projects.	2024-04-30 18:19:52 +03:00
Arseny Sher	7434674d86	Decrease CONSOLE_CONCURRENCY. Last run with 128 created too much load on cplane.	2024-04-30 18:19:52 +03:00
Arseny Sher	ea37234ccc	s3_scrubber: revive garbage collection for safekeepers. - pageserver_id in project details is now is optional, fix it - add active_timeline_count guard/stat similar to active_tenant_count - fix safekeeper prefix - count and log deleted keys	2024-04-30 18:19:52 +03:00
Arseny Sher	3da54e6d90	s3_scrubber: implement scan-metadata for safekeepers. It works by listing postgres table with memory dump of safekeepers state. s3 contents for each timeline are checked then against timeline_start_lsn and backup_lsn. If inconsistency is found, before complaining timeline (branch) is checked at control plane; it might have been deleted between the dump take and s3 check.	2024-04-30 18:19:52 +03:00
John Spray	2226acef7c	s3_scrubber: add `tenant-snapshot` (#7444 ) ## Problem Downloading tenant data for analysis/debug with `aws s3 cp` works well for small tenants, but for larger tenants it is unlikely that one ends up with an index that matches layer files, due to the time taken to download. ## Summary of changes - Add a `tenant-snapshot` command to the scrubber, which reads timeline indices and then downloads the layers referenced in the index, even if they were deleted. The result is a snapshot of the tenant's remote storage state that should be usable when imported (#7399 ).	2024-04-29 12:16:00 +00:00
Joonas Koivunen	752bf5a22f	build: clippy disallow futures::pin_mut macro (#7016 ) `std` has had `pin!` macro for some time, there is no need for us to use the older alternatives. Cannot disallow `tokio::pin` because tokio macros use that.	2024-03-05 10:14:37 +00:00
Arpad Müller	a2d0d44b42	Remove unused allow's (#6760 ) These allow's became redundant some time ago so remove them, or address them if addressing is very simple.	2024-02-14 18:16:05 +00:00
John Spray	b3a681d121	s3_scrubber: updates for sharding (#6281 ) This is a lightweight change to keep the scrubber providing sensible output when using sharding. - The timeline count was wrong when using sharding - When checking for tenant existence, we didn't re-use results between different shards in the same tenant Closes: https://github.com/neondatabase/neon/issues/5929	2024-01-08 09:19:10 +00:00
Joonas Koivunen	946c6a0006	scrubber: use adaptive config with retries, check subset of tenants (#6219 ) The tool still needs a lot of work. These are the easiest fix and feature: - use similar adaptive config with s3 as remote_storage, use retries - process only particular tenants Tenants need to be from the correct region, they are not deduplicated, but the feature is useful for re-checking small amount of tenants after a large run.	2024-01-02 15:22:16 +00:00
Arpad Müller	baa1323b4a	Use ProfileFileCredentialsProvider for AWS SDK configuration (#6202 ) Allows usage via `aws sso login --profile=<p>; AWS_PROFILE=<p>`. Now there is no need to manually configure things any more via `SSO_ACCOUNT_ID` and others. Now one can run the tests locally (given Neon employee access to aws): ``` aws sso login --profile dev export ENABLE_REAL_S3_REMOTE_STORAGE=nonempty REMOTE_STORAGE_S3_REGION=eu-central-1 REMOTE_STORAGE_S3_BUCKET=neon-github-public-dev AWS_PROFILE=dev cargo test -p remote_storage -j 1 s3 -- --nocapture ``` Also makes the scrubber use the same region for auth that it does its operations in (not touching the hard coded role name and start_url values here, they are not ideal though).	2023-12-20 22:38:58 +00:00
John Spray	f260f1565e	pageserver: fixes + test updates for sharding (#6186 ) This is a precursor to: - https://github.com/neondatabase/neon/pull/6185 While that PR contains big changes to neon_local and attachment_service, this PR contains a few unrelated standalone changes generated while working on that branch: - Fix restarting a pageserver when it contains multiple shards for the same tenant - When using location_config api to attach a tenant, create its timelines dir - Update test paths where generations were previously optional to make them always-on: this avoids tests having to spuriously assert that attachment_service is not None in order to make the linter happy. - Add a TenantShardId python implementation for subsequent use in test helpers that will be made shard-aware - Teach scrubber to read across shards when checking for layer existence: this is a refactor to track the list of existent layers at tenant-level rather than locally to each timeline. This is a precursor to testing shard splitting.	2023-12-20 12:26:20 +00:00
John Spray	de1a9c6e3b	s3_scrubber: basic support for sharding (#6119 ) This doesn't make the scrubber smart enough to understand that many shards are part of the same tenants, but it makes it understand paths well enough to scrub the individual shards without thinking they're malformed. This is a prerequisite to being able to run tests with sharding enabled. Related: #5929	2023-12-15 15:48:55 +00:00
John Spray	dfb0a6fdaf	scrubber: handle initdb files, fix an issue with prefixes (#6079 ) - The code for calculating the prefix in the bucket was expecting a trailing slash (as it is in the tests), but that's an awkward expectation to impose for use in the field: make the code more flexible by only trimming a trailing character if it is indeed a slash. - initdb archives were detected by the scrubber as malformed layer files. Teach it to recognize and ignore them.	2023-12-12 16:53:08 +00:00
John Khvatov	3e094e90d7	update aws sdk to 1.0.x (#5976 ) This change will be useful for experimenting with S3 performance. Co-authored-by: Joonas Koivunen <joonas@neon.tech>	2023-11-30 14:17:58 +02:00
Joonas Koivunen	105edc265c	fix: remove layer_removal_cs (#5108 ) Quest: https://github.com/neondatabase/neon/issues/4745. Follow-up to #4938. - add in locks for compaction and gc, so we don't have multiple executions at the same time in tests - remove layer_removal_cs - remove waiting for uploads in eviction/gc/compaction - #4938 will keep the file resident until upload completes Co-authored-by: Christian Schwarz <christian@neon.tech>	2023-11-28 19:15:21 +02:00
Joonas Koivunen	4be6bc7251	refactor: remove unnecessary unsafe (#5802 ) unsafe impls for `Send` and `Sync` should not be added by default. in the case of `SlotGuard` removing them does not cause any issues, as the compiler automatically derives those. This PR adds requirement to document the unsafety (see [clippy::undocumented_unsafe_blocks]) and opportunistically adds `#![deny(unsafe_code)]` to most places where we don't have unsafe code right now. TRPL on Send and Sync: https://doc.rust-lang.org/book/ch16-04-extensible-concurrency-sync-and-send.html [clippy::undocumented_unsafe_blocks]: https://rust-lang.github.io/rust-clippy/master/#/undocumented_unsafe_blocks	2023-11-07 10:26:25 +00:00
duguorong009	b3d3a2587d	feat: improve the serde impl for several types(`Lsn`, `TenantId`, `TimelineId` ...) (#5335 ) Improve the serde impl for several types (`Lsn`, `TenantId`, `TimelineId`) by making them sensitive to `Serializer::is_human_readadable` (true for json, false for bincode). Fixes #3511 by: - Implement the custom serde for `Lsn` - Implement the custom serde for `Id` - Add the helper module `serde_as_u64` in `libs/utils/src/lsn.rs` - Remove the unnecessary attr `#[serde_as(as = "DisplayFromStr")]` in all possible structs Additionally some safekeeper types gained serde tests. --------- Co-authored-by: Joonas Koivunen <joonas@neon.tech>	2023-11-06 11:40:03 +02:00
John Spray	306c4f9967	s3_scrubber: prepare for scrubbing buckets with generation-aware content (#5700 ) ## Problem The scrubber didn't know how to find the latest index_part when generations were in use. ## Summary of changes - Teach the scrubber to do the same dance that pageserver does when finding the latest index_part.json - Teach the scrubber how to understand layer files with generation suffixes. - General improvement to testability: scan_metadata has a machine readable output that the testing `S3Scrubber` wrapper can read. - Existing test coverage of scrubber was false-passing because it just didn't see any data due to prefixing of data in the bucket. Fix that. This is incremental improvement: the more confidence we can have in the scrubber, the more we can use it in integration tests to validate the state of remote storage. --------- Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>	2023-11-03 17:36:02 +00:00
John Spray	7c16b5215e	scrubber: add separate find/purge garbage commands (#5409 ) ## Problem The previous garbage cleanup functionality relied on doing a dry run, inspecting logs, and then doing a deletion. This isn't ideal, because what one actually deletes might not be the same as what one saw in the dry run. It's also risky UX to rely on presence/absence of one CLI flag to control deletion: ideally the deletion command should be totally separate from the one that scans the bucket. Related: https://github.com/neondatabase/neon/issues/5037 ## Summary of changes This is a major re-work of the code, which results in a net decrease in line count of about 600. The old code for removing garbage was build around the idea of doing discovery and purging together: a "delete_batch_producer" sent batches into a deleter. The new code writes out both procedures separately, in functions that use the async streams introduced in https://github.com/neondatabase/neon/pull/5176 to achieve fast concurrent access to S3 while retaining the readability of a single function. - Add `find-garbage`, which writes out a JSON file of tenants/timelines to purge - Add `purge-garbage` which consumes the garbage JSON file, applies some extra validations, and does deletions. - The purge command will refuse to execute if the garbage file indicates that only garbage was found: this guards against classes of bugs where the scrubber might incorrectly deem everything garbage. - The purge command defaults to only deleting tenants that were found in "deleted" state in the control plane. This guards against the risk that using the wrong console API endpoint could cause all tenants to appear to be missing. Outstanding work for a future PR: - Make whatever changes are needed to adapt to the Console/Control Plane separation. - Make purge even safer by checking S3 `Modified` times for index_part.json files (not doing this here, because it will depend on the generation-aware changes for finding index_part.json files) ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist --------- Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com> Co-authored-by: Shany Pozin <shany@neon.tech>	2023-10-26 20:36:28 +01:00
John Spray	36c261851f	s3_scrubber: remove atty dependency (#5171 ) ## Problem - https://github.com/neondatabase/neon/security/dependabot/28 ## Summary of changes Remove atty, and remove the `with_ansi` arg to scrubber's stdout logger.	2023-09-12 10:11:41 +01:00
Joonas Koivunen	720d59737a	rust-1.72.0 changes (#5255 ) Prepare to upgrade rust version to latest stable. - `rustfmt` has learned to format `let irrefutable = $expr else { ... };` blocks - There's a new warning about virtual (workspace) crate resolver, picked the latest resolver as I suspect everyone would expect it to be the latest; should not matter anyways - Some new clippies, which seem alright	2023-09-08 16:28:41 +03:00
John Spray	743933176e	scrubber: add `scan-metadata` and hook into integration tests (#5176 ) ## Problem - Scrubber's `tidy` command requires presence of a control plane - Scrubber has no tests at all ## Summary of changes - Add re-usable async streams for reading metadata from a bucket - Add a `scan-metadata` command that reads from those streams and calls existing `checks.rs` code to validate metadata, then returns a summary struct for the bucket. Command returns nonzero status if errors are found. - Add an `enable_scrub_on_exit()` function to NeonEnvBuilder so that tests using remote storage can request to have the scrubber run after they finish - Enable remote storarge and scrub_on_exit in test_pageserver_restart and test_pageserver_chaos This is a "toe in the water" of the overall space of validating the scrubber. Later, we should: - Enable scrubbing at end of tests using remote storage by default - Make the success condition stricter than "no errors": tests should declare what tenants+timelines they expect to see in the bucket (or sniff these from the functions tests use to create them) and we should require that the scrubber reports on these particular tenants/timelines. The `tidy` command is untouched in this PR, but it should be refactored later to use similar async streaming interface instead of the current batch-reading approach (the streams are faster with large buckets), and to also be covered by some tests. --------- Co-authored-by: Joonas Koivunen <joonas@neon.tech> Co-authored-by: Alexander Bayandin <alexander@neon.tech> Co-authored-by: Christian Schwarz <christian@neon.tech> Co-authored-by: Conrad Ludgate <conrad@neon.tech>	2023-09-06 11:55:24 +01:00
John Spray	616e7046c7	s3_scrubber: import into the main `neon` repository (#5141 ) ## Problem The S3 scrubber currently lives at https://github.com/neondatabase/s3-scrubber We don't have tests that use it, and it has copies of some data structures that can get stale. ## Summary of changes - Import the s3-scrubber as `s3_scrubber/ - Replace copied_definitions/ in the scrubber with direct access to the `utils` and `pageserver` crates - Modify visibility of a few definitions in `pageserver` to allow the scrubber to use them - Update scrubber code for recent changes to `IndexPart` - Update `KNOWN_VERSIONS` for IndexPart and move the definition into index.rs so that it is easier to keep up to date As a future refinement, it would be good to pull the remote persistence types (like IndexPart) out of `pageserver` into a separate library so that the scrubber doesn't have to link against the whole pageserver, and so that it's clearer which types need to be public. Co-authored-by: Kirill Bulatov <kirill@neon.tech> Co-authored-by: Dmitry Rodionov <dmitry@neon.tech> Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>	2023-08-31 19:01:39 +01:00

30 Commits