rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-01-15 09:22:55 +00:00

Author	SHA1	Message	Date
Konstantin Knizhnik	afb68b0e7e	Report search_path to make it possible to use it in pgbouncer track_extra_parameters (#8303 ) ## Problem When pooled connections are used, session semantic its not preserved, including GUC settings. Many customers have particular problem with setting search_path. But pgbouncer 1.20 has `track_extra_parameters` settings which allows to track parameters included in startup package which are reported by Postgres. Postgres has [an official list of parameters that it reports to the client](https://www.postgresql.org/docs/15/protocol-flow.html#PROTOCOL-ASYNC). This PR makes Postgres also report `search_path` and so allows to include it in `track_extra_parameters`. ## Summary of changes Set GUC_REPORT flag for `search_path`. ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2024-08-13 15:07:24 +03:00
Vlad Lazar	b9d2c7bdd5	pageserver: remove vectored get related configs (#8695 ) ## Problem Pageserver exposes some vectored get related configs which are not in use. ## Summary of changes Remove the following pageserver configs: get_impl, get_vectored_impl, and `validate_get_vectored`. They are not used in the pageserver since https://github.com/neondatabase/neon/pull/8601. Manual overrides have been removed from the aws repo in https://github.com/neondatabase/aws/pull/1664.	2024-08-13 12:45:54 +01:00
John Spray	3379cbcaa4	pageserver: add CompactKey, use it in InMemoryLayer (#8652 ) ## Problem This follows a PR that insists all input keys are representable in 16 bytes: - https://github.com/neondatabase/neon/pull/8648 & a PR that prevents postgres from sending us keys that use the high bits of field2: - https://github.com/neondatabase/neon/pull/8657 Motivation for this change: 1. Ingest is bottlenecked on CPU 2. InMemoryLayer can create huge (~1M value) BTreeMap<Key,_> for its index. 3. Maps over i128 are much faster than maps over an arbitrary 18 byte struct. It may still be worthwhile to make the index two-tier to optimize for the case where only the last 4 bytes (blkno) of the key vary frequently, but simply using the i128 representation of keys has a big impact for very little effort. Related: #8452 ## Summary of changes - Introduce `CompactKey` type which contains an i128 - Use this instead of Key in InMemoryLayer's index, converting back and forth as needed. ## Performance All the small-value `bench_ingest` cases show improved throughput. The one that exercises this index most directly shows a 35% throughput increase: ``` ingest-small-values/ingest 128MB/100b seq, no delta time: [374.29 ms 378.56 ms 383.38 ms] thrpt: [333.88 MiB/s 338.13 MiB/s 341.98 MiB/s] change: time: [-26.993% -26.117% -25.111%] (p = 0.00 < 0.05) thrpt: [+33.531% +35.349% +36.974%] Performance has improved. ```	2024-08-13 11:48:23 +01:00
Arseny Sher	d24f1b6c04	Allow logical_replication_max_snap_files = -1 which disables the mechanism.	2024-08-13 09:42:16 +03:00
Sasha Krassovsky	32aa1fc681	Add on-demand WAL download to slot funcs (#8705 ) ## Problem Currently we can have an issue where if someone does `pg_logical_slot_advance`, it could fail because it doesn't have the WAL locally. ## Summary of changes Adds on-demand WAL download and a test to these slot funcs. Before adding these, the test fails with ``` requested WAL segment pg_wal/000000010000000000000001 has already been removed ``` After the changes, the test passes Relies on: - https://github.com/neondatabase/postgres/pull/466 - https://github.com/neondatabase/postgres/pull/467 - https://github.com/neondatabase/postgres/pull/468	2024-08-12 20:54:42 -08:00
Peter Bendel	f57c2fe8fb	Automatically prepare/restore Aurora and RDS databases from pg_dump in benchmarking workflow (#8682 ) ## Problem We use infrastructure as code (TF) to deploy AWS Aurora and AWS RDS Postgres database clusters. Whenever we have a change in TF (e.g. every year to upgrade to a higher Postgres version or when we change the cluster configuration) TF will apply the change and create a new AWS database cluster. However our benchmarking testcase also expects databases in these clusters and tables loaded with data. So we add auto-detection - if the AWS RDS instances are "empty" we create the necessary databases and restore a pg_dump. Important Notes: - These steps are NOT run in each benchmarking run, but only after a new RDS instance has been deployed. - the benchmarking workflows use GitHub secrets to find the connection string for the database. These secrets still need to be (manually or programmatically using git cli) updated if some port of the connection string (e.g. user, password or hostname) changes. ## Summary of changes In each benchmarking run check if - database has already been created - if not create it - database has already been restored - if not restore it Supported databases - tpch - clickbench - user example Supported platforms: - AWS RDS Postgres - AWS Aurora serverless Postgres Sample workflow run - but this one uses Neon database to test the restore step and not real AWS databases https://github.com/neondatabase/neon/actions/runs/10321441086/job/28574350581 Sample workflow run - with real AWS database clusters https://github.com/neondatabase/neon/actions/runs/10346816389/job/28635997653 Verification in second run - with real AWS database clusters - that second time the restore is skipped https://github.com/neondatabase/neon/actions/runs/10348469517/job/28640778223	2024-08-12 21:46:35 +02:00
Christian Schwarz	ce0d0a204c	fix(walredo): shutdown can complete too early (#8701 ) Problem ------- The following race is possible today: ``` walredo_extraordinary_shutdown_thread: shutdown gets until Poll::Pending of self.launched_processes.close().await call other thread: drops the last Arc<Process> = 1. drop(_launched_processes_guard) runs, this ... walredo_extraordinary_shutdown_thread: ... wakes self.launched_processes.close().await walredo_extraordinary_shutdown_thread: logs `done` other thread: = 2. drop(process): this kill & waits ``` Solution -------- Change drop order so that `process` gets dropped first. Context ------- https://neondb.slack.com/archives/C06Q661FA4C/p1723478188785719?thread_ts=1723456706.465789&cid=C06Q661FA4C refs https://github.com/neondatabase/neon/pull/8572 refs https://github.com/neondatabase/cloud/issues/11387	2024-08-12 18:15:48 +01:00
Vlad Lazar	ae527ef088	storcon: implement graceful leadership transfer (#8588 ) ## Problem Storage controller restarts cause temporary unavailability from the control plane POV. See RFC for more details. ## Summary of changes * A couple of small refactors of the storage controller start-up sequence to make extending it easier. * A leader table is added to track the storage controller instance that's currently the leader (if any) * A peer client is added such that storage controllers can send `step_down` requests to each other (implemented in https://github.com/neondatabase/neon/pull/8512). * Implement the leader cut-over as described in the RFC * Add `start-as-candidate` flag to the storage controller to gate the rolling restart behaviour. When the flag is `false` (the default), the only change from the current start-up sequence is persisting the leader entry to the database.	2024-08-12 13:58:46 +01:00
Joonas Koivunen	9dc9a9b2e9	test: do graceful shutdown by default (#8655 ) It should give us all possible allowed_errors more consistently. While getting the workflows to pass on https://github.com/neondatabase/neon/pull/8632 it was noticed that allowed_errors are rarely hit (1/4). This made me realize that we always do an immediate stop by default. Doing a graceful shutdown would had made the draining more apparent and likely we would not have needed the #8632 hotfix. Downside of doing this is that we will see more timeouts if tests are randomly leaving pause failpoints which fail the shutdown. The net outcome should however be positive, we could even detect too slow shutdowns caused by a bug or deadlock.	2024-08-12 15:37:15 +03:00
John Spray	1b9a27d6e3	tests: reinstate test_bulk_insert (#8683 ) ## Problem This test was disabled. ## Summary of changes - Remove the skip marker. - Explicitly avoid doing compaction & gc during checkpoints (the default scale doesn't do anything here, but when experimeting with larger scales it messes things up) - Set a data size that gives a ~20s runtime on a Hetzner dev machine, previous one gave very noisy results because it was so small For reference on a Hetzner AX102: ``` ------------------------------ Benchmark results ------------------------------- test_bulk_insert[neon-release-pg16].insert: 25.664 s test_bulk_insert[neon-release-pg16].pageserver_writes: 5,428 MB test_bulk_insert[neon-release-pg16].peak_mem: 577 MB test_bulk_insert[neon-release-pg16].size: 0 MB test_bulk_insert[neon-release-pg16].data_uploaded: 1,922 MB test_bulk_insert[neon-release-pg16].num_files_uploaded: 8 test_bulk_insert[neon-release-pg16].wal_written: 1,382 MB test_bulk_insert[neon-release-pg16].wal_recovery: 25.373 s test_bulk_insert[neon-release-pg16].compaction: 0.035 s ```	2024-08-12 13:33:09 +01:00
Shinya Kato	41b5ee491e	Fix a comment in walproposer_pg.c (#8583 ) ## Problem Perhaps there is an error in the source code comment. ## Summary of changes Fix "walsender" to "walproposer"	2024-08-12 13:24:25 +01:00
Arseny Sher	06df6ca52e	proto changes	2024-08-12 14:48:05 +03:00
Arseny Sher	930763cad2	s/jsonb/array	2024-08-12 14:48:05 +03:00
Arseny Sher	28ef1522d6	cosmetic fixes	2024-08-12 14:48:05 +03:00
Arseny Sher	c9d2b61195	fix term uniqueness	2024-08-12 14:48:05 +03:00
Arseny Sher	4d1cf2dc6f	tests, rollout	2024-08-12 14:48:05 +03:00
Arseny Sher	7b50c1a457	more wip ref https://github.com/neondatabase/cloud/issues/14668	2024-08-12 14:48:05 +03:00
Arseny Sher	1e789fb963	wipwip	2024-08-12 14:48:05 +03:00
Arseny Sher	162424ad77	wip	2024-08-12 14:48:05 +03:00
Arseny Sher	a4eea5025c	Fix logical apply worker reporting of flush_lsn wrt sync replication. It should take syncrep flush_lsn into account because WAL before it on endpoint restart is lost, which makes replication miss some data if slot had already been advanced too far. This commit adds test reproducing the issue and bumps vendor/postgres to commit with the actual fix.	2024-08-12 13:14:02 +03:00
Alexander Bayandin	4476caf670	CI: add `actions/set-docker-config-dir` to set DOCKER_CONFIG (#8676 ) ## Problem In several workflows, we have repeating code which is separated into two steps: ```bash mkdir -p $(pwd)/.docker-custom echo DOCKER_CONFIG=/tmp/.docker-custom >> $GITHUB_ENV ... rm -rf $(pwd)/.docker-custom ``` Such copy-paste is prone to errors; for example, in one case, instead of `$(pwd)/.docker-custom`, we use `/tmp/.docker-custom`, which is shared between workflows. ## Summary of changes - Create a new action `actions/set-docker-config-dir`, which sets `DOCKER_CONFIG` and deletes it in a Post action part	2024-08-12 09:17:31 +01:00
dependabot[bot]	f7a3380aec	chore(deps): bump aiohttp from 3.9.4 to 3.10.2 (#8684 )	2024-08-11 12:21:32 +01:00
Arpad Müller	507f1a5bdd	Also pass HOME env var in access_env_vars (#8685 ) Noticed this while debugging a test failure in #8673 which only occurs with real S3 instead of mock S3: if you authenticate to S3 via `AWS_PROFILE`, then it requires the `HOME` env var to be set so that it can read inside the `~/.aws` directory. The scrubber abstraction `StorageScrubber::scrubber_cli` in `neon_fixtures.py` would otherwise not work. My earlier PR #6556 has done similar things for the `neon_local` wrapper. You can try: ``` aws sso login --profile dev export ENABLE_REAL_S3_REMOTE_STORAGE=y REMOTE_STORAGE_S3_BUCKET=neon-github-ci-tests REMOTE_STORAGE_S3_REGION=eu-central-1 AWS_PROFILE=dev RUST_BACKTRACE=1 BUILD_TYPE=debug DEFAULT_PG_VERSION=16 ./scripts/pytest -vv --tb=short -k test_scrubber_tenant_snapshot ``` before and after this patch: this patch fixes it.	2024-08-10 12:04:47 +00:00
John Spray	401dcd3551	Update docs/SUMMARY.md (#8665 ) ## Problem This page had many dead links, and was confusing for folks looking for documentation about our product. Closes: https://github.com/neondatabase/neon/issues/8535 ## Summary of changes - Add a link to the product docs up top - Remove dead/placeholder links	2024-08-09 18:30:15 +01:00
Alexander Bayandin	4a53cd0fc3	Dockerfiles: remove cachepot (#8666 ) ## Problem We install and try to use `cachepot`. But it is not configured correctly and doesn't work (after https://github.com/neondatabase/neon/pull/2290) ## Summary of changes - Remove `cachepot`	2024-08-09 15:48:16 +01:00
Vlad Lazar	f5cef7bf7f	storcon: skip draining shard if it's secondary is lagging too much (#8644 ) ## Problem Migrations of tenant shards with cold secondaries are holding up drains in during production deployments. ## Summary of changes If a secondary locations is lagging by more than 256MiB (configurable, but that's the default), then skip cutting it over to the secondary as part of the node drain.	2024-08-09 15:45:07 +01:00
John Spray	e6770d79fd	pageserver: don't treat NotInitialized::Stopped as unexpected (#8675 ) ## Problem This type of error can happen during shutdown & was triggering a circuit breaker alert. ## Summary of changes - Map NotIntialized::Stopped to CompactionError::ShuttingDown, so that we may handle it cleanly	2024-08-09 14:01:56 +01:00
Alexander Bayandin	201f56baf7	CI(pin-build-tools-image): fix permissions for Azure login (#8671 ) ## Problem Azure login fails in `pin-build-tools-image` workflow because the job doesn't have the required permissions. ``` Error: Please make sure to give write permissions to id-token in the workflow. Error: Login failed with Error: Error message: Unable to get ACTIONS_ID_TOKEN_REQUEST_URL env variable. Double check if the 'auth-type' is correct. Refer to https://github.com/Azure/login#readme for more information. ``` ## Summary of changes - Add `id-token: write` permission to `pin-build-tools-image` - Add an input to force image tagging - Unify pushing to Docker Hub with other registries - Split the job into two to have less if's	2024-08-09 12:05:43 +01:00
Alex Chi Z.	a155914c1c	fix(neon): disable create tablespace stmt (#8657 ) part of https://github.com/neondatabase/neon/issues/8653 Disable create tablespace stmt. It turns out it requires much less effort to do the regress test mode flag than patching the test cases, and given that we might need to support tablespaces in the future, I decided to add a new flag `regress_test_mode` to change the behavior of create tablespace. Tested manually that without setting regress_test_mode, create tablespace will be rejected. --------- Signed-off-by: Alex Chi Z <chi@neon.tech> Co-authored-by: Heikki Linnakangas <heikki@neon.tech>	2024-08-09 09:18:55 +01:00
Conrad Ludgate	7e08fbd1b9	Revert "proxy: update tokio-postgres to allow arbitrary config params (#8076 )" (#8654 ) This reverts #8076 - which was already reverted from the release branch since forever (it would have been a breaking change to release for all users who currently set TimeZone options). It's causing conflicts now so we should revert it here as well.	2024-08-09 09:09:29 +01:00
Peter Bendel	2ca5ff26d7	Run a subset of benchmarking job steps on GitHub action runners in Azure - closer to the system under test (#8651 ) ## Problem Latency from one cloud provider to another one is higher than within the same cloud provider. Some of our benchmarks are latency sensitive - we run a pgbench or psql in the github action runner and the system under test is running in Neon (database project). For realistic perf tps and latency results we need to compare apples to apples and run the database client in the same "latency distance" for all tests. ## Summary of changes Move job steps that test Neon databases deployed on Azure into Azure action runners. - bench strategy variant using azure database - pgvector strategy variant using azure database - pgbench-compare strategy variants using azure database ## Test run https://github.com/neondatabase/neon/actions/runs/10314848502	2024-08-09 08:36:29 +01:00
Alexander Bayandin	8acce00953	Dockerfiles: fix LegacyKeyValueFormat & JSONArgsRecommended (#8664 ) ## Problem CI complains in all PRs: ``` "ENV key=value" should be used instead of legacy "ENV key value" format ``` https://docs.docker.com/reference/build-checks/legacy-key-value-format/ See - https://github.com/neondatabase/neon/pull/8644/files ("Unchanged files with check annotations" section) - https://github.com/neondatabase/neon/actions/runs/10304090562?pr=8644 ("Annotations" section) ## Summary of changes - Use `ENV key=value` instead of `ENV key value` in all Dockerfiles	2024-08-09 07:54:54 +01:00
Alexander Bayandin	d28a6f2576	CI(build-tools): update Rust, Python, Mold (#8667 ) ## Problem - Rust 1.80.1 has been released: https://blog.rust-lang.org/2024/08/08/Rust-1.80.1.html - Python 3.9.19 has been released: https://www.python.org/downloads/release/python-3919/ - Mold 2.33.0 has been released: https://github.com/rui314/mold/releases/tag/v2.33.0 - Unpinned `cargo-deny` in `build-tools` got updated to the latest version and doesn't work anymore with the current config file ## Summary of changes - Bump Rust to 1.80.1 - Bump Python to 3.9.19 - Bump Mold to 2.33.0 - Pin `cargo-deny`, `cargo-hack`, `cargo-hakari`, `cargo-nextest`, `rustfilt` versions - Update `deny.toml` to the latest format, see https://github.com/EmbarkStudios/cargo-deny/pull/611	2024-08-09 06:17:16 +00:00
John Spray	4431688dc6	tests: don't require kafka client for regular tests (#8662 ) ## Problem We're adding more third party dependencies to support more diverse + realistic test cases in `test_runner/logical_repl`. I ❤️ these tests, they are a good thing. The slight glitch is that python packaging is hard, and some third party python packages have issues. For example the current kafka dependency doesn't work on latest python. We can mitigate that by only importing these more specialized dependencies in the tests that use them. ## Summary of changes - Move the `kafka` import into a test body, so that folks running the regular `test_runner/regress` tests don't have to have a working kafka client package.	2024-08-08 19:24:21 +01:00
John Spray	953b7d4f7e	pageserver: remove paranoia double-calculation of retain_lsns (#8617 ) ## Problem This code was to mitigate risk in https://github.com/neondatabase/neon/pull/8427 As expected, we did not hit this code path - the new continuous updates of gc_info are working fine, we can remove this code now. ## Summary of changes - Remove block that double-checks retain_lsns	2024-08-08 12:57:48 +01:00
Joonas Koivunen	8561b2c628	fix: stop leaking BackgroundPurges (#8650 ) avoid "leaking" the completions of BackgroundPurges by: 1. switching it to TaskTracker for provided close+wait 2. stop using tokio::fs::remove_dir_all which will consume two units of memory instead of one blocking task Additionally, use more graceful shutdown in tests which do actually some background cleanup.	2024-08-08 12:02:53 +01:00
Joonas Koivunen	21638ee96c	fix(test): do not fail test for filesystem race (#8643 ) evidence: https://neon-github-public-dev.s3.amazonaws.com/reports/pr-8632/10287641784/index.html#suites/0e58fb04d9998963e98e45fe1880af7d/c7a46335515142b/	2024-08-08 10:34:47 +01:00
Konstantin Knizhnik	cbe8c77997	Use sycnhronous commit for logical replicaiton worker (#8645 ) ## Problem See https://neondb.slack.com/archives/C03QLRH7PPD/p1723038557449239?thread_ts=1722868375.476789&cid=C03QLRH7PPD Logical replication subscription by default use `synchronous_commit=off` which cause problems with safekeeper ## Summary of changes Set `synchronous_commit=on` for logical replication subscription in test_subscriber_restart.py ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech> Co-authored-by: Heikki Linnakangas <heikki@neon.tech>	2024-08-08 10:23:57 +03:00
John Spray	cf3eac785b	pageserver: make bench_ingest build (but panic) on macOS (#8641 ) ## Problem Some developers build on MacOS, which doesn't have io_uring. ## Summary of changes - Add `io_engine_for_bench`, which on linux will give io_uring or panic if it's unavailable, and on MacOS will always panic. We do not want to run such benchmarks with StdFs: the results aren't interesting, and will actively waste the time of any developers who start investigating performance before they realize they're using a known-slow I/O backend. Why not just conditionally compile this benchmark on linux only? Because even on linux, I still want it to refuse to run if it can't get io_uring.	2024-08-07 21:17:08 +01:00
Yuchen Liang	542385e364	feat(pageserver): add direct io pageserver config (#8622 ) Part of #8130, [RFC: Direct IO For Pageserver](https://github.com/neondatabase/neon/blob/problame/direct-io-rfc/docs/rfcs/034-direct-io-for-pageserver.md) ## Description Add pageserver config for evaluating/enabling direct I/O. - Disabled: current default, uses buffered io as is. - Evaluate: still uses buffered io, but could do alignment checking and perf simulation (pad latency by direct io RW to a fake file). - Enabled: uses direct io, behavior on alignment error is configurable. Signed-off-by: Yuchen Liang <yuchen@neon.tech>	2024-08-07 21:04:19 +01:00
Joonas Koivunen	05dd1ae9e0	fix: drain completed page_service connections (#8632 ) We've noticed increased memory usage with the latest release. Drain the joinset of `page_service` connection handlers to avoid leaking them until shutdown. An alternative would be to use a TaskTracker. TaskTracker was not discussed in original PR #8339 review, so not hot fixing it in here either.	2024-08-07 17:14:45 +00:00
Cihan Demirci	8468d51a14	cicd: push build-tools image to ACR as well (#8638 ) https://github.com/neondatabase/cloud/issues/15899	2024-08-07 17:53:47 +01:00
Joonas Koivunen	a81fab4826	refactor(timeline_detach_ancestor): replace ordered reparented with a hashset (#8629 ) Earlier I was thinking we'd need a (ancestor_lsn, timeline_id) ordered list of reparented. Turns out we did not need it at all. Replace it with an unordered hashset. Additionally refactor the reparented direct children query out, it will later be used from more places. Split off from #8430. Cc: #6994	2024-08-07 18:19:00 +02:00
Alex Chi Z.	b3eea45277	fix(pageserver): dump the key when it's invalid (#8633 ) We see an assertion error in staging. Dump the key to guess where it was from, and then we can fix it. Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-08-07 16:37:46 +01:00
Joonas Koivunen	fc78774f39	fix: EphemeralFiles can outlive their Timeline via `enum LayerManager` (#8229 ) Ephemeral files cleanup on drop but did not delay shutdown, leading to problems with restarting the tenant. The solution is as proposed: - make ephemeral files carry the gate guard to delay `Timeline::gate` closing - flush in-memory layers and strong references to those on `Timeline::shutdown` The above are realized by making LayerManager an `enum` with `Open` and `Closed` variants, and fail requests to modify `LayerMap`. Additionally: - fix too eager anyhow conversions in compaction - unify how we freeze layers and handle errors - optimize likely_resident_layers to read LayerFileManager hashmap values instead of bouncing through LayerMap Fixes: #7830	2024-08-07 17:50:09 +03:00
Conrad Ludgate	ad0988f278	proxy: random changes (#8602 ) ## Problem 1. Hard to correlate startup parameters with the endpoint that provided them. 2. Some configurations are not needed in the `ProxyConfig` struct. ## Summary of changes Because of some borrow checker fun, I needed to switch to an interior-mutability implementation of our `RequestMonitoring` context system. Using https://docs.rs/try-lock/latest/try_lock/ as a cheap lock for such a use-case (needed to be thread safe). Removed the lock of each startup message, instead just logging only the startup params in a successful handshake. Also removed from values from `ProxyConfig` and kept as arguments. (needed for local-proxy config)	2024-08-07 14:37:03 +01:00
Arpad Müller	4d7c0dac93	Add missing colon to ArchivalConfigRequest specification (#8627 ) Add a missing colon to the API specification of `ArchivalConfigRequest`. The `state` field is required. Pointed out by Gleb.	2024-08-07 14:53:52 +02:00
Arpad Müller	00c981576a	Lower level for timeline cancellations during gc (#8626 ) Timeline cancellation running in parallel with gc yields error log lines like: ``` Gc failed 1 times, retrying in 2s: TimelineCancelled ``` They are completely harmless though and normal to occur. Therefore, only print those messages at an info level. Still print them at all so that we know what is going on if we focus on a single timeline.	2024-08-07 09:29:52 +02:00
Arpad Müller	c3f2240fbd	storage broker: only print one line for version and build tag in init (#8624 ) This makes it more consistent with pageserver and safekeeper. Also, it is easier to collect the two values into one data point.	2024-08-07 09:14:26 +02:00
Yuchen Liang	ed5724d79d	scrubber: clean up `scan_metadata` before prod (#8565 ) Part of #8128. ## Problem Currently, scrubber `scan_metadata` command will return with an error code if the metadata on remote storage is corrupted with fatal errors. To safely deploy this command in a cronjob, we want to differentiate between failures while running scrubber command and the erroneous metadata. At the same time, we also want our regression tests to catch corrupted metadata using the scrubber command. ## Summary of changes - Return with error code only when the scrubber command fails - Uses explicit checks on errors and warnings to determine metadata health in regression tests. Resolve conflict with `tenant-snapshot` command (after shard split): [`test_scrubber_tenant_snapshot`](https://github.com/neondatabase/neon/blob/yuchen/scrubber-scan-cleanup-before-prod/test_runner/regress/test_storage_scrubber.py#L23) failed before applying `422a8443dd` - When taking a snapshot, the old `index_part.json` in the unsharded tenant directory is not kept. - The current `list_timeline_blobs` implementation consider no `index_part.json` as a parse error. - During the scan, we are only analyzing shards with highest shard count, so we will not get a parse error. but we do need to add the layers to tenant object listing, otherwise we will get index is referencing a layer that is not in remote storage error. - Action: Add s3_layers from `list_timeline_blobs` regardless of parsing error Signed-off-by: Yuchen Liang <yuchen@neon.tech>	2024-08-06 18:55:42 +01:00

... 3 4 5 6 7 ...

6045 Commits