rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-05-26 17:40:37 +00:00

Author	SHA1	Message	Date
Peter Bendel	6d8572ded6	Benchmarking: need to checkout actions to download Neon artifacts (#8770 ) ## Problem Database preparation workflow needs Neon artifacts but does not checkout necessary download action. We were lucke in a few runs like this one https://github.com/neondatabase/neon/actions/runs/10413970941/job/28870668020 but this is flaky and a race condition which failed here https://github.com/neondatabase/neon/actions/runs/10446395644/job/28923749772#step:4:1 ## Summary of changes Checkout code (including actions) before invoking download action Successful test run https://github.com/neondatabase/neon/actions/runs/10469356296/job/28992200694	2024-08-21 08:08:49 +01:00
Alex Chi Z.	c8b9116a97	impr(pageserver): abort on fatal I/O writer error (#8777 ) part of https://github.com/neondatabase/neon/issues/8140 The blob writer path now uses `maybe_fatal_err` Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-08-20 20:05:33 +01:00
John Spray	beefc7a810	pageserver: add metric pageserver_secondary_heatmap_total_size (#8768 ) ## Problem We don't have a convenient way for a human to ask "how far are secondary downloads along for this tenant". This is useful when driving migrations of tenants to the storage controller, as we first create a secondary location and want to see it warm up before we cut over. That can already be done via storcon_cli, but we would like a way that doesn't require direct API access. ## Summary of changes Add a metric that reports to total size of layers in the heatmap: this may be used in conjunction with the existing `pageserver_secondary_resident_physical_size` to estimate "warmth" of the secondary location.	2024-08-20 19:47:42 +01:00
Vlad Lazar	fa0750a37e	storcon: add peer jwt token (#8764 ) ## Problem Storage controllers did not have the right token to speak to their peers for leadership transitions. ## Summary of changes Accept a peer jwt token for the storage controller. Epic: https://github.com/neondatabase/cloud/issues/14701	2024-08-20 15:25:21 +01:00
Conrad Ludgate	0170611a97	proxy: small changes (#8752 ) ## Problem #8736 is getting too big. splitting off some simple changes here ## Summary of changes Local proxy wont always be using tls, so make it optional. Local proxy wont be using ws for now, so make it optional. Remove a dead config var.	2024-08-20 14:16:27 +01:00
Vlad Lazar	1c96957e85	storcon: run db migrations after step down sequence (#8756 ) ## Problem Previously, we would run db migrations before doing the step-down sequence. This meant that the current leader would have to deal with the schema changes and that's generally not safe. ## Summary of changes Push the step-down procedure earlier in start-up and do db migrations right after it (but before we load-up the in-memory state from the db). Epic: https://github.com/neondatabase/cloud/issues/14701	2024-08-20 14:00:36 +01:00
John Spray	02a28c01ca	Revert "safekeeper: check for non-consecutive writes in safekeeper.rs" (#8771 ) Reverts neondatabase/neon#8640 This broke `test_last_log_term_switch` via a merge race of some kind.	2024-08-20 11:34:53 +00:00
Alexander Bayandin	c96593b473	Make Postgres 16 default version (#8745 ) ## Problem The default Postgres version is set to 15 in code, while we use 16 in most of the other places (and Postgres 17 is coming) ## Summary of changes - Run `benchmarks` job with Postgres 16 (instead of Postgres 14) - Set `DEFAULT_PG_VERSION` to 16 in all places - Remove deprecated `--pg-version` pytest argument - Update `test_metadata_bincode_serde_ensure_roundtrip` for Postgres 16	2024-08-20 10:46:58 +01:00
Christian Schwarz	ef57e73fbf	task_mgr::spawn: require a `TenantId` (#8462 ) … to dis-incentivize global tasks via task_mgr in the future (As of https://github.com/neondatabase/neon/pull/8339 all remaining task_mgr usage is tenant or timeline scoped.)	2024-08-20 08:26:44 +00:00
Arseny Sher	4c5a0fdc75	safekeeper: check for non-consecutive writes in safekeeper.rs wal_storage.rs already checks this, but since this is a quite legit scenario check it at safekeeper.rs (consensus level) as well. ref https://github.com/neondatabase/neon/issues/8212	2024-08-20 07:12:56 +03:00
Arpad Müller	4b26783c94	scrubber: remove _generic postfix and two unused functions (#8761 ) Removes the `_generic` postfix from the `GenericRemoteStorage` using APIs, as `remote_storage` is the "default" now, and add a `_s3` postfix to the remaining APIs using the S3 SDK (only in tenant snapshot). Also, remove two unused functions: `list_objects_with_retries` and `stream_tenants functions`. Part of https://github.com/neondatabase/neon/issues/7547	2024-08-19 23:58:47 +02:00
Arpad Müller	6949b45e17	Update aws -> infra for repo rename (#8755 ) See slack thread: https://neondb.slack.com/archives/C039YKBRZB4/p1722501766006179	2024-08-19 17:44:10 +02:00
Arpad Müller	3b8ca477ab	Migrate physical GC and scan_metadata to remote_storage (#8673 ) Migrates most of the remaining parts of the scrubber to remote_storage: * `pageserver_physical_gc` * `scan_metadata` for pageservers (safekeepers were done in #8595) * `download()` in `tenant_snapshot`. The main `tenant_snapshot` is not migrated as it uses version history to be able to work in the face of ongoing changes. Part of #7547	2024-08-19 16:39:44 +02:00
Christian Schwarz	eb7241c798	l0_flush: remove support for mode `page-cached` (#8739 ) It's been rolled out everywhere, no configs are referencing it. All code that's made dead by the removal of the config option is removed as part of this PR. The `page_caching::PreWarmingWriter` in `::No` mode is equivalent to a `size_tracking_writer`, so, use that. part of https://github.com/neondatabase/neon/issues/7418	2024-08-19 16:35:34 +02:00
Folke Behrens	f246aa3ca7	proxy: Fix some warnings by extended clippy checks (#8748 ) * Missing blank lifetimes which is now deprecated. * Matching off unqualified enum variants that could act like variable. * Missing semicolons.	2024-08-19 10:33:46 +02:00
Arpad Müller	188bde7f07	Default image compression to zstd at level 1 (#8677 ) After the rollout has succeeded, we now set the default image compression to be enabled. We also remove its explicit mention from `neon_fixtures.py` added in #8368 as it is now the default (and we switch to `zstd(1)` which is a bit nicer on CPU time). Part of https://github.com/neondatabase/neon/issues/5431	2024-08-18 18:32:10 +01:00
Yuchen Liang	7131ac4730	refactor(scrubber): add unified command suitable for cron job (#8635 ) Part of #8128. ## Description This PR creates a unified command to run both physical gc and metadata health check as a cron job. This also enables us to add additional tasks to the cron job in the future. Signed-off-by: Yuchen Liang <yuchen@neon.tech>	2024-08-17 16:48:53 +01:00
Konstantin Knizhnik	2be69af6c3	Track holes to be able to reuse them once LFC limit is increased (#8575 ) ## Problem Multiple increase/decrease LFC limit may cause unlimited growth of LFC file because punched holes while LFC shrinking are not reused when LFC is extended. ## Summary of changes Keep track of holes and reused them when LFC size is increased. ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2024-08-16 22:19:44 +03:00
Sasha Krassovsky	c6b6b7700a	Fix superuser check in test_snap_files (#8749 ) ## Problem Current superuser check always passes because it returns a tuple like `(False,)`, and then the `if not superuser` passes. ## Summary of changes Fixes the issue by unwrapping the tuple. Verified that it works against a project where I don't have superuser.	2024-08-16 19:13:18 +01:00
John Spray	e2d89f7991	pageserver: prioritize secondary downloads to get most recent layers first, except l0s (#8729 ) ## Problem When a secondary location is trying to catch up while a tenant is receiving new writes, it can become quite wasteful: - Downloading L0s which are soon destroyed by compaction to L1s - Downloading older layer files which are soon made irrelevant when covered by image layers. ## Summary of changes Sort the layer files in the heatmap: - L0 layers are the lowest priority - Other layers are sorted to download the highest LSNs first.	2024-08-16 14:35:02 +02:00
Arseny Sher	25e7d321f4	safekeeper: cross check divergence point in ProposerElected handling. Previously, we protected from multiple ProposerElected messages from the same walproposer with the following condition: msg.term == self.get_last_log_term() && self.flush_lsn() > msg.start_streaming_at It is not exhaustive, i.e. we could still proceed to truncating WAL even though safekeeper inserted something since the divergence point has been calculated. While it was most likely safe because walproposer can't use safekeeper position to commit WAL until last_log_term reaches the current walproposer term, let's be more careful and properly calculate the divergence point like walproposer does.	2024-08-16 15:22:46 +03:00
Vlad Lazar	3f91ea28d9	tests: add infra and test for storcon leadership transfer (#8587 ) ## Problem https://github.com/neondatabase/neon/pull/8588 implemented the mechanism for storage controller leadership transfers. However, there's no tests that exercise the behaviour. ## Summary of changes 1. Teach `neon_local` how to handle multiple storage controller instances. Each storage controller instance gets its own subdirectory (`storage_controller_1, ...`). `storage_controller start\|stop` subcommands have also been extended to optionally accept an instance id. 2. Add a storage controller proxy test fixture. It's a basic HTTP server that forwards requests from pageserver and test env to the currently configured storage controller. 3. Add a test which exercises storage controller leadership transfer. 4. Finally fix a couple bugs that the test surfaced	2024-08-16 13:05:04 +01:00
Heikki Linnakangas	7fdc3ea162	Add retroactive RFC about physical replication (#8546 ) We've had physical replication support for a long time, but we never created an RFC for the feature. This RFC does that after the fact. Even though we've already implemented the feature, let's have a design discussion as if it hadn't done that. It can still be quite insightful. This is written from a pretty compute-centric viewpoint, not much on how it works in the control plane.	2024-08-16 11:30:53 +01:00
Joonas Koivunen	4763a960d1	chore: log if we have an open layer or any frozen on shutdown (#8740 ) Some benchmarks are failing with a "long" flushing, which might be because there is a queue of in-memory layers, or something else. Add logging to narrow it down. Private slack DM ref: https://neondb.slack.com/archives/D049K7HJ9JM/p1723727305238099	2024-08-16 06:10:05 +01:00
Sasha Krassovsky	df086cd139	Add logical replication test to exercise snapfiles (#8364 )	2024-08-15 15:34:45 -07:00
Alexander Bayandin	69cb1ee479	CI(replication-tests): store test results & change notification channel (#8687 ) ## Problem We want to store Nightly Replication test results in the database and notify the relevant Slack channel about failures ## Summary of changes - Store test results in the database - Notify `on-call-compute-staging-stream` about failures	2024-08-15 22:41:58 +01:00
Alexander Bayandin	4e58fd9321	CI(label-for-external-users): use CI_ACCESS_TOKEN (#8738 ) ## Problem `secrets.GITHUB_TOKEN` (with any permissions) is not enough to get a user's membership info if they decide to hide it. ## Summary of changes - Use `secrets.CI_ACCESS_TOKEN` for `gh api` call - Use `pull_request_target` instead of `pull_request` event to get access to secrets	2024-08-15 18:37:15 +01:00
Konstantin Knizhnik	f087423a01	Handle reload config file request in LR monitor (#8732 ) ## Problem Logical replication BGW checking replication lag is not reloading config ## Summary of changes Add handling of reload config request ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2024-08-15 16:28:25 +03:00
Joonas Koivunen	24d347f50b	storcon: use tracing for logging panics (#8734 ) this gives spans for panics, and does not globber loki output by writing to stderr while all of the other logging is to stdout. See: #3475	2024-08-15 16:27:07 +03:00
Joonas Koivunen	52641eb853	storcon: add spans to drain/fill ops (#8735 ) this way we do not need to repeat the %node_id everywhere, and we get no stray messages in logs from within the op.	2024-08-15 15:30:04 +03:00
Joonas Koivunen	d9a57aeed9	storcon: deny external node configuration if an operation is ongoing (#8727 ) Per #8674, disallow node configuration while drain/fill are ongoing. Implement it by adding a only-http wrapper `Service::external_node_configure` which checks for operation existing before configuring. Additionally: - allow cancelling drain/fill after a pageserver has restarted and transitioned to WarmingUp Fixes: #8674	2024-08-15 10:54:05 +01:00
Alexander Bayandin	a9c28be7d0	fix(pageserver): allow unused_imports in download.rs on macOS (#8733 ) ## Problem On macOS, clippy fails with the following error: ``` error: unused import: `crate::virtual_file::owned_buffers_io::io_buf_ext::IoBufExt` --> pageserver/src/tenant/remote_timeline_client/download.rs:26:5 \| 26 \| use crate::virtual_file::owned_buffers_io::io_buf_ext::IoBufExt; \| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ \| = note: `-D unused-imports` implied by `-D warnings` = help: to override `-D warnings` add `#[allow(unused_imports)]` ``` Introduced in https://github.com/neondatabase/neon/pull/8717 ## Summary of changes - allow `unused_imports` for `crate::virtual_file::owned_buffers_io::io_buf_ext::IoBufExt` on macOS in download.rs	2024-08-15 10:06:28 +01:00
Vlad Lazar	fef77b0cc9	safekeeper: consider partial uploads when pulling timeline (#8628 ) ## Problem The control file contains the id of the safekeeper that uploaded it. Previously, when sending a snapshot of the control file to another sk, it would eventually be gc-ed by the receiving sk. This is incorrect because the original sk might still need it later. ## Summary of Changes When sending a snapshot and the control file contains an uploaded segment: * Create a copy of the segment in s3 with the destination sk in the object name * Tweak the streamed control file to point to the object create in the previous step Note that the snapshot endpoint now has to know the id of the requestor, so the api has been extended to include the node if of the destination sk. Closes https://github.com/neondatabase/neon/issues/8542	2024-08-15 09:02:33 +01:00
Christian Schwarz	168913bdf0	refactor(write path): newtype to enforce use of fully initialized slices (#8717 ) The `tokio_epoll_uring::Slice` / `tokio_uring::Slice` type is weird. The new `FullSlice` newtype is better. See the doc comment for details. The naming is not ideal, but we'll clean that up in a future refactoring where we move the `FullSlice` into `tokio_epoll_uring`. Then, we'll do the following: * tokio_epoll_uring::Slice is removed * `FullSlice` becomes `tokio_epoll_uring::IoBufView` * new type `tokio_epoll_uring::IoBufMutView` for the current `tokio_epoll_uring::Slice<IoBufMut>` Context ------- I did this work in preparation for https://github.com/neondatabase/neon/pull/8537. There, I'm changing the type that the `inmemory_layer.rs` passes to `DeltaLayerWriter::put_value_bytes` and thus it seemed like a good opportunity to make this cleanup first.	2024-08-14 21:57:17 +02:00
Alexander Bayandin	aa2e16f307	CI: misc cleanup & fixes (#8559 ) ## Problem A bunch of small fixes and improvements for CI, that are too small to have a separate PR for them ## Summary of changes - CI(build-and-test): fix parenthesis - CI(actionlint): fix path to workflow file - CI: remove default args from actions/checkout - CI: remove `gen3` label, using a couple `self-hosted` + `small{,-arm64}`/`large{,-arm64}` is enough - CI: prettify Slack messages, hide links behind text messages - C(build-and-test): add more dependencies to `conclusion` job	2024-08-14 17:56:59 +01:00
Alexander Bayandin	70b18ff481	CI(neon-image): add ARM-specific RUSTFLAGS (#8566 ) ## Problem It's recommended that a couple of additional RUSTFLAGS be set up to improve the performance of Rust applications on AWS Graviton. See `57dc813626/rust.md` Note: Apple Silicon is compatible with neoverse-n1: ``` $ clang --version Apple clang version 15.0.0 (clang-1500.3.9.4) Target: arm64-apple-darwin23.6.0 Thread model: posix InstalledDir: /Applications/Xcode_15.4.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin $ $ clang --print-supported-cpus 2>&1 \| grep neoverse- neoverse-512tvb neoverse-e1 neoverse-n1 neoverse-n2 neoverse-v1 neoverse-v2 ``` ## Summary of changes - Add `-Ctarget-feature=+lse -Ctarget-cpu=neoverse-n1` to RUSTFLAGS for ARM images	2024-08-14 17:03:21 +01:00
Joonas Koivunen	60fc1e8cc8	chore: even more responsive compaction cancellation (#8725 ) Some benchmarks and tests might still fail because of #8655 (tracked in #8708) because we are not fast enough to shut down ([one evidence]). Partially this is explained by the current validation mode of streaming k-merge, but otherwise because that is where we use a lot of time in compaction. Outside of L0 => L1 compaction, the image layer generation is already guarded by vectored reads doing cancellation checks. 32768 is a wild guess based on looking how many keys we put in each layer in a bench (1-2 million), but I assume it will be good enough divisor. Doing checks more often will start showing up as contention which we cannot currently measure. Doing checks less often might be reasonable. [one evidence]: https://neon-github-public-dev.s3.amazonaws.com/reports/main/10384136483/index.html#suites/9681106e61a1222669b9d22ab136d07b/96e6d53af234924/ Earlier PR: #8706.	2024-08-14 14:48:15 +01:00
Alexander Bayandin	36c1719a07	CI(build-neon): fix accidental neon rebuild on `cargo test` (#8721 ) ## Problem During `Run rust tests` step (for debug builds), we accidentally rebuild neon twice (by `cargo test --doc` and by `cargo nextest run`). It happens because we don't set `cov_prefix` for the `cargo test --doc` command, which triggers rebuilding with different build flags, and one more rebuild by `cargo nextest run`. ## Summary of changes - Set `cov_prefix` for `cargo test --doc` to prevent unneeded rebuilds	2024-08-14 13:38:25 +01:00
John Spray	abb53ba36d	storcon_cli: don't clobber heatmap interval when setting eviction (#8722 ) ## Problem This command is kind of a hack, used when we're migrating large tenants and want to get their resident size down. It sets the tenant config to a fixed value, which omitted heatmap_period, so caused secondaries to get out of date. ## Summary of changes - Set heatmap period to the same 300s default that we use elsewhere when updating eviction settings This is not as elegant as some general purpose partial modification of the config, but it practically makes the command safer to use.	2024-08-14 13:37:03 +01:00
Conrad Ludgate	a7028d92b7	proxy: start of jwk cache (#8690 ) basic JWT implementation that caches JWKs and verifies signatures. this code is currently not reachable from proxy, I just wanted to get something merged in.	2024-08-14 13:35:29 +01:00
Joonas Koivunen	6c9e3c9551	refactor: error/anyhow::Error wrapping (#8697 ) We can get CompactionError::Other(Cancelled) via the error handling with a few ways. [evidence](https://neon-github-public-dev.s3.amazonaws.com/reports/pr-8655/10301613380/index.html#suites/cae012a1e6acdd9fdd8b81541972b6ce/653a33de17802bb1/). Hopefully fix it by: 1. replace the `map_err` which hid the `GetReadyAncestorError::Cancelled` with `From<GetReadyAncestorError> for GetVectoredError` conversion 2. simplifying the code in pgdatadir_mapping to eliminate the token anyhow wrapping for deserialization errors 3. stop wrapping GetVectoredError as anyhow errors 4. stop wrapping PageReconstructError as anyhow errors Additionally, produce warnings if we treat any other error (as was legal before this PR) as missing key. Cc: #8708.	2024-08-14 12:45:56 +01:00
Alexander Bayandin	fc3d372f3a	CI(label-for-external-users): check membership using GitHub API (#8724 ) ## Problem `author_association` doesn't properly work if a GitHub user decides not to show affiliation with the org in their profile (i.e. if it's private) ## Summary of changes - Call `/orgs/ORG/members/USERNAME` API to check whether a PR/issue author is a member of the org	2024-08-14 12:27:52 +01:00
John Spray	19d69d515c	pageserver: evict covered layers earlier (#8679 ) ## Problem When pageservers do compaction, they frequently create image layers that make earlier layers un-needed for reads, but then keep those earlier layers around for 24 hours waiting for time-based eviction to expire them. Now that we track layer visibility, we can use it as an input to eviction, and avoid the 24 hour "disk bump" that happens around pageserver restarts. ## Summary of changes - During time-based eviction, if a layer is marked Covered, use the eviction period as the threshold: i.e. these layers get to remain resident for at least one iteration of the eviction loop, but then get evicted. With current settings this means they get evicted after 1h instead of 24h. - During disk usage eviction, prioritized evicting covered layers above all other layers. Caveats: - Using the period as the threshold for time based eviction in this case is a bit of a hack, but it avoids adding yet another configuration property, and in any case the value of a new property would be somewhat arbitrary: there's no "right" length of time to keep covered layers around just in case. - We had previously planned on removing time-based eviction: this change would motivate us to keep it around, but we can still simplify the code later to just do the eviction of covered layers, rather than applying a TTL policy to all layers.	2024-08-14 12:10:15 +01:00
Joonas Koivunen	485d76ac62	timeline_detach_ancestor: adjust error handling (#8528 ) With additional phases from #8430 the `detach_ancestor::Error` became untenable. Split it up into phases, and introduce laundering for remaining `anyhow::Error` to propagate them as most often `Error::ShuttingDown`. Additionally, complete FIXMEs. Cc: #6994	2024-08-14 10:16:18 +01:00
John Spray	4049d2b7e1	scrubber: fix spurious "Missed some shards" errors (#8661 ) ## Problem The storage scrubber was reporting warnings for lots of timelines like: ``` WARN Missed some shards at count ShardCount(0) tenant_id=25eb7a83d9a2f90ac0b765b6ca84cf4c ``` These were spurious: these tenants are fine. There was a bug in accumulating the ShardIndex for each tenant, whereby multiple timelines would lead us to add the same ShardIndex more than one. Closes: #8646 ## Summary of changes - Accumulate ShardIndex in a BTreeSet instead of a Vec - Extend the test to reproduce the issue	2024-08-14 09:29:06 +01:00
Konstantin Knizhnik	7a1736ddcf	Preserve HEAP_COMBOCID when restoring t_cid from WAL (#8503 ) ## Problem See https://github.com/neondatabase/neon/issues/8499 ## Summary of changes Save HEAP_COMBOCID flag in WAL and do not clear it in redo handlers. Related Postgres PRs: https://github.com/neondatabase/postgres/pull/457 https://github.com/neondatabase/postgres/pull/458 https://github.com/neondatabase/postgres/pull/459 ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech> Co-authored-by: Heikki Linnakangas <heikki@neon.tech>	2024-08-14 08:13:20 +03:00
Tristan Partin	c624317b0e	Decode the database name in SQL/HTTP connections A url::Url does not hand you back a URL decoded value for path values, so we must decode them ourselves. Link: https://docs.rs/url/2.5.2/url/struct.Url.html#method.path Link: https://docs.rs/url/2.5.2/url/struct.Url.html#method.path_segments Signed-off-by: Tristan Partin <tristan@neon.tech>	2024-08-13 16:32:58 -05:00
Tristan Partin	0f43b7c51b	Loosen type on PgProtocol::safe_psql(queries:) Using Iterable allows us to also use tuples, among other things. Signed-off-by: Tristan Partin <tristan@neon.tech>	2024-08-13 16:32:58 -05:00
Joonas Koivunen	6d6e2c6a39	feat(detach_ancestor): better retries with persistent gc blocking (#8430 ) With the persistent gc blocking, we can now retry reparenting timelines which had failed for whatever reason on the previous attempt(s). Restructure the detach_ancestor into three phases: - prepare (insert persistent gc blocking, copy lsn prefix, layers) - detach and reparent - reparenting can fail, so we might need to retry this portion - complete (remove persistent gc blocking) Cc: #6994	2024-08-13 18:51:51 +01:00
Joonas Koivunen	87a5d7db9e	test: do better job of shutting everything down (#8714 ) After #8655 we've had a few issues (mostly tracked on #8708) with the graceful shutdown. In order to shutdown more of the processes and catch more errors, for example, from all pageservers, do an immediate shutdown for those nodes which fail the initial (possibly graceful) shutdown. Cc: #6485	2024-08-13 18:49:50 +01:00

1 2 3 4 5 ...

5901 Commits