rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-01-10 06:52:55 +00:00

Author	SHA1	Message	Date
Heikki Linnakangas	7429dd711c	fix the .metrics.socket filename in the ignore list	2025-06-30 23:41:09 +03:00
Heikki Linnakangas	88ac1e356b	Ignore the metrics unix domain socket in tests	2025-06-30 23:39:01 +03:00
Erik Grinaker	c3cb1ab98d	Merge branch 'main' into communicator-rewrite	2025-06-30 21:07:01 +02:00
Dmitrii Kovalkov	8e216a3a59	storcon: notify cplane on safekeeper membership change (#12390 ) ## Problem We don't notify cplane about safekeeper membership change yet. Without the notification the compute needs to know all the safekeepers on the cluster to be able to speak to them. Change notifications will allow to avoid it. - Closes: https://github.com/neondatabase/neon/issues/12188 ## Summary of changes - Implement `notify_safekeepers` method in `ComputeHook` - Notify cplane about safekeepers in `safekeeper_migrate` handler. - Update the test to make sure notifications work. ## Out of scope - There is `cplane_notified_generation` field in `timelines` table in strocon's database. It's not needed now, so it's not updated in the PR. Probably we can remove it. - e2e tests to make sure it works with a production cplane	2025-06-30 14:09:50 +00:00
Dmitrii Kovalkov	c746678bbc	storcon: implement safekeeper_migrate handler (#11849 ) This PR implements a safekeeper migration algorithm from RFC-035 https://github.com/neondatabase/neon/blob/main/docs/rfcs/035-safekeeper-dynamic-membership-change.md#change-algorithm - Closes: https://github.com/neondatabase/neon/issues/11823 It is not production-ready yet, but I think it's good enough to commit and start testing. There are some known issues which will be addressed in later PRs: - https://github.com/neondatabase/neon/issues/12186 - https://github.com/neondatabase/neon/issues/12187 - https://github.com/neondatabase/neon/issues/12188 - https://github.com/neondatabase/neon/issues/12189 - https://github.com/neondatabase/neon/issues/12190 - https://github.com/neondatabase/neon/issues/12191 - https://github.com/neondatabase/neon/issues/12192 ## Summary of changes - Implement `tenant_timeline_safekeeper_migrate` handler to drive the migration - Add possibility to specify number of safekeepers per timeline in tests (`timeline_safekeeper_count`) - Add `term` and `flush_lsn` to `TimelineMembershipSwitchResponse` - Implement compare-and-swap (CAS) operation over timeline in DB for updating membership configuration safely. - Write simple test to verify that migration code works	2025-06-30 08:30:05 +00:00
Erik Grinaker	9d9e3cd08a	Fix `test_normal_work` grpc param	2025-06-30 10:13:46 +02:00
Aleksandr Sarantsev	9bb4688c54	storcon: Remove testing feature from kick_secondary_downloads (#12383 ) ## Problem Some of the design decisions in PR #12256 were influenced by the requirements of consistency tests. These decisions introduced intermediate logic that is no longer needed and should be cleaned up. ## Summary of Changes - Remove the `feature("testing")` flag related to `kick_secondary_download`. - Set the default value of `kick_secondary_download` back to false, reflecting the intended production behavior. Co-authored-by: Aleksandr Sarantsev <aleksandr.sarantsev@databricks.com>	2025-06-30 05:41:05 +00:00
Heikki Linnakangas	a352d290eb	Plumb through both libpq and grpc connection strings to the compute Add a new 'pageserver_connection_info' field in the compute spec. It replaces the old 'pageserver_connstring' field with a more complicated struct that includes both libpq and grpc URLs, for each shard (or only one of the the URLs, depending on the configuration). It also includes a flag suggesting which one to use; compute_ctl now uses it to decide which protocol to use for the basebackup. This is compatible with everything that's in production, because the control plane never used the 'pageserver_connstring' field. That was added a long time ago with the idea that it would replace the code that digs the 'neon.pageserver_connstring' GUC from the list of Postgres settings, but we never got around to do that in the control plane. Hence, it was only used with neon_local. But the plan now is to pass the 'pageserver_connection_info' from the control plane, and once that's fully deployed everywhere, the code to parse 'neon.pageserver_connstring' in compute_ctl can be removed. The 'grpc' flag on an endpoint in endpoint config is now more of a suggestion. Compute_ctl gets both URLs, so it can choose to use libpq or grpc as it wishes. It currently always obeys the 'prefer_grpc' flag that's part of the connection info though. Postgres however uses grpc iff the new rust-based communicator is enabled. TODO/plan for the control plane: - Start to pass `pageserver_connection_info` in the spec file. - Also keep the current `neon.pageserver_connstring` setting for now, for backwards compatibility with old computes After that, the `pageserver_connection_info.prefer_grpc` flag in the spec file can be used to control whether compute_ctl uses grpc or libpq. The actual compute's grpc usage will be controlled by the `neon.enable_new_communicator` GUC. It can be set separately from 'prefer_grpc'. Later: - Once all old computes are gone, remove the code to pass `neon.pageserver_connstring`	2025-06-29 18:16:49 +03:00
Arpad Müller	4c7956fa56	Fix hang deleting offloaded timelines (#12366 ) We don't have cancellation support for timeline deletions. In other words, timeline deletion might still go on in an older generation while we are attaching it in a newer generation already, because the cancellation simply hasn't reached the deletion code. This has caused us to hit a situation with offloaded timelines in which the timeline was in an unrecoverable state: always returning an accepted response, but never a 404 like it should be. The detailed description can be found in [here](https://github.com/neondatabase/cloud/issues/30406#issuecomment-3008667859) (private repo link). TLDR: 1. we ask to delete timeline on old pageserver/generation, starts process in background 2. the storcon migrates the tenant to a different pageserver. - during attach, the pageserver still finds an index part, so it adds it to `offloaded_timelines` 4. the timeline deletion finishes, removing the index part in S3 5. there is a retry of the timeline deletion endpoint, sent to the new pageserver location. it is bound to fail however: - as the index part is gone, we print `Timeline already deleted in remote storage`. - the problem is that we then return an accepted response code, and not a 404. - this confuses the code calling us. it thinks the timeline is not deleted, so keeps retrying. - this state never gets recovered from until a reset/detach, because of the `offloaded_timelines` entry staying there. This is where this PR fixes things: if no index part can be found, we can safely assume that the timeline is gone in S3 (it's the last thing to be deleted), so we can remove it from `offloaded_timelines` and trigger a reupload of the manifest. Subsequent retries will pick that up. Why not improve the cancellation support? It is a more disruptive code change, that might have its own risks. So we don't do it for now. Fixes https://github.com/neondatabase/cloud/issues/30406	2025-06-27 15:14:55 +00:00
Vlad Lazar	cc1664ef93	pageserver: allow flush task cancelled error in sharding autosplit test (#12374 ) ## Problem Test is failing due to compaction shutdown noise (see https://github.com/neondatabase/neon/issues/12162). ## Summary of changes Allow list the noise.	2025-06-27 13:13:11 +00:00
Arpad Müller	232f2447d4	Support pull_timeline of timelines without writes (#12028 ) Make the safekeeper `pull_timeline` endpoint support timelines that haven't had any writes yet. In the storcon managed sk timelines world, if a safekeeper goes down temporarily, the storcon will schedule a `pull_timeline` call. There is no guarantee however that by when the safekeeper is online again, there have been writes to the timeline yet. The `snapshot` endpoint gives an error if the timeline hasn't had writes, so we avoid calling it if `timeline_start_lsn` indicates a freshly created timeline. Fixes #11422 Part of #11670	2025-06-26 16:29:03 +00:00
Arpad Müller	1dc01c9bed	Support cancellations of timelines with hanging ondemand downloads (#12330 ) In `test_layer_download_cancelled_by_config_location`, we simulate hung downloads via the `before-downloading-layer-stream-pausable` failpoint. Then, we cancel a timeline via the `location_config` endpoint. With the new default as of https://github.com/neondatabase/neon/pull/11712, we would be creating the timeline on safekeepers regardless if there have been writes or not, and it turns out the test relied on the timeline not existing on safekeepers, due to a cancellation bug: * as established before, the test makes the read path hang * the timeline cancellation function first cancels the walreceiver, and only then cancels the timeline's token * `WalIngest::new` is requesting a checkpoint, which hits the read path * at cancellation time, we'd be hanging inside the read, not seeing the cancellation of the walreceiver * the test would time out due to the hang This is probably also reproducible in the wild when there is S3 unavailabilies or bottlenecks. So we thought that it's worthwhile to fix the hang issue. The approach chosen in the end involves the `tokio::select` macro. In PR 11712, we originally punted on the test due to the hang and opted it out from the new default, but now we can use the new default. Part of https://github.com/neondatabase/neon/issues/12299	2025-06-25 13:40:38 +00:00
Tristan Partin	aa75722010	Set pgaudit.log=none for monitoring connections (#12137 ) pgaudit can spam logs due to all the monitoring that we do. Logs from these connections are not necessary for HIPPA compliance, so we can stop logging from those connections. Part-of: https://github.com/neondatabase/cloud/issues/29574 Signed-off-by: Tristan Partin <tristan@neon.tech>	2025-06-24 17:42:23 +00:00
Matthias van de Meent	6c6de6382a	Use enum-typed PG versions (#12317 ) This makes it possible for the compiler to validate that a match block matched all PostgreSQL versions we support. ## Problem We did not have a complete picture about which places we had to test against PG versions, and what format these versions were: The full PG version ID format (Major/minor/bugfix `MMmmbb`) as transfered in protocol messages, or only the Major release version (`MM`). This meant type confusion was rampant. With this change, it becomes easier to develop new version-dependent features, by making type and niche confusion impossible. ## Summary of changes Every use of `pg_version` is now typed as either `PgVersionId` (u32, valued in decimal `MMmmbb`) or PgMajorVersion (an enum, with a value for every major version we support, serialized and stored like a u32 with the value of that major version) --------- Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>	2025-06-24 17:25:31 +00:00
Arpad Müller	0efff1db26	Allow cancellation errors in tests that allow timeline deletion errors (#12315 ) After merging of PR https://github.com/neondatabase/neon/pull/11712 we saw some tests be flaky, with errors showing up about the timeline having been cancelled instead of having been deleted. This is an outcome that is inherently racy with the "has been deleted" error. In some instances, https://github.com/neondatabase/neon/pull/11712 has already added the error about the timeline having been cancelled. This PR adds them to the remaining instances of https://github.com/neondatabase/neon/pull/11712, fixing the flakiness.	2025-06-23 22:26:38 +00:00
Aleksandr Sarantsev	5eecde461d	storcon: Fix migration for Attached(0) tenants (#12256 ) ## Problem `Attached(0)` tenant migrations can get stuck if the heatmap file has not been uploaded. ## Summary of Changes - Added a test to reproduce the issue. - Introduced a `kick_secondary_downloads` config flag: - Enabled in testing environments. - Disabled in production (and in the new test). - Updated `Attached(0)` locations to consider the number of secondaries in their intent when deciding whether to download the heatmap.	2025-06-23 18:55:26 +00:00
Alex Chi Z.	85164422d0	feat(pageserver): support force overriding feature flags (#12233 ) ## Problem Part of #11813 ## Summary of changes Add a test API to make it easier to manipulate the feature flags within tests. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-06-23 17:31:53 +00:00
Erik Grinaker	68a175d545	test_runner: fix `test_basebackup_with_high_slru_count` gzip param (#12319 ) The `--gzip-probability` parameter was removed in #12250. However, `test_basebackup_with_high_slru_count` still uses it, and keeps failing. This patch removes the use of the parameter (gzip is enabled by default).	2025-06-23 15:33:45 +00:00
Heikki Linnakangas	2d913ff125	fix some mismerges	2025-06-23 18:21:16 +03:00
Heikki Linnakangas	356ba67607	Merge remote-tracking branch 'origin/main' into HEAD I also included build script changes from https://github.com/neondatabase/neon/pull/12266, which is not yet merged but will be soon.	2025-06-23 17:46:30 +03:00
Tristan Partin	868c38f522	Rename the compute_ctl admin scope to compute_ctl:admin (#12263 ) Signed-off-by: Tristan Partin <tristan@neon.tech>	2025-06-20 22:49:05 +00:00
Alex Chi Z.	79485e7c3a	feat(pageserver): enable gc-compaction by default everywhere (#12105 ) Enable it across tests and set it as default. Marks the first milestone of https://github.com/neondatabase/neon/issues/9114. We already enabled it in all AWS regions and planning to enable it in all Azure regions next week. will merge after we roll out in all regions. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-06-20 15:35:11 +00:00
Heikki Linnakangas	eaf1ab21c4	Store intermediate build files in `build/` rather than `pg_install/build/` (#12295 ) This way, `pg_install` contains only the final build artifacts, not intermediate files like *.o files. Seems cleaner.	2025-06-20 14:50:03 +00:00
Arpad Müller	8b197de7ff	Increase upload timeout for test_tenant_s3_restore (#12297 ) Increase the upload timeout of the test to avoid hitting timeouts (which we sometimes do). Fixes https://github.com/neondatabase/neon/issues/12212	2025-06-20 10:33:11 +00:00
Arpad Müller	ec1452a559	Switch on --timelines-onto-safekeepers in integration tests (#11712 ) Switch on the `--timelines-onto-safekeepers` param in integration tests. Some changes that were needed to enable this but which I put into other PRs to not clutter up this one: * #11786 * #11854 * #12129 * #12138 Further fixes that were needed for this: * https://github.com/neondatabase/neon/pull/11801 * https://github.com/neondatabase/neon/pull/12143 * https://github.com/neondatabase/neon/pull/12204 Not strictly needed, but helpful: * https://github.com/neondatabase/neon/pull/12155 Part of #11670 Closes #11424	2025-06-19 11:17:01 +00:00
Erik Grinaker	d8d62fb7cb	test_runner: add gRPC support (#12279 ) ## Problem `test_runner` integration tests should support gRPC. Touches #11926. ## Summary of changes * Enable gRPC for Pageservers, with dynamic port allocations. * Add a `grpc` parameter for endpoint creation, plumbed through to `neon_local endpoint create`. No tests actually use gRPC yet, since computes don't support it yet.	2025-06-18 14:05:13 +00:00
Aleksandr Sarantsev	e6a404c66d	Fix flaky test_sharding_split_failures (#12199 ) ## Problem `test_sharding_failures` is flaky due to interference from the `background_reconcile` process. The details are in the issue https://github.com/neondatabase/neon/issues/12029. ## Summary of changes - Use `reconcile_until_idle` to ensure a stable state before running test assertions - Added error tolerance in `reconcile_until_idle` test function (Failure cases: 1, 3, 19, 20) - Ignore the `Keeping extra secondaries` warning message since it i retryable (Failure case: 2) - Deduplicated code in `assert_rolled_back` and `assert_split_done` - Added a log message before printing plenty of Node `X` seen on pageserver `Y`	2025-06-18 13:27:41 +00:00
Peter Bendel	7e711ede44	Increase tenant size for large tenant oltp workload (#12260 ) ## Problem - We run the large tenant oltp workload with a fixed size (larger than existing customers' workloads). Our customer's workloads are continuously growing and our testing should stay ahead of the customers' production workloads. - we want to touch all tables in the tenant's database (updates) so that we simulate a continuous change in layer files like in a real production workload - our current oltp benchmark uses a mixture of read and write transactions, however we also want a separate test run with read-only transactions only ## Summary of changes - modify the existing workload to have a separate run with pgbench custom scripts that are read-only - create a new workload that - grows all large tables in each run (for the reuse branch in the large oltp tenant's project) - updates a percentage of rows in all large tables in each run (to enforce table bloat and auto-vacuum runs and layer rebuild in pageservers Each run of the new workflow increases the logical database size about 16 GB. We start with 6 runs per day which will give us about 96-100 GB growth per day. --------- Co-authored-by: Alexander Lakhin <alexander.lakhin@neon.tech>	2025-06-18 12:40:25 +00:00
Mikhail	e95f2f9a67	compute_ctl: return LSN in /terminate (#12240 ) - Add optional `?mode=fast\|immediate` to `/terminate`, `fast` is default. Immediate avoids waiting 30 seconds before returning from `terminate`. - Add `TerminateMode` to `ComputeStatus::TerminationPending` - Use `/terminate?mode=immediate` in `neon_local` instead of `pg_ctl stop` for `test_replica_promotes`. - Change `test_replica_promotes` to check returned LSN - Annotate `finish_sync_safekeepers` as `noreturn`. https://github.com/neondatabase/cloud/issues/29807	2025-06-18 12:25:19 +00:00
Dimitri Fontaine	67fbc0582e	Validate safekeeper_connstrings when parsing compute specs. (#11906 ) This check API only cheks the safekeeper_connstrings at the moment, and the validation is limited to checking we have at least one entry in there, and no duplicates. ## Problem If the compute_ctl service is started with an empty list of safekeepers, then hard-to-debug errors may happen at runtime, where it would be much easier to catch them early. ## Summary of changes Add an entry point in the compute_ctl API to validate the configuration for safekeeper_connstrings. --------- Co-authored-by: Heikki Linnakangas <heikki@neon.tech>	2025-06-18 10:01:05 +00:00
Suhas Thalanki	83069f6ca1	fix: terminate pgbouncer on compute suspend (#12153 ) ## Problem PgBouncer does not terminate connections on a suspend: https://github.com/neondatabase/cloud/issues/16282 ## Summary of changes 1. Adds a pid file to store the pid of PgBouncer 2. Terminates connections on a compute suspend --------- Co-authored-by: Alexey Kondratov <kondratov.aleksey@gmail.com>	2025-06-17 22:56:05 +00:00
Mikhail	7d4f662fbf	upgrade default neon version to 1.6 (#12185 ) Changes for 1.6 were merged and deployed two months ago https://github.com/neondatabase/neon/blob/main/pgxn/neon/neon--1.6--1.5.sql. In order to deploy https://github.com/neondatabase/neon/pull/12183, we need 1.6 to be default, otherwise we can't use prewarm API on read-only replica (`ALTER EXTENSION` won't work) and we need it for promotion	2025-06-17 17:46:35 +00:00
Konstantin Knizhnik	dfa055f4be	Support event trigger for Neon users (#10624 ) ## Problem https://github.com/neondatabase/neon/issues/7570 Even triggers are supported only for superusers. ## Summary of changes Temporary switch to superuser when even trigger is created and disable execution of user's even triggers under superuser. --------- Co-authored-by: Dimitri Fontaine <dim@tapoueh.org> Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2025-06-17 15:44:50 +00:00
Dmitrii Kovalkov	f2e96b2323	tests: prepare test_compatibility.py for --timelines-onto-safekeepers (#12204 ) ## Problem Compatibility tests may be run against a compatibility snapshot generated with --timelines-onto-safekeepers=false. We need to start the compute without a generation (or with 0 generation) if the timeline is not storcon-managed, otherwise the compute will hang. - Follow up on https://github.com/neondatabase/neon/pull/12203 - Relates to https://github.com/neondatabase/neon/pull/11712 ## Summary of changes - Handle compatibility snapshot generated with no `--timelines-onot-safekeepers` properly	2025-06-17 15:16:07 +00:00
Dmitrii Kovalkov	dee73f0cb4	pageserver: implement max_total_size_bytes limit for basebackup cache (#12230 ) ## Problem The cache was introduced as a hackathon project and the only supported limit was the number of entries. The basebackup entry size may vary. We need to have more control over disk space usage to ship it to production. - Part of https://github.com/neondatabase/cloud/issues/29353 ## Summary of changes - Store the size of entries in the cache and use it to limit `max_total_size_bytes` - Add the size of the cache in bytes to metrics.	2025-06-17 15:08:59 +00:00
Alexander Lakhin	118e13438d	Add "Build and Test Fully" workflow (#11931 ) ## Problem We don't test debug builds for v14..v16 in the regular "Build and Test" runs to perform the testing faster, but it means we can't detect assertion failures in those versions. (See https://github.com/neondatabase/neon/issues/11891, https://github.com/neondatabase/neon/issues/11997) ## Summary of changes Add a new workflow to test all build types and all versions on all architectures.	2025-06-16 13:29:39 +00:00
Erik Grinaker	782062014e	Fix `test_normal_work` endpoint restart	2025-06-16 10:16:27 +02:00
Dmitrii Kovalkov	e83f1d8ba5	tests: prepare test_historic_storage_formats for --timelines-onto-safekeepers (#12214 ) ## Problem `test_historic_storage_formats` uses `/tenant_import` to import historic data. Tenant import does not create timelines onto safekeepers, because they might already exist on some safekeeper set. If it does, then we may end up with two different quorums accepting WAL for the same timeline. If the tenant import is used in a real deployment, the administrator is responsible for looking for the proper safekeeper set and migrate timelines into storcon-managed timelines. - Relates to https://github.com/neondatabase/neon/pull/11712 ## Summary of changes - Create timelines onto safekeepers manually after tenant import in `test_historic_storage_formats` - Add a note to tenant import that timelines will be not storcon-managed after the import.	2025-06-13 06:28:18 +00:00
Dmitrii Kovalkov	60dfdf39c7	tests: prepare test_tenant_delete_stale_shards for --timelines-onto-safekeepers (#12198 ) ## Problem The test creates an endpoint and deletes its tenant. The compute cannot stop gracefully because it tries to write a checkpoint shutdown record into the WAL, but the timeline had been already deleted from safekeepers. - Relates to https://github.com/neondatabase/neon/pull/11712 ## Summary of changes Stop the compute before deleting a tenant	2025-06-12 08:10:22 +00:00
Dmitrii Kovalkov	3d5e2bf685	storcon: add tenant_timeline_locate handler (#12203 ) ## Problem Compatibility tests may be run against a compatibility snapshot generated with `--timelines-onto-safekeepers=false`. We need to start the compute without a generation (or with 0 generation) if the timeline is not storcon-managed, otherwise the compute will hang. This handler is needed to check if the timeline is storcon-managed. It's also needed for better test coverage of safekeeper migration code. - Relates to https://github.com/neondatabase/neon/pull/11712 ## Summary of changes - Implement `tenant_timeline_locate` handler in storcon to get safekeeper info from storcon's DB	2025-06-12 08:09:57 +00:00
Mikhail	1b935b1958	endpoint_storage: add ?from_endpoint= to /lfc/prewarm (#12195 ) Related: https://github.com/neondatabase/cloud/issues/24225 Add optional from_endpoint parameter to allow prewarming from other endpoint	2025-06-10 19:25:32 +00:00
a-masterov	3f16ca2c18	Respect limits for projects for the Random Operations test (#12184 ) ## Problem The project limits were not respected, resulting in errors. ## Summary of changes Now limits are checked before running an action, and if the action is not possible to run, another random action will be run. --------- Co-authored-by: Peter Bendel <peterbendel@neon.tech>	2025-06-10 15:59:51 +00:00
Conrad Ludgate	58327ef74d	[proxy] fix sql-over-http password setting (#12177 ) ## Problem Looks like our sql-over-http tests get to rely on "trust" authentication, so the path that made sure the authkeys data was set was never being hit. ## Summary of changes Slight refactor to WakeComputeBackends, as well as making sure auth keys are propagated. Fix tests to ensure passwords are tested.	2025-06-10 08:46:29 +00:00
Konstantin Knizhnik	f42d44342d	Increase statement timeout for test_pageserver_restarts_under_workload test (#12139 ) \## Problem See https://github.com/neondatabase/neon/issues/12119#issuecomment-2942586090 Page server restarts with interval 1 seconds increases time of vacuum especially off prefetch is enabled and so cause test failure because of statement timeout expiration. ## Summary of changes Increase statement timeout to 360 seconds. --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech> Co-authored-by: Alexander Lakhin <alexander.lakhin@neon.tech>	2025-06-10 05:32:03 +00:00
Erik Grinaker	f4d51c0f5c	Use gRPC for `test_normal_work`	2025-06-09 22:51:15 +02:00
Konstantin Knizhnik	d759fcb8bd	Increase wait LFC prewarm timeout (#12174 ) ## Problem See https://github.com/neondatabase/neon/issues/12171 ## Summary of changes Increase LFC prewarm wait timeout to 1 minute Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2025-06-09 18:01:30 +00:00
Erik Grinaker	3c7235669a	pageserver: don't delete parent shard files until split is committed (#12146 ) ## Problem If a shard split fails and must roll back, the tenant may hit a cold start as the parent shard's files have already been removed from local disk. External contribution with minor adjustments, see https://neondb.slack.com/archives/C08TE3203RQ/p1748246398269309. ## Summary of changes Keep the parent shard's files on local disk until the split has been committed, such that they are available if the spilt is rolled back. If all else fails, the files will be removed on the next Pageserver restart. This should also be fine in a mixed version: * New storcon, old Pageserver: the Pageserver will delete the files during the split, storcon will log an error when the cleanup detach fails. * Old storcon, new Pageserver: the Pageserver will leave the parent's files around until the next Pageserver restart. The change looks good to me, but shard splits are delicate so I'd like some extra eyes on this.	2025-06-06 15:55:14 +00:00
Erik Grinaker	e74a957045	test_runner: initial gRPC protocol support	2025-06-06 16:56:33 +02:00
Erik Grinaker	396a16a3b2	test_runner: enable gRPC Pageserver	2025-06-06 14:55:29 +02:00
Arpad Müller	df7e301a54	safekeeper: special error if a timeline has been deleted (#12155 ) We might delete timelines on safekeepers before we are deleting them on pageservers. This should be an exceptional situation, but can occur. As the first step to improve behaviour here, emit a special error that is less scary/obscure than "was not found in global map". It is for example emitted when the pageserver tries to run `IDENTIFY_SYSTEM` on a timeline that has been deleted on the safekeeper. Found when analyzing the failure of `test_scrubber_physical_gc_timeline_deletion` when enabling `--timelines-onto-safekeepers` on the pytests. Due to safekeeper restarts, there is no hard guarantee that we will keep issuing this error, so we need to think of something better if we start encountering this in staging/prod. But I would say that the introduction of `--timelines-onto-safekeepers` in the pytests and into staging won't change much about this: we are already deleting timelines from there. In `test_scrubber_physical_gc_timeline_deletion`, we'd just be leaking the timeline before on the safekeepers. Part of #11712	2025-06-06 11:54:07 +00:00

1 2 3 4 5 ...

2184 Commits