rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-01-08 05:52:55 +00:00

Author	SHA1	Message	Date
Erik Grinaker	a4c76740c0	pageserver: emit gRPC GetPage errors as responses (#12255 ) ## Problem When converting `proto::GetPageRequest` into `page_api::GetPageRequest` and validating the request, errors are returned as `tonic::Status`. This will tear down the GetPage stream, which is disruptive and unnecessary. ## Summary of changes Emit invalid request errors as `GetPageResponse` with an appropriate `status_code` instead. Also move the conversion from `tonic::Status` to `GetPageResponse` out into the stream handler.	2025-06-17 15:41:17 +00:00
Dmitrii Kovalkov	f2e96b2323	tests: prepare test_compatibility.py for --timelines-onto-safekeepers (#12204 ) ## Problem Compatibility tests may be run against a compatibility snapshot generated with --timelines-onto-safekeepers=false. We need to start the compute without a generation (or with 0 generation) if the timeline is not storcon-managed, otherwise the compute will hang. - Follow up on https://github.com/neondatabase/neon/pull/12203 - Relates to https://github.com/neondatabase/neon/pull/11712 ## Summary of changes - Handle compatibility snapshot generated with no `--timelines-onot-safekeepers` properly	2025-06-17 15:16:07 +00:00
Dmitrii Kovalkov	dee73f0cb4	pageserver: implement max_total_size_bytes limit for basebackup cache (#12230 ) ## Problem The cache was introduced as a hackathon project and the only supported limit was the number of entries. The basebackup entry size may vary. We need to have more control over disk space usage to ship it to production. - Part of https://github.com/neondatabase/cloud/issues/29353 ## Summary of changes - Store the size of entries in the cache and use it to limit `max_total_size_bytes` - Add the size of the cache in bytes to metrics.	2025-06-17 15:08:59 +00:00
Erik Grinaker	edf51688bc	neon_local: support gRPC connstrings for endpoints (#12271 ) ## Problem `neon_local` should support endpoints using gRPC, by providing `grpc://` connstrings with the Pageservers' gRPC ports. Requires #12268. Touches #11926. ## Summary of changes * Add `--grpc` switch for `neon_local endpoint create`. * Generate `grpc://` connstrings for endpoints when enabled. Computes don't actually support `grpc://` connstrings yet, but will soon. gRPC is configured when the endpoint is created, not when it's started, such that it continues to use gRPC across restarts and reconfigurations. In particular, this is necessary for the storage controller's local notify hook, which can't easily plumb through gRPC configuration from the start/reconfigure commands but has access to the endpoint's configuration.	2025-06-17 14:39:42 +00:00
Aleksandr Sarantsev	4a8f3508f9	storcon: Add safekeeper request label group (#12239 ) ## Problem The metrics `storage_controller_safekeeper_request_error` and `storage_controller_safekeeper_request_latency` currently use `pageserver_id` as a label. This can be misleading, as the metrics are about safekeeper requests. We want to replace this with a more accurate label — either `safekeeper_id` or `node_id`. ## Summary of changes - Introduced `SafekeeperRequestLabelGroup` with `safekeeper_id`. - Updated the affected metrics to use the new label group. - Fixed incorrect metric usage in safekeeper_client.rs ## Follow-up - Review usage of these metrics in alerting rules and existing Grafana dashboards to ensure this change does not break something.	2025-06-17 13:33:01 +00:00
Erik Grinaker	48052477b4	storcon: register Pageserver gRPC address (#12268 ) ## Problem Pageservers now expose a gRPC API on a separate address and port. This must be registered with the storage controller such that it can be plumbed through to the compute via cplane. Touches #11926. ## Summary of changes This patch registers the gRPC address and port with the storage controller: * Add gRPC address to `nodes` database table and `NodePersistence`, with a Diesel migration. * Add gRPC address in `NodeMetadata`, `NodeRegisterRequest`, `NodeDescribeResponse`, and `TenantLocateResponseShard`. * Add gRPC address flags to `storcon_cli node-register`. These changes are backwards-compatible, since all structs will ignore unknown fields during deserialization.	2025-06-17 13:27:10 +00:00
Erik Grinaker	d81353b2d1	pageserver: gRPC base backup fixes (#12243 ) ## Problem The gRPC base backup implementation has a few issues: chunks are not properly bounded, and it's not possible to omit the LSN. Touches #11728. ## Summary of changes * Properly bound chunks by using a limited writer. * Use an `Option<Lsn>` rather than a `ReadLsn` (the latter requires an LSN).	2025-06-17 12:37:43 +00:00
Aleksandr Sarantsev	143500dc4f	storcon: Improve stably_attached readability (#12249 ) ## Problem The `stably_attached` function is hard to read due to deeply nested conditionals ## Summary of Changes - Refactored `stably_attached` to use early returns and the `?` operator for improved readability	2025-06-17 10:10:10 +00:00
Aleksandr Sarantsev	1a5f7ce6ad	storcon: Exclude another secondaries while optimizing secondary (#12251 ) ## Problem If the node intent includes more than one secondary, we can generate a replace optimization using a candidate node that is already a secondary location. ## Summary of changes - Exclude all other secondary nodes from the scoring process to ensure optimal candidate selection.	2025-06-17 10:09:55 +00:00
Alexander Lakhin	01ccb34118	Don't rerun failed tests in 'Build and Test with Sanitizers' workflow (#12259 ) ## Problem We could easily miss a sanitizer-detected defect, if it occurred due to some race condition, as we just rerun the test and if it succeeds, the overall test run is considered successful. It was more reasonable before, when we had much more unstable tests in main, but now we can track all test failures. ## Summary of changes Don't rerun failed tests.	2025-06-17 08:08:43 +00:00
Tristan Partin	f669e18477	Remove TODO comment related to default_transaction_read_only (#12261 ) This code has been deployed for a while, so let's remove the TODO, and remove the option passed from the control plane. Link: https://github.com/neondatabase/cloud/pull/30274 Signed-off-by: Tristan Partin <tristan@neon.tech>	2025-06-16 19:38:26 +00:00
Suhas Thalanki	632cde7f13	schema and github workflow for validation of compute manifest (#12069 ) Adds a schema to validate the manifest.yaml described in [this RFC](https://github.com/neondatabase/neon/blob/main/docs/rfcs/038-independent-compute-release.md) and a github workflow to test this.	2025-06-16 19:30:41 +00:00
Alexander Lakhin	118e13438d	Add "Build and Test Fully" workflow (#11931 ) ## Problem We don't test debug builds for v14..v16 in the regular "Build and Test" runs to perform the testing faster, but it means we can't detect assertion failures in those versions. (See https://github.com/neondatabase/neon/issues/11891, https://github.com/neondatabase/neon/issues/11997) ## Summary of changes Add a new workflow to test all build types and all versions on all architectures.	2025-06-16 13:29:39 +00:00
Trung Dinh	fc136eec8f	pagectl: add dump layer local (#12245 ) ## Problem In our environment, we don't always have access to the pagectl tool on the pageserver. We have to download the page files to local env to introspect them. Hence, it'll be useful to be able to parse the local files using `pagectl`. ## Summary of changes * Add `dump-layer-local` to `pagectl` that takes a local path as argument and returns the layer content: ``` cargo run -p pagectl layer dump-layer-local ~/Desktop/000000067F000040490002800000FFFFFFFF-030000000000000000000000000000000002__00003E7A53EDE611-00003E7AF27BFD19-v1-00000001 ``` * Bonus: Fix a bug in `pageserver/ctl/src/draw_timeline_dir.rs` in which we don't filter out temporary files.	2025-06-16 10:29:42 +00:00
Erik Grinaker	818e5130f1	page_api: add a few derives (#12253 ) ## Problem The `page_api` domain types are missing a few derives. ## Summary of changes Add `Clone`, `Copy`, and `Debug` derives for all types where appropriate.	2025-06-16 09:45:50 +00:00
Alexander Sarantcev	c243521ae5	Fix reconcile_long_running metric comment (#12234 ) ## Problem Comment for `storage_controller_reconcile_long_running` metric was copy-pasted and not updated in #9207 ## Summary of changes - Fixed comment	2025-06-16 05:51:57 +00:00
Alexander Sarantcev	5303c71589	Move comment above metrics handler (#12236 ) ## Problem Comment is in incorrect place: `/metrics` code is above its description comment. ## Summary of changes - `/metrics` code is now below the comment	2025-06-13 18:18:51 +00:00
Alexander Sarantcev	d146897415	Fix reconciles metrics typo (#12235 ) ## Problem Need to fix naming `safkeeper` -> `safekeeper` ## Summary of changes - `storage_controller_safkeeper_reconciles_` renamed to `storage_controller_safekeeper_reconciles_`	2025-06-13 17:47:09 +00:00
Alexander Sarantcev	d63815fa40	Fix ChaosInjector shard eligibility bug (#12231 ) ## Problem ChaosInjector is intended to skip non-active scheduling policies, but the current logic skips active shards instead. ## Summary of changes - Fixed shard eligibility condition to correctly allow chaos injection for shards with an Active scheduling policy.	2025-06-13 13:34:29 +00:00
Dmitrii Kovalkov	385324ee8a	pageserver: fix post-merge PR comments on basebackup cache (#12216 ) ## Problem This PR addresses all but the direct IO post-merge comments on basebackup cache implementation. - Follow up on https://github.com/neondatabase/neon/pull/11989#pullrequestreview-2867966119 - Part of https://github.com/neondatabase/cloud/issues/29353 ## Summary of changes - Clean up the tmp directory by recreating it. - Recreate the tmp directory on startup. - Add comments why it's safe to not fsync the inode after renaming.	2025-06-13 08:49:31 +00:00
Alex Chi Z.	8a68d463f6	feat(pagectl): no max key limit if time travel recover locally (#12222 ) ## Problem We would easily hit this limit for a tenant running for enough long time. ## Summary of changes Remove the max key limit for time-travel recovery if the command is running locally. Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-06-13 08:41:10 +00:00
Alex Chi Z.	3046c307da	feat(posthog_client): support feature flag secure API (#12201 ) ## Problem Part of #11813 PostHog has two endpoints to retrieve feature flags: the old project ID one that uses personal API token, and the new one using a special feature flag secure token that can only retrieve feature flag. The new API I added in this patch is not documented in the PostHog API doc but it's used in their Python SDK. ## Summary of changes Add support for "feature flag secure token API". The API has no way of providing a project ID so we verify if the retrieved spec is consistent with the project ID specified by comparing the `team_id` field. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-06-13 07:22:02 +00:00
Dmitrii Kovalkov	e83f1d8ba5	tests: prepare test_historic_storage_formats for --timelines-onto-safekeepers (#12214 ) ## Problem `test_historic_storage_formats` uses `/tenant_import` to import historic data. Tenant import does not create timelines onto safekeepers, because they might already exist on some safekeeper set. If it does, then we may end up with two different quorums accepting WAL for the same timeline. If the tenant import is used in a real deployment, the administrator is responsible for looking for the proper safekeeper set and migrate timelines into storcon-managed timelines. - Relates to https://github.com/neondatabase/neon/pull/11712 ## Summary of changes - Create timelines onto safekeepers manually after tenant import in `test_historic_storage_formats` - Add a note to tenant import that timelines will be not storcon-managed after the import.	2025-06-13 06:28:18 +00:00
Trung Dinh	8917676e86	Improve logging for gc-compaction (#12219 ) ## Problem * Inside `compact_with_gc_inner`, there is a similar log line: `db24ba95d1/pageserver/src/tenant/timeline/compaction.rs (L3181-L3187)` * Also, I think it would be useful when debugging to have the ability to select a particular sub-compaction job (e.g., `1/100`) to see all the logs for that job. ## Summary of changes * Attach a span to the `compact_with_gc_inner`. CC: @skyzh	2025-06-13 06:07:18 +00:00
Ivan Efremov	43acabd4c2	[proxy]: Improve backoff strategy for redis reconnection (#12218 ) Sometimes during a failed redis connection attempt at the init stage proxy pod can continuously restart. This, in turn, can aggravate the problem if redis is overloaded. Solves the #11114	2025-06-12 19:46:02 +00:00
Vlad Lazar	db24ba95d1	pagserver: always persist shard identity (#12217 ) ## Problem The location config (which includes the stripe size) is stored on pageserver disk. For unsharded tenants we [do not include the shard identity in the serialized description](`ad88ec9257/pageserver/src/tenant/config.rs (L64-L66)`). When the pageserver restarts, it reads that configuration and will use the stripe size from there and rely on storcon input from reattach for generation and mode. The default deserialization is ShardIdentity::unsharded. This has the new default stripe size of 2048. Hence, for unsharded tenants we can be running with a stripe size different from that the one in the storcon observed state. This is not a problem until we shard split without specifying a stripe size (i.e. manual splits via the UI or storcon_cli). When that happens the new shards will use the 2048 stripe size until storcon realises and switches them back. At that point it's too late, since we've ingested data with the wrong stripe sizes. ## Summary of changes Ideally, we would always have the full shard identity on disk. To achieve this over two releases we do: 1. Always persist the shard identity in the location config on the PS. 2. Storage controller includes the stripe size to use in the re attach response. After the first release, we will start persisting correct stripe sizes for any tenant shard that the storage controller explicitly sends a location_conf. After the second release, the re-attach change kicks in and we'll persist the shard identity for all shards.	2025-06-12 17:15:02 +00:00
Folke Behrens	1dce65308d	Update base64 to 0.22 (#12215 ) ## Problem Base64 0.13 is outdated. ## Summary of changes Update base64 to 0.22. Affects mostly proxy and proxy libs. Also upgrade serde_with to remove another dep on base64 0.13 from dep tree.	2025-06-12 16:12:47 +00:00
Alex Chi Z.	ad88ec9257	fix(pageserver): extend layer manager read guard threshold (#12211 ) ## Problem Follow up of https://github.com/neondatabase/neon/pull/12194 to make the benchmarks run without warnings. ## Summary of changes Extend read guard hold timeout to 30s. Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-06-12 08:39:54 +00:00
Dmitrii Kovalkov	60dfdf39c7	tests: prepare test_tenant_delete_stale_shards for --timelines-onto-safekeepers (#12198 ) ## Problem The test creates an endpoint and deletes its tenant. The compute cannot stop gracefully because it tries to write a checkpoint shutdown record into the WAL, but the timeline had been already deleted from safekeepers. - Relates to https://github.com/neondatabase/neon/pull/11712 ## Summary of changes Stop the compute before deleting a tenant	2025-06-12 08:10:22 +00:00
Dmitrii Kovalkov	3d5e2bf685	storcon: add tenant_timeline_locate handler (#12203 ) ## Problem Compatibility tests may be run against a compatibility snapshot generated with `--timelines-onto-safekeepers=false`. We need to start the compute without a generation (or with 0 generation) if the timeline is not storcon-managed, otherwise the compute will hang. This handler is needed to check if the timeline is storcon-managed. It's also needed for better test coverage of safekeeper migration code. - Relates to https://github.com/neondatabase/neon/pull/11712 ## Summary of changes - Implement `tenant_timeline_locate` handler in storcon to get safekeeper info from storcon's DB	2025-06-12 08:09:57 +00:00
dependabot[bot]	54fdcfdfa8	build(deps): bump requests from 2.32.3 to 2.32.4 in the pip group across 1 directory (#12180 ) Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2025-06-11 21:09:05 +00:00
Vlad Lazar	28e882a80f	pageserver: warn on long layer manager locking intervals (#12194 ) ## Problem We hold the layer map for too long on occasion. ## Summary of changes This should help us identify the places where it's happening from. Related https://github.com/neondatabase/neon/issues/12182	2025-06-11 16:16:30 +00:00
Konstantin Knizhnik	24038033bf	Remove default from DROP FUNCTION (#12202 ) ## Problem DROP FUNCTION doesn't allow to specify default for parameters. ## Summary of changes Remove DEFAULT clause from pgxn/neon/neon--1.6--1.5.sql Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2025-06-11 13:16:58 +00:00
Mikhail	1b935b1958	endpoint_storage: add ?from_endpoint= to /lfc/prewarm (#12195 ) Related: https://github.com/neondatabase/cloud/issues/24225 Add optional from_endpoint parameter to allow prewarming from other endpoint	2025-06-10 19:25:32 +00:00
a-masterov	3f16ca2c18	Respect limits for projects for the Random Operations test (#12184 ) ## Problem The project limits were not respected, resulting in errors. ## Summary of changes Now limits are checked before running an action, and if the action is not possible to run, another random action will be run. --------- Co-authored-by: Peter Bendel <peterbendel@neon.tech>	2025-06-10 15:59:51 +00:00
Conrad Ludgate	67b94c5992	[proxy] per endpoint configuration for rate limits (#12148 ) https://github.com/neondatabase/cloud/issues/28333 Adds a new `rate_limit` response type to EndpointAccessControl, uses it for rate limiting, and adds a generic invalidation for the cache.	2025-06-10 14:26:08 +00:00
Folke Behrens	e38193c530	proxy: Move connect_to_compute back to proxy (#12181 ) It's mostly responsible for waking, retrying, and caching. A new, thin wrapper around compute_once will be PGLB's entry point	2025-06-10 11:23:03 +00:00
Konstantin Knizhnik	21949137ed	Return last ring index instead of min_ring_index in prefetch_register_bufferv (#12039 ) ## Problem See https://github.com/neondatabase/neon/issues/12018 Now `prefetch_register_bufferv` calculates min_ring_index of all vector requests. But because of pump prefetch state or connection failure, previous slots can be already proceeded and reused. ## Summary of changes Instead of returning minimal index, this function should return last slot index. Actually result of this function is used only in two places. A first place just fort checking (and this check is redundant because the same check is done in `prefetch_register_bufferv` itself. And in the second place where index of filled slot is actually used, there is just one request. Sp fortunately this bug can cause only assert failure in debug build. --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2025-06-10 10:09:46 +00:00
Trung Dinh	02f94edb60	Remove global static TENANTS (#12169 ) ## Problem There is this TODO in code: https://github.com/neondatabase/neon/blob/main/pageserver/src/tenant/mgr.rs#L300-L302 This is an old TODO by @jcsp. ## Summary of changes This PR addresses the TODO. Specifically, it removes a global static `TENANTS`. Instead the `TenantManager` now directly manages the tenant map. Enhancing abstraction. Essentially, this PR moves all module-level methods to inside the implementation of `TenantManager`.	2025-06-10 09:26:40 +00:00
Conrad Ludgate	58327ef74d	[proxy] fix sql-over-http password setting (#12177 ) ## Problem Looks like our sql-over-http tests get to rely on "trust" authentication, so the path that made sure the authkeys data was set was never being hit. ## Summary of changes Slight refactor to WakeComputeBackends, as well as making sure auth keys are propagated. Fix tests to ensure passwords are tested.	2025-06-10 08:46:29 +00:00
Dmitrii Kovalkov	73be6bb736	fix(compute): use proper safekeeper in VotesCollectedMset (#12175 ) ## Problem `VotesCollectedMset` uses the wrong safekeeper to update truncateLsn. This led to some failed assert later in the code during running safekeeper migration tests. - Relates to https://github.com/neondatabase/neon/issues/11823 ## Summary of changes Use proper safekeeper to update truncateLsn in VotesCollectedMset	2025-06-10 07:16:42 +00:00
Alex Chi Z.	40d7583906	feat(pageserver): use hostname as feature flag resolver property (#12141 ) ## Problem part of https://github.com/neondatabase/neon/issues/11813 ## Summary of changes Collect pageserver hostname property so that we can use it in the PostHog UI. Not sure if this is the best way to do that -- open to suggestions. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-06-10 07:10:41 +00:00
Alex Chi Z.	7a68699abb	feat(pageserver): support azure time-travel recovery (in an okay way) (#12140 ) ## Problem part of https://github.com/neondatabase/neon/issues/7546 Add Azure time travel recovery support. The tricky thing is how Azure handles deletes in its blob version API. For the following sequence: ``` upload file_1 = a upload file_1 = b delete file_1 upload file_1 = c ``` The "delete file_1" won't be stored as a version (as AWS did). Therefore, we can never rollback to a state where file_1 is temporarily invisible. If we roll back to the time before file_1 gets created for the first time, it will be removed correctly. However, this is fine for pageservers, because (1) having extra files in the tenant storage is usually fine (2) for things like timelines/X/index_part-Y.json, it will only be deleted once, so it can always be recovered to a correct state. Therefore, I don't expect any issues when this functionality is used on pageserver recovery. TODO: unit tests for time-travel recovery. ## Summary of changes Add Azure blob storage time-travel recovery support. Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-06-10 05:32:58 +00:00
Konstantin Knizhnik	f42d44342d	Increase statement timeout for test_pageserver_restarts_under_workload test (#12139 ) \## Problem See https://github.com/neondatabase/neon/issues/12119#issuecomment-2942586090 Page server restarts with interval 1 seconds increases time of vacuum especially off prefetch is enabled and so cause test failure because of statement timeout expiration. ## Summary of changes Increase statement timeout to 360 seconds. --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech> Co-authored-by: Alexander Lakhin <alexander.lakhin@neon.tech>	2025-06-10 05:32:03 +00:00
Konstantin Knizhnik	d759fcb8bd	Increase wait LFC prewarm timeout (#12174 ) ## Problem See https://github.com/neondatabase/neon/issues/12171 ## Summary of changes Increase LFC prewarm wait timeout to 1 minute Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2025-06-09 18:01:30 +00:00
Alex Chi Z.	76f95f06d8	feat(pageserver): add global timeline count metrics (#12159 ) ## Problem We are getting tenants with a lot of branches and num of timelines is a good indicator of pageserver loads. I added this metrics to help us better plan pageserver capacities. ## Summary of changes Add `pageserver_timeline_states_count` with two labels: active + offloaded. Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-06-09 09:57:36 +00:00
Mikhail	7efd4554ab	endpoint_storage: allow bypassing s3 write check on startup (#12165 ) Related: https://github.com/neondatabase/cloud/issues/27195	2025-06-06 18:08:02 +00:00
Erik Grinaker	3c7235669a	pageserver: don't delete parent shard files until split is committed (#12146 ) ## Problem If a shard split fails and must roll back, the tenant may hit a cold start as the parent shard's files have already been removed from local disk. External contribution with minor adjustments, see https://neondb.slack.com/archives/C08TE3203RQ/p1748246398269309. ## Summary of changes Keep the parent shard's files on local disk until the split has been committed, such that they are available if the spilt is rolled back. If all else fails, the files will be removed on the next Pageserver restart. This should also be fine in a mixed version: * New storcon, old Pageserver: the Pageserver will delete the files during the split, storcon will log an error when the cleanup detach fails. * Old storcon, new Pageserver: the Pageserver will leave the parent's files around until the next Pageserver restart. The change looks good to me, but shard splits are delicate so I'd like some extra eyes on this.	2025-06-06 15:55:14 +00:00
Conrad Ludgate	6dd84041a1	refactor and simplify the invalidation notification structure (#12154 ) The current cache invalidation messages are far too specific. They should be more generic since it only ends up triggering a `GetEndpointAccessControl` message anyway. Mappings: * `/allowed_ips_updated`, `/block_public_or_vpc_access_updated`, and `/allowed_vpc_endpoints_updated_for_projects` -> `/project_settings_update`. * `/allowed_vpc_endpoints_updated_for_org` -> `/account_settings_update`. * `/password_updated` -> `/role_setting_update`. I've also introduced `/endpoint_settings_update`. All message types support singular or multiple entries, which allows us to simplify things both on our side and on cplane side. I'm opening a PR to cplane to apply the above mappings, but for now using the old phrases to allow both to roll out independently. This change is inspired by my need to add yet another cached entry to `GetEndpointAccessControl` for https://github.com/neondatabase/cloud/issues/28333	2025-06-06 12:49:29 +00:00
Arpad Müller	df7e301a54	safekeeper: special error if a timeline has been deleted (#12155 ) We might delete timelines on safekeepers before we are deleting them on pageservers. This should be an exceptional situation, but can occur. As the first step to improve behaviour here, emit a special error that is less scary/obscure than "was not found in global map". It is for example emitted when the pageserver tries to run `IDENTIFY_SYSTEM` on a timeline that has been deleted on the safekeeper. Found when analyzing the failure of `test_scrubber_physical_gc_timeline_deletion` when enabling `--timelines-onto-safekeepers` on the pytests. Due to safekeeper restarts, there is no hard guarantee that we will keep issuing this error, so we need to think of something better if we start encountering this in staging/prod. But I would say that the introduction of `--timelines-onto-safekeepers` in the pytests and into staging won't change much about this: we are already deleting timelines from there. In `test_scrubber_physical_gc_timeline_deletion`, we'd just be leaking the timeline before on the safekeepers. Part of #11712	2025-06-06 11:54:07 +00:00

1 2 3 4 5 ...

8072 Commits