rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-07-06 05:30:38 +00:00

Author	SHA1	Message	Date
Alex Chi Z	3e63d0f9e0	test(pageserver): quantify compaction outcome (#7867 ) A simple API to collect some statistics after compaction to easily understand the result. The tool reads the layer map, and analyze range by range instead of doing single-key operations, which is more efficient than doing a benchmark to collect the result. It currently computes two key metrics: * Latest data access efficiency, which finds how many delta layers / image layers the system needs to iterate before returning any key in a key range. * (Approximate) PiTR efficiency, as in https://github.com/neondatabase/neon/issues/7770, which is simply the number of delta files in the range. The reason behind that is, assume no image layer is created, PiTR efficiency is simply the cost of collect records from the delta layers, and the replay time. Number of delta files (or in the future, estimated size of reads) is a simple yet efficient way of estimating how much effort the page server needs to reconstruct a page. Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-06-10 10:42:13 +02:00
Yuchen Liang	630cfbe420	refactor(pageserver): designated api error type for cancelled request (#7949 ) Closes #7406. ## Problem When a `get_lsn_by_timestamp` request is cancelled, an anyhow error is exposed to handle that case, which verbosely logs the error. However, we don't benefit from having the full backtrace provided by anyhow in this case. ## Summary of changes This PR introduces a new `ApiError` type to handle errors caused by cancelled request more robustly. - A new enum variant `ApiError::Cancelled` - Currently the cancelled request is mapped to status code 500. - Need to handle this error in proxy's `http_util` as well. - Added a failpoint test to simulate cancelled `get_lsn_by_timestamp` request. Signed-off-by: Yuchen Liang <yuchen@neon.tech>	2024-06-06 14:00:14 +00:00
John Spray	91dd99038e	pageserver/controller: enable tenant deletion without attachment (#7957 ) ## Problem As described in #7952, the controller's attempt to reconcile a tenant before finally deleting it can get hung up waiting for the compute notification hook to accept updates. The fact that we try and reconcile a tenant at all during deletion is part of a more general design issue (#5080), where deletion was implemented as an operation on attached tenant, requiring the tenant to be attached in order to delete it, which is not in principle necessary. Closes: #7952 ## Summary of changes - In the pageserver deletion API, only do the traditional deletion path if the tenant is attached. If it's secondary, then tear down the secondary location, and then do a remote delete. If it's not attached at all, just do the remote delete. - In the storage controller, instead of ensuring a tenant is attached before deletion, do a best-effort detach of the tenant, and then call into some arbitrary pageserver to issue a deletion of remote content. The pageserver retains its existing delete behavior when invoked on attached locations. We can remove this later when all users of the API are updated to either do a detach-before-delete. This will enable removing the "weird" code paths during startup that sometimes load a tenant and then immediately delete it, and removing the deletion markers on tenants.	2024-06-05 20:22:54 +00:00
Joonas Koivunen	0112097e13	feat(rtc): maintain dirty and uploaded IndexPart (#7833 ) RemoteTimelineClient maintains a copy of "next IndexPart" as a number of fields which are like an IndexPart but this is not immediately obvious. Instead of multiple fields, maintain a `dirty` ("next IndexPart") and `clean` ("uploaded IndexPart") fields. Additional cleanup: - rename `IndexPart::disk_consistent_lsn` accessor `duplicated_disk_consistent_lsn` - no one except scrubber should be looking at it, even scrubber is a stretch - remove usage elsewhere (pagectl used by tests, metadata scan endpoint) - serialize index part before the index upload operation - avoid upload operation being retried because of serialization error - serialization error is fatal anyway for timeline -- it can only make transient local progress after that, at least the error is bubbled up now - gather exploded IndexPart fields into single actual `UploadQueueInitialized::dirty` of which the uploaded snapshot is serialized - implement the long wished monotonicity check with the `clean` IndexPart with an assertion which is not expected to fire Continued work from #7860 towards next step of #6994.	2024-06-04 17:27:08 +03:00
John Spray	9fda85b486	pageserver: remove AncestorStopping error variants (#7916 ) ## Problem In all cases, AncestorStopping is equivalent to Cancelled. This became more obvious in https://github.com/neondatabase/neon/pull/7912#discussion_r1620582309 when updating these error types. ## Summary of changes - Remove AncestorStopping, always use Cancelled instead	2024-05-31 17:02:10 +01:00
John Spray	98dadf8543	pageserver: quieten some shutdown logs around logical size and flush (#7907 ) ## Problem Looking at several noisy shutdown logs: - In https://github.com/neondatabase/neon/issues/7861 we're hitting a log error with `InternalServerError(timeline shutting down\n'` on the checkpoint API handler. - In the field, we see initial_logical_size_calculation errors on shutdown, via DownloadError - In the field, we see errors logged from layer download code (independent of the error propagated) during shutdown Closes: https://github.com/neondatabase/neon/issues/7861 ## Summary of changes The theme of these changes is to avoid propagating anyhow::Errors for cases that aren't really unexpected error cases that we might want a stacktrace for, and avoid "Other" error variants unless we really do have unexpected error cases to propagate. - On the flush_frozen_layers path, use the `FlushLayerError` type throughout, rather than munging it into an anyhow::Error. Give FlushLayerError an explicit from_anyhow helper that checks for timeline cancellation, and uses it to give a Cancelled error instead of an Other error when the timeline is shutting down. - In logical size calculation, remove BackgroundCalculationError (this type was just a Cancelled variant and an Other variant), and instead use CalculateLogicalSizeError throughout. This can express a PageReconstructError, and has a From impl that translates cancel-like page reconstruct errors to Cancelled. - Replace CalculateLogicalSizeError's Other(anyhow::Error) variant case with a Decode(DeserializeError) variant, as this was the only kind of error we actually used in the Other case. - During layer download, drop out early if the timeline is shutting down, so that we don't do an `error!()` log of the shutdown error in this case.	2024-05-31 09:18:58 +01:00
Alex Chi Z	e28e46f20b	fix(pageserver): make wal connstr a connstr (#7846 ) The list timeline API gives something like `"wal_source_connstr":"PgConnectionConfig { host: Domain(\"safekeeper-5.us-east-2.aws.neon.build\"), port: 6500, password: Some(REDACTED-STRING) }"`, which is weird. This pull request makes it somehow like a connection string. This field is not used at least in the neon database, so I assume no one is reading or parsing it. Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-05-23 09:45:29 -04:00
Alex Chi Z	4a278cce7c	chore(pageserver): add force aux file policy switch handler (#7842 ) For existing users, we want to allow doing a force switch for their aux file policy. Part of #7462 --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-05-22 19:05:26 +00:00
Joonas Koivunen	ce44dfe353	openapi: document timeline ancestor detach (#7650 ) The openapi description with the error descriptions: - 200 is used for "detached or has been detached previously" - 400 is used for "cannot be detached right now" -- it's an odd thing, but good enough - 404 is used for tenant or timeline not found - 409 is used for "can never be detached" (root timeline) - 500 is used for transient errors (basically ill-defined shutdown errors) - 503 is used for busy (other tenant ancestor detach underway, pageserver shutdown) Cc: #6994	2024-05-22 13:55:34 +00:00
Arpad Müller	679e031cf6	Add dummy lsn lease http and page service APIs (#7815 ) We want to introduce a concept of temporary and expiring LSN leases. This adds both a http API as well as one for the page service to obtain temporary LSN leases. This adds a dummy implementation to unblock integration work of this API. A functional implementation of the lease feature is deferred to a later step. Fixes #7808 Co-authored-by: Joonas Koivunen <joonas@neon.tech>	2024-05-21 23:31:20 +02:00
Joonas Koivunen	a8a88ba7bc	test(detach_ancestor): ensure L0 compaction in history is ok (#7813 ) detaching a timeline from its ancestor can leave the resulting timeline with more L0 layers than the compaction threshold. most of the time, the detached timeline has made progress, and next time the L0 -> L1 compaction happens near the original branch point and not near the last_record_lsn. add a test to ensure that inheriting the historical L0s does not change fullbackup. additionally: - add `wait_until_completed` to test-only timeline checkpoint and compact HTTP endpoints. with `?wait_until_completed=true` the endpoints will wait until the remote client has completed uploads. - for delta layers, describe L0-ness with the `/layer` endpoint Cc: #6994	2024-05-21 20:08:43 +03:00
Alex Chi Z	e1a9669d05	feat(pagebench): add aux file bench (#7746 ) part of https://github.com/neondatabase/neon/issues/7462 ## Summary of changes This pull request adds two APIs to the pageserver management API: list_aux_files and ingest_aux_files. The aux file pagebench is intended to be used on an empty timeline because the data do not go through the safekeeper. LSNs are advanced by 8 for each ingestion, to avoid invariant checks inside the pageserver. For now, I only care about space amplification / read amplification, so the bench is designed in a very simple way: ingest 10000 files, and I will manually dump the layer map to analyze. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-05-17 20:04:02 +00:00
Alex Chi Z	aaf60819fa	feat(pageserver): persist aux file policy in index part (#7668 ) Part of https://github.com/neondatabase/neon/issues/7462 ## Summary of changes Tenant config is not persisted unless it's attached on the storage controller. In this pull request, we persist the aux file policy flag in the `index_part.json`. Admins can set `switch_aux_file_policy` in the storage controller or using the page server API. Upon the first aux file gets written, the write path will compare the aux file policy target with the current policy. If it is switch-able, we will do the switch. Otherwise, the original policy will be used. The test cases show what the admins can do / cannot do. The `last_aux_file_policy` is stored in `IndexPart`. Updates to the persisted policy are done via `schedule_index_upload_for_aux_file_policy_update`. On the write path, the writer will update the field. --------- Signed-off-by: Alex Chi Z <chi@neon.tech> Co-authored-by: Joonas Koivunen <joonas@neon.tech>	2024-05-17 19:22:49 +00:00
John Spray	c84656a53e	pageserver: implement auto-splitting (#7681 ) ## Problem Currently tenants are only split into multiple shards if a human being calls the API to do it. Issue: #7388 ## Summary of changes - Add a pageserver API for returning the top tenants by size - Add a step to the controller's background loop where if there is no reconciliation or optimization to be done, it looks for things to split. - Add a test that runs pgbench on many tenants concurrently, and checks that splitting happens as expected as tenants grow, without interrupting the client I/O. This PR is quite basic: there is a tasklist in https://github.com/neondatabase/neon/issues/7388 for further work. This PR is meant to be safe (off by default), and sufficient to enable our staging environment to run lots of sharded tenants without a human having to set them up.	2024-05-17 16:01:24 +00:00
John Spray	f342b87f30	pageserver: remove Option<> around remote storage, clean up metadata file refs (#7752 ) ## Problem This is historical baggage from when the pageserver could be run with local disk only: we had a bunch of places where we had to treat remote storage as optional. Closes: https://github.com/neondatabase/neon/issues/6890 ## Changes - Remove Option<> around remote storage (in https://github.com/neondatabase/neon/pull/7722 we made remote storage clearly mandatory) - Remove code for deleting old metadata files: they're all gone now. - Remove other references to metadata files when loading directories, as none exist. I checked last 14 days of logs for "found legacy metadata", there are no instances.	2024-05-15 12:05:24 +00:00
John Spray	be0c73f8e7	pageserver: improve API for invoking GC (#7655 ) ## Problem In https://github.com/neondatabase/neon/pull/7531, I had a test flaky because the GC API endpoint fails if the tenant happens not to be active yet. ## Summary of changes While adding that wait for the tenant to be active, I noticed that this endpoint is kind of strange (spawns a TaskManager task) and has a comment `// TODO: spawning is redundant now, need to hold the gate`, so this PR cleans it up to just run the GC inline while holding a gate. The GC code is updated to avoid assuming it runs inside a task manager task. Avoiding checking the task_mgr cancellation token is safe, because our timeline shutdown always cancels Timeline::cancel.	2024-05-13 17:59:59 +01:00
Joonas Koivunen	86905c1322	openapi: resolve the synthetic_size duplication (#7651 ) We had accidentally left two endpoints for `tenant`: `/synthetic_size` and `/size`. Size had the more extensive description but has returned 404 since renaming. Remove the `/size` in favor of the working one and describe the `text/html` output.	2024-05-10 17:15:11 +03:00
John Spray	ca154d9cd8	pageserver: local layer path followups (#7640 ) - Rename "filename" types which no longer map directly to a filename (LayerFileName -> LayerName) - Add a -v1- part to local layer paths to smooth the path to future updates (we anticipate a -v2- that uses checksums later) - Rename methods that refer to the string-ized version of a LayerName to no longer be called "filename" - Refactor reconcile() function to use a LocalLayerFileMetadata type that includes the local path, rather than carrying local path separately in a tuple and unwrap()'ing it later.	2024-05-08 16:50:21 +00:00
John Spray	0af66a6003	pageserver: include generation number in local layer paths (#7609 ) ## Problem In https://github.com/neondatabase/neon/pull/7531, we would like to be able to rewrite layers safely. One option is to make `Layer` able to rewrite files in place safely (e.g. by blocking evictions/deletions for an old Layer while a new one is created), but that's relatively fragile. It's more robust in general if we simply never overwrite the same local file: we can do that by putting the generation number in the filename. ## Summary of changes - Add `local_layer_path` (counterpart to `remote_layer_path`) and convert all locations that manually constructed a local layer path by joining LayerFileName to timeline path - In the layer upload path, construct remote paths with `remote_layer_path` rather than trying to build them out of a local path. - During startup, carry the full path to layer files through `init::reconcile`, and pass it into `Layer::for_resident` - Add a test to make sure we handle upgrades properly. - Comment out the generation part of `local_layer_path`, since we need to maintain forward compatibility for one release. A tiny followup PR will enable it afterwards. We could make this a bit simpler if we bulk renamed existing layers on startup instead of carrying literal paths through init, but that is operationally risky on existing servers with millions of layer files. We can always do a renaming change in future if it becomes annoying, but for the moment it's kind of nice to have a structure that enables us to change local path names again in future quite easily. We should rename `LayerFileName` to `LayerName` or somesuch, to make it more obvious that it's not a literal filename: this was already a bit confusing where that type is used in remote paths. That will be a followup, to avoid polluting this PR's diff.	2024-05-07 18:03:12 +01:00
Joonas Koivunen	3c9b484c4d	feat: Timeline detach ancestor (#7456 ) ## Problem Timelines cannot be deleted if they have children. In many production cases, a branch or a timeline has been created off the main branch for various reasons to the effect of having now a "new main" branch. This feature will make it possible to detach a timeline from its ancestor by inheriting all of the data before the branchpoint to the detached timeline and by also reparenting all of the ancestor's earlier branches to the detached timeline. ## Summary of changes - Earlier added copy_lsn_prefix functionality is used - RemoteTimelineClient learns to adopt layers by copying them from another timeline - LayerManager adds support for adding adopted layers - `timeline::Timeline::{prepare_to_detach,complete_detaching}_from_ancestor` and `timeline::detach_ancestor` are added - HTTP PUT handler Cc: #6994 Co-authored-by: Christian Schwarz <christian@neon.tech>	2024-05-07 13:47:57 +03:00
John Spray	b5a6e68e68	storage controller: check warmth of secondary before doing proactive migration (#7583 ) ## Problem The logic in Service::optimize_all would sometimes choose to migrate a tenant to a secondary location that was only recently created, resulting in Reconciler::live_migrate hitting its 5 minute timeout warming up the location, and proceeding to attach a tenant to a location that doesn't have a warm enough local set of layer files for good performance. Closes: #7532 ## Summary of changes - Add a pageserver API for checking download progress of a secondary location - During `optimize_all`, connect to pageservers of candidate optimization secondary locations, and check they are warm. - During shard split, do heatmap uploads and start secondary downloads, so that the new shards' secondary locations start downloading ASAP, rather than waiting minutes for background downloads to kick in. I have intentionally not implemented this by continuously reading the status of locations, to avoid dealing with the scale challenge of efficiently polling & updating 10k-100k locations status. If we implement that in the future, then this code can be simplified to act based on latest state of a location rather than fetching it inline during optimize_all.	2024-05-03 14:28:23 +00:00
Arpad Müller	7a49e5d5c2	Remove tenant_id from TenantLocationConfigRequest (#7469 ) Follow-up of #7055 and #7476 to remove `tenant_id` from `TenantLocationConfigRequest` completely. All components of our system should now not specify the `tenant_id`. cc https://github.com/neondatabase/cloud/pull/11791	2024-05-02 20:18:13 +02:00
Alex Chi Z	45c625fb34	feat(pageserver): separate sparse and dense keyspace (#7503 ) extracted (and tested) from https://github.com/neondatabase/neon/pull/7468, part of https://github.com/neondatabase/neon/issues/7462. The current codebase assumes the keyspace is dense -- which means that if we have a keyspace of 0x00-0x100, we assume every key (e.g., 0x00, 0x01, 0x02, ...) exists in the storage engine. However, the assumption does not hold any more in metadata keyspace. The metadata keyspace is sparse. It is impossible to do per-key check. Ideally, we should not have the assumption of dense keyspace at all, but this would incur a lot of refactors. Therefore, we split the keyspaces we have to dense/sparse and handle them differently in the code for now. At some point in the future, we should assume all keyspaces are sparse. ## Summary of changes * Split collect_keyspace to return dense+sparse keyspace. * Do not allow generating image layers for sparse keyspace (for now -- will fix this next week, we need image layers anyways). * Generate delta layers for sparse keyspace. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-04-30 09:39:10 -04:00
John Spray	b655c7030f	neon_local: add "tenant import" (#7399 ) ## Problem Sometimes we have test data in the form of S3 contents that we would like to run live in a neon_local environment. ## Summary of changes - Add a storage controller API that imports an existing tenant. Currently this is equivalent to doing a create with a high generation number, but in future this would be something smarter to probe S3 to find the shards in a tenant and find generation numbers. - Add a `neon_local` command that invokes the import API, and then inspects timelines in the newly attached tenant to create matching branches.	2024-04-29 08:52:18 +01:00
Alex Chi Z	25d9dc6eaf	chore(pageserver): separate missing key error (#7393 ) As part of https://github.com/neondatabase/neon/pull/7375 and to improve the current vectored get implementation, we separate the missing key error out. This also saves us several Box allocations in the get page implementation. ## Summary of changes * Create a caching field of layer traversal id for each of the layer. * Remove box allocations for layer traversal id retrieval and implement MissingKey error message as before. This should be a little bit faster. * Do not format error message until `Display`. * For in-mem layer, the descriptor is different before/after frozen. I'm using once lock for that. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-04-22 10:40:35 -04:00
Vlad Lazar	1c012958c7	pageserver/http: remove status code boilerplate from swagger spec (#7385 ) ## Problem We specify a bunch of possible error codes in the pageserver api swagger spec. This is error prone and annoying to work with. https://github.com/neondatabase/cloud/pull/11907 introduced generic error handling on the control plane side, so we can now clean up the spec. ## Summary of changes * Remove generic error codes from swagger spec * Update a couple route handlers which would previously return an error without a `msg` field in the response body. Tested via https://github.com/neondatabase/cloud/pull/12340 Related https://github.com/neondatabase/cloud/issues/7238	2024-04-16 16:24:09 +01:00
John Spray	83cdbbb89a	pageserver: improve readability of shard.rs (#7330 ) No functional changes, this is a comments/naming PR. While merging sharding changes, some cleanup of the shard.rs types was deferred. In this PR: - Rename `is_zero` to `is_shard_zero` to make clear that this method doesn't literally mean that the entire object is zeros, just that it refers to the 0th shard in a tenant. - Pull definitions of types to the top of shard.rs and add a big comment giving an overview of which type is for what. Closes: https://github.com/neondatabase/neon/issues/6072	2024-04-15 11:50:26 +01:00
Joonas Koivunen	21b3e1d13b	fix(utilization): return used as does df (#7337 ) We can currently underflow `pageserver_resident_physical_size_global`, so the used disk bytes would show `u63::MAX` by mistake. The assumption of the API (and the documented behavior) was to give the layer files disk usage. Switch to reporting numbers that match `df` output. Fixes: #7336	2024-04-08 09:01:38 +03:00
John Spray	8b10407be4	pageserver: on-demand activation of tenant on GET tenant status (#7250 ) ## Problem (Follows https://github.com/neondatabase/neon/pull/7237) Some API users will query a tenant to wait for it to activate. Currently, we return the current status of the tenant, whatever that may be. Under heavy load, a pageserver starting up might take a long time to activate such a tenant. ## Summary of changes - In `tenant_status` handler, call wait_to_become_active on the tenant. If the tenant is currently waiting for activation, this causes it to skip the queue, similiar to other API handlers that require an active tenant, like timeline creation. This avoids external services waiting a long time for activation when polling GET /v1/tenant/<id>.	2024-04-03 16:53:43 +03:00
John Spray	b3b7ce457c	pageserver: remove bare mgr::get_tenant, mgr::list_tenants (#7237 ) ## Problem This is a refactor. This PR was a precursor to a much smaller change `e5bd602dc1`, where as I was writing it I found that we were not far from getting rid of the last non-deprecated code paths that use `mgr::` scoped functions to get at the TenantManager state. We're almost done cleaning this up as per https://github.com/neondatabase/neon/issues/5796. The only significant remaining mgr:: item is `get_active_tenant_with_timeout`, which is page_service's path for fetching tenants. ## Summary of changes - Remove the bool argument to get_attached_tenant_shard: this was almost always false from API use cases, and in cases when it was true, it was readily replacable with an explicit check of the returned tenant's status. - Rather than letting the timeline eviction task query any tenant it likes via `mgr::`, pass an `Arc<Tenant>` into the task. This is still an ugly circular reference, but should eventually go away: either when we switch to exclusively using disk usage eviction, or when we change metadata storage to avoid the need to imitate layer accesses. - Convert all the mgr::get_tenant call sites to use TenantManager::get_attached_tenant_shard - Move list_tenants into TenantManager.	2024-03-26 18:29:08 +00:00
John Spray	8dfe3a070c	pageserver: return 429 on timeline creation in progress (#7225 ) ## Problem Currently, we return 409 (Conflict) in two cases: - Temporary: Timeline creation cannot proceed because another timeline with the same ID is being created - Permanent: Timeline creation cannot proceed because another timeline exists with different parameters but the same ID. Callers which time out a request and retry should be able to distinguish these cases. Closes: #7208 ## Summary of changes - Expose `AlreadyCreating` errors as 429 instead of 409	2024-03-26 15:20:05 +00:00
Vlad Lazar	c75b584430	storage_controller: add metrics (#7178 ) ## Problem Storage controller had basically no metrics. ## Summary of changes 1. Migrate the existing metrics to use Conrad's [`measured`](https://docs.rs/measured/0.0.14/measured/) crate. 2. Add metrics for incoming http requests 3. Add metrics for outgoing http requests to the pageserver 4. Add metrics for outgoing pass through requests to the pageserver 5. Add metrics for database queries Note that the metrics response for the attachment service does not use chunked encoding like the rest of the metrics endpoints. Conrad has kindly extended the crate such that it can now be done. Let's leave it for a follow-up since the payload shouldn't be that big at this point. Fixes https://github.com/neondatabase/neon/issues/6875	2024-03-21 12:00:20 +00:00
Joonas Koivunen	877fd14401	fix: spanless log message (#7155 ) with `immediate_gc` the span only covered the `gc_iteration`, make it cover the whole needless spawned task, which also does waiting for layer drops and stray logging in tests. also clarify some comments while we are here. Fixes: #6910	2024-03-18 16:27:53 +02:00
John Spray	1d3ae57f18	pageserver: refactoring in TenantManager to reduce duplication (#6732 ) ## Problem Followup to https://github.com/neondatabase/neon/pull/6725 In that PR, code for purging local files from a tenant shard was duplicated. ## Summary of changes - Refactor detach code into TenantManager - `spawn_background_purge` method can now be common between detach and split operations	2024-03-18 10:37:20 +00:00
John Spray	9752ad8489	pageserver, controller: improve secondary download APIs for large shards (#7131 ) ## Problem The existing secondary download API relied on the caller to wait as long as it took to complete -- for large shards that could be a long time, so typical clients that might have a baked-in ~30s timeout would have a problem. ## Summary of changes - Take a `wait_ms` query parameter to instruct the pageserver how long to wait: if the download isn't complete in this duration, then 201 is returned instead of 200. - For both 200 and 201 responses, include response body describing download progress, in terms of layers and bytes. This is sufficient for the caller to track how much data is being transferred and log/present that status. - In storage controller live migrations, use this API to apply a much longer outer timeout, with smaller individual per-request timeouts, and log the progress of the downloads. - Add a test that injects layer download delays to exercise the new behavior	2024-03-15 19:45:58 +00:00
John Spray	22c26d610b	pageserver: remove un-needed "uninit mark" (#5717 ) Switched the order; doing https://github.com/neondatabase/neon/pull/6139 first then can remove uninit marker after. ## Problem Previously, existence of a timeline directory was treated as evidence of the timeline's logical existence. That is no longer the case since we treat remote storage as the source of truth on each startup: we can therefore do without this mark file. The mark file had also been used as a pseudo-lock to guard against concurrent creations of the same TimelineId -- now that persistence is no longer required, this is a bit unwieldy. In #6139 the `Tenant::timelines_creating` was added to protect against concurrent creations on the same TimelineId, making the uninit mark file entirely redundant. ## Summary of changes - Code that writes & reads mark file is removed - Some nearby `pub` definitions are amended to `pub(crate)` - `test_duplicate_creation` is added to demonstrate that mutual exclusion of creations still works.	2024-03-15 17:23:05 +02:00
Vlad Lazar	38767ace68	storage_controller: periodic pageserver heartbeats (#7092 ) ## Problem If a pageserver was offline when the storage controller started, there was no mechanism to update the storage controller state when the pageserver becomes active. ## Summary of changes * Add a heartbeater module. The heartbeater must be driven by an external loop. * Integrate the heartbeater into the service. - Extend the types used by the service and scheduler to keep track of a nodes' utilisation score. - Add a background loop to drive the heartbeater and update the state based on the deltas it generated - Do an initial round of heartbeats at start-up	2024-03-14 15:21:36 +00:00
John Spray	44f42627dd	pageserver/controller: error handling for shard splitting (#7074 ) ## Problem Shard splits worked, but weren't safe against failures (e.g. node crash during split) yet. Related: #6676 ## Summary of changes - Introduce async rwlocks at the scope of Tenant and Node: - exclusive tenant lock is used to protect splits - exclusive node lock is used to protect new reconciliation process that happens when setting node active - exclusive locks used in both cases when doing persistent updates (e.g. node scheduling conf) where the update to DB & in-memory state needs to be atomic. - Add failpoints to shard splitting in control plane and pageserver code. - Implement error handling in control plane for shard splits: this detaches child chards and ensures parent shards are re-attached. - Crash-safety for storage controller restarts requires little effort: we already reconcile with nodes over a storage controller restart, so as long as we reset any incomplete splits in the DB on restart (added in this PR), things are implicitly cleaned up. - Implement reconciliation with offline nodes before they transition to active: - (in this context reconciliation means something like startup_reconcile, not literally the Reconciler) - This covers cases where split abort cannot reach a node to clean it up: the cleanup will eventually happen when the node is marked active, as part of reconciliation. - This also covers the case where a node was unavailable when the storage controller started, but becomes available later: previously this allowed it to skip the startup reconcile. - Storage controller now terminates on panics. We only use panics for true "should never happen" assertions, and these cases can leave us in an un-usable state if we keep running (e.g. panicking in a shard split). In the unlikely event that we get into a crashloop as a result, we'll rely on kubernetes to back us off. - Add `test_sharding_split_failures` which exercises a variety of failure cases during shard split.	2024-03-14 09:11:57 +00:00
Arpad Müller	5309711691	Make tenant_id in TenantLocationConfigRequest optional (#7055 ) The `tenant_id` in `TenantLocationConfigRequest` in the `location_config` endpoint was only used in the storage controller/attachment service, and there it was only used for assertions and the creation part.	2024-03-13 17:30:29 +01:00
John Spray	1b41db8bdd	pageserver: enable setting stripe size inline with split request. (#7093 ) ## Summary - Currently we can set stripe size at tenant creation, but it doesn't mean anything until we have multiple shards - When onboarding an existing tenant, it will always get a default shard stripe size, so we would like to be able to pick the actual stripe size at the point we split. ## Why do this inline with a split? The alternative to this change would be to have a separate endpoint on the storage controller for setting the stripe size on a tenant, and only permit writes to that endpoint when the tenant has only a single shard. That would work, but be a little bit more work for a client, and not appreciably simpler (instead of having a special argument to the split functions, we'd have a special separate endpoint, and a requirement that the controller must sync its config down to the pageserver before calling the split API). Either approach would work, but this one feels a bit more robust end-to-end: the split API is the _very last moment_ that the stripe size is mutable, so if we aim to set it before splitting, it makes sense to do it as part of the same operation.	2024-03-12 20:41:08 +00:00
John Spray	f8483cc4a3	pageserver: update swagger for HA APIs (#7070 ) - The type of heatmap_period in tenant config was wrrong - Secondary download and heatmap upload endpoints weren't in swagger.	2024-03-11 09:32:17 +00:00
John Spray	d5a6a2a16d	storage controller: robustness improvements (#7027 ) ## Problem Closes: https://github.com/neondatabase/neon/issues/6847 Closes: https://github.com/neondatabase/neon/issues/7006 ## Summary of changes - Pageserver API calls are wrapped in timeout/retry logic: this prevents a reconciler getting hung on a pageserver API hang, and prevents reconcilers having to totally retry if one API call returns a retryable error (e.g. 503). - Add a cancellation token to `Node`, so that when we mark a node offline we will cancel any API calls in progress to that node, and avoid issuing any more API calls to that offline node. - If the dirty locations of a shard are all on offline nodes, then don't spawn a reconciler - In re-attach, if we have no observed state object for a tenant then construct one with conf: None (which means "unknown"). Then in Reconciler, implement a TODO for scanning such locations before running, so that we will avoid spuriously incrementing a generation in the case of a node that was offline while we started (this is the case that tripped up #7006) - Refactoring: make Node contents private (and thereby guarantee that updates to availability mode reliably update the cancellation token.) - Refactoring: don't pass the whole map of nodes into Reconciler (and thereby remove a bunch of .expect() calls) Some of this was discovered/tested with a new failure injection test that will come in a separate PR, once it is stable enough for CI.	2024-03-07 17:10:03 +00:00
John Spray	4a31e18c81	storage controller: include stripe size in compute notifications (#6974 ) ## Problem - The storage controller is the source of truth for a tenant's stripe size, but doesn't currently have a way to propagate that to compute: we're just using the default stripe size everywhere. Closes: https://github.com/neondatabase/neon/issues/6903 ## Summary of changes - Include stripe size in `ComputeHookNotifyRequest` - Include stripe size in `LocationConfigResponse` The stripe size is optional: it will only be advertised for multi-sharded tenants. This enables the controller to defer the choice of stripe size until we split a tenant for the first time.	2024-03-06 13:56:30 +00:00
Joonas Koivunen	4d426f6fbe	feat: support lazy, queued tenant attaches (#6907 ) Add off-by-default support for lazy queued tenant activation on attach. This should be useful on bulk migrations as some tenants will be activated faster due to operations or endpoint startup. Eventually all tenants will get activated by reusing the same mechanism we have at startup (`PageserverConf::concurrent_tenant_warmup`). The difference to lazy attached tenants to startup ones is that we leave their initial logical size calculation be triggered by WalReceiver or consumption metrics. Fixes: #6315 Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>	2024-02-29 13:26:29 +02:00
John Spray	e5384ebefc	pageserver: accelerate tenant activation on HTTP API timeline read requests (#6944 ) ## Problem Callers of the timeline creation API may issue timeline GETs ahead of creation to e.g. check if their intended timeline already exists, or to learn the LSN of a parent timeline. Although the timeline creation API already triggers activation of a timeline if it's currently waiting to activate, the GET endpoint doesn't, so such callers will encounter 503 responses for several minutes after a pageserver restarts, while tenants are lazily warming up. The original scope of which APIs will activate a timeline was quite small, but really it makes sense to do it for any API that needs a particular timeline to be active. ## Summary of changes - In the timeline detail GET handler, use wait_to_become_active, which triggers immediate activation of a tenant if it was currently waiting for the warmup semaphore, then waits up to 5 seconds for the activation to complete. If it doesn't complete promptly, we return a 503 as before. - Modify active_timeline_for_active_tenant to also use wait_to_become_active, which indirectly makes several other timeline-scope request handlers fast-activate a tenant when called. This is important because a timeline creation flow could also use e.g. get_lsn_for_timestamp as a precursor to creating a timeline. - There is some risk to this change: an excessive number of timeline GET requests could cause too many tenant activations to happen at the same time, leading to excessive queue depth to the S3 client. However, this was already the case for e.g. many concurrent timeline creations.	2024-02-28 14:53:35 +00:00
Christian Schwarz	62d77e263f	test_remote_timeline_client_calls_started_metric: fix flakiness (#6911 ) fixes https://github.com/neondatabase/neon/issues/6889 # Problem The failure in the last 3 flaky runs on `main` is ``` test_runner/regress/test_remote_storage.py:460: in test_remote_timeline_client_calls_started_metric churn("a", "b") test_runner/regress/test_remote_storage.py:457: in churn assert gc_result["layers_removed"] > 0 E assert 0 > 0 ``` That's this code `cd449d66ea/test_runner/regress/test_remote_storage.py (L448-L460)` So, the test expects GC to remove some layers but the GC doesn't. # Fix My impression is that the VACUUM isn't re-using pages aggressively enough, but I can't really prove that. Tried to analyze the layer map dump but it's too complex. So, this PR: - Creates more churn by doing the overwrite twice. - Forces image layer creation. It also drive-by removes the redundant call to timeline_compact, because, timeline_checkpoint already does that internally.	2024-02-27 10:55:10 +01:00
John Spray	8283779ee8	pageserver: remove legacy attach/detach APIs from swagger (#6883 ) ## Problem Since the location config API was added, the attach and detach endpoints are deprecated. Hiding them from consumers of the swagger definition is a precursor to removing them Neon's cloud no longer uses this api since https://github.com/neondatabase/cloud/pull/10538 Fully removing the APIs will implicitly make use of generation numbers mandatory, and should happen alongside https://github.com/neondatabase/neon/issues/5388, which will happen once we're happy that the storage controller is ready for prime time. ## Summary of changes - Remove /attach and /detach from pageserver's swagger file	2024-02-25 14:53:17 +00:00
Joonas Koivunen	bc7a82caf2	feat: bare-bones /v1/utilization (#6831 ) PR adds a simple at most 1Hz refreshed informational API for querying pageserver utilization. In this first phase, no actual background calculation is performed. Instead, the worst possible score is always returned. The returned bytes information is however correct. Cc: #6835 Cc: #5331	2024-02-22 13:58:59 +02:00
Arpad Müller	e0c12faabd	Allow initdb preservation for broken tenants (#6790 ) Often times the tenants we want to (WAL) DR are the ones which the pageserver marks as broken. Therefore, we should allow initdb preservation also for broken tenants. Fixes #6781.	2024-02-19 17:27:02 +01:00
John Spray	5667372c61	pageserver: during shard split, wait for child to activate (#6789 ) ## Problem test_sharding_split_unsharded was flaky with log errors from tenants not being active. This was happening when the split function enters wait_lsn() while the child shard might still be activating. It's flaky rather than an outright failure because activation is usually very fast. This is also a real bug fix, because in realistic scenarios we could proceed to detach the parent shard before the children are ready, leading to an availability gap for clients. ## Summary of changes - Do a short wait_to_become_active on the child shards before proceeding to wait for their LSNs to advance --------- Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>	2024-02-18 15:55:19 +00:00

1 2 3 4 5 ...

327 Commits