rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-01-04 03:52:56 +00:00

Author	SHA1	Message	Date
Vlad Lazar	f5cef7bf7f	storcon: skip draining shard if it's secondary is lagging too much (#8644 ) ## Problem Migrations of tenant shards with cold secondaries are holding up drains in during production deployments. ## Summary of changes If a secondary locations is lagging by more than 256MiB (configurable, but that's the default), then skip cutting it over to the secondary as part of the node drain.	2024-08-09 15:45:07 +01:00
Joonas Koivunen	a81fab4826	refactor(timeline_detach_ancestor): replace ordered reparented with a hashset (#8629 ) Earlier I was thinking we'd need a (ancestor_lsn, timeline_id) ordered list of reparented. Turns out we did not need it at all. Replace it with an unordered hashset. Additionally refactor the reparented direct children query out, it will later be used from more places. Split off from #8430. Cc: #6994	2024-08-07 18:19:00 +02:00
John Spray	2334fed762	storage_controller: start adding chaos hooks (#7946 ) Chaos injection bridges the gap between automated testing (where we do lots of different things with small, short-lived tenants), and staging (where we do many fewer things, but with larger, long-lived tenants). This PR adds a first type of chaos which isn't really very chaotic: it's live migration of tenants between healthy pageservers. This nevertheless provides continuous checks that things like clean, prompt shutdown of tenants works for realistically deployed pageservers with realistically large tenants.	2024-08-02 09:37:44 +01:00
John Spray	2f9ada13c4	controller: simplify reconciler generation increment logic (#8560 ) ## Problem This code was confusing, untested and covered: - an impossible case, where intent state is AttacheStale (we never do this) - a rare edge case (going from AttachedMulti to Attached), which we were not testing, and in any case the pageserver internally does the same Tenant reset in this transition as it would do if we incremented generation. Closes: https://github.com/neondatabase/neon/issues/8367 ## Summary of changes - Simplify the logic to only skip incrementing the generation if the location already has the expected generation and the exact same mode.	2024-07-31 18:37:47 +01:00
Yuchen Liang	e374d6778e	feat(storcon): store scrubber metadata scan result (#8480 ) Part of #8128, followed by #8502. ## Problem Currently we lack mechanism to alert unhealthy `scan_metadata` status if we start running this scrubber command as part of a cronjob. With the storage controller client introduced to storage scrubber in #8196, it is viable to set up alert by storing health status in the storage controller database. We intentionally do not store the full output to the database as the json blobs potentially makes the table really huge. Instead, only a health status and a timestamp recording the last time metadata health status is posted on a tenant shard. Signed-off-by: Yuchen Liang <yuchen@neon.tech>	2024-07-30 14:32:00 +01:00
Vlad Lazar	7a796a9963	storcon: introduce step down primitive (#8512 ) ## Problem We are missing the step-down primitive required to implement rolling restarts of the storage controller. ## Summary of changes Add `/control/v1/step_down` endpoint which puts the storage controller into a state where it rejects all API requests apart from `/control/v1/step_down`, `/status` and `/metrics`. When receiving the request, storage controller cancels all pending reconciles and waits for them to exit gracefully. The response contains a snapshot of the in-memory observed state. Related: * https://github.com/neondatabase/cloud/issues/14701 * https://github.com/neondatabase/neon/issues/7797 * https://github.com/neondatabase/neon/pull/8310	2024-07-26 14:54:09 +01:00
Vlad Lazar	3977e0a7a3	storcon: shutdown with clean observed state (#8494 ) ## Problem Storcon shutdown did not produce a clean observed state. This is not a problem at the moment, but we will need to stop all reconciles with clean observed state for rolling restarts. I tried to test this by collecting the observed state during shutdown and comparing it with the in-memory observed state, but it doesn't work because a lot of tests use the cursed attach hook to create tenants directly through the ps. ## Summary of Changes Rework storcon shutdown as follows: * Reconcilers get a separate cancellation token which is a child token of the global `Service::cancel`. * Reconcilers get a separate gate * Add a mechanism to drain the reconciler result queue before * Put all of this together into a clean shutdown sequence Related https://github.com/neondatabase/cloud/issues/14701	2024-07-25 15:13:34 +01:00
Vlad Lazar	9c5ad21341	storcon: make heartbeats restart aware (#8222 ) ## Problem Re-attach blocks the pageserver http server from starting up. Hence, it can't reply to heartbeats until that's done. This makes the storage controller mark the node off-line (not good). We worked around this by setting the interval after which nodes are marked offline to 5 minutes. This isn't a long term solution. ## Summary of changes * Introduce a new `NodeAvailability` state: `WarmingUp`. This state models the following time interval: * From receiving the re-attach request until the pageserver replies to the first heartbeat post re-attach * The heartbeat delta generator becomes aware of this state and uses a separate longer interval * Flag `max-warming-up-interval` now models the longer timeout and `max-offline-interval` the shorter one to match the names of the states Closes https://github.com/neondatabase/neon/issues/7552	2024-07-25 14:09:12 +01:00
John Spray	9e23410074	tests: allow-list a controller heartbeat error (#8471 ) ## Problem `test_change_pageserver` stops pageservers in a way that can overlap with the controller's heartbeats: the controller can get a heartbeat success and then immediately find the node unavailable. This particular situation triggers a log that isn't in our current allow-list of messages for nodes offline Example: https://neon-github-public-dev.s3.amazonaws.com/reports/pr-8339/10048487700/index.html#testresult/19678f27810231df/retries ## Summary of changes - Add the message to the allow list	2024-07-23 16:09:05 -04:00
John Spray	44781518d0	storage scrubber: GC ancestor shard layers (#8196 ) ## Problem After a shard split, the pageserver leaves the ancestor shard's content in place. It may be referenced by child shards, but eventually child shards will de-reference most ancestor layers as they write their own data and do GC. We would like to eventually clean up those ancestor layers to reclaim space. ## Summary of changes - Extend the physical GC command with `--mode=full`, which includes cleaning up unreferenced ancestor shard layers - Add test `test_scrubber_physical_gc_ancestors` - Remove colored log output: in testing this is irritating ANSI code spam in logs, and in interactive use doesn't add much. - Refactor storage controller API client code out of storcon_client into a `storage_controller/client` crate - During physical GC of ancestors, call into the storage controller to check that the latest shards seen in S3 reflect the latest state of the tenant, and there is no shard split in progress.	2024-07-19 19:07:59 +03:00
Christian Schwarz	a2d170b6d0	NeonEnv.from_repo_dir: use storage_controller_db instead of `attachments.json` (#8382 ) When `NeonEnv.from_repo_dir` was introduced, storage controller stored its state exclusively `attachments.json`. Since then, it has moved to using Postgres, which stores its state in `storage_controller_db`. But `NeonEnv.from_repo_dir` wasn't adjusted to do this. This PR rectifies the situation. Context for this is failures in `test_pageserver_characterize_throughput_with_n_tenants` CF: https://neondb.slack.com/archives/C033RQ5SPDH/p1721035799502239?thread_ts=1720901332.293769&cid=C033RQ5SPDH Notably, `from_repo_dir` is also used by the backwards- and forwards-compatibility. Thus, the changes in this PR affect those tests as well. However, it turns out that the compatibility snapshot already contains the `storage_controller_db`. Thus, it should just work and in fact we can remove hacks like `fixup_storage_controller`. Follow-ups created as part of this work: * https://github.com/neondatabase/neon/issues/8399 * https://github.com/neondatabase/neon/issues/8400	2024-07-18 10:56:07 +02:00
Joonas Koivunen	730db859c7	feat(timeline_detach_ancestor): success idempotency (#8354 ) Right now timeline detach ancestor reports an error (409, "no ancestor") on a new attempt after successful completion. This makes it troublesome for storage controller retries. Fix it to respond with `200 OK` as if the operation had just completed quickly. Additionally, the returned timeline identifiers in the 200 OK response are now ordered so that responses between different nodes for error comparison are done by the storage controller added in #8353. Design-wise, this PR introduces a new strategy for accessing the latest uploaded IndexPart: `RemoteTimelineClient::initialized_upload_queue(&self) -> Result<UploadQueueAccessor<'_>, NotInitialized>`. It should be a more scalable way to query the latest uploaded `IndexPart` than to add a query method for each question directly on `RemoteTimelineClient`. GC blocking will need to be introduced to make the operation fully idempotent. However, it is idempotent for the cases demonstrated by tests. Cc: #6994	2024-07-15 17:47:53 +00:00
Joonas Koivunen	324e4e008f	feat(storcon): timeline detach ancestor passthrough (#8353 ) Currently storage controller does not support forwarding timeline detach ancestor requests to pageservers. Add support for forwarding `PUT .../:tenant_id/timelines/:timeline_id/detach_ancestor`. Implement the support mostly as is, because the timeline detach ancestor will be made (mostly) idempotent in future PR. Cc: #6994	2024-07-15 18:08:24 +03:00
Conrad Ludgate	411a130675	Fix nightly warnings 2024 june (#8151 ) ## Problem new clippy warnings on nightly. ## Summary of changes broken up each commit by warning type. 1. Remove some unnecessary refs. 2. In edition 2024, inference will default to `!` and not `()`. 3. Clippy complains about doc comment indentation 4. Fix `Trait + ?Sized` where `Trait: Sized`. 5. diesel_derives triggering `non_local_defintions`	2024-07-12 13:58:04 +01:00
John Spray	814c8e8f68	storage controller: add node deletion API (#8226 ) ## Problem In anticipation of later adding a really nice drain+delete API, I initially only added an intentionally basic `/drop` API that is just about usable for deleting nodes in a pinch, but requires some ugly storage controller restarts to persuade it to restart secondaries. ## Summary of changes I started making a few tiny fixes, and ended up writing the delete API... - Quality of life nit: ordering of node + tenant listings in storcon_cli - Papercut: Fix the attach_hook using the wrong operation type for reporting slow locks - Make Service::spawn tolerate `generation_pageserver` columns that point to nonexistent node IDs. I started out thinking of this as a general resilience thing, but when implementing the delete API I realized it was actually a legitimate end state after the delete API is called (as that API doesn't wait for all reconciles to succeed). - Add a `DELETE` API for nodes, which does not gracefully drain, but does reschedule everything. This becomes safe to use when the system is in any state, but will incur availability gaps for any tenants that weren't already live-migrated away. If tenants have already been drained, this becomes a totally clean + safe way to decom a node. - Add a test and a storcon_cli wrapper for it This is meant to be a robust initial API that lets us remove nodes without doing ugly things like restarting the storage controller -- it's not quite a totally graceful node-draining routine yet. There's more work in https://github.com/neondatabase/neon/issues/8333 to get to our end-end state.	2024-07-11 17:05:47 +01:00
Vlad Lazar	d9a82468e2	storage_controller: fix ReconcilerWaiter::get_status (#8341 ) ## Problem SeqWait::would_wait_for returns Ok in the case when we would not wait for the sequence number and Err otherwise. ReconcilerWaiter::get_status uses it the wrong way around. This can cause the storage controller to go into a busy loop and make it look unavailable to the k8s controller. ## Summary of changes Use `SeqWait::would_wait_for` correctly.	2024-07-11 15:43:28 +01:00
Christian Schwarz	1a49f1c15c	pageserver: move `page_service`'s `import basebackup` / `import wal` to mgmt API (#8292 ) I want to fix bugs in `page_service` ([issue](https://github.com/neondatabase/neon/issues/7427)) and the `import basebackup` / `import wal` stand in the way / make the refactoring more complicated. We don't use these methods anyway in practice, but, there have been some objections to removing the functionality completely. So, this PR preserves the existing functionality but moves it into the HTTP management API. Note that I don't try to fix existing bugs in the code, specifically not fixing * it only ever worked correctly for unsharded tenants * it doesn't clean up on error All errors are mapped to `ApiError::InternalServerError`.	2024-07-09 23:17:42 +02:00
Alex Chi Z	df3dc6e4c1	fix(pageserver): write to both v1+v2 for aux tenant import (#8316 ) close https://github.com/neondatabase/neon/issues/8202 ref https://github.com/neondatabase/neon/pull/6560 For tenant imports, we now write the aux files into both v1+v2 storage, so that the test case can pick either one for testing. Given the API is only used for testing, this looks like a safe change. Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-07-08 20:05:59 +01:00
John Spray	b8bbaafc03	storage controller: fix heatmaps getting disabled during shard split (#8197 ) ## Problem At the start of do_tenant_shard_split, we drop any secondary location for the parent shards. The reconciler uses presence of secondary locations as a condition for enabling heatmaps. On the pageserver, child shards inherit their configuration from parents, but the storage controller assumes the child's ObservedState is the same as the parent's config from the prepare phase. The result is that some child shards end up with inaccurate ObservedState, and until something next migrates or restarts, those tenant shards aren't uploading heatmaps, so their secondary locations are downloading everything that was resident at the moment of the split (including ancestor layers which are often cleaned up shortly after the split). Closes: https://github.com/neondatabase/neon/issues/8189 ## Summary of changes - Use PlacementPolicy to control enablement of heatmap upload, rather than the literal presence of secondaries in IntentState: this way we avoid switching them off during shard split - test: during tenant split test, assert that the child shards have heatmap uploads enabled.	2024-06-28 18:27:13 +01:00
John Spray	063553a51b	pageserver: remove tenant create API (#8135 ) ## Problem For some time, we have created tenants with calls to location_conf. The legacy "POST /v1/tenant" path was only used in some tests. ## Summary of changes - Remove the API - Relocate TenantCreateRequest to the controller API file (this used to be used in both pageserver and controller APIs) - Rewrite tenant_create test helper to use location_config API, as control plane and storage controller do - Update docker-compose test script to create tenants with location_config API (this small commit is also present in https://github.com/neondatabase/neon/pull/7947)	2024-06-28 09:14:19 +01:00
Vlad Lazar	89cf8df93b	stocon: bump number of concurrent reconciles per operation (#8179 ) ## Problem Background node operations take a long time for loaded nodes. ## Summary of changes Increase number of concurrent reconciles an operation is allowed to spawn. This should make drain and fill operations faster and the new value is still well below the total limit of concurrent reconciles.	2024-06-27 13:16:41 +00:00
Arseny Sher	6f20a18e8e	Allow to change compute safekeeper list without restart. - Add --safekeepers option to neon_local reconfigure - Add it to python Endpoint reconfigure - Implement config reload in walproposer by restarting the whole bgw when safekeeper list changes. ref https://github.com/neondatabase/neon/issues/6341	2024-06-27 15:08:35 +03:00
Vlad Lazar	d557002675	strocon: don't overcommit when making node fill plan (#8171 ) ## Problem The fill requirement was not taken into account when looking through the shards of a given node to fill from. ## Summary of Changes Ensure that we do not fill a node past the recommendation from `Scheduler::compute_fill_requirement`.	2024-06-27 11:56:57 +01:00
John Spray	07f21dd6b6	pageserver: remove attach/detach apis (#8134 ) ## Problem These APIs have been deprecated for some time, but were still used from test code. Closes: https://github.com/neondatabase/neon/issues/4282 ## Summary of changes - It is still convenient to do a "tenant_attach" from a test without having to write out a location_conf body, so those test methods have been retained with implementations that call through to their location_conf equivalent.	2024-06-25 17:38:06 +01:00
Vlad Lazar	8fe3f17c47	storcon: improve drain and fill shard placement (#8119 ) ## Problem While adapting the storage controller scale test to do graceful rolling restarts via drain and fill, I noticed that secondaries are also being rescheduled, which, in turn, caused the storage controller to optimise attachments. ## Summary of changes * Introduce a transactional looking rescheduling primitive (i.e. "try to schedule to this secondary, but leave everything as is if you can't") * Use it for the drain and fill stages to avoid calling into `Scheduler::schedule` and having secondaries move around.	2024-06-22 14:20:58 +00:00
John Spray	b74232eb4d	tests: allow-list neon_local endpoint errors from storage controller (#8123 ) ## Problem For testing, the storage controller has a built-in hack that loads neon_local endpoint config from disk, and uses it to reconfigure endpoints when the attached pageserver changes. Some tests that stop an endpoint while the storage controller is running could occasionally fail on log errors from the controller trying to use its special test-mode calls into neon local Endpoint. Example: https://neon-github-public-dev.s3.amazonaws.com/reports/pr-8117/9592392425/index.html#/testresult/9d2bb8623d0d53f8 ## Summary of changes - Give NotifyError an explicit NeonLocal variant, to avoid munging these into generic 500s (I don't want to ignore 500s in general) - Allow-list errors related to the local notification hook. The expectation is that tests using endpoints/workloads should be independently checking that those endpoints work: if neon_local generates an error inside the storage controller, that's ignorable.	2024-06-21 17:23:31 +00:00
Vlad Lazar	ee3081863e	storcon: implement endpoints for cancellation of drain and fill operations (#8029 ) ## Problem There's no way to cancel drain and fill operations. ## Summary of changes Implement HTTP endpoints to allow cancelling of background operations. When the operationis cancelled successfully, the node scheduling policy will revert to `Active`.	2024-06-21 17:13:51 +01:00
Vlad Lazar	01399621d5	storcon: avoid promoting too many shards of the same tenant (#8099 ) ## Problem The fill planner introduced in https://github.com/neondatabase/neon/pull/8014 selects tenant shards to promote strictly based on attached shard count load (tenant shards on nodes with the most attached shard counts are considered first). This approach runs the risk of migrating too many shards belonging to the same tenant on the same primary node. This is bad for availability and causes extra reconciles via the storage controller's background optimisations. Also see https://github.com/neondatabase/neon/pull/8014#discussion_r1642456241. ## Summary of changes Refine the fill plan to avoid promoting too many shards belonging to the same tenant on the same node. We allow for `max(1, shard_count / node_count)` shards belonging to the same tenant to be promoted.	2024-06-21 10:19:01 +01:00
Jure Bajic	0792bb6785	Add tracing for shared locks in `id_lock_map` (#7618 ) ## Problem Storage controller shared locks do not print a warning when held for long time spans. ## Summary of changes Extension of issue https://github.com/neondatabase/neon/issues/7108 in tracing to exclusive lock in `id_lock_map` was added, to add the same for shared locks. It was mentioned in the comment https://github.com/neondatabase/neon/pull/7397#discussion_r1587961160	2024-06-21 09:47:04 +01:00
Vlad Lazar	f8ac3b0e0e	storcon: use attached shard counts for initial shard placement (#8061 ) ## Problem When creating a new shard the storage controller schedules via Scheduler::schedule_shard. This does not take into account the number of attached shards. What it does take into account is the node affinity: when a shard is scheduled, all its nodes (primaries and secondaries) get their affinity incremented. For two node clusters and shards with one secondary we have a pathological case where all primaries are scheduled on the same node. Now that we track the count of attached shards per node, this is trivial to fix. Still, the "proper" fix is to use the pageserver's utilization score. Closes https://github.com/neondatabase/neon/issues/8041 ## Summary of changes Use attached shard count when deciding which node to schedule a fresh shard on.	2024-06-20 17:32:01 +01:00
Christian Schwarz	438fd2aaf3	neon_local: `background_process`: launch all processes in repo dir (or `datadir`) (#8058 ) Before this PR, storage controller and broker would run in the PWD of neon_local, i.e., most likely the checkout of neon.git. With this PR, the shared infrastructure for background processes sets the PWD. Benefits: * easy listing of processes in a repo dir using `lsof`, see added comment in the code * coredumps go in the right directory (next to the process) * generally matching common expectations, I think Changes: * set the working directory in `background_process` module * drive-by: fix reliance of storage_controller on NEON_REPO_DIR being set by neon_local for the local compute hook to work correctly	2024-06-19 13:59:36 +02:00
Vlad Lazar	e7d62a257d	test: fix tenant duplication utility generation numbers (#8096 ) ## Problem We have this set of test utilities which duplicate a tenant by copying everything that's in remote storage and then attaching a tenant to the pageserver and storage controller. When the "copied tenants" are created on the storage controller, they start off from generation number 0. This means that they can't see anything past that generation. This issues has existed ever since generation numbers have been introduced, but we've largely been lucky for the generation to stay stable during the template tenant creation. ## Summary of Changes Extend the storage controller debug attach hook to accept a generation override. Use that in the tenant duplication logic to set the generation number to something greater than the naturally reached generation. This allows the tenants to see all layer files.	2024-06-19 11:55:59 +01:00
Vlad Lazar	5778d714f0	storcon: add drain and fill background operations for graceful cluster restarts (#8014 ) ## Problem Pageserver restarts cause read availablity downtime for tenants. See `Motivation` section in the [RFC](https://github.com/neondatabase/neon/pull/7704). ## Summary of changes * Introduce a new `NodeSchedulingPolicy`: `PauseForRestart` * Implement the first take of drain and fill algorithms * Add a node status endpoint which can be polled to figure out when an operation is done The implementation follows the RFC, so it might be useful to peek at it as you're reviewing. Since the PR is rather chunky, I've made sure all commits build (with warnings), so you can review by commit if you prefer that. RFC: https://github.com/neondatabase/neon/pull/7704 Related https://github.com/neondatabase/neon/issues/7387	2024-06-19 11:55:30 +01:00
Vlad Lazar	16d80128ee	storcon: handle entire cluster going unavailable correctly (#8060 ) ## Problem A period of unavailability for all pageservers in a cluster produced the following fallout in staging: all tenants became detached and required manual operation to re-attach. Manually restarting the storage controller re-attached all tenants due to a consistency bug. Turns out there are two related bugs which caused the issue: 1. Pageserver re-attach can be processed before the first heartbeat. Hence, when handling the availability delta produced by the heartbeater, `Node::get_availability_transition` claims that there's no need to reconfigure the node. 2. We would still attempt to reschedule tenant shards when handling offline transitions even if the entire cluster is down. This puts tenant shards into a state where the reconciler believes they have to be detached (no pageserver shows up in their intent state). This is doubly wrong because we don't mark the tenant shards as detached in the database, thus causing memory vs database consistency issues. Luckily, this bug allowed all tenant shards to re-attach after restart. ## Summary of changes * For (1), abuse the fact that re-attach requests do not contain an utilisation score and use that to differentiate from a node that replied to heartbeats. * For (2), introduce a special case that skips any rescheduling if the entire cluster is unavailable. * Update the storage controller heartbeat test with an extra scenario where the entire cluster goes for lunch. Fixes https://github.com/neondatabase/neon/issues/8044	2024-06-17 11:40:35 +01:00
John Spray	6843fd8f89	storage controller: always wait for tenant detach before delete (#8049 ) ## Problem This test could fail with a timeout waiting for tenant deletions. Tenant deletions could get tripped up on nodes transitioning from offline to online at the moment of the deletion. In a previous reconciliation, the reconciler would skip detaching a particular location because the node was offline, but then when we do the delete the node is marked as online and can be picked as the node to use for issuing a deletion request. This hits the "Unexpectedly still attached path", which would still work if the caller kept calling DELETE, but if a caller does a Delete,get,get,get poll, then it doesn't work because the GET calls fail after we've marked the tenant as detached. ## Summary of changes Fix the undesirable storage controller behavior highlighted by this test failure: - Change tenant deletion flow to _always_ wait for reconciliation to succeed: it was unsound to proceed and return 202 if something was still attached, because after the 202 callers can no longer GET the tenant. Stabilize the test: - Add a reconcile_until_idle to the test, so that it will not have reconciliations running in the background while we mark a node online. This test is not meant to be a chaos test: we should test that kind of complexity elsewhere. - This reconcile_until_idle also fixes another failure mode where the test might see a None for a tenant location because a reconcile was mutating it (https://neon-github-public-dev.s3.amazonaws.com/reports/pr-7288/9500177581/index.html#suites/8fc5d1648d2225380766afde7c428d81/4acece42ae00c442/) It remains the case that a motivated tester could produce a situation where a DELETE gives a 500, when precisely the wrong node transitions from offline to available at the precise moment of a deletion (but the 500 is better than returning 202 and then failing all subsequent GETs). Note that nodes don't go through the offline state during normal restarts, so this is super rare. We should eventually fix this by making DELETE to the pageserver implicitly detach the tenant if it's attached, but that should wait until nobody is using the legacy-style deletes (the ones that use 202 + polling)	2024-06-14 10:37:30 +01:00
Vlad Lazar	126bcc3794	storcon: track number of attached shards for each node (#8011 ) ## Problem The storage controller does not track the number of shards attached to a given pageserver. This is a requirement for various scheduling operations (e.g. draining and filling will use this to figure out if the cluster is balanced) ## Summary of Changes Track the number of shards attached to each node. Related https://github.com/neondatabase/neon/issues/7387	2024-06-11 16:03:25 +01:00
John Spray	91dd99038e	pageserver/controller: enable tenant deletion without attachment (#7957 ) ## Problem As described in #7952, the controller's attempt to reconcile a tenant before finally deleting it can get hung up waiting for the compute notification hook to accept updates. The fact that we try and reconcile a tenant at all during deletion is part of a more general design issue (#5080), where deletion was implemented as an operation on attached tenant, requiring the tenant to be attached in order to delete it, which is not in principle necessary. Closes: #7952 ## Summary of changes - In the pageserver deletion API, only do the traditional deletion path if the tenant is attached. If it's secondary, then tear down the secondary location, and then do a remote delete. If it's not attached at all, just do the remote delete. - In the storage controller, instead of ensuring a tenant is attached before deletion, do a best-effort detach of the tenant, and then call into some arbitrary pageserver to issue a deletion of remote content. The pageserver retains its existing delete behavior when invoked on attached locations. We can remove this later when all users of the API are updated to either do a detach-before-delete. This will enable removing the "weird" code paths during startup that sometimes load a tenant and then immediately delete it, and removing the deletion markers on tenants.	2024-06-05 20:22:54 +00:00
John Spray	c84656a53e	pageserver: implement auto-splitting (#7681 ) ## Problem Currently tenants are only split into multiple shards if a human being calls the API to do it. Issue: #7388 ## Summary of changes - Add a pageserver API for returning the top tenants by size - Add a step to the controller's background loop where if there is no reconciliation or optimization to be done, it looks for things to split. - Add a test that runs pgbench on many tenants concurrently, and checks that splitting happens as expected as tenants grow, without interrupting the client I/O. This PR is quite basic: there is a tasklist in https://github.com/neondatabase/neon/issues/7388 for further work. This PR is meant to be safe (off by default), and sufficient to enable our staging environment to run lots of sharded tenants without a human having to set them up.	2024-05-17 16:01:24 +00:00
John Spray	af99c959ef	storage controller: use SERIALIZABLE isolation level (#7792 ) ## Problem The storage controller generally assumes that things like updating generation numbers are atomic: it should use a strict isolation level. ## Summary of changes - Wrap all database operations in a SERIALIZABLE transaction. - Retry serialization failures, as these do not indicate problems and are normal when plenty of concurrent work is happening. Using this isolation level for all reads is overkill, but much simpler than reasoning about it on a per-operation basis, and does not hurt performance. Tested this with a modified version of storage_controller_many_tenants test with 128k shards, to check that our performance is still fine: it is.	2024-05-17 16:44:33 +01:00
John Spray	107f535294	storage controller: fix handing of tenants with no timelines during scheduling optimization (#7673 ) ## Problem Storage controller was using a zero layer count in SecondaryProgress as a proxy for "not initialized". However, in tenants with zero timelines (a legitimate state), the layer count remains zero forever. This caused https://github.com/neondatabase/neon/pull/7583 to destabilize the storage controller scale test, which creates lots of tenants, some of which don't get any timelines. ## Summary of changes - Use a None mtime instead of zero layer count to determine if a SecondaryProgress should be ignored. - Adjust the test to use a shorter heatmap upload period to let it proceed faster while waiting for scheduling optimizations to complete.	2024-05-09 12:33:09 +01:00
John Spray	b5a6e68e68	storage controller: check warmth of secondary before doing proactive migration (#7583 ) ## Problem The logic in Service::optimize_all would sometimes choose to migrate a tenant to a secondary location that was only recently created, resulting in Reconciler::live_migrate hitting its 5 minute timeout warming up the location, and proceeding to attach a tenant to a location that doesn't have a warm enough local set of layer files for good performance. Closes: #7532 ## Summary of changes - Add a pageserver API for checking download progress of a secondary location - During `optimize_all`, connect to pageservers of candidate optimization secondary locations, and check they are warm. - During shard split, do heatmap uploads and start secondary downloads, so that the new shards' secondary locations start downloading ASAP, rather than waiting minutes for background downloads to kick in. I have intentionally not implemented this by continuously reading the status of locations, to avoid dealing with the scale challenge of efficiently polling & updating 10k-100k locations status. If we implement that in the future, then this code can be simplified to act based on latest state of a location rather than fetching it inline during optimize_all.	2024-05-03 14:28:23 +00:00
John Spray	b7385bb016	storage_controller: fix non-timeline passthrough GETs (#7602 ) ## Problem We were matching on `/tenant/:tenant_id` and `/tenant/:tenant_id/timeline`, but not non-timeline tenant sub-paths. There aren't many: this was only noticeable when using the synthetic_size endpoint by hand. ## Summary of changes - Change the wildcard from `/tenant/:tenant_id/timeline` to `/tenant/:tenant_id/*` - Add test lines that exercise this	2024-05-03 12:52:43 +01:00
Jure Bajic	00423152c6	Store operation identifier in `IdLockMap` on exclusive lock (#7397 ) ## Problem Issues around operation and tenant locks would have been hard to debug since there was little observability around them. ## Summary of changes - As suggested in the issue, a wrapper was added around `OwnedRwLockWriteGuard` called `IdentifierLock` that removes the operation currently holding the exclusive lock when it's dropped. - The value in `IdLockMap` was extended to hold a pair of locks and operations that can be accessed and locked independently. - When requesting an exclusive lock besides returning the lock on that resource, an operation is changed if the lock is acquired. Closes https://github.com/neondatabase/neon/issues/7108	2024-05-03 09:38:19 +01:00
Matt Podraza	ab95942fc2	storage controller: make the initial database wait configurable (#7591 ) This allows passing a humantime string in the CLI to configure the initial wait for the database. It defaults to the previously hard-coded value of 5 seconds.	2024-05-02 15:19:51 +00:00
Conrad Ludgate	cb4b4750ba	update to reqwest 0.12 (#7561 ) ## Problem #7557 ## Summary of changes	2024-05-02 11:16:04 +02:00
John Spray	a74b60066c	storage controller: test for large shard counts (#7475 ) ## Problem Storage controller was observed to have unexpectedly large memory consumption when loaded with many thousands of shards. This was recently fixed: - https://github.com/neondatabase/neon/pull/7493 ...but we need a general test that the controller is well behaved with thousands of shards. Closes: https://github.com/neondatabase/neon/issues/7460 Closes: https://github.com/neondatabase/neon/issues/7463 ## Summary of changes - Add test test_storage_controller_many_tenants to exercise the system's behaviour with a more substantial workload. This test measures memory consumption and reproduces #7460 before the other changes in this PR. - Tweak reconcile_all's return value to make it nonzero if it spawns no reconcilers, but _would_ have spawned some reconcilers if they weren't blocked by the reconcile concurrency limit. This makes the test's reconcile_until_idle behave as expected (i.e. not complete until the system is nice and calm). - Fix an issue where tenant migrations would leave a spurious secondary location when migrated to some location that was not already their secondary (this was an existing low-impact bug that tripped up the test's consistency checks). On the test with 8000 shards, the resident memory per shard is about 20KiB. This is not really per-shard memory: the primary source of memory growth is the number of concurrent network/db clients we create. With 8000 shards, the test takes 125s to run on my workstation.	2024-04-30 15:21:54 +00:00
John Spray	84914434e3	storage controller: send startup compute notifications in background (#7495 ) ## Problem Previously, we try to send compute notifications in startup_reconcile before completing that function, with a time limit. Any notifications that don't happen within the time limit result in tenants having their `pending_compute_notification` flag set, which causes them to spawn a Reconciler next time the background reconciler loop runs. This causes two problems: - Spawning a lot of reconcilers after startup caused a spike in memory (this is addressed in https://github.com/neondatabase/neon/pull/7493) - After https://github.com/neondatabase/neon/pull/7493, spawning lots of reconcilers will block some other operations, e.g. a tenant creation might fail due to lack of reconciler semaphore units while the controller is busy running all the Reconcilers for its startup compute notifications. When the code was first written, ComputeHook didn't have internal ordering logic to ensure that notifications for a shard were sent in the right order. Since that was added in https://github.com/neondatabase/neon/pull/7088, we can use it to avoid waiting for notifications to complete in startup_reconcile. Related to: https://github.com/neondatabase/neon/issues/7460 ## Summary of changes - Add a `notify_background` method to ComputeHook. - Call this from startup_reconcile instead of doing notifications inline - Process completions from `notify_background` in `process_results`, and if a notification failed then set the `pending_compute_notification` flag on the shard. The result is that we will only spawn lots of Reconcilers if the compute notifications _fail_, not just because they take some significant amount of time. Test coverage for this case is in https://github.com/neondatabase/neon/pull/7475	2024-04-29 08:59:22 +00:00
John Spray	b655c7030f	neon_local: add "tenant import" (#7399 ) ## Problem Sometimes we have test data in the form of S3 contents that we would like to run live in a neon_local environment. ## Summary of changes - Add a storage controller API that imports an existing tenant. Currently this is equivalent to doing a create with a high generation number, but in future this would be something smarter to probe S3 to find the shards in a tenant and find generation numbers. - Add a `neon_local` command that invokes the import API, and then inspects timelines in the newly attached tenant to create matching branches.	2024-04-29 08:52:18 +01:00
John Spray	d63185fa6c	storage controller: log hygiene & better error type (#7508 ) These are testability/logging improvements spun off from #7475 - Don't log warnings for shutdown errors in compute hook - Revise logging around heartbeats and reconcile_all so that we aren't emitting such a large volume of INFO messages under normal quite conditions. - Clean up the `last_error` of TenantShard to hold a ReconcileError instead of a String, and use that properly typed error to suppress reconciler cancel errors during reconcile_all_now. This is important for tests that iteratively call that, as otherwise they would get 500 errors when some reconciler in flight was cancelled (perhaps due to a state change on the tenant shard starting a new reconciler).	2024-04-26 08:15:59 +00:00
John Spray	e8814b6f81	controller: limit Reconciler concurrency (#7493 ) ## Problem Storage controller memory can spike very high if we have many tenants and they all try to reconcile at the same time. Related: - https://github.com/neondatabase/neon/issues/7463 - https://github.com/neondatabase/neon/issues/7460 Not closing those issues in this PR, because the test coverage for them will be in https://github.com/neondatabase/neon/pull/7475 ## Summary of changes - Add a CLI arg `--reconciler-concurrency`, defaulted to 128 - Add a semaphore to Service with this many units - In `maybe_reconcile_shard`, try to acquire semaphore unit. If we can't get one, return a ReconcileWaiter for a future sequence number, and push the TenantShardId onto a channel of delayed IDs. - In `process_result`, consume from the channel of delayed IDs if there are semaphore units available and call maybe_reconcile_shard again for these delayed shards. This has been tested in https://github.com/neondatabase/neon/pull/7475, but will land that PR separately because it contains other changes & needs the test stabilizing. This change is worth merging sooner, because it fixes a practical issue with larger shard counts.	2024-04-25 10:46:07 +01:00

1 2

58 Commits