## Problem
Migrations of tenant shards with cold secondaries are holding up drains
in during production deployments.
## Summary of changes
If a secondary locations is lagging by more than 256MiB (configurable,
but that's the default), then skip cutting it over to the secondary as part of the node drain.
Earlier I was thinking we'd need a (ancestor_lsn, timeline_id) ordered
list of reparented. Turns out we did not need it at all. Replace it with
an unordered hashset. Additionally refactor the reparented direct
children query out, it will later be used from more places.
Split off from #8430.
Cc: #6994
Chaos injection bridges the gap between automated testing (where we do
lots of different things with small, short-lived tenants), and staging
(where we do many fewer things, but with larger, long-lived tenants).
This PR adds a first type of chaos which isn't really very chaotic: it's
live migration of tenants between healthy pageservers. This nevertheless
provides continuous checks that things like clean, prompt shutdown of
tenants works for realistically deployed pageservers with realistically
large tenants.
## Problem
This code was confusing, untested and covered:
- an impossible case, where intent state is AttacheStale (we never do
this)
- a rare edge case (going from AttachedMulti to Attached), which we were
not testing, and in any case the pageserver internally does the same
Tenant reset in this transition as it would do if we incremented
generation.
Closes: https://github.com/neondatabase/neon/issues/8367
## Summary of changes
- Simplify the logic to only skip incrementing the generation if the
location already has the expected generation and the exact same mode.
Part of #8128, followed by #8502.
## Problem
Currently we lack mechanism to alert unhealthy `scan_metadata` status if
we start running this scrubber command as part of a cronjob. With the
storage controller client introduced to storage scrubber in #8196, it is
viable to set up alert by storing health status in the storage
controller database.
We intentionally do not store the full output to the database as the
json blobs potentially makes the table really huge. Instead, only a
health status and a timestamp recording the last time metadata health
status is posted on a tenant shard.
Signed-off-by: Yuchen Liang <yuchen@neon.tech>
## Problem
We are missing the step-down primitive required to implement rolling
restarts of the storage controller.
## Summary of changes
Add `/control/v1/step_down` endpoint which puts the storage controller
into a state where it rejects
all API requests apart from `/control/v1/step_down`, `/status` and
`/metrics`. When receiving the request,
storage controller cancels all pending reconciles and waits for them to
exit gracefully. The response contains
a snapshot of the in-memory observed state.
Related:
* https://github.com/neondatabase/cloud/issues/14701
* https://github.com/neondatabase/neon/issues/7797
* https://github.com/neondatabase/neon/pull/8310
## Problem
Storcon shutdown did not produce a clean observed state. This is not a
problem at the moment, but we will need to stop all reconciles with
clean observed state for rolling restarts.
I tried to test this by collecting the observed state during shutdown
and comparing it with the in-memory observed
state, but it doesn't work because a lot of tests use the cursed attach
hook to create tenants directly through the ps.
## Summary of Changes
Rework storcon shutdown as follows:
* Reconcilers get a separate cancellation token which is a child token
of the global `Service::cancel`.
* Reconcilers get a separate gate
* Add a mechanism to drain the reconciler result queue before
* Put all of this together into a clean shutdown sequence
Related https://github.com/neondatabase/cloud/issues/14701
## Problem
Re-attach blocks the pageserver http server from starting up. Hence, it
can't reply to heartbeats
until that's done. This makes the storage controller mark the node
off-line (not good). We worked
around this by setting the interval after which nodes are marked offline
to 5 minutes. This isn't a
long term solution.
## Summary of changes
* Introduce a new `NodeAvailability` state: `WarmingUp`. This state
models the following time interval:
* From receiving the re-attach request until the pageserver replies to
the first heartbeat post re-attach
* The heartbeat delta generator becomes aware of this state and uses a
separate longer interval
* Flag `max-warming-up-interval` now models the longer timeout and
`max-offline-interval` the shorter one to
match the names of the states
Closes https://github.com/neondatabase/neon/issues/7552
## Problem
`test_change_pageserver` stops pageservers in a way that can overlap
with the controller's heartbeats: the controller can get a heartbeat
success and then immediately find the node unavailable. This particular
situation triggers a log that isn't in our current allow-list of
messages for nodes offline
Example:
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-8339/10048487700/index.html#testresult/19678f27810231df/retries
## Summary of changes
- Add the message to the allow list
## Problem
After a shard split, the pageserver leaves the ancestor shard's content
in place. It may be referenced by child shards, but eventually child
shards will de-reference most ancestor layers as they write their own
data and do GC. We would like to eventually clean up those ancestor
layers to reclaim space.
## Summary of changes
- Extend the physical GC command with `--mode=full`, which includes
cleaning up unreferenced ancestor shard layers
- Add test `test_scrubber_physical_gc_ancestors`
- Remove colored log output: in testing this is irritating ANSI code
spam in logs, and in interactive use doesn't add much.
- Refactor storage controller API client code out of storcon_client into
a `storage_controller/client` crate
- During physical GC of ancestors, call into the storage controller to
check that the latest shards seen in S3 reflect the latest state of the
tenant, and there is no shard split in progress.
When `NeonEnv.from_repo_dir` was introduced, storage controller stored
its
state exclusively `attachments.json`.
Since then, it has moved to using Postgres, which stores its state in
`storage_controller_db`.
But `NeonEnv.from_repo_dir` wasn't adjusted to do this.
This PR rectifies the situation.
Context for this is failures in
`test_pageserver_characterize_throughput_with_n_tenants`
CF:
https://neondb.slack.com/archives/C033RQ5SPDH/p1721035799502239?thread_ts=1720901332.293769&cid=C033RQ5SPDH
Notably, `from_repo_dir` is also used by the backwards- and
forwards-compatibility.
Thus, the changes in this PR affect those tests as well.
However, it turns out that the compatibility snapshot already contains
the `storage_controller_db`.
Thus, it should just work and in fact we can remove hacks like
`fixup_storage_controller`.
Follow-ups created as part of this work:
* https://github.com/neondatabase/neon/issues/8399
* https://github.com/neondatabase/neon/issues/8400
Right now timeline detach ancestor reports an error (409, "no ancestor")
on a new attempt after successful completion. This makes it troublesome
for storage controller retries. Fix it to respond with `200 OK` as if
the operation had just completed quickly.
Additionally, the returned timeline identifiers in the 200 OK response
are now ordered so that responses between different nodes for error
comparison are done by the storage controller added in #8353.
Design-wise, this PR introduces a new strategy for accessing the latest
uploaded IndexPart:
`RemoteTimelineClient::initialized_upload_queue(&self) ->
Result<UploadQueueAccessor<'_>, NotInitialized>`. It should be a more
scalable way to query the latest uploaded `IndexPart` than to add a
query method for each question directly on `RemoteTimelineClient`.
GC blocking will need to be introduced to make the operation fully
idempotent. However, it is idempotent for the cases demonstrated by
tests.
Cc: #6994
Currently storage controller does not support forwarding timeline detach
ancestor requests to pageservers. Add support for forwarding `PUT
.../:tenant_id/timelines/:timeline_id/detach_ancestor`. Implement the
support mostly as is, because the timeline detach ancestor will be made
(mostly) idempotent in future PR.
Cc: #6994
## Problem
new clippy warnings on nightly.
## Summary of changes
broken up each commit by warning type.
1. Remove some unnecessary refs.
2. In edition 2024, inference will default to `!` and not `()`.
3. Clippy complains about doc comment indentation
4. Fix `Trait + ?Sized` where `Trait: Sized`.
5. diesel_derives triggering `non_local_defintions`
## Problem
In anticipation of later adding a really nice drain+delete API, I
initially only added an intentionally basic `/drop` API that is just
about usable for deleting nodes in a pinch, but requires some ugly
storage controller restarts to persuade it to restart secondaries.
## Summary of changes
I started making a few tiny fixes, and ended up writing the delete
API...
- Quality of life nit: ordering of node + tenant listings in storcon_cli
- Papercut: Fix the attach_hook using the wrong operation type for
reporting slow locks
- Make Service::spawn tolerate `generation_pageserver` columns that
point to nonexistent node IDs. I started out thinking of this as a
general resilience thing, but when implementing the delete API I
realized it was actually a legitimate end state after the delete API is
called (as that API doesn't wait for all reconciles to succeed).
- Add a `DELETE` API for nodes, which does not gracefully drain, but
does reschedule everything. This becomes safe to use when the system is
in any state, but will incur availability gaps for any tenants that
weren't already live-migrated away. If tenants have already been
drained, this becomes a totally clean + safe way to decom a node.
- Add a test and a storcon_cli wrapper for it
This is meant to be a robust initial API that lets us remove nodes
without doing ugly things like restarting the storage controller -- it's
not quite a totally graceful node-draining routine yet. There's more
work in https://github.com/neondatabase/neon/issues/8333 to get to our
end-end state.
## Problem
SeqWait::would_wait_for returns Ok in the case when we would not wait
for the sequence number and Err otherwise.
ReconcilerWaiter::get_status uses it the wrong way around. This can
cause the storage controller to go into a busy loop
and make it look unavailable to the k8s controller.
## Summary of changes
Use `SeqWait::would_wait_for` correctly.
I want to fix bugs in `page_service`
([issue](https://github.com/neondatabase/neon/issues/7427)) and the
`import basebackup` / `import wal` stand in the way / make the
refactoring more complicated.
We don't use these methods anyway in practice, but, there have been some
objections to removing the functionality completely.
So, this PR preserves the existing functionality but moves it into the
HTTP management API.
Note that I don't try to fix existing bugs in the code, specifically not
fixing
* it only ever worked correctly for unsharded tenants
* it doesn't clean up on error
All errors are mapped to `ApiError::InternalServerError`.
## Problem
At the start of do_tenant_shard_split, we drop any secondary location
for the parent shards. The reconciler uses presence of secondary
locations as a condition for enabling heatmaps.
On the pageserver, child shards inherit their configuration from
parents, but the storage controller assumes the child's ObservedState is
the same as the parent's config from the prepare phase. The result is
that some child shards end up with inaccurate ObservedState, and until
something next migrates or restarts, those tenant shards aren't
uploading heatmaps, so their secondary locations are downloading
everything that was resident at the moment of the split (including
ancestor layers which are often cleaned up shortly after the split).
Closes: https://github.com/neondatabase/neon/issues/8189
## Summary of changes
- Use PlacementPolicy to control enablement of heatmap upload, rather
than the literal presence of secondaries in IntentState: this way we
avoid switching them off during shard split
- test: during tenant split test, assert that the child shards have
heatmap uploads enabled.
## Problem
For some time, we have created tenants with calls to location_conf. The
legacy "POST /v1/tenant" path was only used in some tests.
## Summary of changes
- Remove the API
- Relocate TenantCreateRequest to the controller API file (this used to
be used in both pageserver and controller APIs)
- Rewrite tenant_create test helper to use location_config API, as
control plane and storage controller do
- Update docker-compose test script to create tenants with
location_config API (this small commit is also present in
https://github.com/neondatabase/neon/pull/7947)
## Problem
Background node operations take a long time for loaded nodes.
## Summary of changes
Increase number of concurrent reconciles an operation is allowed to
spawn.
This should make drain and fill operations faster and the new value is
still well below the total limit of concurrent reconciles.
- Add --safekeepers option to neon_local reconfigure
- Add it to python Endpoint reconfigure
- Implement config reload in walproposer by restarting the whole bgw when
safekeeper list changes.
ref https://github.com/neondatabase/neon/issues/6341
## Problem
The fill requirement was not taken into account when looking through the
shards of a given node to fill from.
## Summary of Changes
Ensure that we do not fill a node past the recommendation from
`Scheduler::compute_fill_requirement`.
## Problem
These APIs have been deprecated for some time, but were still used from
test code.
Closes: https://github.com/neondatabase/neon/issues/4282
## Summary of changes
- It is still convenient to do a "tenant_attach" from a test without
having to write out a location_conf body, so those test methods have
been retained with implementations that call through to their
location_conf equivalent.
## Problem
While adapting the storage controller scale test to do graceful rolling
restarts via
drain and fill, I noticed that secondaries are also being rescheduled,
which, in turn,
caused the storage controller to optimise attachments.
## Summary of changes
* Introduce a transactional looking rescheduling primitive (i.e. "try to
schedule to this
secondary, but leave everything as is if you can't")
* Use it for the drain and fill stages to avoid calling into
`Scheduler::schedule` and having
secondaries move around.
## Problem
For testing, the storage controller has a built-in hack that loads
neon_local endpoint config from disk, and uses it to reconfigure
endpoints when the attached pageserver changes.
Some tests that stop an endpoint while the storage controller is running
could occasionally fail on log errors from the controller trying to use
its special test-mode calls into neon local Endpoint.
Example:
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-8117/9592392425/index.html#/testresult/9d2bb8623d0d53f8
## Summary of changes
- Give NotifyError an explicit NeonLocal variant, to avoid munging these
into generic 500s (I don't want to ignore 500s in general)
- Allow-list errors related to the local notification hook.
The expectation is that tests using endpoints/workloads should be
independently checking that those endpoints work: if neon_local
generates an error inside the storage controller, that's ignorable.
## Problem
There's no way to cancel drain and fill operations.
## Summary of changes
Implement HTTP endpoints to allow cancelling of background operations.
When the operationis cancelled successfully, the node scheduling policy will revert to
`Active`.
## Problem
The fill planner introduced in
https://github.com/neondatabase/neon/pull/8014 selects
tenant shards to promote strictly based on attached shard count load
(tenant shards on
nodes with the most attached shard counts are considered first). This
approach runs the
risk of migrating too many shards belonging to the same tenant on the
same primary node.
This is bad for availability and causes extra reconciles via the storage
controller's background
optimisations.
Also see
https://github.com/neondatabase/neon/pull/8014#discussion_r1642456241.
## Summary of changes
Refine the fill plan to avoid promoting too many shards belonging to the
same tenant on the same node.
We allow for `max(1, shard_count / node_count)` shards belonging to the
same tenant to be promoted.
## Problem
When creating a new shard the storage controller schedules via
Scheduler::schedule_shard. This does not take into account the number of
attached shards. What it does take into account is the node affinity:
when a shard is scheduled, all its nodes (primaries and secondaries) get
their affinity incremented.
For two node clusters and shards with one secondary we have a
pathological case where all primaries are scheduled on the same node.
Now that we track the count of attached shards per node, this is trivial
to fix. Still, the "proper" fix is to use the pageserver's utilization
score.
Closes https://github.com/neondatabase/neon/issues/8041
## Summary of changes
Use attached shard count when deciding which node to schedule a fresh
shard on.
Before this PR, storage controller and broker would run in the
PWD of neon_local, i.e., most likely the checkout of neon.git.
With this PR, the shared infrastructure for background processes
sets the PWD.
Benefits:
* easy listing of processes in a repo dir using `lsof`, see added
comment in the code
* coredumps go in the right directory (next to the process)
* generally matching common expectations, I think
Changes:
* set the working directory in `background_process` module
* drive-by: fix reliance of storage_controller on NEON_REPO_DIR being
set by neon_local for the local compute hook to work correctly
## Problem
We have this set of test utilities which duplicate a tenant by copying
everything that's in remote storage and then attaching a tenant to the
pageserver and storage controller. When the "copied tenants" are created
on the storage controller, they start off from generation number 0. This
means that they can't see anything past that generation.
This issues has existed ever since generation numbers have been
introduced, but we've largely been lucky
for the generation to stay stable during the template tenant creation.
## Summary of Changes
Extend the storage controller debug attach hook to accept a generation
override. Use that in the tenant duplication logic to set the generation
number to something greater than the naturally reached generation. This
allows the tenants to see all layer files.
## Problem
Pageserver restarts cause read availablity downtime for tenants. See
`Motivation` section in the
[RFC](https://github.com/neondatabase/neon/pull/7704).
## Summary of changes
* Introduce a new `NodeSchedulingPolicy`: `PauseForRestart`
* Implement the first take of drain and fill algorithms
* Add a node status endpoint which can be polled to figure out when an
operation is done
The implementation follows the RFC, so it might be useful to peek at it
as you're reviewing.
Since the PR is rather chunky, I've made sure all commits build (with
warnings), so you can
review by commit if you prefer that.
RFC: https://github.com/neondatabase/neon/pull/7704
Related https://github.com/neondatabase/neon/issues/7387
## Problem
A period of unavailability for all pageservers in a cluster produced the
following fallout in staging:
all tenants became detached and required manual operation to re-attach.
Manually restarting
the storage controller re-attached all tenants due to a consistency bug.
Turns out there are two related bugs which caused the issue:
1. Pageserver re-attach can be processed before the first heartbeat.
Hence, when handling
the availability delta produced by the heartbeater,
`Node::get_availability_transition` claims
that there's no need to reconfigure the node.
2. We would still attempt to reschedule tenant shards when handling
offline transitions even
if the entire cluster is down. This puts tenant shards into a state
where the reconciler believes
they have to be detached (no pageserver shows up in their intent state).
This is doubly wrong
because we don't mark the tenant shards as detached in the database,
thus causing memory vs
database consistency issues. Luckily, this bug allowed all tenant shards
to re-attach after restart.
## Summary of changes
* For (1), abuse the fact that re-attach requests do not contain an
utilisation score and use that
to differentiate from a node that replied to heartbeats.
* For (2), introduce a special case that skips any rescheduling if the
entire cluster is unavailable.
* Update the storage controller heartbeat test with an extra scenario
where the entire cluster goes
for lunch.
Fixes https://github.com/neondatabase/neon/issues/8044
## Problem
This test could fail with a timeout waiting for tenant deletions.
Tenant deletions could get tripped up on nodes transitioning from
offline to online at the moment of the deletion. In a previous
reconciliation, the reconciler would skip detaching a particular
location because the node was offline, but then when we do the delete
the node is marked as online and can be picked as the node to use for
issuing a deletion request. This hits the "Unexpectedly still attached
path", which would still work if the caller kept calling DELETE, but if
a caller does a Delete,get,get,get poll, then it doesn't work because
the GET calls fail after we've marked the tenant as detached.
## Summary of changes
Fix the undesirable storage controller behavior highlighted by this test
failure:
- Change tenant deletion flow to _always_ wait for reconciliation to
succeed: it was unsound to proceed and return 202 if something was still
attached, because after the 202 callers can no longer GET the tenant.
Stabilize the test:
- Add a reconcile_until_idle to the test, so that it will not have
reconciliations running in the background while we mark a node online.
This test is not meant to be a chaos test: we should test that kind of
complexity elsewhere.
- This reconcile_until_idle also fixes another failure mode where the
test might see a None for a tenant location because a reconcile was
mutating it
(https://neon-github-public-dev.s3.amazonaws.com/reports/pr-7288/9500177581/index.html#suites/8fc5d1648d2225380766afde7c428d81/4acece42ae00c442/)
It remains the case that a motivated tester could produce a situation
where a DELETE gives a 500, when precisely the wrong node transitions
from offline to available at the precise moment of a deletion (but the
500 is better than returning 202 and then failing all subsequent GETs).
Note that nodes don't go through the offline state during normal
restarts, so this is super rare. We should eventually fix this by making
DELETE to the pageserver implicitly detach the tenant if it's attached,
but that should wait until nobody is using the legacy-style deletes (the
ones that use 202 + polling)
## Problem
The storage controller does not track the number of shards attached to a
given pageserver. This is a requirement for various scheduling
operations (e.g. draining and filling will use this to figure out if the
cluster is balanced)
## Summary of Changes
Track the number of shards attached to each node.
Related https://github.com/neondatabase/neon/issues/7387
## Problem
As described in #7952, the controller's attempt to reconcile a tenant
before finally deleting it can get hung up waiting for the compute
notification hook to accept updates.
The fact that we try and reconcile a tenant at all during deletion is
part of a more general design issue (#5080), where deletion was
implemented as an operation on attached tenant, requiring the tenant to
be attached in order to delete it, which is not in principle necessary.
Closes: #7952
## Summary of changes
- In the pageserver deletion API, only do the traditional deletion path
if the tenant is attached. If it's secondary, then tear down the
secondary location, and then do a remote delete. If it's not attached at
all, just do the remote delete.
- In the storage controller, instead of ensuring a tenant is attached
before deletion, do a best-effort detach of the tenant, and then call
into some arbitrary pageserver to issue a deletion of remote content.
The pageserver retains its existing delete behavior when invoked on
attached locations. We can remove this later when all users of the API
are updated to either do a detach-before-delete. This will enable
removing the "weird" code paths during startup that sometimes load a
tenant and then immediately delete it, and removing the deletion markers
on tenants.
## Problem
Currently tenants are only split into multiple shards if a human being
calls the API to do it.
Issue: #7388
## Summary of changes
- Add a pageserver API for returning the top tenants by size
- Add a step to the controller's background loop where if there is no
reconciliation or optimization to be done, it looks for things to split.
- Add a test that runs pgbench on many tenants concurrently, and checks
that splitting happens as expected as tenants grow, without interrupting
the client I/O.
This PR is quite basic: there is a tasklist in
https://github.com/neondatabase/neon/issues/7388 for further work. This
PR is meant to be safe (off by default), and sufficient to enable our
staging environment to run lots of sharded tenants without a human
having to set them up.
## Problem
The storage controller generally assumes that things like updating
generation numbers are atomic: it should use a strict isolation level.
## Summary of changes
- Wrap all database operations in a SERIALIZABLE transaction.
- Retry serialization failures, as these do not indicate problems and
are normal when plenty of concurrent work is happening.
Using this isolation level for all reads is overkill, but much simpler
than reasoning about it on a per-operation basis, and does not hurt
performance.
Tested this with a modified version of storage_controller_many_tenants
test with 128k shards, to check that our performance is still fine: it
is.
## Problem
Storage controller was using a zero layer count in SecondaryProgress as
a proxy for "not initialized". However, in tenants with zero timelines
(a legitimate state), the layer count remains zero forever.
This caused https://github.com/neondatabase/neon/pull/7583 to
destabilize the storage controller scale test, which creates lots of
tenants, some of which don't get any timelines.
## Summary of changes
- Use a None mtime instead of zero layer count to determine if a
SecondaryProgress should be ignored.
- Adjust the test to use a shorter heatmap upload period to let it
proceed faster while waiting for scheduling optimizations to complete.
## Problem
The logic in Service::optimize_all would sometimes choose to migrate a
tenant to a secondary location that was only recently created, resulting
in Reconciler::live_migrate hitting its 5 minute timeout warming up the
location, and proceeding to attach a tenant to a location that doesn't
have a warm enough local set of layer files for good performance.
Closes: #7532
## Summary of changes
- Add a pageserver API for checking download progress of a secondary
location
- During `optimize_all`, connect to pageservers of candidate
optimization secondary locations, and check they are warm.
- During shard split, do heatmap uploads and start secondary downloads,
so that the new shards' secondary locations start downloading ASAP,
rather than waiting minutes for background downloads to kick in.
I have intentionally not implemented this by continuously reading the
status of locations, to avoid dealing with the scale challenge of
efficiently polling & updating 10k-100k locations status. If we
implement that in the future, then this code can be simplified to act
based on latest state of a location rather than fetching it inline
during optimize_all.
## Problem
We were matching on `/tenant/:tenant_id` and
`/tenant/:tenant_id/timeline*`, but not non-timeline tenant sub-paths.
There aren't many: this was only noticeable when using the
synthetic_size endpoint by hand.
## Summary of changes
- Change the wildcard from `/tenant/:tenant_id/timeline*` to
`/tenant/:tenant_id/*`
- Add test lines that exercise this
## Problem
Issues around operation and tenant locks would have been hard to debug
since there was little observability around them.
## Summary of changes
- As suggested in the issue, a wrapper was added around
`OwnedRwLockWriteGuard` called `IdentifierLock` that removes the
operation currently holding the exclusive lock when it's dropped.
- The value in `IdLockMap` was extended to hold a pair of locks and
operations that can be accessed and locked independently.
- When requesting an exclusive lock besides returning the lock on that
resource, an operation is changed if the lock is acquired.
Closes https://github.com/neondatabase/neon/issues/7108
This allows passing a humantime string in the CLI to configure the
initial wait for the database.
It defaults to the previously hard-coded value of 5 seconds.
## Problem
Storage controller was observed to have unexpectedly large memory
consumption when loaded with many thousands of shards.
This was recently fixed:
- https://github.com/neondatabase/neon/pull/7493
...but we need a general test that the controller is well behaved with
thousands of shards.
Closes: https://github.com/neondatabase/neon/issues/7460
Closes: https://github.com/neondatabase/neon/issues/7463
## Summary of changes
- Add test test_storage_controller_many_tenants to exercise the system's
behaviour with a more substantial workload. This test measures memory
consumption and reproduces #7460 before the other changes in this PR.
- Tweak reconcile_all's return value to make it nonzero if it spawns no
reconcilers, but _would_ have spawned some reconcilers if they weren't
blocked by the reconcile concurrency limit. This makes the test's
reconcile_until_idle behave as expected (i.e. not complete until the
system is nice and calm).
- Fix an issue where tenant migrations would leave a spurious secondary
location when migrated to some location that was not already their
secondary (this was an existing low-impact bug that tripped up the
test's consistency checks).
On the test with 8000 shards, the resident memory per shard is about
20KiB. This is not really per-shard memory: the primary source of memory
growth is the number of concurrent network/db clients we create.
With 8000 shards, the test takes 125s to run on my workstation.
## Problem
Previously, we try to send compute notifications in startup_reconcile
before completing that function, with a time limit. Any notifications
that don't happen within the time limit result in tenants having their
`pending_compute_notification` flag set, which causes them to spawn a
Reconciler next time the background reconciler loop runs.
This causes two problems:
- Spawning a lot of reconcilers after startup caused a spike in memory
(this is addressed in https://github.com/neondatabase/neon/pull/7493)
- After https://github.com/neondatabase/neon/pull/7493, spawning lots of
reconcilers will block some other operations, e.g. a tenant creation
might fail due to lack of reconciler semaphore units while the
controller is busy running all the Reconcilers for its startup compute
notifications.
When the code was first written, ComputeHook didn't have internal
ordering logic to ensure that notifications for a shard were sent in the
right order. Since that was added in
https://github.com/neondatabase/neon/pull/7088, we can use it to avoid
waiting for notifications to complete in startup_reconcile.
Related to: https://github.com/neondatabase/neon/issues/7460
## Summary of changes
- Add a `notify_background` method to ComputeHook.
- Call this from startup_reconcile instead of doing notifications inline
- Process completions from `notify_background` in `process_results`, and
if a notification failed then set the `pending_compute_notification`
flag on the shard.
The result is that we will only spawn lots of Reconcilers if the compute
notifications _fail_, not just because they take some significant amount
of time.
Test coverage for this case is in
https://github.com/neondatabase/neon/pull/7475
## Problem
Sometimes we have test data in the form of S3 contents that we would
like to run live in a neon_local environment.
## Summary of changes
- Add a storage controller API that imports an existing tenant.
Currently this is equivalent to doing a create with a high generation
number, but in future this would be something smarter to probe S3 to
find the shards in a tenant and find generation numbers.
- Add a `neon_local` command that invokes the import API, and then
inspects timelines in the newly attached tenant to create matching
branches.
These are testability/logging improvements spun off from #7475
- Don't log warnings for shutdown errors in compute hook
- Revise logging around heartbeats and reconcile_all so that we aren't
emitting such a large volume of INFO messages under normal quite
conditions.
- Clean up the `last_error` of TenantShard to hold a ReconcileError
instead of a String, and use that properly typed error to suppress
reconciler cancel errors during reconcile_all_now. This is important for
tests that iteratively call that, as otherwise they would get 500 errors
when some reconciler in flight was cancelled (perhaps due to a state
change on the tenant shard starting a new reconciler).
## Problem
Storage controller memory can spike very high if we have many tenants
and they all try to reconcile at the same time.
Related:
- https://github.com/neondatabase/neon/issues/7463
- https://github.com/neondatabase/neon/issues/7460
Not closing those issues in this PR, because the test coverage for them
will be in https://github.com/neondatabase/neon/pull/7475
## Summary of changes
- Add a CLI arg `--reconciler-concurrency`, defaulted to 128
- Add a semaphore to Service with this many units
- In `maybe_reconcile_shard`, try to acquire semaphore unit. If we can't
get one, return a ReconcileWaiter for a future sequence number, and push
the TenantShardId onto a channel of delayed IDs.
- In `process_result`, consume from the channel of delayed IDs if there
are semaphore units available and call maybe_reconcile_shard again for
these delayed shards.
This has been tested in https://github.com/neondatabase/neon/pull/7475,
but will land that PR separately because it contains other changes &
needs the test stabilizing. This change is worth merging sooner, because
it fixes a practical issue with larger shard counts.