Not a user-facing change, but can break any existing `.neon` directories
created by neon_local, as the name of the database used by the storage
controller changes.
This PR changes all the locations apart from the path of
`control_plane/attachment_service` (waiting for an opportune moment to
do that one, because it's the most conflict-ish wrt ongoing PRs like
#6676 )
## Problem
- The storage controller is the source of truth for a tenant's stripe
size, but doesn't currently have a way to propagate that to compute:
we're just using the default stripe size everywhere.
Closes: https://github.com/neondatabase/neon/issues/6903
## Summary of changes
- Include stripe size in `ComputeHookNotifyRequest`
- Include stripe size in `LocationConfigResponse`
The stripe size is optional: it will only be advertised for
multi-sharded tenants. This enables the controller to defer the choice
of stripe size until we split a tenant for the first time.
During onboarding, the control plane may attempt ad-hoc creation of a
secondary location to facilitate live migration. This gives us two
problems to solve:
- Accept 'Secondary' mode in /location_config and use it to put the
tenant into secondary mode on some physical pageserver, then pass
through /tenant/xyz/secondary/download requests
- Create tenants with no generation initially, since the initial
`Secondary` mode call will not provide us a generation.
This PR also fixes modification of a tenant's TenantConf during
/location_conf, which was previously ignored, and refines the flow for
config modification:
- avoid bumping generations when the only reason we're reconciling an
attached location is a config change
- increment TenantState.sequence when spawning a reconciler: usually
schedule() does this, but when we do config changes that doesn't happen,
so without this change waiters would think reconciliation was done
immediately. `sequence` is a bit of a murky thing right now, as it's
dual-purposed for tracking waiters, and for checking if an existing
reconciliation is already making updates to our current sequence. I'll
follow up at some point to clarify it's purpose.
- test config modification at the end of onboarding test
Extracted from https://github.com/neondatabase/neon/pull/6953
Part of https://github.com/neondatabase/neon/issues/5899
Core Change
-----------
In #6953, we need the ability to scan the log _after_ a specific line
and ignore anything before that line.
This PR changes `log_contains` to returns a tuple of `(matching line,
cursor)`.
Hand that cursor to a subsequent `log_contains` call to search the log
for the next occurrence of the pattern.
Other Changes
-------------
- Inspect all the callsites of `log_contains` to handle the new tuple
return type.
- Above inspection unveiled many callers aren't using `assert
log_contains(...) is not None` but some weaker version of the code that
breaks if `log_contains` ever returns a not-None but falsy value. Fix
that.
- Above changes unveiled that `test_remote_storage_upload_queue_retries`
was using `wait_until` incorrectly; after fixing the usage, I had to
raise the `wait_until` timeout. So, maybe this will fix its flakiness.
## Problem
Starting up the pageserver before the storage controller is ready can
lead
to a round of reconciliation, which leads to the previous tenant being
shut down.
This disturbs some tests.
## Summary of changes
Wait for the storage controller to become ready on neon env start-up.
Closes https://github.com/neondatabase/neon/issues/6724
## Problem
Attachment service does not do auth based on JWT scopes.
## Summary of changes
Do JWT based permission checking for requests coming into the attachment
service.
Requests into the attachment service must use different tokens based on
the endpoint:
* `/control` and `/debug` require `admin` scope
* `/upcall` requires `generations_api` scope
* `/v1/...` requires `pageserverapi` scope
Requests into the pageserver from the attachment service must use
`pageserverapi` scope.
Stacks on https://github.com/neondatabase/neon/pull/6823
- Pending a heartbeating mechanism (#6844 ), use /re-attach calls as a
cue to mark an offline node as active, so that a node which is
unavailable during controller startup doesn't require manual
intervention if it later starts/restarts.
- Tweak scheduling logic so that when we schedule the attached location
for a tenant, we prefer to select from secondary locations rather than
picking a fresh one.
This is an interim state until we implement #6844 and full chaos testing
for handling failures.
- Add some context to logs
- Add tests for pageserver restarts when managed by storage controller
- Make /location_config tolerate compute hook failures on shard
creations, not just modifications.
The sharding service didn't have support for S3 disaster recovery.
This PR adds a new endpoint to the attachment service, which is slightly
different from the endpoint on the pageserver, in that it takes the
shard count history of the tenant as json parameters: we need to do
time travel recovery for both the shard count at the target time and the
shard count at the current moment in time, as well as the past shard
counts that either still reference.
Fixes#6604, part of https://github.com/neondatabase/cloud/issues/8233
---------
Co-authored-by: John Spray <john@neon.tech>
This PR stacks on https://github.com/neondatabase/neon/pull/6814
Observability:
- Because we only persist a subset of our state, and our external API is
pretty high level, it can be hard to get at the detail of what's going
on internally (e.g. the IntentState of a shard).
- Add debug endpoints for getting a full dump of all TenantState and
SchedulerNode objects
- Enrich the /control/v1/node listing endpoint to include full in-memory
detail of `Node` rather than just the `NodePersistence` subset
Consistency checks:
- The storage controller maintains separate in-memory and on-disk
states, by design. To catch subtle bugs, it is useful to occasionally
cross-check these.
- The Scheduler maintains reference counts for shard->node
relationships, which could drift if there was a bug in IntentState:
exhausively cross check them in tests.
## Problem
When debugging/supporting this service, we sometimes need it to just
forget about a tenant or node, e.g. because of an issue cleanly tearing
them down. For example, if I create a tenant with a PlacementPolicy that
can't be scheduled on the nodes we have, we would never be able to
schedule it for a DELETE to work.
## Summary of changes
- Add APIs for dropping nodes and tenants that do no teardown other than
removing the entity from the DB and removing any references to it.
- Automatically set a node's availability to Active if it is responsive
in startup_reconcile
- Impose a 5s timeout of HTTP request to list location conf, so that an
unresponsive node can't hang it for minutes
- Do several retries if the request fails with a retryable error, to be
tolerant of concurrent pageserver & storage controller restarts
- Add a readiness hook for use with k8s so that we can tell when the
startup reconciliaton is done and the service is fully ready to do work.
- Add /metrics to the list of un-authenticated endpoints (this is
unrelated but we're touching the line in this PR already, and it fixes
auth error spam in deployed container.)
- A test for the above.
Closes: #6670
Fix several test flakes:
- test_sharding_service_smoke had log failures on "Dropped LSN updates"
- test_emergency_mode had log failures on a deletion queue shutdown
check, where the check was incorrect because it was expecting channel
receiver to stay alive after cancellation token was fired.
- test_secondary_mode_eviction had racing heatmap uploads because the
test was using a live migration hook to set up locations, where that
migration was itself uploading heatmaps and generally making the
situation more complex than it needed to be.
These are the failure modes that I saw when spot checking the last few
failures of each test.
This will mostly/completely address #6511, but I'll leave that ticket
open for a couple days and then check if either of the tests named in
that ticket are flaky.
Related #6511
Cleanups from https://github.com/neondatabase/neon/pull/6394
- There was a rogue `*` breaking the `GET /tenant/:tenant_id`, which
passes through to shard zero
- There was a duplicate migrate endpoint
- There are un-prefixed API endpoints that were only needed for compat
tests and can now be removed.
## Problem
When we change which physical pageservers a tenant is attached to, we
must update the control plane so that it can update computes. This will
be done via an HTTP hook, as described in
https://www.notion.so/neondatabase/Sharding-Service-Control-Plane-interface-6de56dd310a043bfa5c2f5564fa98365#1fe185a35d6d41f0a54279ac1a41bc94
## Summary of changes
- Optional CLI args `--control-plane-jwt-token` and `-compute-hook-url`
are added. If these are set, then we will use this HTTP endpoint,
instead of trying to use neon_local LocalEnv to update compute
configuration.
- Implement an HTTP-driven version of ComputeHook that calls into the
configured URL
- Notify for all tenants on startup, to ensure that we don't miss
notifications if we crash partway through a change, and carry a
`pending_compute_notification` flag at runtime to allow notifications to
fail without risking never sending the update.
- Add a test for all this
One might wonder: why not do a "forever" retry for compute hook
notifications, rather than carrying a flag on the shard to call
reconcile() again later. The reason is that we will later limit
concurreny of reconciles, when dealing with larger numbers of shards,
and if reconcile is stuck waiting for the control plane to accept a
notification request, it could jam up the whole system and prevent us
making other changes. Anyway: from the perspective of the outside world,
we _do_ retry forever, but we don't retry forever within a given
Reconciler lifetime.
The `pending_compute_notification` logic is predicated on later adding a
background task that just calls `Service::reconcile_all` on a schedule
to make sure that anything+everything that can fail a
Reconciler::reconcile call will eventually be retried.
Depends on: https://github.com/neondatabase/neon/pull/6468
## Problem
The sharding service will be used as a "virtual pageserver" by the
control plane -- so it needs the set of pageserver APIs that the control
plane uses, and to present them under identical URLs, including prefix
(/v1).
## Summary of changes
- Add missing APIs:
- Tenant deletion
- Timeline deletion
- Node list (used in test now, later in tools)
- `/location_config` API (for migrating tenants into the sharding
service)
- Rework attachment service URLs:
- `/v1` prefix is used for pageserver-compatible APIs
- `/upcall/v1` prefix is used for APIs that are called by the pageserver
(re-attach and validate)
- `/debug/v1` prefix is used for endpoints that are for testing
- `/control/v1` prefix is used for new sharding service APIs that do not
mimic a pageserver API, such as registering and configuring nodes.
- Add test_sharding_service. The sharding service already had some
collateral coverage from its use in general tests, but this is the first
dedicated testing for it.