## Problem
Pageservers now expose a gRPC API on a separate address and port. This
must be registered with the storage controller such that it can be
plumbed through to the compute via cplane.
Touches #11926.
## Summary of changes
This patch registers the gRPC address and port with the storage
controller:
* Add gRPC address to `nodes` database table and `NodePersistence`, with
a Diesel migration.
* Add gRPC address in `NodeMetadata`, `NodeRegisterRequest`,
`NodeDescribeResponse`, and `TenantLocateResponseShard`.
* Add gRPC address flags to `storcon_cli node-register`.
These changes are backwards-compatible, since all structs will ignore
unknown fields during deserialization.
## Problem
Imports don't support schema evolution nicely. If we want to change the
stuff we keep in storcon,
we'd have to carry the old cruft around.
## Summary of changes
Version import progress. Note that the import progress version
determines the version of the import
job split and execution. This means that we can also use it as a
mechanism for deploying new import
implementations in the future.
## Problem
Import up-calls did not enforce the usage of the latest generation. The
import might have finished in one previous generation, but not in the
latest one. Hence, the controller might try to activate a timeline
before it is ready. In theory, that would be fine, but it's tricky to
reason about.
## Summary of Changes
Pageserver provides the current generation in the upcall to the storage
controller and the later validates the generation. If the generation is
stale, we return an error which stops progress of the import job. Note
that the import job will retry the upcall until the stale location is
detached.
I'll add some proper tests for this as part of the [checkpointing
PR](https://github.com/neondatabase/neon/pull/11862).
Closes https://github.com/neondatabase/neon/issues/11884
## Problem
Lifetime of imported timelines (and implicitly the import background
task) has some shortcomings:
1. Timeline activation upon import completion is tricky. Previously, a
timeline that finished importing
after a tenant detach would not get activated and there's concerns about
the safety of activating
concurrently with shut-down.
2. Import jobs can prevent tenant shut down since they hold the tenant
gate
## Summary of Changes
Track the import tasks in memory and abort them explicitly on tenant
shutdown.
Integrate more closely with the storage controller:
1. When an import task has finished all of its jobs, it notifies the
storage controller, but **does not** mark the import as done in the
index_part. When all shards have finished importing, the storage
controller will call the `/activate_post_import` idempotent endpoint for
all of them. The handler, marks the import complete in index part,
resets the tenant if required and checks if the timeline is active yet.
2. Not directly related, but the import job now gets the starting state
from the storage controller instead of the import bucket. This paves the
way for progress checkpointing.
Related: https://github.com/neondatabase/neon/issues/11568
## Problem
We had retained the ability to run in a generation-less mode to support
test_generations_upgrade, which was replaced with a cleaner backward
compat test in https://github.com/neondatabase/neon/pull/10701
## Summary of changes
- Remove all the special cases for "if no generation" or "if no control
plane api"
- Make control_plane_api config mandatory
---------
Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
## Problem
Pageservers notify control plane directly when a shard import has
completed.
Control plane has to download the status of each shard from S3 and
figure out if everything is truly done,
before proceeding with branch activation.
Issues with this approach are:
* We can't control shard split behaviour on the storage controller side.
It's unsafe to split
during import.
* Control plane needs to know about shards and implement logic to check
all timelines are indeed ready.
## Summary of changes
In short, storage controller coordinates imports, and, only when
everything is done, notifies control plane.
Big rocks:
1. Store timeline imports in the storage controller database. Each
import stores the status of its shards in the database.
We hook into the timeline creation call as our entry point for this.
2. Pageservers get a new upcall endpoint to notify the storage
controller of shard import updates.
3. Storage controller handles these updates by updating persisted state.
If an update finalizes the import,
then poll pageservers until timeline activation, and, then, notify the
control plane that the import is complete.
Cplane side change with new endpoint is in
https://github.com/neondatabase/cloud/pull/26166
Closes https://github.com/neondatabase/neon/issues/11566
## Problem
Pageservers and safakeepers do not pass CA certificates to broker
client, so the client do not trust locally issued certificates.
- Part of https://github.com/neondatabase/cloud/issues/27492
## Summary of changes
- Change `ssl_ca_certs` type in PS/SK's config to `Pem` which may be
converted to both `reqwest` and `tonic` certificates.
- Pass CA certificates to storage broker client in PS and SK
## Problem
- We need to support multiple SSL CA certificates for graceful root CA
certificate rotation.
- Closes: https://github.com/neondatabase/cloud/issues/25971
## Summary of changes
- Parses `ssl_ca_file` as a pem bundle, which may contain multiple
certificates. Single pem cert is a valid pem bundle, so the change is
backward compatible.
## Problem
Pageservers use unencrypted HTTP requests for storage controller API.
- Closes: https://github.com/neondatabase/cloud/issues/25524
## Summary of changes
- Replace hyper0::server::Server with http_utils::server::Server in
storage controller.
- Add HTTPS handler for storage controller API.
- Support `ssl_ca_file` in pageserver.
## Problem
The pageserver upcall api was designed to work with control plane or the
storage controller.
We have completed the transition period and now the upcall api only
targets the storage controller.
## Summary of changes
Rename types accordingly and tweak some comments.
Before this PR, re-attach and validate would log the same warning
```
calling control plane generation validation API failed
```
on retry errors.
This can be confusing.
This PR makes the message generically valid for any upcall and adds
additional tracing spans to capture context.
Along the way, clean up some copy-pasta variable naming.
refs
-
https://github.com/neondatabase/neon/issues/10381#issuecomment-2684755827
---------
Co-authored-by: Alexander Lakhin <alexander.lakhin@neon.tech>
Updates storage components to edition 2024. We like to stay on the
latest edition if possible. There is no functional changes, however some
code changes had to be done to accommodate the edition's breaking
changes.
The PR has two commits:
* the first commit updates storage crates to edition 2024 and appeases
`cargo clippy` by changing code. i have accidentially ran the formatter
on some files that had other edits.
* the second commit performs a `cargo fmt`
I would recommend a closer review of the first commit and a less close
review of the second one (as it just runs `cargo fmt`).
part of https://github.com/neondatabase/neon/issues/10918
## Problem
Storage controller uses unsecure http for pageserver API.
Closes: https://github.com/neondatabase/cloud/issues/23734
Closes: https://github.com/neondatabase/cloud/issues/24091
## Summary of changes
- Add an optional `listen_https_port` field to storage controller's Node
state and its API (RegisterNode/ListNodes/etc).
- Allow updating `listen_https_port` on node registration to gradually
add https port for all nodes.
- Add `use_https_pageserver_api` CLI option to storage controller to
enable https.
- Pageserver doesn't support https for now and always reports
`https_port=None`. This will be addressed in follow-up PR.
## Problem
We saw a peculiar case where a pageserver apparently got a 0-tenant
response to `/re-attach` but we couldn't see the request landing on a
storage controller. It was hard to confirm retrospectively that the
pageserver was configured properly at the moment it sent the request.
## Summary of changes
- Log the URL to which we are sending the request
- Log the NodeId and metadata that we sent
## Problem
- In https://github.com/neondatabase/neon/pull/8784, the validate
controller API is modified to check generations directly in the
database. It batches tenants into separate queries to avoid generating a
huge statement, but
- While updating this, I realized that "control_plane_client" is a kind
of confusing name for the client code now that it primarily talks to the
storage controller (the case of talking to the control plane will go
away in a few months).
## Summary of changes
- Big rename to "ControllerUpcallClient" -- this reflects the storage
controller's api naming, where the paths used by the pageserver are in
`/upcall/`
- When sending validate requests, break them up into chunks so that we
avoid possible edge cases of generating any HTTP requests that require
database I/O across many thousands of tenants.
This PR mixes a functional change with a refactor, but the commits are
cleanly separated -- only the last commit is a functional change.
---------
Co-authored-by: Christian Schwarz <christian@neon.tech>