## Problem The current stripe size of 256 MB is a bit large, and can cause load imbalances across shards. A stripe size of 16 MB appears more reasonable to avoid hotspots, although we don't see evidence of this in benchmarks. Resolves https://github.com/neondatabase/cloud/issues/25634. Touches https://github.com/neondatabase/cloud/issues/21870. ## Summary of changes * Change the default stripe size to 16 MB. * Remove `ShardParameters::DEFAULT_STRIPE_SIZE`, and only use `pageserver_api::shard::DEFAULT_STRIPE_SIZE`. * Update a bunch of tests that assumed a certain stripe size.
9.7 KiB
Storage Controller
Concepts
The storage controller sits between administrative API clients and pageservers, and handles the details of mapping tenants to pageserver tenant shards. For example, creating a tenant is one API call to the storage controller, which is mapped into many API calls to many pageservers (for multiple shards, and for secondary locations).
It implements a pageserver-compatible API that may be used for CRUD operations on tenants and timelines, translating these requests into appropriate operations on the shards within a tenant, which may be on many different pageservers. Using this API, the storage controller may be used in the same way as the pageserver's administrative HTTP API, hiding the underlying details of how data is spread across multiple nodes.
The storage controller also manages generations, high availability (via secondary locations) and live migrations for tenants under its management. This is done with a reconciliation loop pattern, where tenants have an “intent” state and a “reconcile” task that tries to make the outside world match the intent.
APIs
The storage controller’s HTTP server implements four logically separate APIs:
/v1/...path is the pageserver-compatible API. This has to be at the path root because that’s where clients expect to find it on a pageserver./control/v1/...path is the storage controller’s API, which enables operations such as registering and management pageservers, or executing shard splits./debug/v1/...path contains endpoints which are either exclusively used in tests, or are for use by engineers when supporting a deployed system./upcall/v1/...path contains endpoints that are called by pageservers. This includes the/re-attachand/validateAPIs used by pageservers to ensure data safety with generation numbers.
The API is authenticated with a JWT token, and tokens must have scope pageserverapi (i.e. the same scope as pageservers’ APIs).
See the http.rs file in the source for where the HTTP APIs are implemented.
Database
The storage controller uses a postgres database to persist a subset of its state. Note that the storage controller does not keep all its state in the database: this is a design choice to enable most operations to be done efficiently in memory, rather than having to read from the database. See persistence.rs for a more comprehensive comment explaining what we do and do not persist: a useful metaphor is that we persist objects like tenants and nodes, but we do not
persist the relationships between them: the attachment state of a tenant's shards to nodes is kept in memory and
rebuilt on startup.
The file persistence.rs contains all the code for accessing the database, and has a large doc comment that goes into more detail about exactly what we persist and why.
The diesel crate is used for defining models & migrations.
Running a local cluster with cargo neon automatically starts a vanilla postgress process to host the storage controller’s database.
Diesel tip: migrations
If you need to modify the database schema, here’s how to create a migration:
- Install the diesel CLI with
cargo install diesel_cli - Use
diesel migration generate <name>to create a new migration - Populate the SQL files in the
migrations/subdirectory - Use
DATABASE_URL=... diesel migration runto apply the migration you just wrote: this will update the[schema.rs](http://schema.rs)file automatically.- This requires a running database: the easiest way to do that is to just run
cargo neon init ; cargo neon start, which will leave a database available atpostgresql://localhost:1235/storage_controller
- This requires a running database: the easiest way to do that is to just run
- Commit the migration files and the changes to schema.rs
- If you need to iterate, you can rewind migrations with
diesel migration revert -aand thendiesel migration runagain. - The migrations are build into the storage controller binary, and automatically run at startup after it is deployed, so once you’ve committed a migration no further steps are needed.
storcon_cli
The storcon_cli tool enables interactive management of the storage controller. This is usually
only necessary for debug, but may also be used to manage nodes (e.g. marking a node as offline).
storcon_cli --help includes details on commands.
Deploying
This section is aimed at engineers deploying the storage controller outside of Neon's cloud platform, as part of a self-hosted system.
General note: since the default neon_local environment includes a storage controller, this is a useful
reference when figuring out deployment.
Database
It is essential that the database used by the storage controller is durable (do not store it on ephemeral local disk). This database contains pageserver generation numbers, which are essential to data safety on the pageserver.
The resource requirements for the database are very low: a single CPU core and 1GiB of memory should work well for most deployments. The physical size of the database is typically under a gigabyte.
Set the URL to the database using the --database-url CLI option.
There is no need to run migrations manually: the storage controller automatically applies migrations when it starts up.
Configure pageservers to use the storage controller
- The pageserver
control_plane_apiandcontrol_plane_api_tokenshould be set in thepageserver.tomlfile. The API setting should point to the "upcall" prefix, for examplehttp://127.0.0.1:1234/upcall/v1/is used in neon_local clusters. - Create a
metadata.jsonfile in the same directory aspageserver.toml: this enables the pageserver to automatically register itself with the storage controller when it starts up. See the example below for the format of this file.
Example metadata.json
{"host":"acmehost.localdomain","http_host":"acmehost.localdomain","http_port":9898,"port":64000}
portandhostrefer to the postgres port and host, and these must be accessible from wherever postgres runs.http_portandhttp_hostrefer to the pageserver's HTTP api, this must be accessible from where the storage controller runs.
Handle compute notifications.
The storage controller independently moves tenant attachments between pageservers in response to changes such as a pageserver node becoming unavailable, or the tenant's shard count changing. To enable postgres clients to handle such changes, the storage controller calls an API hook when a tenant's pageserver location changes.
The hook is configured using the storage controller's --control-plane-url CLI option, from which the hook URL is computed.
Currently, there is two hooks, each computed by appending the name to the provided control plane URL prefix:
notify-attach, called whenever attachment for pageservers changesnotify-safekeepers, called whenever attachment for safekeepers changes
If the hooks require JWT auth, the token may be provided with --control-plane-jwt-token.
The hooks will be invoked with a PUT request.
In the Neon cloud service, these hooks are implemented by Neon's internal cloud control plane. In neon_local systems,
the storage controller integrates directly with neon_local to reconfigure local postgres processes instead of calling
the compute hook.
When implementing an on-premise Neon deployment, you must implement a service that handles the compute hooks. This is not complicated.
notify-attach body
The notify-attach request body follows the format of the ComputeHookNotifyRequest structure, provided below for convenience.
struct ComputeHookNotifyRequestShard {
node_id: NodeId,
shard_number: ShardNumber,
}
struct ComputeHookNotifyRequest {
tenant_id: TenantId,
stripe_size: Option<ShardStripeSize>,
shards: Vec<ComputeHookNotifyRequestShard>,
}
When a notification is received:
-
Modify postgres configuration for this tenant:
- set
neon.pageserver_connstringto a comma-separated list of postgres connection strings to pageservers according to theshardslist. The shards identified byNodeIdmust be converted to the address+port of the node. - if stripe_size is not None, set
neon.shard_stripe_sizeto this value
- set
-
Send SIGHUP to postgres to reload configuration
-
Respond with 200 to the notification request. Do not return success if postgres was not updated: if an error is returned, the controller will retry the notification until it succeeds..
Example body:
{
"tenant_id": "1f359dd625e519a1a4e8d7509690f6fc",
"stripe_size": 2048,
"shards": [
{"node_id": 344, "shard_number": 0},
{"node_id": 722, "shard_number": 1},
],
}
notify-safekeepers body
The notify-safekeepers request body forllows the format of the SafekeepersNotifyRequest structure, provided below for convenience.
pub struct SafekeeperInfo {
pub id: NodeId,
pub hostname: String,
}
pub struct SafekeepersNotifyRequest {
pub tenant_id: TenantId,
pub timeline_id: TimelineId,
pub generation: u32,
pub safekeepers: Vec<SafekeeperInfo>,
}
When a notification is received:
-
Modify postgres configuration for this tenant:
- set
neon.safekeeper_connstringsto an array of postgres connection strings to safekeepers according to thesafekeeperslist. The safekeepers identified byNodeIdmust be converted to the address+port of the respective safekeeper. The hostname is provided for debugging purposes, so we reserve changes to how we pass it. - set
neon.safekeepers_generationto the providedgenerationvalue.
- set
-
Send SIGHUP to postgres to reload configuration
-
Respond with 200 to the notification request. Do not return success if postgres was not updated: if an error is returned, the controller will retry the notification until it succeeds..