Files
neon/control_plane
John Spray 44f42627dd pageserver/controller: error handling for shard splitting (#7074)
## Problem

Shard splits worked, but weren't safe against failures (e.g. node crash
during split) yet.

Related: #6676 

## Summary of changes

- Introduce async rwlocks at the scope of Tenant and Node:
  - exclusive tenant lock is used to protect splits
- exclusive node lock is used to protect new reconciliation process that
happens when setting node active
- exclusive locks used in both cases when doing persistent updates (e.g.
node scheduling conf) where the update to DB & in-memory state needs to
be atomic.
- Add failpoints to shard splitting in control plane and pageserver
code.
- Implement error handling in control plane for shard splits: this
detaches child chards and ensures parent shards are re-attached.
- Crash-safety for storage controller restarts requires little effort:
we already reconcile with nodes over a storage controller restart, so as
long as we reset any incomplete splits in the DB on restart (added in
this PR), things are implicitly cleaned up.
- Implement reconciliation with offline nodes before they transition to
active:
- (in this context reconciliation means something like
startup_reconcile, not literally the Reconciler)
- This covers cases where split abort cannot reach a node to clean it
up: the cleanup will eventually happen when the node is marked active,
as part of reconciliation.
- This also covers the case where a node was unavailable when the
storage controller started, but becomes available later: previously this
allowed it to skip the startup reconcile.
- Storage controller now terminates on panics. We only use panics for
true "should never happen" assertions, and these cases can leave us in
an un-usable state if we keep running (e.g. panicking in a shard split).
In the unlikely event that we get into a crashloop as a result, we'll
rely on kubernetes to back us off.
- Add `test_sharding_split_failures` which exercises a variety of
failure cases during shard split.
2024-03-14 09:11:57 +00:00
..

Control Plane and Neon Local

This crate contains tools to start a Neon development environment locally. This utility can be used with the cargo neon command.

Example: Start with Postgres 16

To create and start a local development environment with Postgres 16, you will need to provide --pg-version flag to 3 of the start-up commands.

cargo neon init --pg-version 16
cargo neon start
cargo neon tenant create --set-default --pg-version 16
cargo neon endpoint create main --pg-version 16
cargo neon endpoint start main

Example: Create Test User and Database

By default, cargo neon starts an endpoint with cloud_admin and postgres database. If you want to have a role and a database similar to what we have on the cloud service, you can do it with the following commands when starting an endpoint.

cargo neon endpoint create main --pg-version 16 --update-catalog true
cargo neon endpoint start main --create-test-user true

The first command creates neon_superuser and necessary roles. The second command creates test user and neondb database. You will see a connection string that connects you to the test user after running the second command.