neon/docs/rfcs/032-shard-splitting.md

# Shard splitting

## Summary

This RFC describes a new pageserver API for splitting an existing tenant shard into
multiple shards, and describes how to use this API to safely increase the total
shard count of a tenant.

## Motivation

In the [sharding RFC](031-sharding-static.md), a mechanism was introduced to scale
tenants beyond the capacity of a single pageserver by breaking up the key space
into stripes, and distributing these stripes across many pageservers. However,
the shard count was defined once at tenant creation time and not varied thereafter.

In practice, the expected size of a database is rarely known at creation time, and
it is inefficient to enable sharding for very small tenants: we need to be
able to create a tenant with a small number of shards (such as 1), and later expand
when it becomes clear that the tenant has grown in size to a point where sharding
is beneficial.

### Prior art

Many distributed systems have the problem of choosing how many shards to create for
tenants that do not specify an expected size up-front. There are a couple of general
approaches:

- Write to a key space in order, and start a new shard when the highest key advances
  past some point. This doesn't work well for Neon, because we write to our key space
  in many different contiguous ranges (per relation), rather than in one contiguous
  range. To adapt to this kind of model, we would need a sharding scheme where each
  relation had its own range of shards, which would be inefficient for the common
  case of databases with many small relations.
- Monitor the system, and automatically re-shard at some size threshold. For
  example in Ceph, the [pg_autoscaler](https://github.com/ceph/ceph/blob/49c27499af4ee9a90f69fcc6bf3597999d6efc7b/src/pybind/mgr/pg_autoscaler/module.py)
  component monitors the size of each RADOS Pool, and adjusts the number of Placement
  Groups (Ceph's shard equivalent).

## Requirements

- A configurable capacity limit per-shard is enforced.
- Changes in shard count do not interrupt service beyond requiring postgres
  to reconnect (i.e. milliseconds).
- Human being does not have to choose shard count

## Non Goals

- Shard splitting is always a tenant-global operation: we will not enable splitting
  one shard while leaving others intact.
- The inverse operation (shard merging) is not described in this RFC. This is a lower
  priority than splitting, because databases grow more often than they shrink, and
  a database with many shards will still work properly if the stored data shrinks, just
  with slightly more overhead (e.g. redundant WAL replication)
- Shard splitting is only initiated based on capacity bounds, not load. Splitting
  a tenant based on load will make sense for some medium-capacity, high-load workloads,
  but is more complex to reason about and likely is not desirable until we have
  shard merging to reduce the shard count again if the database becomes less busy.

## Impacted Components

pageserver, storage controller

(the _storage controller_ is the evolution of what was called `attachment_service` in our test environment)

## Terminology

**Parent** shards are the shards that exist before a split. **Child** shards are
the new shards created during a split.

**Shard** is synonymous with _tenant shard_.

**Shard Index** is the 2-tuple of shard number and shard count, written in
paths as {:02x}{:02x}, e.g. `0001`.

## Background

In the implementation section, a couple of existing aspects of sharding are important
to remember:

- Shard identifiers contain the shard number and count, so that "shard 0 of 1" (`0001`) is
  a distinct shard from "shard 0 of 2" (`0002`). This is the case in key paths, local
  storage paths, and remote index metadata.
- Remote layer file paths contain the shard index of the shard that created them, and
  remote indices contain the same index to enable building the layer file path. A shard's
  index may reference layers that were created by another shard.
- Local tenant shard directories include the shard index. All layers downloaded by
  a tenant shard are stored in this shard-prefixed path, even if those layers were
  initially created by another shard: tenant shards do not read and write one anothers'
  paths.
- The `Tenant` pageserver type represents one tenant _shard_, not the whole tenant.
  This is for historical reasons and will be cleaned up in future, but the existing
  name is used here to help comprehension when reading code.

## Implementation

Note: this section focuses on the correctness of the core split process. This will
be fairly inefficient in a naive implementation, and several important optimizations
are described in a later section.

There are broadly two parts to the implementation:

1. The pageserver split API, which splits one shard on one pageserver
2. The overall tenant split proccess which is coordinated by the storage controller,
   and calls into the pageserver split API as needed.

### Pageserver Split API

The pageserver will expose a new API endpoint at `/v1/tenant/:tenant_shard_id/shard_split`
that takes the new total shard count in the body.

The pageserver split API operates on one tenant shard, on one pageserver. External
coordination is required to use it safely, this is described in the later
'Split procedure' section.

#### Preparation

First identify the shard indices for the new child shards. These are deterministic,
calculated from the parent shard's index, and the number of children being created (this
is an input to the API, and validated to be a power of two). In a trivial example, splitting
0001 in two always results in 0002 and 0102.

Child shard indices are chosen such that the childrens' parts of the keyspace will
be subsets of the parent's parts of the keyspace.

#### Step 1: write new remote indices

In remote storage, splitting is very simple: we may just write new index_part.json
objects for each child shard, containing exactly the same layers as the parent shard.

The children will have more data than they need, but this avoids any exhausive
re-writing or copying of layer files.

The index key path includes a generation number: the parent shard's current
attached generation number will also be used for the child shards' indices. This
makes the operation safely retryable: if everything crashes and restarts, we may
call the split API again on the parent shard, and the result will be some new remote
indices for the child shards, under a higher generation number.

#### Step 2: start new `Tenant` objects

A new `Tenant` object may be instantiated for each child shard, while the parent
shard still exists. When calling the tenant_spawn function for this object,
the remote index from step 1 will be read, and the child shard will start
to ingest WAL to catch up from whatever was in the remote storage at step 1.

We now wait for child shards' WAL ingestion to catch up with the parent shard,
so that we can safely tear down the parent shard without risking an availability
gap to clients reading recent LSNs.

#### Step 3: tear down parent `Tenant` object

Once child shards are running and have caught up with WAL ingest, we no longer
need the parent shard. Note that clients may still be using it -- when we
shut it down, any page_service handlers will also shut down, causing clients
to disconnect. When the client reconnects, it will re-lookup the tenant,
and hit the child shard instead of the parent (shard lookup from page_service
should bias toward higher ShardCount shards).

Note that at this stage the page service client has not yet been notified of
any split. In the trivial single split example:

- Shard 0001 is gone: Tenant object torn down
- Shards 0002 and 0102 are running on the same pageserver where Shard 0001 used to live.
- Clients will continue to connect to that server thinking that shard 0001 is there,
  and all requests will work, because any key that was in shard 0001 is definitely
  available in either shard 0002 or shard 0102.
- Eventually, the storage controller (not the pageserver) will decide to migrate
  some child shards away: at that point it will do a live migration, ensuring
  that the client has an updated configuration before it detaches anything
  from the original server.

#### Complete

When we send a 200 response to the split request, we are promising the caller:

- That the child shards are persistent in remote storage
- That the parent shard has been shut down

This enables the caller to proceed with the overall shard split operation, which
may involve other shards on other pageservers.

### Storage Controller Split procedure

Splitting a tenant requires calling the pageserver split API, and tracking
enough state to ensure recovery + completion in the event of any component (pageserver
or storage controller) crashing (or request timing out) during the split.

1. call the split API on all existing shards. Ensure that the resulting
   child shards are pinned to their pageservers until _all_ the split calls are done.
   This pinning may be implemented as a "split bit" on the tenant shards, that
   blocks any migrations, and also acts as a sign that if we restart, we must go
   through some recovery steps to resume the split.
2. Once all the split calls are done, we may unpin the child shards (clear
   the split bit). The split is now complete: subsequent steps are just migrations,
   not strictly part of the split.
3. Try to schedule new pageserver locations for the child shards, using
   a soft anti-affinity constraint to place shards from the same tenant onto different
   pageservers.

Updating computes about the new shard count is not necessary until we migrate
any of the child shards away from the parent's location.

### Recovering from failures

#### Rolling back an incomplete split

An incomplete shard split may be rolled back quite simply, by attaching the parent shards to pageservers,
and detaching child shards. This will lose any WAL ingested into the children after the parents
were detached earlier, but the parents will catch up.

No special pageserver API is needed for this. From the storage controllers point of view, the
procedure is:

1. For all parent shards in the tenant, ensure they are attached
2. For all child shards, ensure they are not attached
3. Drop child shards from the storage controller's database, and clear the split bit on the parent shards.

Any remote storage content for child shards is left behind. This is similar to other cases where
we may leave garbage objects in S3 (e.g. when we upload a layer but crash before uploading an
index that references it). Future online scrub/cleanup functionality can remove these objects, or
they will be removed when the tenant is deleted, as tenant deletion lists all objects in the prefix,
which would include any child shards that were rolled back.

If any timelines had been created on child shards, they will be lost when rolling back. To mitigate
this, we will **block timeline creation during splitting**, so that we can safely roll back until
the split is complete, without risking losing timelines.

Rolling back an incomplete split will happen automatically if a split fails due to some fatal
reason, and will not be accessible via an API:

- A pageserver fails to complete its split API request after too many retries
- A pageserver returns a fatal unexpected error such as 400 or 500
- The storage controller database returns a non-retryable error
- Some internal invariant is violated in the storage controller split code

#### Rolling back a complete split

A complete shard split may be rolled back similarly to an incomplete split, with the following
modifications:

- The parent shards will no longer exist in the storage controller database, so these must
  be re-synthesized somehow: the hard part of this is figuring the parent shards' generations. This
  may be accomplished either by probing in S3, or by retaining some tombstone state for deleted
  shards in the storage controller database.
- Any timelines that were created after the split complete will disappear when rolling back
  to the tenant shards. For this reason, rolling back after a complete split should only
  be done due to serious issues where loss of recently created timelines is acceptable, or
  in cases where we have confirmed that no timelines were created in the intervening period.
- Parent shards' layers must not have been deleted: this property will come "for free" when
  we first roll out sharding, by simply not implementing deletion of parent layers after
  a split. When we do implement such deletion (see "Cleaning up parent-shard layers" in the
  Optimizations section), it should apply a TTL to layers such that we have a
  defined walltime window in which rollback will be possible.

The storage controller will expose an API for rolling back a complete split, for use
in the field if we encounter some critical bug with a post-split tenant.

#### Retrying API calls during Pageserver Restart

When a pageserver restarts during a split API call, it may witness on-disk content for both parent and
child shards from an ongoing split. This does not intrinsically break anything, and the
pageserver may include all these shards in its `/re-attach` request to the storage controller.

In order to support such restarts, it is important that the storage controller stores
persistent records of each child shard before it calls into a pageserver, as these child shards
may require generation increments via a `/re-attach` request.

The pageserver restart will also result in a failed API call from the storage controller's point
of view. Recall that if _any_ pageserver fails to split, the overall split operation may not
complete, and all shards must remain pinned to their current pageserver locations until the
split is done.

The pageserver API calls during splitting will retry on transient errors, so that
short availability gaps do not result in a failure of the overall operation. The
split in progress will be automatically rolled back if the threshold for API
retries is reached (e.g. if a pageserver stays offline for longer than a typical
restart).

#### Rollback on Storage Controller Restart

On startup, the storage controller will inspect the split bit for tenant shards that
it loads from the database. If any splits are in progress:

- Database content will be reverted to the parent shards
- Child shards will be dropped from memory
- The parent and child shards will be included in the general startup reconciliation that
  the storage controller does: any child shards will be detached from pageservers because
  they don't exist in the storage controller's expected set of shards, and parent shards
  will be attached if they aren't already.

#### Storage controller API request failures/retries

The split request handler will implement idempotency: if the [`Tenant`] requested to split
doesn't exist, we will check for the would-be child shards, and if they already exist,
we consider the request complete.

If a request is retried while the original request is still underway, then the split
request handler will notice an InProgress marker in TenantManager, and return 503
to encourage the client to backoff/retry. This is the same as the general pageserver
API handling for calls that try to act on an InProgress shard.

#### Compute start/restart during a split

If a compute starts up during split, it will be configured with the old sharding
configuration. This will work for reads irrespective of the progress of the split
as long as no child hards have been migrated away from their original location, and
this is guaranteed in the split procedure (see earlier section).

#### Pageserver fails permanently during a split

If a pageserver permanently fails (i.e. the storage controller availability state for it
goes to Offline) while a split is in progress, the splitting operation will roll back, and
during the roll back it will skip any API calls to the offline pageserver. If the offline
pageserver becomes available again, any stale locations will be cleaned up via the normal reconciliation process (the `/re-attach` API).

### Handling secondary locations

For correctness, it is not necessary to split secondary locations. We can simply detach
the secondary locations for parent shards, and then attach new secondary locations
for child shards.

Clearly this is not optimal, as it will result in re-downloads of layer files that
were already present on disk. See "Splitting secondary locations"

### Conditions to trigger a split

The pageserver will expose a new API for reporting on shards that are candidates
for split: this will return a top-N report of the largest tenant shards by
physical size (remote size). This should exclude any tenants that are already
at the maximum configured shard count.

The API would look something like:
`/v1/top_n_tenant?shard_count_lt=8&sort_by=resident_size`

The storage controller will poll that API across all pageservers it manages at some appropriate interval (e.g. 60 seconds).

A split operation will be started when the tenant exceeds some threshold. This threshold
should be _less than_ how large we actually want shards to be, perhaps much less. That's to
minimize the amount of work involved in splitting -- if we want 100GiB shards, we shouldn't
wait for a tenant to exceed 100GiB before we split anything. Some data analysis of existing
tenant size distribution may be useful here: if we can make a statement like "usually, if
a tenant has exceeded 20GiB they're probably going to exceed 100GiB later", then we might
make our policy to split a tenant at 20GiB.

The finest split we can do is by factors of two, but we can do higher-cardinality splits
too, and this will help to reduce the overhead of repeatedly re-splitting a tenant
as it grows. An example of a very simple heuristic for early deployment of the splitting
feature would be: "Split tenants into 8 shards when their physical size exceeds 64GiB": that
would give us two kinds of tenant (1 shard and 8 shards), and the confidence that once we had
split a tenant, it will not need re-splitting soon after.

## Optimizations

### Flush parent shard to remote storage during split

Any data that is in WAL but not remote storage at time of split will need
to be replayed by child shards when they start for the first time. To minimize
this work, we may flush the parent shard to remote storage before writing the
remote indices for child shards.

It is important that this flush is subject to some time bounds: we may be splitting
in response to a surge of write ingest, so it may be time-critical to split. A
few seconds to flush latest data should be sufficient to optimize common cases without
running the risk of holding up a split for a harmful length of time when a parent
shard is being written heavily. If the flush doesn't complete in time, we may proceed
to shut down the parent shard and carry on with the split.

### Hard linking parent layers into child shard directories

Before we start the Tenant objects for child shards, we may pre-populate their
local storage directories with hard links to the layer files already present
in the parent shard's local directory. When the child shard starts and downloads
its remote index, it will find all those layer files already present on local disk.

This avoids wasting download capacity and makes splitting faster, but more importantly
it avoids taking up a factor of N more disk space when splitting 1 shard into N.

This mechanism will work well in typical flows where shards are migrated away
promptly after a split, but for the general case including what happens when
layers are evicted and re-downloaded after a split, see the 'Proactive compaction'
section below.

### Filtering during compaction

Compaction, especially image layer generation, should skip any keys that are
present in a shard's layer files, but do not match the shard's ShardIdentity's
is_key_local() check. This avoids carrying around data for longer than necessary
in post-split compactions.

This was already implemented in https://github.com/neondatabase/neon/pull/6246

### Proactive compaction

In remote storage, there is little reason to rewrite any data on a shard split:
all the children can reference parent layers via the very cheap write of the child
index_part.json.

In local storage, things are more nuanced. During the initial split there is no
capacity cost to duplicating parent layers, if we implement the hard linking
optimization described above. However, as soon as any layers are evicted from
local disk and re-downloaded, the downloaded layers will not be hard-links any more:
they'll have real capacity footprint. That isn't a problem if we migrate child shards
away from the parent node swiftly, but it risks a significant over-use of local disk
space if we do not.

For example, if we did an 8-way split of a shard, and then _didn't_ migrate 7 of
the shards elsewhere, then churned all the layers in all the shards via eviction,
then we would blow up the storage capacity used on the node by 8x. If we're splitting
a 100GB shard, that could take the pageserver to the point of exhausting disk space.

To avoid this scenario, we could implement a special compaction mode where we just
read historic layers, drop unwanted keys, and write back the layer file. This
is pretty expensive, but useful if we have split a large shard and are not going to
migrate the child shards away.

The heuristic conditions for triggering such a compaction are:

- A) eviction plus time: if a child shard
  has existed for more than a time threshold, and has been requested to perform at least one eviction, then it becomes urgent for this child shard to execute a proactive compaction to reduce its storage footprint, at the cost of I/O load.
- B) resident size plus time: we may inspect the resident layers and calculate how
  many of them include the overhead of storing pre-split keys. After some time
  threshold (different to the one in case A) we still have such layers occupying
  local disk space, then we should proactively compact them.

### Cleaning up parent-shard layers

It is functionally harmless to leave parent shard layers in remote storage indefinitely.
They would be cleaned up in the event of the tenant's deletion.

As an optimization to avoid leaking remote storage capacity (which costs money), we may
lazily clean up parent shard layers once no child shards reference them.

This may be done _very_ lazily: e.g. check every PITR interval. The cleanup procedure is:

- list all the key prefixes beginning with the tenant ID, and select those shard prefixes
  which do not belong to the most-recently-split set of shards (_ancestral shards_, i.e. `shard*count < max(shard_count) over all shards)`, and those shard prefixes which do have the latest shard count (_current shards_)
- If there are no _ancestral shard_ prefixes found, we have nothing to clean up and
  may drop out now.
- find the latest-generation index for each _current shard_, read all and accumulate the set of layers belonging to ancestral shards referenced by these indices.
- for all ancestral shards, list objects in the prefix and delete any layer which was not
  referenced by a current shard.

If this cleanup is scheduled for 1-2 PITR periods after the split, there is a good chance that child shards will have written their own image layers covering the whole keyspace, such that all parent shard layers will be deletable.

The cleanup may be done by the scrubber (external process), or we may choose to have
the zeroth shard in the latest generation do the work -- there is no obstacle to one shard
reading the other shard's indices at runtime, and we do not require visibility of the
latest index writes.

Cleanup should be artificially delayed by some period (for example 24 hours) to ensure
that we retain the option to roll back a split in case of bugs.

### Splitting secondary locations

We may implement a pageserver API similar to the main splitting API, which does a simpler
operation for secondary locations: it would not write anything to S3, instead it would simply
create the child shard directory on local disk, hard link in directories from the parent,
and set up the in memory (TenantSlot) state for the children.

Similar to attached locations, a subset of secondary locations will probably need re-locating
after the split is complete, to avoid leaving multiple child shards on the same pageservers,
where they may use excessive space for the tenant.

## FAQ/Alternatives

### What should the thresholds be set to?

Shard size limit: the pre-sharding default capacity quota for databases was 200GiB, so this could be a starting point for the per-shard size limit.

Max shard count:

- The safekeeper overhead to sharding is currently O(N) network bandwidth because
  the un-filtered WAL is sent to all shards. To avoid this growing out of control,
  a limit of 8 shards should be temporarily imposed until WAL filtering is implemented
  on the safekeeper.
- there is also little benefit to increasing the shard count beyond the number
  of pageservers in a region.

### Is it worth just rewriting all the data during a split to simplify reasoning about space?