mirror of
https://github.com/neondatabase/neon.git
synced 2025-12-22 21:59:59 +00:00
docs: sharding phase 1 RFC (#5432)
We need to shard our Tenants to support larger databases without those large databases dominating our pageservers and/or requiring dedicated pageservers. This RFC aims to define an initial capability that will permit creating large-capacity databases using a static configuration defined at time of Tenant creation. Online re-sharding is deferred as future work, as is offloading layers for historical reads. However, both of these capabilities would be implementable without further changes to the control plane or compute: this RFC aims to define the cross-component work needed to bootstrap sharding end-to-end.
This commit is contained in:
408
docs/rfcs/031-sharding-static.md
Normal file
408
docs/rfcs/031-sharding-static.md
Normal file
@@ -0,0 +1,408 @@
|
||||
# Sharding Phase 1: Static Key-space Sharding
|
||||
|
||||
## Summary
|
||||
|
||||
To enable databases with sizes approaching the capacity of a pageserver's disk,
|
||||
it is necessary to break up the storage for the database, or _shard_ it.
|
||||
|
||||
Sharding in general is a complex area. This RFC aims to define an initial
|
||||
capability that will permit creating large-capacity databases using a static configuration
|
||||
defined at time of Tenant creation.
|
||||
|
||||
## Motivation
|
||||
|
||||
Currently, all data for a Tenant, including all its timelines, is stored on a single
|
||||
pageserver. The local storage required may be several times larger than the actual
|
||||
database size, due to LSM write inflation.
|
||||
|
||||
If a database is larger than what one pageserver can hold, then it becomes impossible
|
||||
for the pageserver to hold it in local storage, as it must do to provide service to
|
||||
clients.
|
||||
|
||||
### Prior art
|
||||
|
||||
In Neon:
|
||||
|
||||
- Layer File Spreading: https://www.notion.so/neondatabase/One-Pager-Layer-File-Spreading-Konstantin-21fd9b11b618475da5f39c61dd8ab7a4
|
||||
- Layer File SPreading: https://www.notion.so/neondatabase/One-Pager-Layer-File-Spreading-Christian-eb6b64182a214e11b3fceceee688d843
|
||||
- Key Space partitioning: https://www.notion.so/neondatabase/One-Pager-Key-Space-Partitioning-Stas-8e3a28a600a04a25a68523f42a170677
|
||||
|
||||
Prior art in other distributed systems is too broad to capture here: pretty much
|
||||
any scale out storage system does something like this.
|
||||
|
||||
## Requirements
|
||||
|
||||
- Enable creating a large (for example, 16TiB) database without requiring dedicated
|
||||
pageserver nodes.
|
||||
- Share read/write bandwidth costs for large databases across pageservers, as well
|
||||
as storage capacity, in order to avoid large capacity databases acting as I/O hotspots
|
||||
that disrupt service to other tenants.
|
||||
- Our data distribution scheme should handle sparse/nonuniform keys well, since postgres
|
||||
does not write out a single contiguous ranges of page numbers.
|
||||
|
||||
_Note: the definition of 'large database' is arbitrary, but the lower bound is to ensure that a database
|
||||
that a user might create on a current-gen enterprise SSD should also work well on
|
||||
Neon. The upper bound is whatever postgres can handle: i.e. we must make sure that the
|
||||
pageserver backend is not the limiting factor in the database size_.
|
||||
|
||||
## Non Goals
|
||||
|
||||
- Independently distributing timelines within the same tenant. If a tenant has many
|
||||
timelines, then sharding may be a less efficient mechanism for distributing load than
|
||||
sharing out timelines between pageservers.
|
||||
- Distributing work in the LSN dimension: this RFC focuses on the Key dimension only,
|
||||
based on the idea that separate mechanisms will make sense for each dimension.
|
||||
|
||||
## Impacted Components
|
||||
|
||||
pageserver, control plane, postgres/smgr
|
||||
|
||||
## Terminology
|
||||
|
||||
**Key**: a postgres page number, qualified by relation. In the sense that the pageserver is a versioned key-value store,
|
||||
the page number is the key in that store. `Key` is a literal data type in existing code.
|
||||
|
||||
**LSN dimension**: this just means the range of LSNs (history), when talking about the range
|
||||
of keys and LSNs as a two dimensional space.
|
||||
|
||||
## Implementation
|
||||
|
||||
### Key sharding vs. LSN sharding
|
||||
|
||||
When we think of sharding across the two dimensional key/lsn space, this is an
|
||||
opportunity to think about how the two dimensions differ:
|
||||
|
||||
- Sharding the key space distributes the _write_ workload of ingesting data
|
||||
and compacting. This work must be carefully managed so that exactly one
|
||||
node owns a given key.
|
||||
- Sharding the LSN space distributes the _historical read_ workload. This work
|
||||
can be done by anyone without any special coordination, as long as they can
|
||||
see the remote index and layers.
|
||||
|
||||
The key sharding is the harder part, and also the more urgent one, to support larger
|
||||
capacity databases. Because distributing historical LSN read work is a relatively
|
||||
simpler problem that most users don't have, we defer it to future work. It is anticipated
|
||||
that some quite simple P2P offload model will enable distributing work for historical
|
||||
reads: a node which is low on space can call out to peer to ask it to download and
|
||||
serve reads from a historical layer.
|
||||
|
||||
### Key mapping scheme
|
||||
|
||||
Having decided to focus on key sharding, we must next decide how we will map
|
||||
keys to shards. It is proposed to use a "wide striping" approach, to obtain a good compromise
|
||||
between data locality and avoiding entire large relations mapping to the same shard.
|
||||
|
||||
We will define two spaces:
|
||||
|
||||
- Key space: unsigned integer
|
||||
- Shard space: integer from 0 to N-1, where we have N shards.
|
||||
|
||||
### Key -> Shard mapping
|
||||
|
||||
Keys are currently defined in the pageserver's getpage@lsn interface as follows:
|
||||
|
||||
```
|
||||
pub struct Key {
|
||||
pub field1: u8,
|
||||
pub field2: u32,
|
||||
pub field3: u32,
|
||||
pub field4: u32,
|
||||
pub field5: u8,
|
||||
pub field6: u32,
|
||||
}
|
||||
|
||||
|
||||
fn rel_block_to_key(rel: RelTag, blknum: BlockNumber) -> Key {
|
||||
Key {
|
||||
field1: 0x00,
|
||||
field2: rel.spcnode,
|
||||
field3: rel.dbnode,
|
||||
field4: rel.relnode,
|
||||
field5: rel.forknum,
|
||||
field6: blknum,
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
_Note: keys for relation metadata are ignored here, as this data will be mirrored to all
|
||||
shards. For distribution purposes, we only care about user data keys_
|
||||
|
||||
The properties we want from our Key->Shard mapping are:
|
||||
|
||||
- Locality in `blknum`, such that adjacent `blknum` will usually map to
|
||||
the same stripe and consequently land on the same shard, even though the overall
|
||||
collection of blocks in a relation will be spread over many stripes and therefore
|
||||
many shards.
|
||||
- Avoid the same blknum on different relations landing on the same stripe, so that
|
||||
with many small relations we do not end up aliasing data to the same stripe/shard.
|
||||
- Avoid vulnerability to aliasing in the values of relation identity fields, such that
|
||||
if there are patterns in the value of `relnode`, these do not manifest as patterns
|
||||
in data placement.
|
||||
|
||||
To accomplish this, the blknum is used to select a stripe, and stripes are
|
||||
assigned to shards in a pseudorandom order via a hash. The motivation for
|
||||
pseudo-random distribution (rather than sequential mapping of stripe to shard)
|
||||
is to avoid I/O hotspots when sequentially reading multiple relations: we don't want
|
||||
all relations' stripes to touch pageservers in the same order.
|
||||
|
||||
To map a `Key` to a shard:
|
||||
|
||||
- Hash the `Key` field 4 (relNode).
|
||||
- Divide field 6 (`blknum`) field by the stripe size in pages, and combine the
|
||||
hash of this with the hash from the previous step.
|
||||
- The total hash modulo the shard count gives the shard holding this key.
|
||||
|
||||
Why don't we use the other fields in the Key?
|
||||
|
||||
- We ignore `forknum` for key mapping, because it distinguishes different classes of data
|
||||
in the same relation, and we would like to keep the data in a relation together.
|
||||
- We would like to use spcNode and dbNode, but cannot. Postgres database creation operations can refer to an existing database as a template, such that the created
|
||||
database's blocks differ only by spcNode and dbNode from the original. To enable running
|
||||
this type of creation without cross-pageserver communication, we must ensure that these
|
||||
blocks map to the same shard -- we do this by excluding spcNode and dbNode from the hash.
|
||||
|
||||
### Data placement examples
|
||||
|
||||
For example, consider the extreme large databases cases of postgres data layout in a system with 8 shards
|
||||
and a stripe size of 32k pages:
|
||||
|
||||
- A single large relation: `blknum` division will break the data up into 4096
|
||||
stripes, which will be scattered across the shards.
|
||||
- 4096 relations of of 32k pages each: each relation will map to exactly one stripe,
|
||||
and that stripe will be placed according to the hash of the key fields 4. The
|
||||
data placement will be statistically uniform across shards.
|
||||
|
||||
Data placement will be more uneven on smaller databases:
|
||||
|
||||
- A tenant with 2 shards and 2 relations of one stripe size each: there is a 50% chance
|
||||
that both relations land on the same shard and no data lands on the other shard.
|
||||
- A tenant with 8 shards and one relation of size 12 stripes: 4 shards will have double
|
||||
the data of the other four shards.
|
||||
|
||||
These uneven cases for small amounts of data do not matter, as long as the stripe size
|
||||
is an order of magnitude smaller than the amount of data we are comfortable holding
|
||||
in a single shard: if our system handles shard sizes up to 10-100GB, then it is not an issue if
|
||||
a tenant has some shards with 256MB size and some shards with 512MB size, even though
|
||||
the standard deviation of shard size within the tenant is very high. Our key mapping
|
||||
scheme provides a statistical guarantee that as the tenant's overall data size increases,
|
||||
uniformity of placement will improve.
|
||||
|
||||
### Important Types
|
||||
|
||||
#### `ShardIdentity`
|
||||
|
||||
Provides the information needed to know whether a particular key belongs
|
||||
to a particular shard:
|
||||
|
||||
- Layout version
|
||||
- Stripe size
|
||||
- Shard count
|
||||
- Shard index
|
||||
|
||||
This structure's size is constant. Note that if we had used a differnet key
|
||||
mapping scheme such as consistent hashing with explicit hash ranges assigned
|
||||
to each shard, then the ShardIdentity's size would grow with the shard count: the simpler
|
||||
key mapping scheme used here enables a small fixed size ShardIdentity.
|
||||
|
||||
### Pageserver changes
|
||||
|
||||
#### Structural
|
||||
|
||||
Everywhere the Pageserver currently deals with Tenants, it will move to dealing with
|
||||
`TenantShard`s, which are just a `Tenant` plus a `ShardIdentity` telling it which part
|
||||
of the keyspace it owns. An un-sharded tenant is just a `TenantShard` whose `ShardIdentity`
|
||||
covers the whole keyspace.
|
||||
|
||||
When the pageserver writes layers and index_part.json to remote storage, it must
|
||||
include the shard index & count in the name, to avoid collisions (the count is
|
||||
necessary for future-proofing: the count will vary in time). These keys
|
||||
will also include a generation number: the [generation numbers](025-generation-numbers.md) system will work
|
||||
exactly the same for TenantShards as it does for Tenants today: each shard will have
|
||||
its own generation number.
|
||||
|
||||
#### Storage Format: Keys
|
||||
|
||||
For tenants with >1 shard, layer files implicitly become sparse: within the key
|
||||
range described in the layer name, the layer file for a shard will only hold the
|
||||
content relevant to stripes assigned to the shard.
|
||||
|
||||
For this reason, the LayerFileName within a tenant is no longer unique: different shards
|
||||
may use the same LayerFileName to refer to different data. We may solve this simply
|
||||
by including the shard number in the keys used for layers.
|
||||
|
||||
The shard number will be included as a prefix (as part of tenant ID), like this:
|
||||
|
||||
`pageserver/v1/tenants/<tenant_id>-<shard_number><shard_count>/timelines/<timeline id>/<layer file name>-<generation>`
|
||||
|
||||
`pageserver/v1/tenants/<tenant_id>-<shard_number><shard_count>/timelines/<timeline id>/index_part.json-<generation>`
|
||||
|
||||
Reasons for this particular format:
|
||||
|
||||
- Use of a prefix is convenient for implementation (no need to carry the shard ID everywhere
|
||||
we construct a layer file name), and enables efficient listing of index_parts within
|
||||
a particular shard-timeline prefix.
|
||||
- Including the shard _count_ as well as shard number means that in future when we implement
|
||||
shard splitting, it will be possible for a parent shard and one of its children to write
|
||||
the same layer file without a name collision. For example, a parent shard 0_1 might split
|
||||
into two (0_2, 1_2), and in the process of splitting shard 0_2 could write a layer or index_part
|
||||
that is distinct from what shard 0_1 would have written at the same place.
|
||||
|
||||
In practice, we expect shard counts to be relatively small, so a `u8` will be sufficient,
|
||||
and therefore the shard part of the path can be a fixed-length hex string like `{:02X}{:02X}`,
|
||||
for example a single-shard tenant's prefix will be `0001`.
|
||||
|
||||
For backward compatibility, we may define a special `ShardIdentity` that has shard_count==0,
|
||||
and use this as a cue to construct paths with no prefix at all.
|
||||
|
||||
#### Storage Format: Indices
|
||||
|
||||
In the phase 1 described in this RFC, shards only reference layers they write themselves. However,
|
||||
when we implement shard splitting in future, it will be useful to enable shards to reference layers
|
||||
written by other shards (specifically the parent shard during a split), so that shards don't
|
||||
have to exhaustively copy all data into their own shard-prefixed keys.
|
||||
|
||||
To enable this, the `IndexPart` structure will be extended to store the (shard number, shard count)
|
||||
tuple on each layer, such that it can construct paths for layers written by other shards. This
|
||||
naturally raises the question of who "owns" such layers written by ancestral shards: this problem
|
||||
will be addressed in phase 2.
|
||||
|
||||
For backward compatibility, any index entry without shard information will be assumed to be
|
||||
in the legacy shardidentity.
|
||||
|
||||
#### WAL Ingest
|
||||
|
||||
In Phase 1, all shards will subscribe to the safekeeper to download WAL content. They will filter
|
||||
it down to the pages relevant to their shard:
|
||||
|
||||
- For ordinary user data writes, only retain a write if it matches the ShardIdentity
|
||||
- For metadata describing relations etc, all shards retain these writes.
|
||||
|
||||
The pageservers must somehow give the safekeeper correct feedback on remote_consistent_lsn:
|
||||
one solution here is for the 0th shard to periodically peek at the IndexParts for all the other shards,
|
||||
and have only the 0th shard populate remote_consistent_lsn. However, this is relatively
|
||||
expensive: if the safekeeper can be made shard-aware then it could be taught to use
|
||||
the max() of all shards' remote_consistent_lsns to decide when to trim the WAL.
|
||||
|
||||
#### Compaction/GC
|
||||
|
||||
No changes needed.
|
||||
|
||||
The pageserver doesn't have to do anything special during compaction
|
||||
or GC. It is implicitly operating on the subset of keys that map to its ShardIdentity.
|
||||
This will result in sparse layer files, containing keys only in the stripes that this
|
||||
shard owns. Where optimizations currently exist in compaction for spotting "gaps" in
|
||||
the key range, these should be updated to ignore gaps that are due to sharding, to
|
||||
avoid spuriously splitting up layers ito stripe-sized pieces.
|
||||
|
||||
### Compute Endpoints
|
||||
|
||||
Compute endpoints will need to:
|
||||
|
||||
- Accept a vector of connection strings as part of their configuration from the control plane
|
||||
- Route pageserver requests according to mapping the hash of key to the correct
|
||||
entry in the vector of connection strings.
|
||||
|
||||
Doing this in compute rather than routing requests via a single pageserver is
|
||||
necessary to enable sharding tenants without adding latency from extra hops.
|
||||
|
||||
### Control Plane
|
||||
|
||||
Tenants, or _Projects_ in the control plane, will each own a set of TenantShards (this will
|
||||
be 1 for small tenants). Logic for placement of tenant shards is just the same as the current logic for placing
|
||||
tenants.
|
||||
|
||||
Tenant lifecycle operations like deletion will require fanning-out to all the shards
|
||||
in the tenant. The same goes for timeline creation and deletion: a timeline should
|
||||
not be considered created until it has been created in all shards.
|
||||
|
||||
#### Selectively enabling sharding for large tenants
|
||||
|
||||
Initially, we will explicitly enable sharding for large tenants only.
|
||||
|
||||
In future, this hint mechanism will become optional when we implement automatic
|
||||
re-sharding of tenants.
|
||||
|
||||
## Future Phases
|
||||
|
||||
This section exists to indicate what will likely come next after this phase.
|
||||
|
||||
Phases 2a and 2b are amenable to execution in parallel.
|
||||
|
||||
### Phase 2a: WAL fan-out
|
||||
|
||||
**Problem**: when all shards consume the whole WAL, the network bandwidth used
|
||||
for transmitting the WAL from safekeeper to pageservers is multiplied by a factor
|
||||
of the shard count.
|
||||
|
||||
Network bandwidth is not our most pressing bottleneck, but it is likely to become
|
||||
a problem if we set a modest shard count (~8) on a significant number of tenants,
|
||||
especially as those larger tenants which we shard are also likely to have higher
|
||||
write bandwidth than average.
|
||||
|
||||
### Phase 2b: Shard Splitting
|
||||
|
||||
**Problem**: the number of shards in a tenant is defined at creation time and cannot
|
||||
be changed. This causes excessive sharding for most small tenants, and an upper
|
||||
bound on scale for very large tenants.
|
||||
|
||||
To address this, a _splitting_ feature will later be added. One shard can split its
|
||||
data into a number of children by doing a special compaction operation to generate
|
||||
image layers broken up child-shard-wise, and then writing out an `index_part.json` for
|
||||
each child. This will then require external coordination (by the control plane) to
|
||||
safely attach these new child shards and then move them around to distribute work.
|
||||
The opposite _merging_ operation can also be imagined, but is unlikely to be implemented:
|
||||
once a Tenant has been sharded, the marginal efficiency benefit of merging is unlikely to justify
|
||||
the risk/complexity of implementing such a rarely-encountered scenario.
|
||||
|
||||
### Phase N (future): distributed historical reads
|
||||
|
||||
**Problem**: while sharding based on key is good for handling changes in overall
|
||||
database size, it is less suitable for spiky/unpredictable changes in the read
|
||||
workload to historical layers. Sudden increases in historical reads could result
|
||||
in sudden increases in local disk capacity required for a TenantShard.
|
||||
|
||||
Example: the extreme case of this would be to run a tenant for a year, then create branches
|
||||
with ancestors at monthly intervals. This could lead to a sudden 12x inflation in
|
||||
the on-disk capacity footprint of a TenantShard, since it would be serving reads
|
||||
from all those disparate historical layers.
|
||||
|
||||
If we can respond fast enough, then key-sharding a tenant more finely can help with
|
||||
this, but splitting may be a relatively expensive operation and the increased historical
|
||||
read load may be transient.
|
||||
|
||||
A separate mechanism for handling heavy historical reads could be something like
|
||||
a gossip mechanism for pageservers to communicate
|
||||
about their workload, and then a getpageatlsn offload mechanism where one pageserver can
|
||||
ask another to go read the necessary layers from remote storage to serve the read. This
|
||||
requires relativly little coordination because it is read-only: any node can service any
|
||||
read. All reads to a particular shard would still flow through one node, but the
|
||||
disk capactity & I/O impact of servicing the read would be distributed.
|
||||
|
||||
## FAQ/Alternatives
|
||||
|
||||
### Why stripe the data, rather than using contiguous ranges of keyspace for each shard?
|
||||
|
||||
When a database is growing under a write workload, writes may predominantly hit the
|
||||
end of the keyspace, creating a bandwidth hotspot on that shard. Similarly, if the user
|
||||
is intensively re-writing a particular relation, if that relation lived in a particular
|
||||
shard then it would not achieve our goal of distributing the write work across shards.
|
||||
|
||||
### Why not proxy read requests through one pageserver, so that endpoints don't have to change?
|
||||
|
||||
1. This would not achieve scale-out of network bandwidth: a busy tenant with a large
|
||||
database would still cause a load hotspot on the pageserver routing its read requests.
|
||||
2. The additional hop through the "proxy" pageserver would add latency and overall
|
||||
resource cost (CPU, network bandwidth)
|
||||
|
||||
### Layer File Spreading: use one pageserver as the owner of a tenant, and have it spread out work on a per-layer basis to peers
|
||||
|
||||
In this model, there would be no explicit sharding of work, but the pageserver to which
|
||||
a tenant is attached would not hold all layers on its disk: instead, it would call out
|
||||
to peers to have them store some layers, and call out to those peers to request reads
|
||||
in those layers.
|
||||
|
||||
This mechanism will work well for distributing work in the LSN dimension, but in the key
|
||||
space dimension it has the major limitation of requiring one node to handle all
|
||||
incoming writes, and compactions. Even if the write workload for a large database
|
||||
fits in one pageserver, it will still be a hotspot and such tenants may still
|
||||
de-facto require their own pageserver.
|
||||
Reference in New Issue
Block a user