mirror of
https://github.com/neondatabase/neon.git
synced 2026-01-07 05:22:56 +00:00
We need to shard our Tenants to support larger databases without those large databases dominating our pageservers and/or requiring dedicated pageservers. This RFC aims to define an initial capability that will permit creating large-capacity databases using a static configuration defined at time of Tenant creation. Online re-sharding is deferred as future work, as is offloading layers for historical reads. However, both of these capabilities would be implementable without further changes to the control plane or compute: this RFC aims to define the cross-component work needed to bootstrap sharding end-to-end.
409 lines
19 KiB
Markdown
409 lines
19 KiB
Markdown
# Sharding Phase 1: Static Key-space Sharding
|
|
|
|
## Summary
|
|
|
|
To enable databases with sizes approaching the capacity of a pageserver's disk,
|
|
it is necessary to break up the storage for the database, or _shard_ it.
|
|
|
|
Sharding in general is a complex area. This RFC aims to define an initial
|
|
capability that will permit creating large-capacity databases using a static configuration
|
|
defined at time of Tenant creation.
|
|
|
|
## Motivation
|
|
|
|
Currently, all data for a Tenant, including all its timelines, is stored on a single
|
|
pageserver. The local storage required may be several times larger than the actual
|
|
database size, due to LSM write inflation.
|
|
|
|
If a database is larger than what one pageserver can hold, then it becomes impossible
|
|
for the pageserver to hold it in local storage, as it must do to provide service to
|
|
clients.
|
|
|
|
### Prior art
|
|
|
|
In Neon:
|
|
|
|
- Layer File Spreading: https://www.notion.so/neondatabase/One-Pager-Layer-File-Spreading-Konstantin-21fd9b11b618475da5f39c61dd8ab7a4
|
|
- Layer File SPreading: https://www.notion.so/neondatabase/One-Pager-Layer-File-Spreading-Christian-eb6b64182a214e11b3fceceee688d843
|
|
- Key Space partitioning: https://www.notion.so/neondatabase/One-Pager-Key-Space-Partitioning-Stas-8e3a28a600a04a25a68523f42a170677
|
|
|
|
Prior art in other distributed systems is too broad to capture here: pretty much
|
|
any scale out storage system does something like this.
|
|
|
|
## Requirements
|
|
|
|
- Enable creating a large (for example, 16TiB) database without requiring dedicated
|
|
pageserver nodes.
|
|
- Share read/write bandwidth costs for large databases across pageservers, as well
|
|
as storage capacity, in order to avoid large capacity databases acting as I/O hotspots
|
|
that disrupt service to other tenants.
|
|
- Our data distribution scheme should handle sparse/nonuniform keys well, since postgres
|
|
does not write out a single contiguous ranges of page numbers.
|
|
|
|
_Note: the definition of 'large database' is arbitrary, but the lower bound is to ensure that a database
|
|
that a user might create on a current-gen enterprise SSD should also work well on
|
|
Neon. The upper bound is whatever postgres can handle: i.e. we must make sure that the
|
|
pageserver backend is not the limiting factor in the database size_.
|
|
|
|
## Non Goals
|
|
|
|
- Independently distributing timelines within the same tenant. If a tenant has many
|
|
timelines, then sharding may be a less efficient mechanism for distributing load than
|
|
sharing out timelines between pageservers.
|
|
- Distributing work in the LSN dimension: this RFC focuses on the Key dimension only,
|
|
based on the idea that separate mechanisms will make sense for each dimension.
|
|
|
|
## Impacted Components
|
|
|
|
pageserver, control plane, postgres/smgr
|
|
|
|
## Terminology
|
|
|
|
**Key**: a postgres page number, qualified by relation. In the sense that the pageserver is a versioned key-value store,
|
|
the page number is the key in that store. `Key` is a literal data type in existing code.
|
|
|
|
**LSN dimension**: this just means the range of LSNs (history), when talking about the range
|
|
of keys and LSNs as a two dimensional space.
|
|
|
|
## Implementation
|
|
|
|
### Key sharding vs. LSN sharding
|
|
|
|
When we think of sharding across the two dimensional key/lsn space, this is an
|
|
opportunity to think about how the two dimensions differ:
|
|
|
|
- Sharding the key space distributes the _write_ workload of ingesting data
|
|
and compacting. This work must be carefully managed so that exactly one
|
|
node owns a given key.
|
|
- Sharding the LSN space distributes the _historical read_ workload. This work
|
|
can be done by anyone without any special coordination, as long as they can
|
|
see the remote index and layers.
|
|
|
|
The key sharding is the harder part, and also the more urgent one, to support larger
|
|
capacity databases. Because distributing historical LSN read work is a relatively
|
|
simpler problem that most users don't have, we defer it to future work. It is anticipated
|
|
that some quite simple P2P offload model will enable distributing work for historical
|
|
reads: a node which is low on space can call out to peer to ask it to download and
|
|
serve reads from a historical layer.
|
|
|
|
### Key mapping scheme
|
|
|
|
Having decided to focus on key sharding, we must next decide how we will map
|
|
keys to shards. It is proposed to use a "wide striping" approach, to obtain a good compromise
|
|
between data locality and avoiding entire large relations mapping to the same shard.
|
|
|
|
We will define two spaces:
|
|
|
|
- Key space: unsigned integer
|
|
- Shard space: integer from 0 to N-1, where we have N shards.
|
|
|
|
### Key -> Shard mapping
|
|
|
|
Keys are currently defined in the pageserver's getpage@lsn interface as follows:
|
|
|
|
```
|
|
pub struct Key {
|
|
pub field1: u8,
|
|
pub field2: u32,
|
|
pub field3: u32,
|
|
pub field4: u32,
|
|
pub field5: u8,
|
|
pub field6: u32,
|
|
}
|
|
|
|
|
|
fn rel_block_to_key(rel: RelTag, blknum: BlockNumber) -> Key {
|
|
Key {
|
|
field1: 0x00,
|
|
field2: rel.spcnode,
|
|
field3: rel.dbnode,
|
|
field4: rel.relnode,
|
|
field5: rel.forknum,
|
|
field6: blknum,
|
|
}
|
|
}
|
|
```
|
|
|
|
_Note: keys for relation metadata are ignored here, as this data will be mirrored to all
|
|
shards. For distribution purposes, we only care about user data keys_
|
|
|
|
The properties we want from our Key->Shard mapping are:
|
|
|
|
- Locality in `blknum`, such that adjacent `blknum` will usually map to
|
|
the same stripe and consequently land on the same shard, even though the overall
|
|
collection of blocks in a relation will be spread over many stripes and therefore
|
|
many shards.
|
|
- Avoid the same blknum on different relations landing on the same stripe, so that
|
|
with many small relations we do not end up aliasing data to the same stripe/shard.
|
|
- Avoid vulnerability to aliasing in the values of relation identity fields, such that
|
|
if there are patterns in the value of `relnode`, these do not manifest as patterns
|
|
in data placement.
|
|
|
|
To accomplish this, the blknum is used to select a stripe, and stripes are
|
|
assigned to shards in a pseudorandom order via a hash. The motivation for
|
|
pseudo-random distribution (rather than sequential mapping of stripe to shard)
|
|
is to avoid I/O hotspots when sequentially reading multiple relations: we don't want
|
|
all relations' stripes to touch pageservers in the same order.
|
|
|
|
To map a `Key` to a shard:
|
|
|
|
- Hash the `Key` field 4 (relNode).
|
|
- Divide field 6 (`blknum`) field by the stripe size in pages, and combine the
|
|
hash of this with the hash from the previous step.
|
|
- The total hash modulo the shard count gives the shard holding this key.
|
|
|
|
Why don't we use the other fields in the Key?
|
|
|
|
- We ignore `forknum` for key mapping, because it distinguishes different classes of data
|
|
in the same relation, and we would like to keep the data in a relation together.
|
|
- We would like to use spcNode and dbNode, but cannot. Postgres database creation operations can refer to an existing database as a template, such that the created
|
|
database's blocks differ only by spcNode and dbNode from the original. To enable running
|
|
this type of creation without cross-pageserver communication, we must ensure that these
|
|
blocks map to the same shard -- we do this by excluding spcNode and dbNode from the hash.
|
|
|
|
### Data placement examples
|
|
|
|
For example, consider the extreme large databases cases of postgres data layout in a system with 8 shards
|
|
and a stripe size of 32k pages:
|
|
|
|
- A single large relation: `blknum` division will break the data up into 4096
|
|
stripes, which will be scattered across the shards.
|
|
- 4096 relations of of 32k pages each: each relation will map to exactly one stripe,
|
|
and that stripe will be placed according to the hash of the key fields 4. The
|
|
data placement will be statistically uniform across shards.
|
|
|
|
Data placement will be more uneven on smaller databases:
|
|
|
|
- A tenant with 2 shards and 2 relations of one stripe size each: there is a 50% chance
|
|
that both relations land on the same shard and no data lands on the other shard.
|
|
- A tenant with 8 shards and one relation of size 12 stripes: 4 shards will have double
|
|
the data of the other four shards.
|
|
|
|
These uneven cases for small amounts of data do not matter, as long as the stripe size
|
|
is an order of magnitude smaller than the amount of data we are comfortable holding
|
|
in a single shard: if our system handles shard sizes up to 10-100GB, then it is not an issue if
|
|
a tenant has some shards with 256MB size and some shards with 512MB size, even though
|
|
the standard deviation of shard size within the tenant is very high. Our key mapping
|
|
scheme provides a statistical guarantee that as the tenant's overall data size increases,
|
|
uniformity of placement will improve.
|
|
|
|
### Important Types
|
|
|
|
#### `ShardIdentity`
|
|
|
|
Provides the information needed to know whether a particular key belongs
|
|
to a particular shard:
|
|
|
|
- Layout version
|
|
- Stripe size
|
|
- Shard count
|
|
- Shard index
|
|
|
|
This structure's size is constant. Note that if we had used a differnet key
|
|
mapping scheme such as consistent hashing with explicit hash ranges assigned
|
|
to each shard, then the ShardIdentity's size would grow with the shard count: the simpler
|
|
key mapping scheme used here enables a small fixed size ShardIdentity.
|
|
|
|
### Pageserver changes
|
|
|
|
#### Structural
|
|
|
|
Everywhere the Pageserver currently deals with Tenants, it will move to dealing with
|
|
`TenantShard`s, which are just a `Tenant` plus a `ShardIdentity` telling it which part
|
|
of the keyspace it owns. An un-sharded tenant is just a `TenantShard` whose `ShardIdentity`
|
|
covers the whole keyspace.
|
|
|
|
When the pageserver writes layers and index_part.json to remote storage, it must
|
|
include the shard index & count in the name, to avoid collisions (the count is
|
|
necessary for future-proofing: the count will vary in time). These keys
|
|
will also include a generation number: the [generation numbers](025-generation-numbers.md) system will work
|
|
exactly the same for TenantShards as it does for Tenants today: each shard will have
|
|
its own generation number.
|
|
|
|
#### Storage Format: Keys
|
|
|
|
For tenants with >1 shard, layer files implicitly become sparse: within the key
|
|
range described in the layer name, the layer file for a shard will only hold the
|
|
content relevant to stripes assigned to the shard.
|
|
|
|
For this reason, the LayerFileName within a tenant is no longer unique: different shards
|
|
may use the same LayerFileName to refer to different data. We may solve this simply
|
|
by including the shard number in the keys used for layers.
|
|
|
|
The shard number will be included as a prefix (as part of tenant ID), like this:
|
|
|
|
`pageserver/v1/tenants/<tenant_id>-<shard_number><shard_count>/timelines/<timeline id>/<layer file name>-<generation>`
|
|
|
|
`pageserver/v1/tenants/<tenant_id>-<shard_number><shard_count>/timelines/<timeline id>/index_part.json-<generation>`
|
|
|
|
Reasons for this particular format:
|
|
|
|
- Use of a prefix is convenient for implementation (no need to carry the shard ID everywhere
|
|
we construct a layer file name), and enables efficient listing of index_parts within
|
|
a particular shard-timeline prefix.
|
|
- Including the shard _count_ as well as shard number means that in future when we implement
|
|
shard splitting, it will be possible for a parent shard and one of its children to write
|
|
the same layer file without a name collision. For example, a parent shard 0_1 might split
|
|
into two (0_2, 1_2), and in the process of splitting shard 0_2 could write a layer or index_part
|
|
that is distinct from what shard 0_1 would have written at the same place.
|
|
|
|
In practice, we expect shard counts to be relatively small, so a `u8` will be sufficient,
|
|
and therefore the shard part of the path can be a fixed-length hex string like `{:02X}{:02X}`,
|
|
for example a single-shard tenant's prefix will be `0001`.
|
|
|
|
For backward compatibility, we may define a special `ShardIdentity` that has shard_count==0,
|
|
and use this as a cue to construct paths with no prefix at all.
|
|
|
|
#### Storage Format: Indices
|
|
|
|
In the phase 1 described in this RFC, shards only reference layers they write themselves. However,
|
|
when we implement shard splitting in future, it will be useful to enable shards to reference layers
|
|
written by other shards (specifically the parent shard during a split), so that shards don't
|
|
have to exhaustively copy all data into their own shard-prefixed keys.
|
|
|
|
To enable this, the `IndexPart` structure will be extended to store the (shard number, shard count)
|
|
tuple on each layer, such that it can construct paths for layers written by other shards. This
|
|
naturally raises the question of who "owns" such layers written by ancestral shards: this problem
|
|
will be addressed in phase 2.
|
|
|
|
For backward compatibility, any index entry without shard information will be assumed to be
|
|
in the legacy shardidentity.
|
|
|
|
#### WAL Ingest
|
|
|
|
In Phase 1, all shards will subscribe to the safekeeper to download WAL content. They will filter
|
|
it down to the pages relevant to their shard:
|
|
|
|
- For ordinary user data writes, only retain a write if it matches the ShardIdentity
|
|
- For metadata describing relations etc, all shards retain these writes.
|
|
|
|
The pageservers must somehow give the safekeeper correct feedback on remote_consistent_lsn:
|
|
one solution here is for the 0th shard to periodically peek at the IndexParts for all the other shards,
|
|
and have only the 0th shard populate remote_consistent_lsn. However, this is relatively
|
|
expensive: if the safekeeper can be made shard-aware then it could be taught to use
|
|
the max() of all shards' remote_consistent_lsns to decide when to trim the WAL.
|
|
|
|
#### Compaction/GC
|
|
|
|
No changes needed.
|
|
|
|
The pageserver doesn't have to do anything special during compaction
|
|
or GC. It is implicitly operating on the subset of keys that map to its ShardIdentity.
|
|
This will result in sparse layer files, containing keys only in the stripes that this
|
|
shard owns. Where optimizations currently exist in compaction for spotting "gaps" in
|
|
the key range, these should be updated to ignore gaps that are due to sharding, to
|
|
avoid spuriously splitting up layers ito stripe-sized pieces.
|
|
|
|
### Compute Endpoints
|
|
|
|
Compute endpoints will need to:
|
|
|
|
- Accept a vector of connection strings as part of their configuration from the control plane
|
|
- Route pageserver requests according to mapping the hash of key to the correct
|
|
entry in the vector of connection strings.
|
|
|
|
Doing this in compute rather than routing requests via a single pageserver is
|
|
necessary to enable sharding tenants without adding latency from extra hops.
|
|
|
|
### Control Plane
|
|
|
|
Tenants, or _Projects_ in the control plane, will each own a set of TenantShards (this will
|
|
be 1 for small tenants). Logic for placement of tenant shards is just the same as the current logic for placing
|
|
tenants.
|
|
|
|
Tenant lifecycle operations like deletion will require fanning-out to all the shards
|
|
in the tenant. The same goes for timeline creation and deletion: a timeline should
|
|
not be considered created until it has been created in all shards.
|
|
|
|
#### Selectively enabling sharding for large tenants
|
|
|
|
Initially, we will explicitly enable sharding for large tenants only.
|
|
|
|
In future, this hint mechanism will become optional when we implement automatic
|
|
re-sharding of tenants.
|
|
|
|
## Future Phases
|
|
|
|
This section exists to indicate what will likely come next after this phase.
|
|
|
|
Phases 2a and 2b are amenable to execution in parallel.
|
|
|
|
### Phase 2a: WAL fan-out
|
|
|
|
**Problem**: when all shards consume the whole WAL, the network bandwidth used
|
|
for transmitting the WAL from safekeeper to pageservers is multiplied by a factor
|
|
of the shard count.
|
|
|
|
Network bandwidth is not our most pressing bottleneck, but it is likely to become
|
|
a problem if we set a modest shard count (~8) on a significant number of tenants,
|
|
especially as those larger tenants which we shard are also likely to have higher
|
|
write bandwidth than average.
|
|
|
|
### Phase 2b: Shard Splitting
|
|
|
|
**Problem**: the number of shards in a tenant is defined at creation time and cannot
|
|
be changed. This causes excessive sharding for most small tenants, and an upper
|
|
bound on scale for very large tenants.
|
|
|
|
To address this, a _splitting_ feature will later be added. One shard can split its
|
|
data into a number of children by doing a special compaction operation to generate
|
|
image layers broken up child-shard-wise, and then writing out an `index_part.json` for
|
|
each child. This will then require external coordination (by the control plane) to
|
|
safely attach these new child shards and then move them around to distribute work.
|
|
The opposite _merging_ operation can also be imagined, but is unlikely to be implemented:
|
|
once a Tenant has been sharded, the marginal efficiency benefit of merging is unlikely to justify
|
|
the risk/complexity of implementing such a rarely-encountered scenario.
|
|
|
|
### Phase N (future): distributed historical reads
|
|
|
|
**Problem**: while sharding based on key is good for handling changes in overall
|
|
database size, it is less suitable for spiky/unpredictable changes in the read
|
|
workload to historical layers. Sudden increases in historical reads could result
|
|
in sudden increases in local disk capacity required for a TenantShard.
|
|
|
|
Example: the extreme case of this would be to run a tenant for a year, then create branches
|
|
with ancestors at monthly intervals. This could lead to a sudden 12x inflation in
|
|
the on-disk capacity footprint of a TenantShard, since it would be serving reads
|
|
from all those disparate historical layers.
|
|
|
|
If we can respond fast enough, then key-sharding a tenant more finely can help with
|
|
this, but splitting may be a relatively expensive operation and the increased historical
|
|
read load may be transient.
|
|
|
|
A separate mechanism for handling heavy historical reads could be something like
|
|
a gossip mechanism for pageservers to communicate
|
|
about their workload, and then a getpageatlsn offload mechanism where one pageserver can
|
|
ask another to go read the necessary layers from remote storage to serve the read. This
|
|
requires relativly little coordination because it is read-only: any node can service any
|
|
read. All reads to a particular shard would still flow through one node, but the
|
|
disk capactity & I/O impact of servicing the read would be distributed.
|
|
|
|
## FAQ/Alternatives
|
|
|
|
### Why stripe the data, rather than using contiguous ranges of keyspace for each shard?
|
|
|
|
When a database is growing under a write workload, writes may predominantly hit the
|
|
end of the keyspace, creating a bandwidth hotspot on that shard. Similarly, if the user
|
|
is intensively re-writing a particular relation, if that relation lived in a particular
|
|
shard then it would not achieve our goal of distributing the write work across shards.
|
|
|
|
### Why not proxy read requests through one pageserver, so that endpoints don't have to change?
|
|
|
|
1. This would not achieve scale-out of network bandwidth: a busy tenant with a large
|
|
database would still cause a load hotspot on the pageserver routing its read requests.
|
|
2. The additional hop through the "proxy" pageserver would add latency and overall
|
|
resource cost (CPU, network bandwidth)
|
|
|
|
### Layer File Spreading: use one pageserver as the owner of a tenant, and have it spread out work on a per-layer basis to peers
|
|
|
|
In this model, there would be no explicit sharding of work, but the pageserver to which
|
|
a tenant is attached would not hold all layers on its disk: instead, it would call out
|
|
to peers to have them store some layers, and call out to those peers to request reads
|
|
in those layers.
|
|
|
|
This mechanism will work well for distributing work in the LSN dimension, but in the key
|
|
space dimension it has the major limitation of requiring one node to handle all
|
|
incoming writes, and compactions. Even if the write workload for a large database
|
|
fits in one pageserver, it will still be a hotspot and such tenants may still
|
|
de-facto require their own pageserver.
|