mirror of
https://github.com/neondatabase/neon.git
synced 2026-05-19 06:00:38 +00:00
docs: shard splitting RFC (#6358)
Extend the previous sharding RFC with functionality for dynamically splitting shards to increase the total shard count on existing tenants.
This commit is contained in:
479
docs/rfcs/032-shard-splitting.md
Normal file
479
docs/rfcs/032-shard-splitting.md
Normal file
@@ -0,0 +1,479 @@
|
||||
# Shard splitting
|
||||
|
||||
## Summary
|
||||
|
||||
This RFC describes a new pageserver API for splitting an existing tenant shard into
|
||||
multiple shards, and describes how to use this API to safely increase the total
|
||||
shard count of a tenant.
|
||||
|
||||
## Motivation
|
||||
|
||||
In the [sharding RFC](031-sharding-static.md), a mechanism was introduced to scale
|
||||
tenants beyond the capacity of a single pageserver by breaking up the key space
|
||||
into stripes, and distributing these stripes across many pageservers. However,
|
||||
the shard count was defined once at tenant creation time and not varied thereafter.
|
||||
|
||||
In practice, the expected size of a database is rarely known at creation time, and
|
||||
it is inefficient to enable sharding for very small tenants: we need to be
|
||||
able to create a tenant with a small number of shards (such as 1), and later expand
|
||||
when it becomes clear that the tenant has grown in size to a point where sharding
|
||||
is beneficial.
|
||||
|
||||
### Prior art
|
||||
|
||||
Many distributed systems have the problem of choosing how many shards to create for
|
||||
tenants that do not specify an expected size up-front. There are a couple of general
|
||||
approaches:
|
||||
|
||||
- Write to a key space in order, and start a new shard when the highest key advances
|
||||
past some point. This doesn't work well for Neon, because we write to our key space
|
||||
in many different contiguous ranges (per relation), rather than in one contiguous
|
||||
range. To adapt to this kind of model, we would need a sharding scheme where each
|
||||
relation had its own range of shards, which would be inefficient for the common
|
||||
case of databases with many small relations.
|
||||
- Monitor the system, and automatically re-shard at some size threshold. For
|
||||
example in Ceph, the [pg_autoscaler](https://github.com/ceph/ceph/blob/49c27499af4ee9a90f69fcc6bf3597999d6efc7b/src/pybind/mgr/pg_autoscaler/module.py)
|
||||
component monitors the size of each RADOS Pool, and adjusts the number of Placement
|
||||
Groups (Ceph's shard equivalent).
|
||||
|
||||
## Requirements
|
||||
|
||||
- A configurable capacity limit per-shard is enforced.
|
||||
- Changes in shard count do not interrupt service beyond requiring postgres
|
||||
to reconnect (i.e. milliseconds).
|
||||
- Human being does not have to choose shard count
|
||||
|
||||
## Non Goals
|
||||
|
||||
- Shard splitting is always a tenant-global operation: we will not enable splitting
|
||||
one shard while leaving others intact.
|
||||
- The inverse operation (shard merging) is not described in this RFC. This is a lower
|
||||
priority than splitting, because databases grow more often than they shrink, and
|
||||
a database with many shards will still work properly if the stored data shrinks, just
|
||||
with slightly more overhead (e.g. redundant WAL replication)
|
||||
- Shard splitting is only initiated based on capacity bounds, not load. Splitting
|
||||
a tenant based on load will make sense for some medium-capacity, high-load workloads,
|
||||
but is more complex to reason about and likely is not desirable until we have
|
||||
shard merging to reduce the shard count again if the database becomes less busy.
|
||||
|
||||
## Impacted Components
|
||||
|
||||
pageserver, storage controller
|
||||
|
||||
(the _storage controller_ is the evolution of what was called `attachment_service` in our test environment)
|
||||
|
||||
## Terminology
|
||||
|
||||
**Parent** shards are the shards that exist before a split. **Child** shards are
|
||||
the new shards created during a split.
|
||||
|
||||
**Shard** is synonymous with _tenant shard_.
|
||||
|
||||
**Shard Index** is the 2-tuple of shard number and shard count, written in
|
||||
paths as {:02x}{:02x}, e.g. `0001`.
|
||||
|
||||
## Background
|
||||
|
||||
In the implementation section, a couple of existing aspects of sharding are important
|
||||
to remember:
|
||||
|
||||
- Shard identifiers contain the shard number and count, so that "shard 0 of 1" (`0001`) is
|
||||
a distinct shard from "shard 0 of 2" (`0002`). This is the case in key paths, local
|
||||
storage paths, and remote index metadata.
|
||||
- Remote layer file paths contain the shard index of the shard that created them, and
|
||||
remote indices contain the same index to enable building the layer file path. A shard's
|
||||
index may reference layers that were created by another shard.
|
||||
- Local tenant shard directories include the shard index. All layers downloaded by
|
||||
a tenant shard are stored in this shard-prefixed path, even if those layers were
|
||||
initially created by another shard: tenant shards do not read and write one anothers'
|
||||
paths.
|
||||
- The `Tenant` pageserver type represents one tenant _shard_, not the whole tenant.
|
||||
This is for historical reasons and will be cleaned up in future, but the existing
|
||||
name is used here to help comprehension when reading code.
|
||||
|
||||
## Implementation
|
||||
|
||||
Note: this section focuses on the correctness of the core split process. This will
|
||||
be fairly inefficient in a naive implementation, and several important optimizations
|
||||
are described in a later section.
|
||||
|
||||
There are broadly two parts to the implementation:
|
||||
|
||||
1. The pageserver split API, which splits one shard on one pageserver
|
||||
2. The overall tenant split proccess which is coordinated by the storage controller,
|
||||
and calls into the pageserver split API as needed.
|
||||
|
||||
### Pageserver Split API
|
||||
|
||||
The pageserver will expose a new API endpoint at `/v1/tenant/:tenant_shard_id/shard_split`
|
||||
that takes the new total shard count in the body.
|
||||
|
||||
The pageserver split API operates on one tenant shard, on one pageserver. External
|
||||
coordination is required to use it safely, this is described in the later
|
||||
'Split procedure' section.
|
||||
|
||||
#### Preparation
|
||||
|
||||
First identify the shard indices for the new child shards. These are deterministic,
|
||||
calculated from the parent shard's index, and the number of children being created (this
|
||||
is an input to the API, and validated to be a power of two). In a trivial example, splitting
|
||||
0001 in two always results in 0002 and 0102.
|
||||
|
||||
Child shard indices are chosen such that the childrens' parts of the keyspace will
|
||||
be subsets of the parent's parts of the keyspace.
|
||||
|
||||
#### Step 1: write new remote indices
|
||||
|
||||
In remote storage, splitting is very simple: we may just write new index_part.json
|
||||
objects for each child shard, containing exactly the same layers as the parent shard.
|
||||
|
||||
The children will have more data than they need, but this avoids any exhausive
|
||||
re-writing or copying of layer files.
|
||||
|
||||
The index key path includes a generation number: the parent shard's current
|
||||
attached generation number will also be used for the child shards' indices. This
|
||||
makes the operation safely retryable: if everything crashes and restarts, we may
|
||||
call the split API again on the parent shard, and the result will be some new remote
|
||||
indices for the child shards, under a higher generation number.
|
||||
|
||||
#### Step 2: start new `Tenant` objects
|
||||
|
||||
A new `Tenant` object may be instantiated for each child shard, while the parent
|
||||
shard still exists. When calling the tenant_spawn function for this object,
|
||||
the remote index from step 1 will be read, and the child shard will start
|
||||
to ingest WAL to catch up from whatever was in the remote storage at step 1.
|
||||
|
||||
We now wait for child shards' WAL ingestion to catch up with the parent shard,
|
||||
so that we can safely tear down the parent shard without risking an availability
|
||||
gap to clients reading recent LSNs.
|
||||
|
||||
#### Step 3: tear down parent `Tenant` object
|
||||
|
||||
Once child shards are running and have caught up with WAL ingest, we no longer
|
||||
need the parent shard. Note that clients may still be using it -- when we
|
||||
shut it down, any page_service handlers will also shut down, causing clients
|
||||
to disconnect. When the client reconnects, it will re-lookup the tenant,
|
||||
and hit the child shard instead of the parent (shard lookup from page_service
|
||||
should bias toward higher ShardCount shards).
|
||||
|
||||
Note that at this stage the page service client has not yet been notified of
|
||||
any split. In the trivial single split example:
|
||||
|
||||
- Shard 0001 is gone: Tenant object torn down
|
||||
- Shards 0002 and 0102 are running on the same pageserver where Shard 0001 used to live.
|
||||
- Clients will continue to connect to that server thinking that shard 0001 is there,
|
||||
and all requests will work, because any key that was in shard 0001 is definitely
|
||||
available in either shard 0002 or shard 0102.
|
||||
- Eventually, the storage controller (not the pageserver) will decide to migrate
|
||||
some child shards away: at that point it will do a live migration, ensuring
|
||||
that the client has an updated configuration before it detaches anything
|
||||
from the original server.
|
||||
|
||||
#### Complete
|
||||
|
||||
When we send a 200 response to the split request, we are promising the caller:
|
||||
|
||||
- That the child shards are persistent in remote storage
|
||||
- That the parent shard has been shut down
|
||||
|
||||
This enables the caller to proceed with the overall shard split operation, which
|
||||
may involve other shards on other pageservers.
|
||||
|
||||
### Storage Controller Split procedure
|
||||
|
||||
Splitting a tenant requires calling the pageserver split API, and tracking
|
||||
enough state to ensure recovery + completion in the event of any component (pageserver
|
||||
or storage controller) crashing (or request timing out) during the split.
|
||||
|
||||
1. call the split API on all existing shards. Ensure that the resulting
|
||||
child shards are pinned to their pageservers until _all_ the split calls are done.
|
||||
This pinning may be implemented as a "split bit" on the tenant shards, that
|
||||
blocks any migrations, and also acts as a sign that if we restart, we must go
|
||||
through some recovery steps to resume the split.
|
||||
2. Once all the split calls are done, we may unpin the child shards (clear
|
||||
the split bit). The split is now complete: subsequent steps are just migrations,
|
||||
not strictly part of the split.
|
||||
3. Try to schedule new pageserver locations for the child shards, using
|
||||
a soft anti-affinity constraint to place shards from the same tenant onto different
|
||||
pageservers.
|
||||
|
||||
Updating computes about the new shard count is not necessary until we migrate
|
||||
any of the child shards away from the parent's location.
|
||||
|
||||
### Recovering from failures
|
||||
|
||||
#### Rolling back an incomplete split
|
||||
|
||||
An incomplete shard split may be rolled back quite simply, by attaching the parent shards to pageservers,
|
||||
and detaching child shards. This will lose any WAL ingested into the children after the parents
|
||||
were detached earlier, but the parents will catch up.
|
||||
|
||||
No special pageserver API is needed for this. From the storage controllers point of view, the
|
||||
procedure is:
|
||||
|
||||
1. For all parent shards in the tenant, ensure they are attached
|
||||
2. For all child shards, ensure they are not attached
|
||||
3. Drop child shards from the storage controller's database, and clear the split bit on the parent shards.
|
||||
|
||||
Any remote storage content for child shards is left behind. This is similar to other cases where
|
||||
we may leave garbage objects in S3 (e.g. when we upload a layer but crash before uploading an
|
||||
index that references it). Future online scrub/cleanup functionality can remove these objects, or
|
||||
they will be removed when the tenant is deleted, as tenant deletion lists all objects in the prefix,
|
||||
which would include any child shards that were rolled back.
|
||||
|
||||
If any timelines had been created on child shards, they will be lost when rolling back. To mitigate
|
||||
this, we will **block timeline creation during splitting**, so that we can safely roll back until
|
||||
the split is complete, without risking losing timelines.
|
||||
|
||||
Rolling back an incomplete split will happen automatically if a split fails due to some fatal
|
||||
reason, and will not be accessible via an API:
|
||||
|
||||
- A pageserver fails to complete its split API request after too many retries
|
||||
- A pageserver returns a fatal unexpected error such as 400 or 500
|
||||
- The storage controller database returns a non-retryable error
|
||||
- Some internal invariant is violated in the storage controller split code
|
||||
|
||||
#### Rolling back a complete split
|
||||
|
||||
A complete shard split may be rolled back similarly to an incomplete split, with the following
|
||||
modifications:
|
||||
|
||||
- The parent shards will no longer exist in the storage controller database, so these must
|
||||
be re-synthesized somehow: the hard part of this is figuring the parent shards' generations. This
|
||||
may be accomplished either by probing in S3, or by retaining some tombstone state for deleted
|
||||
shards in the storage controller database.
|
||||
- Any timelines that were created after the split complete will disappear when rolling back
|
||||
to the tenant shards. For this reason, rolling back after a complete split should only
|
||||
be done due to serious issues where loss of recently created timelines is acceptable, or
|
||||
in cases where we have confirmed that no timelines were created in the intervening period.
|
||||
- Parent shards' layers must not have been deleted: this property will come "for free" when
|
||||
we first roll out sharding, by simply not implementing deletion of parent layers after
|
||||
a split. When we do implement such deletion (see "Cleaning up parent-shard layers" in the
|
||||
Optimizations section), it should apply a TTL to layers such that we have a
|
||||
defined walltime window in which rollback will be possible.
|
||||
|
||||
The storage controller will expose an API for rolling back a complete split, for use
|
||||
in the field if we encounter some critical bug with a post-split tenant.
|
||||
|
||||
#### Retrying API calls during Pageserver Restart
|
||||
|
||||
When a pageserver restarts during a split API call, it may witness on-disk content for both parent and
|
||||
child shards from an ongoing split. This does not intrinsically break anything, and the
|
||||
pageserver may include all these shards in its `/re-attach` request to the storage controller.
|
||||
|
||||
In order to support such restarts, it is important that the storage controller stores
|
||||
persistent records of each child shard before it calls into a pageserver, as these child shards
|
||||
may require generation increments via a `/re-attach` request.
|
||||
|
||||
The pageserver restart will also result in a failed API call from the storage controller's point
|
||||
of view. Recall that if _any_ pageserver fails to split, the overall split operation may not
|
||||
complete, and all shards must remain pinned to their current pageserver locations until the
|
||||
split is done.
|
||||
|
||||
The pageserver API calls during splitting will retry on transient errors, so that
|
||||
short availability gaps do not result in a failure of the overall operation. The
|
||||
split in progress will be automatically rolled back if the threshold for API
|
||||
retries is reached (e.g. if a pageserver stays offline for longer than a typical
|
||||
restart).
|
||||
|
||||
#### Rollback on Storage Controller Restart
|
||||
|
||||
On startup, the storage controller will inspect the split bit for tenant shards that
|
||||
it loads from the database. If any splits are in progress:
|
||||
|
||||
- Database content will be reverted to the parent shards
|
||||
- Child shards will be dropped from memory
|
||||
- The parent and child shards will be included in the general startup reconciliation that
|
||||
the storage controller does: any child shards will be detached from pageservers because
|
||||
they don't exist in the storage controller's expected set of shards, and parent shards
|
||||
will be attached if they aren't already.
|
||||
|
||||
#### Storage controller API request failures/retries
|
||||
|
||||
The split request handler will implement idempotency: if the [`Tenant`] requested to split
|
||||
doesn't exist, we will check for the would-be child shards, and if they already exist,
|
||||
we consider the request complete.
|
||||
|
||||
If a request is retried while the original request is still underway, then the split
|
||||
request handler will notice an InProgress marker in TenantManager, and return 503
|
||||
to encourage the client to backoff/retry. This is the same as the general pageserver
|
||||
API handling for calls that try to act on an InProgress shard.
|
||||
|
||||
#### Compute start/restart during a split
|
||||
|
||||
If a compute starts up during split, it will be configured with the old sharding
|
||||
configuration. This will work for reads irrespective of the progress of the split
|
||||
as long as no child hards have been migrated away from their original location, and
|
||||
this is guaranteed in the split procedure (see earlier section).
|
||||
|
||||
#### Pageserver fails permanently during a split
|
||||
|
||||
If a pageserver permanently fails (i.e. the storage controller availability state for it
|
||||
goes to Offline) while a split is in progress, the splitting operation will roll back, and
|
||||
during the roll back it will skip any API calls to the offline pageserver. If the offline
|
||||
pageserver becomes available again, any stale locations will be cleaned up via the normal reconciliation process (the `/re-attach` API).
|
||||
|
||||
### Handling secondary locations
|
||||
|
||||
For correctness, it is not necessary to split secondary locations. We can simply detach
|
||||
the secondary locations for parent shards, and then attach new secondary locations
|
||||
for child shards.
|
||||
|
||||
Clearly this is not optimal, as it will result in re-downloads of layer files that
|
||||
were already present on disk. See "Splitting secondary locations"
|
||||
|
||||
### Conditions to trigger a split
|
||||
|
||||
The pageserver will expose a new API for reporting on shards that are candidates
|
||||
for split: this will return a top-N report of the largest tenant shards by
|
||||
physical size (remote size). This should exclude any tenants that are already
|
||||
at the maximum configured shard count.
|
||||
|
||||
The API would look something like:
|
||||
`/v1/top_n_tenant?shard_count_lt=8&sort_by=resident_size`
|
||||
|
||||
The storage controller will poll that API across all pageservers it manages at some appropriate interval (e.g. 60 seconds).
|
||||
|
||||
A split operation will be started when the tenant exceeds some threshold. This threshold
|
||||
should be _less than_ how large we actually want shards to be, perhaps much less. That's to
|
||||
minimize the amount of work involved in splitting -- if we want 100GiB shards, we shouldn't
|
||||
wait for a tenant to exceed 100GiB before we split anything. Some data analysis of existing
|
||||
tenant size distribution may be useful here: if we can make a statement like "usually, if
|
||||
a tenant has exceeded 20GiB they're probably going to exceed 100GiB later", then we might
|
||||
make our policy to split a tenant at 20GiB.
|
||||
|
||||
The finest split we can do is by factors of two, but we can do higher-cardinality splits
|
||||
too, and this will help to reduce the overhead of repeatedly re-splitting a tenant
|
||||
as it grows. An example of a very simple heuristic for early deployment of the splitting
|
||||
feature would be: "Split tenants into 8 shards when their physical size exceeds 64GiB": that
|
||||
would give us two kinds of tenant (1 shard and 8 shards), and the confidence that once we had
|
||||
split a tenant, it will not need re-splitting soon after.
|
||||
|
||||
## Optimizations
|
||||
|
||||
### Flush parent shard to remote storage during split
|
||||
|
||||
Any data that is in WAL but not remote storage at time of split will need
|
||||
to be replayed by child shards when they start for the first time. To minimize
|
||||
this work, we may flush the parent shard to remote storage before writing the
|
||||
remote indices for child shards.
|
||||
|
||||
It is important that this flush is subject to some time bounds: we may be splitting
|
||||
in response to a surge of write ingest, so it may be time-critical to split. A
|
||||
few seconds to flush latest data should be sufficient to optimize common cases without
|
||||
running the risk of holding up a split for a harmful length of time when a parent
|
||||
shard is being written heavily. If the flush doesn't complete in time, we may proceed
|
||||
to shut down the parent shard and carry on with the split.
|
||||
|
||||
### Hard linking parent layers into child shard directories
|
||||
|
||||
Before we start the Tenant objects for child shards, we may pre-populate their
|
||||
local storage directories with hard links to the layer files already present
|
||||
in the parent shard's local directory. When the child shard starts and downloads
|
||||
its remote index, it will find all those layer files already present on local disk.
|
||||
|
||||
This avoids wasting download capacity and makes splitting faster, but more importantly
|
||||
it avoids taking up a factor of N more disk space when splitting 1 shard into N.
|
||||
|
||||
This mechanism will work well in typical flows where shards are migrated away
|
||||
promptly after a split, but for the general case including what happens when
|
||||
layers are evicted and re-downloaded after a split, see the 'Proactive compaction'
|
||||
section below.
|
||||
|
||||
### Filtering during compaction
|
||||
|
||||
Compaction, especially image layer generation, should skip any keys that are
|
||||
present in a shard's layer files, but do not match the shard's ShardIdentity's
|
||||
is_key_local() check. This avoids carrying around data for longer than necessary
|
||||
in post-split compactions.
|
||||
|
||||
This was already implemented in https://github.com/neondatabase/neon/pull/6246
|
||||
|
||||
### Proactive compaction
|
||||
|
||||
In remote storage, there is little reason to rewrite any data on a shard split:
|
||||
all the children can reference parent layers via the very cheap write of the child
|
||||
index_part.json.
|
||||
|
||||
In local storage, things are more nuanced. During the initial split there is no
|
||||
capacity cost to duplicating parent layers, if we implement the hard linking
|
||||
optimization described above. However, as soon as any layers are evicted from
|
||||
local disk and re-downloaded, the downloaded layers will not be hard-links any more:
|
||||
they'll have real capacity footprint. That isn't a problem if we migrate child shards
|
||||
away from the parent node swiftly, but it risks a significant over-use of local disk
|
||||
space if we do not.
|
||||
|
||||
For example, if we did an 8-way split of a shard, and then _didn't_ migrate 7 of
|
||||
the shards elsewhere, then churned all the layers in all the shards via eviction,
|
||||
then we would blow up the storage capacity used on the node by 8x. If we're splitting
|
||||
a 100GB shard, that could take the pageserver to the point of exhausting disk space.
|
||||
|
||||
To avoid this scenario, we could implement a special compaction mode where we just
|
||||
read historic layers, drop unwanted keys, and write back the layer file. This
|
||||
is pretty expensive, but useful if we have split a large shard and are not going to
|
||||
migrate the child shards away.
|
||||
|
||||
The heuristic conditions for triggering such a compaction are:
|
||||
|
||||
- A) eviction plus time: if a child shard
|
||||
has existed for more than a time threshold, and has been requested to perform at least one eviction, then it becomes urgent for this child shard to execute a proactive compaction to reduce its storage footprint, at the cost of I/O load.
|
||||
- B) resident size plus time: we may inspect the resident layers and calculate how
|
||||
many of them include the overhead of storing pre-split keys. After some time
|
||||
threshold (different to the one in case A) we still have such layers occupying
|
||||
local disk space, then we should proactively compact them.
|
||||
|
||||
### Cleaning up parent-shard layers
|
||||
|
||||
It is functionally harmless to leave parent shard layers in remote storage indefinitely.
|
||||
They would be cleaned up in the event of the tenant's deletion.
|
||||
|
||||
As an optimization to avoid leaking remote storage capacity (which costs money), we may
|
||||
lazily clean up parent shard layers once no child shards reference them.
|
||||
|
||||
This may be done _very_ lazily: e.g. check every PITR interval. The cleanup procedure is:
|
||||
|
||||
- list all the key prefixes beginning with the tenant ID, and select those shard prefixes
|
||||
which do not belong to the most-recently-split set of shards (_ancestral shards_, i.e. `shard*count < max(shard_count) over all shards)`, and those shard prefixes which do have the latest shard count (_current shards_)
|
||||
- If there are no _ancestral shard_ prefixes found, we have nothing to clean up and
|
||||
may drop out now.
|
||||
- find the latest-generation index for each _current shard_, read all and accumulate the set of layers belonging to ancestral shards referenced by these indices.
|
||||
- for all ancestral shards, list objects in the prefix and delete any layer which was not
|
||||
referenced by a current shard.
|
||||
|
||||
If this cleanup is scheduled for 1-2 PITR periods after the split, there is a good chance that child shards will have written their own image layers covering the whole keyspace, such that all parent shard layers will be deletable.
|
||||
|
||||
The cleanup may be done by the scrubber (external process), or we may choose to have
|
||||
the zeroth shard in the latest generation do the work -- there is no obstacle to one shard
|
||||
reading the other shard's indices at runtime, and we do not require visibility of the
|
||||
latest index writes.
|
||||
|
||||
Cleanup should be artificially delayed by some period (for example 24 hours) to ensure
|
||||
that we retain the option to roll back a split in case of bugs.
|
||||
|
||||
### Splitting secondary locations
|
||||
|
||||
We may implement a pageserver API similar to the main splitting API, which does a simpler
|
||||
operation for secondary locations: it would not write anything to S3, instead it would simply
|
||||
create the child shard directory on local disk, hard link in directories from the parent,
|
||||
and set up the in memory (TenantSlot) state for the children.
|
||||
|
||||
Similar to attached locations, a subset of secondary locations will probably need re-locating
|
||||
after the split is complete, to avoid leaving multiple child shards on the same pageservers,
|
||||
where they may use excessive space for the tenant.
|
||||
|
||||
## FAQ/Alternatives
|
||||
|
||||
### What should the thresholds be set to?
|
||||
|
||||
Shard size limit: the pre-sharding default capacity quota for databases was 200GiB, so this could be a starting point for the per-shard size limit.
|
||||
|
||||
Max shard count:
|
||||
|
||||
- The safekeeper overhead to sharding is currently O(N) network bandwidth because
|
||||
the un-filtered WAL is sent to all shards. To avoid this growing out of control,
|
||||
a limit of 8 shards should be temporarily imposed until WAL filtering is implemented
|
||||
on the safekeeper.
|
||||
- there is also little benefit to increasing the shard count beyond the number
|
||||
of pageservers in a region.
|
||||
|
||||
### Is it worth just rewriting all the data during a split to simplify reasoning about space?
|
||||
Reference in New Issue
Block a user