mirror of
https://github.com/neondatabase/neon.git
synced 2025-12-22 21:59:59 +00:00
rfc: add 041-rel-sparse-keyspace (#10412)
Based on the PoC patch I've done in #10316, I'd like to put an RFC in advance to ensure everyone is on the same page, and start incrementally port the code to the main branch. https://github.com/neondatabase/neon/issues/9516 [Rendered](https://github.com/neondatabase/neon/blob/skyzh/rfc-041-rel-sparse-keyspace/docs/rfcs/041-rel-sparse-keyspace.md) --------- Signed-off-by: Alex Chi Z <chi@neon.tech> Co-authored-by: Erik Grinaker <erik@neon.tech>
This commit is contained in:
201
docs/rfcs/041-rel-sparse-keyspace.md
Normal file
201
docs/rfcs/041-rel-sparse-keyspace.md
Normal file
@@ -0,0 +1,201 @@
|
||||
# Sparse Keyspace for Relation Directories
|
||||
|
||||
## Summary
|
||||
|
||||
This is an RFC describing a new storage strategy for storing relation directories.
|
||||
|
||||
## Motivation
|
||||
|
||||
Postgres maintains a directory structure for databases and relations. In Neon, we store these information
|
||||
by serializing the directory data in a single key (see `pgdatadir_mapping.rs`).
|
||||
|
||||
```rust
|
||||
// DbDir:
|
||||
// 00 00000000 00000000 00000000 00 00000000
|
||||
|
||||
// RelDir:
|
||||
// 00 SPCNODE DBNODE 00000000 00 00000001 (Postgres never uses relfilenode 0)
|
||||
```
|
||||
|
||||
We have a dedicated structure on the ingestion path to serialize the relation directory into this single key.
|
||||
|
||||
```rust
|
||||
#[derive(Debug, Serialize, Deserialize, Default)]
|
||||
pub(crate) struct RelDirectory {
|
||||
// Set of relations that exist. (relfilenode, forknum)
|
||||
//
|
||||
// TODO: Store it as a btree or radix tree or something else that spans multiple
|
||||
// key-value pairs, if you have a lot of relations
|
||||
pub(crate) rels: HashSet<(Oid, u8)>,
|
||||
}
|
||||
```
|
||||
|
||||
The current codebase has the following three access patterns for the relation directory.
|
||||
|
||||
1. Check if a relation exists.
|
||||
2. List all relations.
|
||||
3. Create/drop a relation.
|
||||
|
||||
For (1), we currently have to get the reldir key, deserialize it, and check whether the relation exists in the
|
||||
hash set. For (2), we get the reldir key and the hash set. For (3), we need first to get
|
||||
and deserialize the key, add the new relation record to the hash set, and then serialize it and write it back.
|
||||
|
||||
If we have 100k relations in a database, we would have a 100k-large hash set. Then, every
|
||||
relation created and dropped would have deserialized and serialized this 100k-large hash set. This makes the
|
||||
relation create/drop process to be quadratic. When we check if a relation exists in the ingestion path,
|
||||
we would have to deserialize this super big 100k-large key before checking if a single relation exists.
|
||||
|
||||
In this RFC, we will propose a new way to store the reldir data in the sparse keyspace and propose how
|
||||
to seamlessly migrate users to use the new keyspace.
|
||||
|
||||
The PoC patch is implemented in [PR10316](https://github.com/neondatabase/neon/pull/10316).
|
||||
|
||||
## Key Mapping
|
||||
|
||||
We will use the recently introduced sparse keyspace to store actual data. Sparse keyspace was proposed in
|
||||
[038-aux-file-v2.md](038-aux-file-v2.md). The original reldir has one single value of `HashSet<(Oid, u8)>`
|
||||
for each of the databases (identified as `spcnode, dbnode`). We encode the `Oid` (`relnode, forknum`),
|
||||
into the key.
|
||||
|
||||
```plain
|
||||
(REL_DIR_KEY_PREFIX, spcnode, dbnode, relnode, forknum, 1) -> deleted
|
||||
(REL_DIR_KEY_PREFIX, spcnode, dbnode, relnode, forknum, 1) -> exists
|
||||
```
|
||||
|
||||
Assume all reldir data are stored in this new keyspace; the 3 reldir operations we mentioned before can be
|
||||
implemented as follows.
|
||||
|
||||
1. Check if a relation exists: check if the key maps to "exists".
|
||||
2. List all relations: scan the sprase keyspace over the `rel_dir_key_prefix`. Extract relnode and forknum from the key.
|
||||
3. Create/drop a relation: write "exists" or "deleted" to the corresponding key of the relation. The delete tombstone will
|
||||
be removed during image layer generation upon compaction.
|
||||
|
||||
Note that "exists" and "deleted" will be encoded as a single byte as two variants of an enum.
|
||||
The mapping is implemented as `rel_tag_sparse_key` in the PoC patch.
|
||||
|
||||
## Changes to Sparse Keyspace
|
||||
|
||||
Previously, we only used sparse keyspaces for the aux files, which did not carry over when branching. The reldir
|
||||
information needs to be preserved from the parent branch to the child branch. Therefore, the read path needs
|
||||
to be updated accordingly to accommodate such "inherited sparse keys". This is done in
|
||||
[PR#10313](https://github.com/neondatabase/neon/pull/10313).
|
||||
|
||||
## Coexistence of the Old and New Keyspaces
|
||||
|
||||
Migrating to the new keyspace will be done gradually: when we flip a config item to enable the new reldir keyspace, the
|
||||
ingestion path will start to write to the new keyspace and the old reldir data will be kept in the old one. The read
|
||||
path needs to combine the data from both keyspaces.
|
||||
|
||||
Theoretically, we could do a rewrite at the startup time that scans all relation directories and copies that data into the
|
||||
new keyspace. However, this could take a long time, especially if we have thousands of tenants doing the migration
|
||||
process simultaneously after the pageserver restarts. Therefore, we propose the coexistence strategy so that the
|
||||
migration can happen seamlessly and imposes no potential downtime for the user.
|
||||
|
||||
With the coexistence assumption, the 3 reldir operations will be implemented as follows:
|
||||
|
||||
1. Check if a relation exists
|
||||
- Check the new keyspace if the key maps to any value. If it maps to "exists" or "deleted", directly
|
||||
return it to the user.
|
||||
- Otherwise, deserialize the old reldir key and get the result.
|
||||
2. List all relations: scan the sparse keyspace over the `rel_dir_key_prefix` and deserialize the old reldir key.
|
||||
Combine them to obtain the final result.
|
||||
3. Create/drop a relation: write "exists" or "deleted" to the corresponding key of the relation into the new keyspace.
|
||||
- We assume no overwrite of relations will happen (i.e., the user won't create a relation at the same Oid). This will be implemented as a runtime check.
|
||||
- For relation creation, we add `sparse_reldir_tableX -> exists` to the keyspace.
|
||||
- For relation drop, we first check if the relation is recorded in the old keyspace. If yes, we deserialize the old reldir key,
|
||||
remove the relation, and then write it back. Otherwise, we put `sparse_reldir_tableX -> deleted` to the keyspace.
|
||||
- The delete tombstone will be removed during image layer generation upon compaction.
|
||||
|
||||
This process ensures that the transition will not introduce any downtime and all new updates are written to the new keyspace. The total
|
||||
amount of data in the storage would be `O(relations_modifications)` and we can guarantee `O(current_relations)` after compaction.
|
||||
There could be some relations that exist in the old reldir key for a long time. Refer to the "Full Migration" section on how to deal
|
||||
with them. Plus, for relation modifications, it will have `O(old_relations)` complexity until we do the full migration, which gives
|
||||
us `O(1)` complexity after fully opt-in the sparse keyspace.
|
||||
|
||||
The process also implies that a relation will only exists either in the old reldir key or in the new sparse keyspace. It is not possible
|
||||
to have a table to be recorded in the old reldir key while later having a delete tombstone for it in the sparse keyspace at any LSN.
|
||||
|
||||
We will introduce a config item and an index_part record to record the current status of the migration process.
|
||||
|
||||
- Config item `enable_reldir_v2`: controls whether the ingestion path writes the reldir info into the new keyspace.
|
||||
- `index_part.json` field `reldir_v2_status`: whether the timeline has written any key into the new reldir keyspace.
|
||||
|
||||
If `enable_reldir_v2` is set to `true` and the timeline ingests the first key into the new reldir keyspace, it will update
|
||||
`index_part.json` to set `reldir_v2_status` to `Status::Migrating`. Even if `enable_reldir_v2` gets flipped back to
|
||||
`false` (i.e., when the pageserver restarts and such config isn't persisted), the read/write path will still
|
||||
read/write to the new keyspace to avoid data inconsistency. This also indicates that the migration is one-way only:
|
||||
once v2 is enabled, the user cannot go back to v1.
|
||||
|
||||
## Next Steps
|
||||
|
||||
### Full Migration
|
||||
|
||||
This won't be implemented in the project's first phase but might be implemented in the future. Having both v1 and
|
||||
v2 existing in the system would force us to keep the code to deserialize the old reldir key forever. To entirely deprecate this
|
||||
code path, we must ensure the timeline has no old reldir data.
|
||||
|
||||
We can trigger a special image layer generation process at the gc-horizon. The generated image layers will cover several keyspaces:
|
||||
the old reldir key in each of the databases, and the new reldir sparse keyspace. It will remove the old reldir key while
|
||||
copying them into the corresponding keys in the sparse keyspace in the resulting image. This special process happens in
|
||||
the background during compaction. For example, assume this special process is triggered at LSN 0/180. The `create_image_layers`
|
||||
process discovers the following keys at this LSN.
|
||||
|
||||
```plain
|
||||
db1/reldir_key -> (table 1, table 2, table 3)
|
||||
...db1 rel keys
|
||||
db2/reldir_key -> (table 4, table 5, table 6)
|
||||
...db2 rel keys
|
||||
sparse_reldir_db2_table7 -> exists
|
||||
sparse_reldir_db1_table8 -> deleted
|
||||
```
|
||||
|
||||
It will generate the following keys:
|
||||
|
||||
```plain
|
||||
db1/reldir_key -> () # we have to keep the key because it is part of `collect_keyspace`.
|
||||
...db1 rel keys
|
||||
db2/reldir_key -> ()
|
||||
...db2 rel keys
|
||||
|
||||
-- start image layer for the sparse keyspace at sparse_reldir_prefix at LSN 0/180
|
||||
sparse_reldir_db1_table1 -> exists
|
||||
sparse_reldir_db1_table2 -> exists
|
||||
sparse_reldir_db1_table3 -> exists
|
||||
sparse_reldir_db2_table4 -> exists
|
||||
sparse_reldir_db2_table5 -> exists
|
||||
sparse_reldir_db2_table6 -> exists
|
||||
sparse_reldir_db2_table7 -> exists
|
||||
-- end image layer for the sparse keyspace at sparse_reldir_prefix+1
|
||||
|
||||
# The `sparse_reldir_db1_table8` key gets dropped as part of the image layer generation code for the sparse keyspace.
|
||||
# Note that the read path will stop reading if a key is not found in the image layer covering the key range so there
|
||||
# are no correctness issue.
|
||||
```
|
||||
|
||||
We must verify that no pending modifications to the old reldir exists in the delta/image layers above the gc-horizon before
|
||||
we start this process (We can do a vectored read to get the full key history of the old reldir key and ensure there are no more images
|
||||
above the gc-horizon). Otherwise, it will violate the property that "a relation will only exists either in the old reldir key or
|
||||
in the new sparse keyspace". After we run this migration process, we can mark `reldir_v2_status` in the `index_part.json` to
|
||||
`Status::Migrated`, and the read path won't need to read from the old reldir anymore. Once the status is set to `Migrated`, we
|
||||
don't need to add the key into `collect_keyspace` and therefore all of them will be removed from all future image layers.
|
||||
|
||||
The migration process can be proactively triggered across all attached/detached tenants to help us fully remove the old reldir code.
|
||||
|
||||
### Consolidate Relation Size Keys
|
||||
|
||||
We have relsize at the end of all relation nodes.
|
||||
|
||||
```plain
|
||||
// RelSize:
|
||||
// 00 SPCNODE DBNODE RELNODE FORK FFFFFFFF
|
||||
```
|
||||
|
||||
This means that computing logical size requires us to do several single-key gets across the keyspace,
|
||||
potentially requiring downloading many layer files. We could consolidate them into a single
|
||||
keyspace, improving logical size calculation performance.
|
||||
|
||||
### Migrate DBDir Keys
|
||||
|
||||
We assume the number of databases created by the users will be small, and therefore, the current way
|
||||
of storing the database directory would be acceptable. In the future, we could also migrate DBDir keys into
|
||||
the sparse keyspace to support large amount of databases.
|
||||
Reference in New Issue
Block a user