Don't use term 'snapshot' to describe layers

This commit is contained in:
anastasia
2021-09-02 14:45:58 +03:00
committed by lubennikovaav
parent c6678c5dea
commit fabf5ec664
11 changed files with 85 additions and 78 deletions

View File

@@ -26,7 +26,7 @@ A checkpoint record in the WAL marks a point in the WAL sequence at which it is
NOTE: This is an overloaded term.
Whenever enough WAL has been accumulated in memory, the page server []
writes out the changes in memory into new layer files[]. This process
writes out the changes from in-memory layers into new layer files[]. This process
is called "checkpointing". The page server only creates layer files for
relations that have been modified since the last checkpoint.
@@ -41,17 +41,28 @@ Stateless Postgres node that stores data in pageserver.
Each of the separate segmented file sets in which a relation is stored. The main fork is where the actual data resides. There also exist two secondary forks for metadata: the free space map and the visibility map.
Each PostgreSQL fork is considered a separate relish.
### Layer file
### Layer
Each layer corresponds to one RELISH_SEG_SIZE slice of a relish in a range of LSNs.
There are two kinds of layers, in-memory and on-disk layers. In-memory
layers are used to ingest incoming WAL, and provide fast access
to the recent page versions. On-disk layers are stored as files on disk, and
are immutable.
### Layer file (on-disk layer)
Layered repository on-disk format is based on immutable files. The
files are called "layer files". Each file corresponds to one 10 MB
files are called "layer files". Each file corresponds to one RELISH_SEG_SIZE
segment of a PostgreSQL relation fork. There are two kinds of layer
files: image files and delta files. An image file contains a
"snapshot" of the segment at a particular LSN, and a delta file
contains WAL records applicable to the segment, in a range of LSNs.
### Layer map
The layer map tracks what layers exist for all the relishes in a timeline.
### Layered repository
Zenith repository implementation that keeps data in layers.
### LSN
@@ -121,7 +132,7 @@ Each SLRU segment is considered a separate relish[].
### Tenant (Multitenancy)
Tenant represents a single customer, interacting with Zenith.
Wal redo[] activity, timelines[], snapshots[] are managed for each tenant independently.
Wal redo[] activity, timelines[], layers[] are managed for each tenant independently.
One pageserver[] can serve multiple tenants at once.
One safekeeper

View File

@@ -37,7 +37,7 @@ On the page server tenants introduce one level of indirection, so data directory
├── de182bc61fb11a5a6b390a8aed3a804a
└── ee6016ec31116c1b7c33dfdfca38891f
```
Wal redo activity, timelines, snapshots are managed for each tenant independently.
Wal redo activity and timelines are managed for each tenant independently.
For local environment used for example in tests there also new level of indirection for tenants. It touches `pgdatadirs` directory. Now it contains `tenants` subdirectory so the structure looks the following way:

View File

@@ -6,13 +6,14 @@ which pages they apply to, and accumulates the incoming changes in
memory. Every now and then, the accumulated changes are written out to
new files.
The files are called "snapshot files". Each snapshot file corresponds
to one 10 MB slice of a PostgreSQL relation fork. The snapshot files
The files are called "layer files". Each layer file corresponds
to one RELISH_SEG_SIZE slice of a PostgreSQL relation fork or
non-rel file in a range of LSNs. The layer files
for each timeline are stored in the timeline's subdirectory under
.zenith/tenants/<tenantid>/timelines.
There are two kind of snapshot file: base images, and deltas. A base
image file contains a snapshot of a segment as it was at one LSN,
There are two kind of layer file: base images, and deltas. A base
image file contains a layer of a segment as it was at one LSN,
whereas a delta file contains modifications to a segment - mostly in
the form of WAL records - in a range of LSN
@@ -44,7 +45,7 @@ managed, except that the first part of file names is different.
Internally, the relations and non-relation files that are managed in
the versioned store are together called "relishes".
If a file has been dropped, the last snapshot file for it is created
If a file has been dropped, the last layer file for it is created
with the _DROPPED suffix, e.g.
rel_1663_13990_2609_0_10_000000000169C348_0000000001702000_DROPPED
@@ -67,7 +68,7 @@ for 'orders' table on 'main' branch is represented like this:
main/orders_100_200
# Creating snapshot files
# Creating layer files
Let's start with a simple example with a system that contains one
branch called 'main' and two tables, 'orders' and 'customers'. The end
@@ -86,10 +87,10 @@ end of WAL at 250 are kept in memory. If the page server crashes, the
latest records between 200-250 need to be re-read from the WAL.
Whenever enough WAL has been accumulated in memory, the page server
writes out the changes in memory into new snapshot files. This process
writes out the changes in memory into new layer files. This process
is called "checkpointing" (not to be confused with the PostgreSQL
checkpoints, that's a different thing). The page server only creates
snapshot files for relations that have been modified since the last
layer files for relations that have been modified since the last
checkpoint. For example, if the current end of WAL is at LSN 450, and
the last checkpoint happened at LSN 400 but there hasn't been any
recent changes to 'customers' table, you would have these files on
@@ -108,7 +109,7 @@ disk:
If the customers table is modified later, a new file is created for it
at the next checkpoint. The new file will cover the "gap" from the
last snapshot file, so the LSN ranges are always contiguous:
last layer file, so the LSN ranges are always contiguous:
main/orders_100
main/orders_100_200
@@ -130,13 +131,13 @@ page server needs to reconstruct the requested page, as it was at the
requested LSN. To do that, the page server first checks the recent
in-memory layer; if the requested page version is found there, it can
be returned immediatedly without looking at the files on
disk. Otherwise the page server needs to locate the snapshot file that
disk. Otherwise the page server needs to locate the layer file that
contains the requested page version.
For example, if a request comes in for table 'orders' at LSN 250, the
page server would load the 'main/orders_200_300' file into memory, and
reconstruct and return the requested page from it, as it was at
LSN 250. Because the snapshot file consists of a full image of the
LSN 250. Because the layer file consists of a full image of the
relation at the start LSN and the WAL, reconstructing the page
involves replaying any WAL records applicable to the page between LSNs
200-250, starting from the base image at LSN 200.
@@ -171,7 +172,7 @@ Then, the 'orders' table is updated differently on the 'main' and
Because the 'customers' table hasn't been modified on the child
branch, there is no file for it there. If you request a page for it on
the 'child' branch, the page server will not find any snapshot file
the 'child' branch, the page server will not find any layer file
for it in the 'child' directory, so it will recurse to look into the
parent 'main' branch instead.
@@ -217,7 +218,7 @@ branch at a historic LSN, is how we support PITR in Zenith.
# Garbage collection
In this scheme, we keep creating new snapshot files over time. We also
In this scheme, we keep creating new layer files over time. We also
need a mechanism to remove old files that are no longer needed,
because disk space isn't infinite.
@@ -245,7 +246,7 @@ of the branch is LSN 525, so that the GC horizon is currently at
main/customers_200
We can remove the following files because the end LSNs of those files are
older than GC horizon 375, and there are more recent snapshot files for the
older than GC horizon 375, and there are more recent layer files for the
table:
main/orders_100 DELETE
@@ -262,7 +263,7 @@ table:
main/customers_200 KEEP, NO NEWER VERSION
'main/customers_100_200' is old enough, but it cannot be
removed because there is no newer snapshot file for the table.
removed because there is no newer layer file for the table.
Things get slightly more complicated with multiple branches. All of
the above still holds, but in addition to recent files we must also
@@ -308,7 +309,7 @@ new base image and delta file for it on the child:
After this, the 'main/orders_100' and 'main/orders_100_200' file could
be removed. It is no longer needed by the child branch, because there
is a newer snapshot file there. TODO: This optimization hasn't been
is a newer layer file there. TODO: This optimization hasn't been
implemented! The GC algorithm will currently keep the file on the
'main' branch anyway, for as long as the child branch exists.
@@ -346,7 +347,7 @@ It would also be OK to have overlapping LSN ranges for the same relation:
main/orders_300_400
main/orders_400
The code that reads the snapshot files should cope with this, but this
The code that reads the layer files should cope with this, but this
situation doesn't arise either, because the checkpointing code never
does that. It could be useful, however, as a transient state when
garbage collecting around branch points, or explicit recovery
@@ -360,6 +361,6 @@ points. For example, if we start with this:
And there is a branch or explicit recovery point at LSN 150, we could
replace 'main/orders_100_200' with 'main/orders_150' to keep a
snapshot only at that exact point that's still needed, removing the
layer only at that exact point that's still needed, removing the
other page versions around it. But such compaction has not been
implemented yet.

View File

@@ -174,7 +174,7 @@ impl Layer for DeltaLayer {
{
// Open the file and lock the metadata in memory
// TODO: avoid opening the snapshot file for each read
// TODO: avoid opening the file for each read
let (_path, book) = self.open_book()?;
let page_version_reader = book.chapter_reader(PAGE_VERSIONS_CHAPTER)?;
let inner = self.load()?;

View File

@@ -2,8 +2,9 @@
//! It is stored in a file on disk.
//!
//! On disk, the image files are stored in timelines/<timelineid> directory.
//! Currently, there are no subdirectories, and each snapshot file is named like this:
//! Currently, there are no subdirectories, and each image layer file is named like this:
//!
//! Note that segno is
//! <spcnode>_<dbnode>_<relnode>_<forknum>_<segno>_<LSN>
//!
//! For example:
@@ -15,10 +16,10 @@
//! Only metadata is loaded into memory by the load function.
//! When images are needed, they are read directly from disk.
//!
//! For blocky segments, the images are stored in BLOCKY_IMAGES_CHAPTER.
//! For blocky relishes, the images are stored in BLOCKY_IMAGES_CHAPTER.
//! All the images are required to be BLOCK_SIZE, which allows for random access.
//!
//! For non-blocky segments, the image can be found in NONBLOCKY_IMAGE_CHAPTER.
//! For non-blocky relishes, the image can be found in NONBLOCKY_IMAGE_CHAPTER.
//!
use crate::layered_repository::filename::{ImageFileName, PathOrConf};
use crate::layered_repository::storage_layer::{

View File

@@ -472,7 +472,7 @@ impl InMemoryLayer {
}
///
/// Write the this in-memory layer to disk, as a snapshot layer.
/// Write the this in-memory layer to disk.
///
/// The cutoff point for the layer that's written to disk is 'end_lsn'.
///

View File

@@ -1,5 +1,5 @@
//!
//! The layer map tracks what layers exist for all the relations in a timeline.
//! The layer map tracks what layers exist for all the relishes in a timeline.
//!
//! When the timeline is first accessed, the server lists of all layer files
//! in the timelines/<timelineid> directory, and populates this map with

View File

@@ -97,16 +97,12 @@ pub enum PageReconstructResult {
}
///
/// A Layer holds all page versions for one segment of a relish, in a range of LSNs.
/// There are two kinds of layers, in-memory and snapshot layers. In-memory
/// A Layer corresponds to one RELISH_SEG_SIZE slice of a relish in a range of LSNs.
/// There are two kinds of layers, in-memory and on-disk layers. In-memory
/// layers are used to ingest incoming WAL, and provide fast access
/// to the recent page versions. Snaphot layers are stored on disk, and
/// to the recent page versions. On-disk layers are stored as files on disk, and
/// are immutable. This trait presents the common functionality of
/// in-memory and snapshot layers.
///
/// Each layer contains a full snapshot of the segment at the start
/// LSN. In addition to that, it contains WAL (or more page images)
/// needed to recontruct any page version up to the end LSN.
/// in-memory and on-disk layers.
///
pub trait Layer: Send + Sync {
// These functions identify the relish segment and the LSN range

View File

@@ -346,7 +346,7 @@ impl PageServerHandler {
pgb.write_message(&BeMessage::CopyOutResponse)?;
info!("sent CopyOut");
/* Send a tarball of the latest snapshot on the timeline */
/* Send a tarball of the latest layer on the timeline */
{
let mut writer = CopyDataSink { pgb };
let mut basebackup = basebackup::Basebackup::new(&mut writer, &timeline, lsn);
@@ -582,18 +582,18 @@ impl postgres_backend::Handler for PageServerHandler {
let result = repo.gc_iteration(Some(timelineid), gc_horizon, true)?;
pgb.write_message_noflush(&BeMessage::RowDescription(&[
RowDescriptor::int8_col(b"snapshot_relfiles_total"),
RowDescriptor::int8_col(b"snapshot_relfiles_needed_by_cutoff"),
RowDescriptor::int8_col(b"snapshot_relfiles_needed_by_branches"),
RowDescriptor::int8_col(b"snapshot_relfiles_not_updated"),
RowDescriptor::int8_col(b"snapshot_relfiles_removed"),
RowDescriptor::int8_col(b"snapshot_relfiles_dropped"),
RowDescriptor::int8_col(b"snapshot_nonrelfiles_total"),
RowDescriptor::int8_col(b"snapshot_nonrelfiles_needed_by_cutoff"),
RowDescriptor::int8_col(b"snapshot_nonrelfiles_needed_by_branches"),
RowDescriptor::int8_col(b"snapshot_nonrelfiles_not_updated"),
RowDescriptor::int8_col(b"snapshot_nonrelfiles_removed"),
RowDescriptor::int8_col(b"snapshot_nonrelfiles_dropped"),
RowDescriptor::int8_col(b"layer_relfiles_total"),
RowDescriptor::int8_col(b"layer_relfiles_needed_by_cutoff"),
RowDescriptor::int8_col(b"layer_relfiles_needed_by_branches"),
RowDescriptor::int8_col(b"layer_relfiles_not_updated"),
RowDescriptor::int8_col(b"layer_relfiles_removed"),
RowDescriptor::int8_col(b"layer_relfiles_dropped"),
RowDescriptor::int8_col(b"layer_nonrelfiles_total"),
RowDescriptor::int8_col(b"layer_nonrelfiles_needed_by_cutoff"),
RowDescriptor::int8_col(b"layer_nonrelfiles_needed_by_branches"),
RowDescriptor::int8_col(b"layer_nonrelfiles_not_updated"),
RowDescriptor::int8_col(b"layer_nonrelfiles_removed"),
RowDescriptor::int8_col(b"layer_nonrelfiles_dropped"),
RowDescriptor::int8_col(b"elapsed"),
]))?
.write_message_noflush(&BeMessage::DataRow(&[

View File

@@ -45,7 +45,6 @@ pub fn import_timeline_from_postgres_datadir(
match direntry.file_name().to_str() {
None => continue,
// These special files appear in the snapshot, but are not needed by the page server
Some("pg_control") => {
import_nonrel_file(timeline, lsn, RelishTag::ControlFile, &direntry.path())?;
// Extract checkpoint record from pg_control and store is as separate object
@@ -93,7 +92,6 @@ pub fn import_timeline_from_postgres_datadir(
match direntry.file_name().to_str() {
None => continue,
// These special files appear in the snapshot, but are not needed by the page server
Some("PG_VERSION") => continue,
Some("pg_filenode.map") => import_nonrel_file(
timeline,
@@ -153,7 +151,7 @@ fn import_relfile(
let p = parse_relfilename(path.file_name().unwrap().to_str().unwrap());
if let Err(e) = p {
warn!("unrecognized file in snapshot: {:?} ({})", path, e);
warn!("unrecognized file in postgres datadir: {:?} ({})", path, e);
return Err(e.into());
}
let (relnode, forknum, segno) = p.unwrap();

View File

@@ -6,19 +6,19 @@ pytest_plugins = ("fixtures.zenith_fixtures")
def print_gc_result(row):
print("GC duration {elapsed} ms".format_map(row));
print(" REL total: {snapshot_relfiles_total}, needed_by_cutoff {snapshot_relfiles_needed_by_cutoff}, needed_by_branches: {snapshot_relfiles_needed_by_branches}, not_updated: {snapshot_relfiles_not_updated}, removed: {snapshot_relfiles_removed}, dropped: {snapshot_relfiles_dropped}".format_map(row))
print(" NONREL total: {snapshot_nonrelfiles_total}, needed_by_cutoff {snapshot_nonrelfiles_needed_by_cutoff}, needed_by_branches: {snapshot_nonrelfiles_needed_by_branches}, not_updated: {snapshot_nonrelfiles_not_updated}, removed: {snapshot_nonrelfiles_removed}, dropped: {snapshot_nonrelfiles_dropped}".format_map(row))
print(" REL total: {layer_relfiles_total}, needed_by_cutoff {layer_relfiles_needed_by_cutoff}, needed_by_branches: {layer_relfiles_needed_by_branches}, not_updated: {layer_relfiles_not_updated}, removed: {layer_relfiles_removed}, dropped: {layer_relfiles_dropped}".format_map(row))
print(" NONREL total: {layer_nonrelfiles_total}, needed_by_cutoff {layer_nonrelfiles_needed_by_cutoff}, needed_by_branches: {layer_nonrelfiles_needed_by_branches}, not_updated: {layer_nonrelfiles_not_updated}, removed: {layer_nonrelfiles_removed}, dropped: {layer_nonrelfiles_dropped}".format_map(row))
#
# Test Garbage Collection of old snapshot files
# Test Garbage Collection of old layer files
#
# This test is pretty tightly coupled with the current implementation of layered
# storage, in layered_repository.rs.
#
def test_snapfiles_gc(zenith_cli, pageserver, postgres, pg_bin):
zenith_cli.run(["branch", "test_snapfiles_gc", "empty"])
pg = postgres.create_start('test_snapfiles_gc')
def test_layerfiles_gc(zenith_cli, pageserver, postgres, pg_bin):
zenith_cli.run(["branch", "test_layerfiles_gc", "empty"])
pg = postgres.create_start('test_layerfiles_gc')
with closing(pg.connect()) as conn:
with conn.cursor() as cur:
@@ -55,8 +55,8 @@ def test_snapfiles_gc(zenith_cli, pageserver, postgres, pg_bin):
row = pscur.fetchone()
print_gc_result(row);
# remember the number of files
snapshot_relfiles_remain = row['snapshot_relfiles_total'] - row['snapshot_relfiles_removed']
assert snapshot_relfiles_remain > 0
layer_relfiles_remain = row['layer_relfiles_total'] - row['layer_relfiles_removed']
assert layer_relfiles_remain > 0
# Insert a row.
print("Inserting one row and running GC")
@@ -64,12 +64,12 @@ def test_snapfiles_gc(zenith_cli, pageserver, postgres, pg_bin):
pscur.execute(f"do_gc {pageserver.initial_tenant} {timeline} 0")
row = pscur.fetchone()
print_gc_result(row);
assert row['snapshot_relfiles_total'] == snapshot_relfiles_remain + 1
assert row['snapshot_relfiles_removed'] == 1
assert row['snapshot_relfiles_dropped'] == 0
assert row['layer_relfiles_total'] == layer_relfiles_remain + 1
assert row['layer_relfiles_removed'] == 1
assert row['layer_relfiles_dropped'] == 0
# Insert two more rows and run GC.
# This should create a new snapshot file with the new contents, and
# This should create a new layer file with the new contents, and
# remove the old one.
print("Inserting two more rows and running GC")
cur.execute("INSERT INTO foo VALUES (2)")
@@ -78,11 +78,11 @@ def test_snapfiles_gc(zenith_cli, pageserver, postgres, pg_bin):
pscur.execute(f"do_gc {pageserver.initial_tenant} {timeline} 0")
row = pscur.fetchone()
print_gc_result(row);
assert row['snapshot_relfiles_total'] == snapshot_relfiles_remain + 1
assert row['snapshot_relfiles_removed'] == 1
assert row['snapshot_relfiles_dropped'] == 0
assert row['layer_relfiles_total'] == layer_relfiles_remain + 1
assert row['layer_relfiles_removed'] == 1
assert row['layer_relfiles_dropped'] == 0
# Do it again. Should again create a new snapshot file and remove old one.
# Do it again. Should again create a new layer file and remove old one.
print("Inserting two more rows and running GC")
cur.execute("INSERT INTO foo VALUES (2)")
cur.execute("INSERT INTO foo VALUES (3)")
@@ -90,18 +90,18 @@ def test_snapfiles_gc(zenith_cli, pageserver, postgres, pg_bin):
pscur.execute(f"do_gc {pageserver.initial_tenant} {timeline} 0")
row = pscur.fetchone()
print_gc_result(row);
assert row['snapshot_relfiles_total'] == snapshot_relfiles_remain + 1
assert row['snapshot_relfiles_removed'] == 1
assert row['snapshot_relfiles_dropped'] == 0
assert row['layer_relfiles_total'] == layer_relfiles_remain + 1
assert row['layer_relfiles_removed'] == 1
assert row['layer_relfiles_dropped'] == 0
# Run GC again, with no changes in the database. Should not remove anything.
print("Run GC again, with nothing to do")
pscur.execute(f"do_gc {pageserver.initial_tenant} {timeline} 0")
row = pscur.fetchone()
print_gc_result(row);
assert row['snapshot_relfiles_total'] == snapshot_relfiles_remain
assert row['snapshot_relfiles_removed'] == 0
assert row['snapshot_relfiles_dropped'] == 0
assert row['layer_relfiles_total'] == layer_relfiles_remain
assert row['layer_relfiles_removed'] == 0
assert row['layer_relfiles_dropped'] == 0
#
# Test DROP TABLE checks that relation data and metadata was deleted by GC from object storage
@@ -114,11 +114,11 @@ def test_snapfiles_gc(zenith_cli, pageserver, postgres, pg_bin):
print_gc_result(row);
# Each relation fork is counted separately, hence 3.
assert row['snapshot_relfiles_dropped'] == 3
assert row['layer_relfiles_dropped'] == 3
# The catalog updates also create new snapshot files of the catalogs, which
# The catalog updates also create new layer files of the catalogs, which
# are counted as 'removed'
assert row['snapshot_relfiles_removed'] > 0
assert row['layer_relfiles_removed'] > 0
# TODO: perhaps we should count catalog and user relations separately,
# to make this kind of testing more robust