diff --git a/docs/glossary.md b/docs/glossary.md index c8c43e46fa..cfe487d2e0 100644 --- a/docs/glossary.md +++ b/docs/glossary.md @@ -26,7 +26,7 @@ A checkpoint record in the WAL marks a point in the WAL sequence at which it is NOTE: This is an overloaded term. Whenever enough WAL has been accumulated in memory, the page server [] -writes out the changes in memory into new layer files[]. This process +writes out the changes from in-memory layers into new layer files[]. This process is called "checkpointing". The page server only creates layer files for relations that have been modified since the last checkpoint. @@ -41,17 +41,28 @@ Stateless Postgres node that stores data in pageserver. Each of the separate segmented file sets in which a relation is stored. The main fork is where the actual data resides. There also exist two secondary forks for metadata: the free space map and the visibility map. Each PostgreSQL fork is considered a separate relish. -### Layer file +### Layer + +Each layer corresponds to one RELISH_SEG_SIZE slice of a relish in a range of LSNs. +There are two kinds of layers, in-memory and on-disk layers. In-memory +layers are used to ingest incoming WAL, and provide fast access +to the recent page versions. On-disk layers are stored as files on disk, and +are immutable. +### Layer file (on-disk layer) Layered repository on-disk format is based on immutable files. The -files are called "layer files". Each file corresponds to one 10 MB +files are called "layer files". Each file corresponds to one RELISH_SEG_SIZE segment of a PostgreSQL relation fork. There are two kinds of layer files: image files and delta files. An image file contains a "snapshot" of the segment at a particular LSN, and a delta file contains WAL records applicable to the segment, in a range of LSNs. +### Layer map + +The layer map tracks what layers exist for all the relishes in a timeline. ### Layered repository +Zenith repository implementation that keeps data in layers. ### LSN @@ -121,7 +132,7 @@ Each SLRU segment is considered a separate relish[]. ### Tenant (Multitenancy) Tenant represents a single customer, interacting with Zenith. -Wal redo[] activity, timelines[], snapshots[] are managed for each tenant independently. +Wal redo[] activity, timelines[], layers[] are managed for each tenant independently. One pageserver[] can serve multiple tenants at once. One safekeeper diff --git a/docs/multitenancy.md b/docs/multitenancy.md index 7016509375..c9a95116c5 100644 --- a/docs/multitenancy.md +++ b/docs/multitenancy.md @@ -37,7 +37,7 @@ On the page server tenants introduce one level of indirection, so data directory ├── de182bc61fb11a5a6b390a8aed3a804a └── ee6016ec31116c1b7c33dfdfca38891f ``` -Wal redo activity, timelines, snapshots are managed for each tenant independently. +Wal redo activity and timelines are managed for each tenant independently. For local environment used for example in tests there also new level of indirection for tenants. It touches `pgdatadirs` directory. Now it contains `tenants` subdirectory so the structure looks the following way: diff --git a/pageserver/src/layered_repository/README.md b/pageserver/src/layered_repository/README.md index cede6725e0..ef09c084d6 100644 --- a/pageserver/src/layered_repository/README.md +++ b/pageserver/src/layered_repository/README.md @@ -6,13 +6,14 @@ which pages they apply to, and accumulates the incoming changes in memory. Every now and then, the accumulated changes are written out to new files. -The files are called "snapshot files". Each snapshot file corresponds -to one 10 MB slice of a PostgreSQL relation fork. The snapshot files +The files are called "layer files". Each layer file corresponds +to one RELISH_SEG_SIZE slice of a PostgreSQL relation fork or +non-rel file in a range of LSNs. The layer files for each timeline are stored in the timeline's subdirectory under .zenith/tenants//timelines. -There are two kind of snapshot file: base images, and deltas. A base -image file contains a snapshot of a segment as it was at one LSN, +There are two kind of layer file: base images, and deltas. A base +image file contains a layer of a segment as it was at one LSN, whereas a delta file contains modifications to a segment - mostly in the form of WAL records - in a range of LSN @@ -44,7 +45,7 @@ managed, except that the first part of file names is different. Internally, the relations and non-relation files that are managed in the versioned store are together called "relishes". -If a file has been dropped, the last snapshot file for it is created +If a file has been dropped, the last layer file for it is created with the _DROPPED suffix, e.g. rel_1663_13990_2609_0_10_000000000169C348_0000000001702000_DROPPED @@ -67,7 +68,7 @@ for 'orders' table on 'main' branch is represented like this: main/orders_100_200 -# Creating snapshot files +# Creating layer files Let's start with a simple example with a system that contains one branch called 'main' and two tables, 'orders' and 'customers'. The end @@ -86,10 +87,10 @@ end of WAL at 250 are kept in memory. If the page server crashes, the latest records between 200-250 need to be re-read from the WAL. Whenever enough WAL has been accumulated in memory, the page server -writes out the changes in memory into new snapshot files. This process +writes out the changes in memory into new layer files. This process is called "checkpointing" (not to be confused with the PostgreSQL checkpoints, that's a different thing). The page server only creates -snapshot files for relations that have been modified since the last +layer files for relations that have been modified since the last checkpoint. For example, if the current end of WAL is at LSN 450, and the last checkpoint happened at LSN 400 but there hasn't been any recent changes to 'customers' table, you would have these files on @@ -108,7 +109,7 @@ disk: If the customers table is modified later, a new file is created for it at the next checkpoint. The new file will cover the "gap" from the -last snapshot file, so the LSN ranges are always contiguous: +last layer file, so the LSN ranges are always contiguous: main/orders_100 main/orders_100_200 @@ -130,13 +131,13 @@ page server needs to reconstruct the requested page, as it was at the requested LSN. To do that, the page server first checks the recent in-memory layer; if the requested page version is found there, it can be returned immediatedly without looking at the files on -disk. Otherwise the page server needs to locate the snapshot file that +disk. Otherwise the page server needs to locate the layer file that contains the requested page version. For example, if a request comes in for table 'orders' at LSN 250, the page server would load the 'main/orders_200_300' file into memory, and reconstruct and return the requested page from it, as it was at -LSN 250. Because the snapshot file consists of a full image of the +LSN 250. Because the layer file consists of a full image of the relation at the start LSN and the WAL, reconstructing the page involves replaying any WAL records applicable to the page between LSNs 200-250, starting from the base image at LSN 200. @@ -171,7 +172,7 @@ Then, the 'orders' table is updated differently on the 'main' and Because the 'customers' table hasn't been modified on the child branch, there is no file for it there. If you request a page for it on -the 'child' branch, the page server will not find any snapshot file +the 'child' branch, the page server will not find any layer file for it in the 'child' directory, so it will recurse to look into the parent 'main' branch instead. @@ -217,7 +218,7 @@ branch at a historic LSN, is how we support PITR in Zenith. # Garbage collection -In this scheme, we keep creating new snapshot files over time. We also +In this scheme, we keep creating new layer files over time. We also need a mechanism to remove old files that are no longer needed, because disk space isn't infinite. @@ -245,7 +246,7 @@ of the branch is LSN 525, so that the GC horizon is currently at main/customers_200 We can remove the following files because the end LSNs of those files are -older than GC horizon 375, and there are more recent snapshot files for the +older than GC horizon 375, and there are more recent layer files for the table: main/orders_100 DELETE @@ -262,7 +263,7 @@ table: main/customers_200 KEEP, NO NEWER VERSION 'main/customers_100_200' is old enough, but it cannot be -removed because there is no newer snapshot file for the table. +removed because there is no newer layer file for the table. Things get slightly more complicated with multiple branches. All of the above still holds, but in addition to recent files we must also @@ -308,7 +309,7 @@ new base image and delta file for it on the child: After this, the 'main/orders_100' and 'main/orders_100_200' file could be removed. It is no longer needed by the child branch, because there -is a newer snapshot file there. TODO: This optimization hasn't been +is a newer layer file there. TODO: This optimization hasn't been implemented! The GC algorithm will currently keep the file on the 'main' branch anyway, for as long as the child branch exists. @@ -346,7 +347,7 @@ It would also be OK to have overlapping LSN ranges for the same relation: main/orders_300_400 main/orders_400 -The code that reads the snapshot files should cope with this, but this +The code that reads the layer files should cope with this, but this situation doesn't arise either, because the checkpointing code never does that. It could be useful, however, as a transient state when garbage collecting around branch points, or explicit recovery @@ -360,6 +361,6 @@ points. For example, if we start with this: And there is a branch or explicit recovery point at LSN 150, we could replace 'main/orders_100_200' with 'main/orders_150' to keep a -snapshot only at that exact point that's still needed, removing the +layer only at that exact point that's still needed, removing the other page versions around it. But such compaction has not been implemented yet. diff --git a/pageserver/src/layered_repository/delta_layer.rs b/pageserver/src/layered_repository/delta_layer.rs index b697d3da6c..fccf477041 100644 --- a/pageserver/src/layered_repository/delta_layer.rs +++ b/pageserver/src/layered_repository/delta_layer.rs @@ -174,7 +174,7 @@ impl Layer for DeltaLayer { { // Open the file and lock the metadata in memory - // TODO: avoid opening the snapshot file for each read + // TODO: avoid opening the file for each read let (_path, book) = self.open_book()?; let page_version_reader = book.chapter_reader(PAGE_VERSIONS_CHAPTER)?; let inner = self.load()?; diff --git a/pageserver/src/layered_repository/image_layer.rs b/pageserver/src/layered_repository/image_layer.rs index 9bfc7bccdc..8796e43d9c 100644 --- a/pageserver/src/layered_repository/image_layer.rs +++ b/pageserver/src/layered_repository/image_layer.rs @@ -2,8 +2,9 @@ //! It is stored in a file on disk. //! //! On disk, the image files are stored in timelines/ directory. -//! Currently, there are no subdirectories, and each snapshot file is named like this: +//! Currently, there are no subdirectories, and each image layer file is named like this: //! +//! Note that segno is //! _____ //! //! For example: @@ -15,10 +16,10 @@ //! Only metadata is loaded into memory by the load function. //! When images are needed, they are read directly from disk. //! -//! For blocky segments, the images are stored in BLOCKY_IMAGES_CHAPTER. +//! For blocky relishes, the images are stored in BLOCKY_IMAGES_CHAPTER. //! All the images are required to be BLOCK_SIZE, which allows for random access. //! -//! For non-blocky segments, the image can be found in NONBLOCKY_IMAGE_CHAPTER. +//! For non-blocky relishes, the image can be found in NONBLOCKY_IMAGE_CHAPTER. //! use crate::layered_repository::filename::{ImageFileName, PathOrConf}; use crate::layered_repository::storage_layer::{ diff --git a/pageserver/src/layered_repository/inmemory_layer.rs b/pageserver/src/layered_repository/inmemory_layer.rs index 476b53ffc4..d2bcc73be8 100644 --- a/pageserver/src/layered_repository/inmemory_layer.rs +++ b/pageserver/src/layered_repository/inmemory_layer.rs @@ -472,7 +472,7 @@ impl InMemoryLayer { } /// - /// Write the this in-memory layer to disk, as a snapshot layer. + /// Write the this in-memory layer to disk. /// /// The cutoff point for the layer that's written to disk is 'end_lsn'. /// diff --git a/pageserver/src/layered_repository/layer_map.rs b/pageserver/src/layered_repository/layer_map.rs index 2c9325b354..6fd8756d0e 100644 --- a/pageserver/src/layered_repository/layer_map.rs +++ b/pageserver/src/layered_repository/layer_map.rs @@ -1,5 +1,5 @@ //! -//! The layer map tracks what layers exist for all the relations in a timeline. +//! The layer map tracks what layers exist for all the relishes in a timeline. //! //! When the timeline is first accessed, the server lists of all layer files //! in the timelines/ directory, and populates this map with diff --git a/pageserver/src/layered_repository/storage_layer.rs b/pageserver/src/layered_repository/storage_layer.rs index 09e4df334c..cfae4c81c1 100644 --- a/pageserver/src/layered_repository/storage_layer.rs +++ b/pageserver/src/layered_repository/storage_layer.rs @@ -97,16 +97,12 @@ pub enum PageReconstructResult { } /// -/// A Layer holds all page versions for one segment of a relish, in a range of LSNs. -/// There are two kinds of layers, in-memory and snapshot layers. In-memory +/// A Layer corresponds to one RELISH_SEG_SIZE slice of a relish in a range of LSNs. +/// There are two kinds of layers, in-memory and on-disk layers. In-memory /// layers are used to ingest incoming WAL, and provide fast access -/// to the recent page versions. Snaphot layers are stored on disk, and +/// to the recent page versions. On-disk layers are stored as files on disk, and /// are immutable. This trait presents the common functionality of -/// in-memory and snapshot layers. -/// -/// Each layer contains a full snapshot of the segment at the start -/// LSN. In addition to that, it contains WAL (or more page images) -/// needed to recontruct any page version up to the end LSN. +/// in-memory and on-disk layers. /// pub trait Layer: Send + Sync { // These functions identify the relish segment and the LSN range diff --git a/pageserver/src/page_service.rs b/pageserver/src/page_service.rs index cf3542e30a..30a3b267ff 100644 --- a/pageserver/src/page_service.rs +++ b/pageserver/src/page_service.rs @@ -346,7 +346,7 @@ impl PageServerHandler { pgb.write_message(&BeMessage::CopyOutResponse)?; info!("sent CopyOut"); - /* Send a tarball of the latest snapshot on the timeline */ + /* Send a tarball of the latest layer on the timeline */ { let mut writer = CopyDataSink { pgb }; let mut basebackup = basebackup::Basebackup::new(&mut writer, &timeline, lsn); @@ -582,18 +582,18 @@ impl postgres_backend::Handler for PageServerHandler { let result = repo.gc_iteration(Some(timelineid), gc_horizon, true)?; pgb.write_message_noflush(&BeMessage::RowDescription(&[ - RowDescriptor::int8_col(b"snapshot_relfiles_total"), - RowDescriptor::int8_col(b"snapshot_relfiles_needed_by_cutoff"), - RowDescriptor::int8_col(b"snapshot_relfiles_needed_by_branches"), - RowDescriptor::int8_col(b"snapshot_relfiles_not_updated"), - RowDescriptor::int8_col(b"snapshot_relfiles_removed"), - RowDescriptor::int8_col(b"snapshot_relfiles_dropped"), - RowDescriptor::int8_col(b"snapshot_nonrelfiles_total"), - RowDescriptor::int8_col(b"snapshot_nonrelfiles_needed_by_cutoff"), - RowDescriptor::int8_col(b"snapshot_nonrelfiles_needed_by_branches"), - RowDescriptor::int8_col(b"snapshot_nonrelfiles_not_updated"), - RowDescriptor::int8_col(b"snapshot_nonrelfiles_removed"), - RowDescriptor::int8_col(b"snapshot_nonrelfiles_dropped"), + RowDescriptor::int8_col(b"layer_relfiles_total"), + RowDescriptor::int8_col(b"layer_relfiles_needed_by_cutoff"), + RowDescriptor::int8_col(b"layer_relfiles_needed_by_branches"), + RowDescriptor::int8_col(b"layer_relfiles_not_updated"), + RowDescriptor::int8_col(b"layer_relfiles_removed"), + RowDescriptor::int8_col(b"layer_relfiles_dropped"), + RowDescriptor::int8_col(b"layer_nonrelfiles_total"), + RowDescriptor::int8_col(b"layer_nonrelfiles_needed_by_cutoff"), + RowDescriptor::int8_col(b"layer_nonrelfiles_needed_by_branches"), + RowDescriptor::int8_col(b"layer_nonrelfiles_not_updated"), + RowDescriptor::int8_col(b"layer_nonrelfiles_removed"), + RowDescriptor::int8_col(b"layer_nonrelfiles_dropped"), RowDescriptor::int8_col(b"elapsed"), ]))? .write_message_noflush(&BeMessage::DataRow(&[ diff --git a/pageserver/src/restore_local_repo.rs b/pageserver/src/restore_local_repo.rs index eb639c0375..4598661bf9 100644 --- a/pageserver/src/restore_local_repo.rs +++ b/pageserver/src/restore_local_repo.rs @@ -45,7 +45,6 @@ pub fn import_timeline_from_postgres_datadir( match direntry.file_name().to_str() { None => continue, - // These special files appear in the snapshot, but are not needed by the page server Some("pg_control") => { import_nonrel_file(timeline, lsn, RelishTag::ControlFile, &direntry.path())?; // Extract checkpoint record from pg_control and store is as separate object @@ -93,7 +92,6 @@ pub fn import_timeline_from_postgres_datadir( match direntry.file_name().to_str() { None => continue, - // These special files appear in the snapshot, but are not needed by the page server Some("PG_VERSION") => continue, Some("pg_filenode.map") => import_nonrel_file( timeline, @@ -153,7 +151,7 @@ fn import_relfile( let p = parse_relfilename(path.file_name().unwrap().to_str().unwrap()); if let Err(e) = p { - warn!("unrecognized file in snapshot: {:?} ({})", path, e); + warn!("unrecognized file in postgres datadir: {:?} ({})", path, e); return Err(e.into()); } let (relnode, forknum, segno) = p.unwrap(); diff --git a/test_runner/batch_others/test_snapfiles_gc.py b/test_runner/batch_others/test_snapfiles_gc.py index 99e4b2747d..957194d7d5 100644 --- a/test_runner/batch_others/test_snapfiles_gc.py +++ b/test_runner/batch_others/test_snapfiles_gc.py @@ -6,19 +6,19 @@ pytest_plugins = ("fixtures.zenith_fixtures") def print_gc_result(row): print("GC duration {elapsed} ms".format_map(row)); - print(" REL total: {snapshot_relfiles_total}, needed_by_cutoff {snapshot_relfiles_needed_by_cutoff}, needed_by_branches: {snapshot_relfiles_needed_by_branches}, not_updated: {snapshot_relfiles_not_updated}, removed: {snapshot_relfiles_removed}, dropped: {snapshot_relfiles_dropped}".format_map(row)) - print(" NONREL total: {snapshot_nonrelfiles_total}, needed_by_cutoff {snapshot_nonrelfiles_needed_by_cutoff}, needed_by_branches: {snapshot_nonrelfiles_needed_by_branches}, not_updated: {snapshot_nonrelfiles_not_updated}, removed: {snapshot_nonrelfiles_removed}, dropped: {snapshot_nonrelfiles_dropped}".format_map(row)) + print(" REL total: {layer_relfiles_total}, needed_by_cutoff {layer_relfiles_needed_by_cutoff}, needed_by_branches: {layer_relfiles_needed_by_branches}, not_updated: {layer_relfiles_not_updated}, removed: {layer_relfiles_removed}, dropped: {layer_relfiles_dropped}".format_map(row)) + print(" NONREL total: {layer_nonrelfiles_total}, needed_by_cutoff {layer_nonrelfiles_needed_by_cutoff}, needed_by_branches: {layer_nonrelfiles_needed_by_branches}, not_updated: {layer_nonrelfiles_not_updated}, removed: {layer_nonrelfiles_removed}, dropped: {layer_nonrelfiles_dropped}".format_map(row)) # -# Test Garbage Collection of old snapshot files +# Test Garbage Collection of old layer files # # This test is pretty tightly coupled with the current implementation of layered # storage, in layered_repository.rs. # -def test_snapfiles_gc(zenith_cli, pageserver, postgres, pg_bin): - zenith_cli.run(["branch", "test_snapfiles_gc", "empty"]) - pg = postgres.create_start('test_snapfiles_gc') +def test_layerfiles_gc(zenith_cli, pageserver, postgres, pg_bin): + zenith_cli.run(["branch", "test_layerfiles_gc", "empty"]) + pg = postgres.create_start('test_layerfiles_gc') with closing(pg.connect()) as conn: with conn.cursor() as cur: @@ -55,8 +55,8 @@ def test_snapfiles_gc(zenith_cli, pageserver, postgres, pg_bin): row = pscur.fetchone() print_gc_result(row); # remember the number of files - snapshot_relfiles_remain = row['snapshot_relfiles_total'] - row['snapshot_relfiles_removed'] - assert snapshot_relfiles_remain > 0 + layer_relfiles_remain = row['layer_relfiles_total'] - row['layer_relfiles_removed'] + assert layer_relfiles_remain > 0 # Insert a row. print("Inserting one row and running GC") @@ -64,12 +64,12 @@ def test_snapfiles_gc(zenith_cli, pageserver, postgres, pg_bin): pscur.execute(f"do_gc {pageserver.initial_tenant} {timeline} 0") row = pscur.fetchone() print_gc_result(row); - assert row['snapshot_relfiles_total'] == snapshot_relfiles_remain + 1 - assert row['snapshot_relfiles_removed'] == 1 - assert row['snapshot_relfiles_dropped'] == 0 + assert row['layer_relfiles_total'] == layer_relfiles_remain + 1 + assert row['layer_relfiles_removed'] == 1 + assert row['layer_relfiles_dropped'] == 0 # Insert two more rows and run GC. - # This should create a new snapshot file with the new contents, and + # This should create a new layer file with the new contents, and # remove the old one. print("Inserting two more rows and running GC") cur.execute("INSERT INTO foo VALUES (2)") @@ -78,11 +78,11 @@ def test_snapfiles_gc(zenith_cli, pageserver, postgres, pg_bin): pscur.execute(f"do_gc {pageserver.initial_tenant} {timeline} 0") row = pscur.fetchone() print_gc_result(row); - assert row['snapshot_relfiles_total'] == snapshot_relfiles_remain + 1 - assert row['snapshot_relfiles_removed'] == 1 - assert row['snapshot_relfiles_dropped'] == 0 + assert row['layer_relfiles_total'] == layer_relfiles_remain + 1 + assert row['layer_relfiles_removed'] == 1 + assert row['layer_relfiles_dropped'] == 0 - # Do it again. Should again create a new snapshot file and remove old one. + # Do it again. Should again create a new layer file and remove old one. print("Inserting two more rows and running GC") cur.execute("INSERT INTO foo VALUES (2)") cur.execute("INSERT INTO foo VALUES (3)") @@ -90,18 +90,18 @@ def test_snapfiles_gc(zenith_cli, pageserver, postgres, pg_bin): pscur.execute(f"do_gc {pageserver.initial_tenant} {timeline} 0") row = pscur.fetchone() print_gc_result(row); - assert row['snapshot_relfiles_total'] == snapshot_relfiles_remain + 1 - assert row['snapshot_relfiles_removed'] == 1 - assert row['snapshot_relfiles_dropped'] == 0 + assert row['layer_relfiles_total'] == layer_relfiles_remain + 1 + assert row['layer_relfiles_removed'] == 1 + assert row['layer_relfiles_dropped'] == 0 # Run GC again, with no changes in the database. Should not remove anything. print("Run GC again, with nothing to do") pscur.execute(f"do_gc {pageserver.initial_tenant} {timeline} 0") row = pscur.fetchone() print_gc_result(row); - assert row['snapshot_relfiles_total'] == snapshot_relfiles_remain - assert row['snapshot_relfiles_removed'] == 0 - assert row['snapshot_relfiles_dropped'] == 0 + assert row['layer_relfiles_total'] == layer_relfiles_remain + assert row['layer_relfiles_removed'] == 0 + assert row['layer_relfiles_dropped'] == 0 # # Test DROP TABLE checks that relation data and metadata was deleted by GC from object storage @@ -114,11 +114,11 @@ def test_snapfiles_gc(zenith_cli, pageserver, postgres, pg_bin): print_gc_result(row); # Each relation fork is counted separately, hence 3. - assert row['snapshot_relfiles_dropped'] == 3 + assert row['layer_relfiles_dropped'] == 3 - # The catalog updates also create new snapshot files of the catalogs, which + # The catalog updates also create new layer files of the catalogs, which # are counted as 'removed' - assert row['snapshot_relfiles_removed'] > 0 + assert row['layer_relfiles_removed'] > 0 # TODO: perhaps we should count catalog and user relations separately, # to make this kind of testing more robust