Store base images in separate ImageLayers

Previously, a SnapshotLayer and corresponding file on disk contained the base image of every page in the segment at the start LSN, and all the changes (= WAL records) in the range between start and end LSN. That was a bit awkward, because we had to keep the base image of every page in memory until we had accumulated enough WAL after the base image to write out the layer. When it's time to write out a layer, we would really want to replay the WAL to reconstruct the most recent version of each page, to save the effort later. That's on the assumption that the client will usually request the most recent version, not some older one. Split the SnapshotLayer into two structs: ImageLayer and DeltaLayer. An image layer contains a "snapshot" of the segment at one specific LSN, and no WAL records, whereas a delta layer contains WAL records in a range of LSNs. In order to reconstruct a page version in the delta layer, by performing WAL redo, you also need the previous image layer. So the delta layers are "incremental" against the previous layer. So where previously we would create snapshot files like this: rel_100_200 rel_200_300 rel_300_400 We now create image and delta files like this: rel_100 # image rel_100_200 # delta rel_200 rel_200_300 rel_300 rel_300_400 rel_400 That's more files, but as discussed above, this allows storing more up-to-date page versions on disk, which should reduce the latency of responding to a GetPage request. It also allows more fine-grained garbage collection. In the above example, after the old page version are no longer needed and if the relation is not modified anymore, we only need to keep the latest image file, 'rel_400', and everything else can be removed. Implements https://github.com/zenithdb/zenith/issues/339
2026-01-05 20:42:54 +00:00 · 2021-08-25 17:58:00 +03:00
parent 40c79988a8
commit 4902d1daa8
10 changed files with 1066 additions and 359 deletions
--- a/pageserver/src/layered_repository.rs
+++ b/pageserver/src/layered_repository.rs
@@ -1,13 +1,12 @@
 //!
-//! Zenith repository implementation that keeps old data in "snapshot files", and
-//! the recent changes in memory. See layered_repository/snapshot_layer.rs and
-//! layered_repository/inmemory_layer.rs, respectively. The functions here are
-//! responsible for locating the correct layer for the get/put call, tracing
-//! timeline branching history as needed.
+//! Zenith repository implementation that keeps old data in files on disk, and
+//! the recent changes in memory. See layered_repository/*_layer.rs files.
+//! The functions here are responsible for locating the correct layer for the
+//! get/put call, tracing timeline branching history as needed.
 //!
-//! The snapshot files are stored in the .zenith/tenants/<tenantid>/timelines/<timelineid>
+//! The files are stored in the .zenith/tenants/<tenantid>/timelines/<timelineid>
 //! directory. See layered_repository/README for how the files are managed.
-//! In addition to the snapshot files, there is a metadata file in the same
+//! In addition to the layer files, there is a metadata file in the same
 //! directory that contains information about the timeline, in particular its
 //! parent timeline, and the last LSN that has been written to disk.
 //!
@@ -34,20 +33,23 @@ use crate::walredo::WalRedoManager;
 use crate::PageServerConf;
 use crate::{ZTenantId, ZTimelineId};

+use zenith_metrics::{register_histogram, Histogram};
 use zenith_metrics::{register_histogram_vec, HistogramVec};
 use zenith_utils::bin_ser::BeSer;
 use zenith_utils::lsn::{AtomicLsn, Lsn};
 use zenith_utils::seqwait::SeqWait;

+mod delta_layer;
 mod filename;
+mod image_layer;
 mod inmemory_layer;
 mod layer_map;
-mod snapshot_layer;
 mod storage_layer;

+use delta_layer::DeltaLayer;
+use image_layer::ImageLayer;
 use inmemory_layer::InMemoryLayer;
 use layer_map::LayerMap;
-use snapshot_layer::SnapshotLayer;
 use storage_layer::{Layer, PageReconstructData, SegmentTag, RELISH_SEG_SIZE};

 static ZERO_PAGE: Bytes = Bytes::from_static(&[0u8; 8192]);
@@ -74,6 +76,15 @@ lazy_static! {
    .expect("failed to define a metric");
 }

+// Metrics collected on operations on the storage repository.
+lazy_static! {
+    static ref RECONSTRUCT_TIME: Histogram = register_histogram!(
+        "pageserver_getpage_reconstruct_time",
+        "FIXME Time spent on storage operations"
+    )
+    .expect("failed to define a metric");
+}
+
 ///
 /// Repository consists of multiple timelines. Keep them in a hash table.
 ///
@@ -198,7 +209,7 @@ impl LayeredRepository {
                    self.walredo_mgr.clone(),
                )?;

-                // List the snapshot layers on disk, and load them into the layer map
+                // List the layers on disk, and load them into the layer map
                timeline.load_layer_map()?;

                // Load any new WAL after the last checkpoint into memory.
@@ -318,7 +329,7 @@ impl LayeredRepository {
    // 2. Scan all timelines, and on each timeline, make note of the
    //    all the points where other timelines have been branched off.
    //    We will refrain from removing page versions at those LSNs.
-    // 3. For each timeline, scan all snapshot files on the timeline.
+    // 3. For each timeline, scan all layer files on the timeline.
    //    Remove all files for which a newer file exists and which
    //    don't cover any branch point LSNs.
    //
@@ -509,7 +520,8 @@ impl Timeline for LayeredTimeline {
        let seg = SegmentTag::from_blknum(rel, blknum);

        if let Some((layer, lsn)) = self.get_layer_for_read(seg, lsn)? {
-            self.materialize_page(seg, blknum, lsn, &*layer)
+            RECONSTRUCT_TIME
+                .observe_closure_duration(|| self.materialize_page(seg, blknum, lsn, &*layer))
        } else {
            bail!("relish {} not found at {}", rel, lsn);
        }
@@ -886,7 +898,7 @@ impl LayeredTimeline {
    }

    ///
-    /// Load the list of snapshot files from disk, populating the layer map
+    /// Scan the timeline directory to populate the layer map
    ///
    fn load_layer_map(&self) -> anyhow::Result<()> {
        info!(
@@ -894,23 +906,53 @@ impl LayeredTimeline {
            self.timelineid
        );
        let mut layers = self.layers.lock().unwrap();
-        let snapfilenames =
-            filename::list_snapshot_files(self.conf, self.timelineid, self.tenantid)?;
+        let (imgfilenames, mut deltafilenames) =
+            filename::list_files(self.conf, self.timelineid, self.tenantid)?;

-        for filename in snapfilenames.iter() {
-            let layer = SnapshotLayer::load_snapshot_layer(self.conf, self.timelineid, self.tenantid, filename)?;
+        // First create ImageLayer structs for each image file.
+        for filename in imgfilenames.iter() {
+            let layer = ImageLayer::new(self.conf, self.timelineid, self.tenantid, filename);

            info!(
-                "found layer {} {}-{} {} on timeline {}",
+                "found layer {} {} on timeline {}",
                layer.get_seg_tag(),
                layer.get_start_lsn(),
-                layer.get_end_lsn(),
-                layer.is_dropped(),
                self.timelineid
            );
            layers.insert_historic(Arc::new(layer));
        }

+        // Then for the Delta files. The delta files are created in order starting
+        // from the oldest file, because each DeltaLayer needs a reference to its
+        // predecessor.
+        deltafilenames.sort();
+
+        for filename in deltafilenames.iter() {
+            let predecessor = layers.get(&filename.seg, filename.start_lsn);
+
+            let predecessor_str: String = if let Some(prec) = &predecessor {
+                prec.filename().display().to_string()
+            } else {
+                "none".to_string()
+            };
+
+            let layer = DeltaLayer::new(
+                self.conf,
+                self.timelineid,
+                self.tenantid,
+                filename,
+                predecessor,
+            );
+
+            info!(
+                "found layer {} on timeline {}, predecessor: {}",
+                layer.filename().display(),
+                self.timelineid,
+                predecessor_str,
+            );
+            layers.insert_historic(Arc::new(layer));
+        }
+
        Ok(())
    }

@@ -1040,10 +1082,9 @@ impl LayeredTimeline {
                prev_layer.get_start_lsn(),
                prev_layer.get_end_lsn()
            );
-            layer = InMemoryLayer::copy_snapshot(
+            layer = InMemoryLayer::create_successor_layer(
                self.conf,
-                &self,
-                &*prev_layer,
+                prev_layer,
                self.timelineid,
                self.tenantid,
                start_lsn,
@@ -1121,7 +1162,7 @@ impl LayeredTimeline {
        let mut layers = self.layers.lock().unwrap();

        // Take the in-memory layer with the oldest WAL record. If it's older
-        // than the threshold, write it out to disk as a new snapshot file.
+        // than the threshold, write it out to disk as a new image and delta file.
        // Repeat until all remaining in-memory layers are within the threshold.
        //
        // That's necessary to limit the amount of WAL that needs to be kept
@@ -1145,21 +1186,23 @@ impl LayeredTimeline {
            }

            // freeze it
-            let (new_historic, new_open) = oldest_layer.freeze(last_valid_lsn, &self)?;
+            let (new_historics, new_open) = oldest_layer.freeze(last_valid_lsn, &self)?;

            // replace this layer with the new layers that 'freeze' returned
            layers.pop_oldest_open();
            if let Some(n) = new_open {
                layers.insert_open(n);
            }
-            layers.insert_historic(new_historic);
+            for n in new_historics {
+                layers.insert_historic(n);
+            }
        }

        // Call unload() on all frozen layers, to release memory.
        // TODO: On-disk layers shouldn't consume much memory to begin with,
-        // so this shouldn't be necessary. But currently the SnapshotLayer
-        // code slurps the whole file into memory, so they do in fact consume
-        // a lot of memory.
+        // so this shouldn't be necessary. But currently the DeltaLayer and
+        // ImageLayer code slurps the whole file into memory, so they do in
+        // fact consume a lot of memory.
        for layer in layers.iter_historic_layers() {
            layer.unload()?;
        }
@@ -1199,7 +1242,7 @@ impl LayeredTimeline {
    }

    ///
-    /// Garbage collect snapshot files on a timeline that are no longer needed.
+    /// Garbage collect layer files on a timeline that are no longer needed.
    ///
    /// The caller specifies how much history is needed with the two arguments:
    ///
@@ -1217,7 +1260,7 @@ impl LayeredTimeline {
    /// to figure out what read-only nodes might actually need.)
    ///
    /// Currently, we don't make any attempt at removing unneeded page versions
-    /// within a snapshot file. We can only remove the whole file if it's fully
+    /// within a layer file. We can only remove the whole file if it's fully
    /// obsolete.
    ///
    pub fn gc_timeline(&self, retain_lsns: Vec<Lsn>, cutoff: Lsn) -> Result<GcResult> {
@@ -1229,9 +1272,9 @@ impl LayeredTimeline {
            self.timelineid, cutoff
        );

-        let mut layers_to_remove: Vec<Arc<SnapshotLayer>> = Vec::new();
+        let mut layers_to_remove: Vec<Arc<dyn Layer>> = Vec::new();

-        // Scan all snapshot files in the directory. For each file, if a newer file
+        // Scan all layer files in the directory. For each file, if a newer file
        // exists, we can remove the old one.
        //
        // Determine for each file if it needs to be retained
@@ -1242,9 +1285,9 @@ impl LayeredTimeline {
            let seg = l.get_seg_tag();

            if seg.rel.is_relation() {
-                result.snapshot_relfiles_total += 1;
+                result.ondisk_relfiles_total += 1;
            } else {
-                result.snapshot_nonrelfiles_total += 1;
+                result.ondisk_nonrelfiles_total += 1;
            }

            // Is it newer than cutoff point?
@@ -1257,9 +1300,9 @@ impl LayeredTimeline {
                    cutoff
                );
                if seg.rel.is_relation() {
-                    result.snapshot_relfiles_needed_by_cutoff += 1;
+                    result.ondisk_relfiles_needed_by_cutoff += 1;
                } else {
-                    result.snapshot_nonrelfiles_needed_by_cutoff += 1;
+                    result.ondisk_nonrelfiles_needed_by_cutoff += 1;
                }
                continue 'outer;
            }
@@ -1276,20 +1319,21 @@ impl LayeredTimeline {
                        *retain_lsn
                    );
                    if seg.rel.is_relation() {
-                        result.snapshot_relfiles_needed_by_branches += 1;
+                        result.ondisk_relfiles_needed_by_branches += 1;
                    } else {
-                        result.snapshot_nonrelfiles_needed_by_branches += 1;
+                        result.ondisk_nonrelfiles_needed_by_branches += 1;
                    }
                    continue 'outer;
                }
            }

-            // Unless the relation was dropped, is there a later snapshot file for this relation?
-            if !l.is_dropped() && !layers.newer_layer_exists(l.get_seg_tag(), l.get_end_lsn()) {
+            // Unless the relation was dropped, is there a later image file for this relation?
+            if !l.is_dropped() && !layers.newer_image_layer_exists(l.get_seg_tag(), l.get_end_lsn())
+            {
                if seg.rel.is_relation() {
-                    result.snapshot_relfiles_not_updated += 1;
+                    result.ondisk_relfiles_not_updated += 1;
                } else {
-                    result.snapshot_nonrelfiles_not_updated += 1;
+                    result.ondisk_nonrelfiles_not_updated += 1;
                }
                continue 'outer;
            }
@@ -1314,15 +1358,15 @@ impl LayeredTimeline {

            if doomed_layer.is_dropped() {
                if doomed_layer.get_seg_tag().rel.is_relation() {
-                    result.snapshot_relfiles_dropped += 1;
+                    result.ondisk_relfiles_dropped += 1;
                } else {
-                    result.snapshot_nonrelfiles_dropped += 1;
+                    result.ondisk_nonrelfiles_dropped += 1;
                }
            } else {
                if doomed_layer.get_seg_tag().rel.is_relation() {
-                    result.snapshot_relfiles_removed += 1;
+                    result.ondisk_relfiles_removed += 1;
                } else {
-                    result.snapshot_nonrelfiles_removed += 1;
+                    result.ondisk_nonrelfiles_removed += 1;
                }
            }
        }
--- a/pageserver/src/layered_repository/README.md
+++ b/pageserver/src/layered_repository/README.md
@@ -11,45 +11,48 @@ to one 10 MB slice of a PostgreSQL relation fork. The snapshot files
 for each timeline are stored in the timeline's subdirectory under
 .zenith/tenants/<tenantid>/timelines.

-The files are named like this:
+There are two kind of snapshot file: base images, and deltas. A base
+image file contains a snapshot of a segment as it was at one LSN,
+whereas a delta file contains modifications to a segment - mostly in
+the form of WAL records - in a range of LSN
+
+base image file:
+
+    rel_<spcnode>_<dbnode>_<relnode>_<forknum>_<segno>_<start LSN>
+
+delta file:

    rel_<spcnode>_<dbnode>_<relnode>_<forknum>_<segno>_<start LSN>_<end LSN>

 For example:

+    rel_1663_13990_2609_0_10_000000000169C348
    rel_1663_13990_2609_0_10_000000000169C348_0000000001702000

-Some non-relation files are also stored in repository. For example,
-a CLOG segment would be named like this:
+In addition to the relations, with "rel_*" prefix, we use the same
+format for storing various smaller files from the PostgreSQL data
+directory. They will use different suffixes and the naming scheme up
+to the LSNs vary. The Zenith source code uses the term "relish" to
+mean "a relation, or other file that's treated like a relation in the
+storage" For example, a base image of a CLOG segment would be named
+like this:

-    pg_xact_0000_0_00000000198B06B0_00000000198C2550
+    pg_xact_0000_0_00000000198B06B0

 There is no difference in how the relation and non-relation files are
 managed, except that the first part of file names is different.
 Internally, the relations and non-relation files that are managed in
 the versioned store are together called "relishes".

-Each snapshot file contains a full snapshot, that is, full copy of all
-pages in the relation, as of the "start LSN". It also contains all WAL
-records applicable to the relation between the start and end
-LSNs. With this information, the page server can reconstruct any page
-version of the relation in the LSN range.
-
 If a file has been dropped, the last snapshot file for it is created
 with the _DROPPED suffix, e.g.

    rel_1663_13990_2609_0_10_000000000169C348_0000000001702000_DROPPED

-In addition to the relations, with "rel_*" prefix, we use the same
-format for storing various smaller files from the PostgreSQL data
-directory. They will use different suffixes and the naming scheme
-up to the LSN range varies. The Zenith source code uses the term
-"relish" to mean "a relation, or other file that's treated like a
-relation in the storage"

 ## Notation used in this document

-The full path of a snapshot file looks like this:
+The full path of a delta file looks like this:

    .zenith/tenants/941ddc8604413b88b3d208bddf90396c/timelines/4af489b06af8eed9e27a841775616962/rel_1663_13990_2609_0_10_000000000169C348_0000000001702000

@@ -57,9 +60,10 @@ For simplicity, the examples below use a simplified notation for the
 paths.  The tenant ID is left out, the timeline ID is replaced with
 the human-readable branch name, and spcnode+dbnode+relnode+forkum+segno
 with a human-readable table name. The LSNs are also shorter. For
-example, a snapshot file for 'orders' table on 'main' branch, with LSN
-range 100-200 would be:
+example, a base image file at LSN 100 and a delta file between 100-200
+for 'orders' table on 'main' branch is represented like this:

+    main/orders_100
    main/orders_100_200


@@ -68,10 +72,14 @@ range 100-200 would be:
 Let's start with a simple example with a system that contains one
 branch called 'main' and two tables, 'orders' and 'customers'. The end
 of WAL is currently at LSN 250. In this starting situation, you would
-have two files on disk:
+have these files on disk:

+	main/orders_100
 	main/orders_100_200
+	main/orders_200
+	main/customers_100
 	main/customers_100_200
+	main/customers_200

 In addition to those files, the recent changes between LSN 200 and the
 end of WAL at 250 are kept in memory. If the page server crashes, the
@@ -87,20 +95,33 @@ the last checkpoint happened at LSN 400 but there hasn't been any
 recent changes to 'customers' table, you would have these files on
 disk:

+	main/orders_100
 	main/orders_100_200
+	main/orders_200
 	main/orders_200_300
+	main/orders_300
 	main/orders_300_400
+	main/orders_400
+	main/customers_100
 	main/customers_100_200
+	main/customers_200

 If the customers table is modified later, a new file is created for it
 at the next checkpoint. The new file will cover the "gap" from the
 last snapshot file, so the LSN ranges are always contiguous:

+	main/orders_100
 	main/orders_100_200
+	main/orders_200
 	main/orders_200_300
+	main/orders_300
 	main/orders_300_400
+	main/orders_400
+	main/customers_100
 	main/customers_100_200
+	main/customers_200
 	main/customers_200_500
+	main/customers_500

 ## Reading page versions

@@ -120,18 +141,6 @@ relation at the start LSN and the WAL, reconstructing the page
 involves replaying any WAL records applicable to the page between LSNs
 200-250, starting from the base image at LSN 200.

-A request at a file boundary can be satisfied using either file. For
-example, if there are two files on disk:
-
-	main/orders_100_200
-	main/orders_200_300
-
-And a request comes with LSN 200, either file can be used for it. It
-is better to use the later file, however, because it contains an
-already materialized version of all the pages at LSN 200. Using the
-first file, you would need to apply any WAL records between 100 and
-200 to reconstruct the requested page.
-
 # Multiple branches

 Imagine that a child branch is created at LSN 250:
@@ -145,12 +154,20 @@ Imagine that a child branch is created at LSN 250:
 Then, the 'orders' table is updated differently on the 'main' and
 'child' branches. You now have this situation on disk:

+    main/orders_100
    main/orders_100_200
+    main/orders_200
    main/orders_200_300
+    main/orders_300
    main/orders_300_400
+    main/orders_400
+    main/customers_100
    main/customers_100_200
+    main/customers_200
    child/orders_250_300
+    child/orders_300
    child/orders_300_400
+    child/orders_400

 Because the 'customers' table hasn't been modified on the child
 branch, there is no file for it there. If you request a page for it on
@@ -163,24 +180,34 @@ is linear, and the request's LSN identifies unambiguously which file
 you need to look at. For example, the history for the 'orders' table
 on the 'main' branch consists of these files:

+    main/orders_100
    main/orders_100_200
+    main/orders_200
    main/orders_200_300
+    main/orders_300
    main/orders_300_400
+    main/orders_400

 And from the 'child' branch's point of view, it consists of these
 files:

+    main/orders_100
    main/orders_100_200
+    main/orders_200
    main/orders_200_300
    child/orders_250_300
+    child/orders_300
    child/orders_300_400
+    child/orders_400

 The branch metadata includes the point where the child branch was
 created, LSN 250. If a page request comes with LSN 275, we read the
-page version from the 'child/orders_250_300' file. If the request LSN
-is 225, we read it from the 'main/orders_200_300' file instead.  The
-page versions between 250-300 in the 'main/orders_200_300' file are
-ignored when operating on the child branch.
+page version from the 'child/orders_250_300' file. We might also
+need to reconstruct the page version as it was at LSN 250, in order
+to replay the WAL up to LSN 275, using 'main/orders_200_300' and
+'main/orders_200'. The page versions between 250-300 in the
+'main/orders_200_300' file are ignored when operating on the child
+branch.

 Note: It doesn't make any difference if the child branch is created
 when the end of the main branch was at LSN 250, or later when the tip of
@@ -204,17 +231,37 @@ Let's look at the single branch scenario again. Imagine that the end
 of the branch is LSN 525, so that the GC horizon is currently at
 525-150 = 375

+	main/orders_100
 	main/orders_100_200
+	main/orders_200
 	main/orders_200_300
+	main/orders_300
 	main/orders_300_400
+	main/orders_400
 	main/orders_400_500
+	main/orders_500
+	main/customers_100
 	main/customers_100_200
+	main/customers_200

-We can remove files 'main/orders_100_200' and 'main/orders_200_300',
-because the end LSNs of those files are older than GC horizon 375, and
-there are more recent snapshot files for the table. 'main/orders_300_400'
-and 'main/orders_400_500' are still within the horizon, so they must be
-retained. 'main/customers_100_200' is old enough, but it cannot be
+We can remove the following files because the end LSNs of those files are
+older than GC horizon 375, and there are more recent snapshot files for the
+table:
+
+	main/orders_100       DELETE
+	main/orders_100_200   DELETE
+	main/orders_200       DELETE
+	main/orders_200_300   DELETE
+	main/orders_300       STILL NEEDED BY orders_300_400
+	main/orders_300_400   KEEP, NEWER THAN GC HORIZON
+	main/orders_400       .. 
+	main/orders_400_500   .. 
+	main/orders_500       .. 
+	main/customers_100      DELETE
+	main/customers_100_200  DELETE
+	main/customers_200      KEEP, NO NEWER VERSION
+
+'main/customers_100_200' is old enough, but it cannot be
 removed because there is no newer snapshot file for the table.

 Things get slightly more complicated with multiple branches. All of
@@ -223,41 +270,47 @@ retain older shapshot files that are still needed by child branches.
 For example, if child branch is created at LSN 150, and the 'customers'
 table is updated on the branch, you would have these files:

-	main/orders_100_200
-	main/orders_200_300
-	main/orders_300_400
-	main/orders_400_500
-	main/customers_100_200
-	child/customers_150_300
+	main/orders_100        KEEP, NEEDED BY child BRANCH
+	main/orders_100_200    KEEP, NEEDED BY child BRANCH
+	main/orders_200        DELETE
+	main/orders_200_300    DELETE
+	main/orders_300        KEEP, NEWER THAN GC HORIZON
+	main/orders_300_400    KEEP, NEWER THAN GC HORIZON
+	main/orders_400        KEEP, NEWER THAN GC HORIZON
+	main/orders_400_500    KEEP, NEWER THAN GC HORIZON
+	main/orders_500        KEEP, NEWER THAN GC HORIZON
+	main/customers_100       DELETE
+	main/customers_100_200   DELETE
+	main/customers_200       KEEP, NO NEWER VERSION
+	child/customers_150_300  DELETE
+	child/customers_300      KEEP, NO NEWER VERSION

-In this situation, the 'main/orders_100_200' file cannot be removed,
-even though it is older than the GC horizon, because it is still
-needed by the child branch.  'main/orders_200_300' can still be
-removed. So after garbage collection, these files would remain:
-
-	main/orders_100_200
-
-	main/orders_300_400
-	main/orders_400_500
-	main/customers_100_200
-	child/customers_150_300
+In this situation, 'main/orders_100' and 'main/orders_100_200' cannot
+be removed, even though they are older than the GC horizon, because
+they are still needed by the child branch. 'main/orders_200'
+and 'main/orders_200_300' can still be removed.

 If 'orders' is modified later on the 'child' branch, we will create a
-snapshot file for it on the child:
+new base image and delta file for it on the child:

+	main/orders_100
 	main/orders_100_200

+	main/orders_300
 	main/orders_300_400
+	main/orders_400
 	main/orders_400_500
-	main/customers_100_200
-	child/customers_150_300
+	main/orders_500
+	main/customers_200
+	child/customers_300
 	child/orders_150_400
+	child/orders_400

-After this, the 'main/orders_100_200' file can be removed. It is no
-longer needed by the child branch, because there is a newer snapshot
-file there. TODO: This optimization hasn't been implemented! The GC
-algorithm will currently keep the file on the 'main' branch anyway, for
-as long as the child branch exists.
+After this, the 'main/orders_100' and 'main/orders_100_200' file could
+be removed. It is no longer needed by the child branch, because there
+is a newer snapshot file there. TODO: This optimization hasn't been
+implemented! The GC algorithm will currently keep the file on the
+'main' branch anyway, for as long as the child branch exists.


 # TODO: On LSN ranges
@@ -265,21 +318,33 @@ as long as the child branch exists.
 In principle, each relation can be checkpointed separately, i.e. the
 LSN ranges of the files don't need to line up. So this would be legal:

+	main/orders_100
 	main/orders_100_200
+	main/orders_200
 	main/orders_200_300
+	main/orders_300
 	main/orders_300_400
+	main/orders_400
+	main/customers_150
 	main/customers_150_250
+	main/customers_250
 	main/customers_250_500
+	main/customers_500

 However, the code currently always checkpoints all relations together.
 So that situation doesn't arise in practice.

 It would also be OK to have overlapping LSN ranges for the same relation:

+	main/orders_100
 	main/orders_100_200
+	main/orders_200
 	main/orders_200_300
+	main/orders_300
 	main/orders_250_350
+	main/orders_350
 	main/orders_300_400
+	main/orders_400

 The code that reads the snapshot files should cope with this, but this
 situation doesn't arise either, because the checkpointing code never
@@ -287,12 +352,14 @@ does that.  It could be useful, however, as a transient state when
 garbage collecting around branch points, or explicit recovery
 points. For example, if we start with this:

+	main/orders_100
 	main/orders_100_200
+	main/orders_200
 	main/orders_200_300
-	main/orders_300_400
+	main/orders_300

 And there is a branch or explicit recovery point at LSN 150, we could
-replace 'main/orders_100_200' with 'main/orders_150_150' to keep a
+replace 'main/orders_100_200' with 'main/orders_150' to keep a
 snapshot only at that exact point that's still needed, removing the
 other page versions around it. But such compaction has not been
 implemented yet.
--- a/pageserver/src/layered_repository/snapshot_layer.rs
+++ b/pageserver/src/layered_repository/snapshot_layer.rs
@@ -1,46 +1,43 @@
 //!
-//! A SnapshotLayer represents one snapshot file on disk. One file holds all page
-//! version and size information of one relation, in a range of LSN.
-//! The name "snapshot file" is a bit of a misnomer because a snapshot file doesn't
-//! contain a snapshot at a specific LSN, but rather all the page versions in a range
-//! of LSNs.
+//! A DeltaLayer represents a collection of WAL records or page images in a range of
+//! LSNs, for one segment. It is stored on a file on disk.
 //!
-//! Currently, a snapshot file contains full information needed to reconstruct any
-//! page version in the LSN range, without consulting any other snapshot files. When
-//! a new snapshot file is created for writing, the full contents of relation are
-//! materialized as it is at the beginning of the LSN range. That can be very expensive,
-//! we should find a way to store differential files. But this keeps the read-side
-//! of things simple. You can find the correct snapshot file based on RelishTag and
-//! timeline+LSN, and once you've located it, you have all the data you need to in that
-//! file.
+//! Usually a delta layer only contains differences - in the form of WAL records against
+//! a base LSN. However, if a segment is newly created, by creating a new relation or
+//! extending an old one, there might be no base image. In that case, all the entries in
+//! the delta layer must be page images or WAL records with the 'will_init' flag set, so
+//! that they can be replayed without referring to an older page version. Also in some
+//! circumstances, the predecessor layer might actually be another delta layer. That
+//! can happen when you create a new branch in the middle of a delta layer, and the WAL
+//! records on the new branch are put in a new delta layer.
 //!
-//! When a snapshot file needs to be accessed, we slurp the whole file into memory, into
-//! the SnapshotLayer struct. See load() and unload() functions.
+//! When a delta file needs to be accessed, we slurp the whole file into memory, into
+//! the DeltaLayer struct. See load() and unload() functions.
 //!
-//! On disk, the snapshot files are stored in timelines/<timelineid> directory.
-//! Currently, there are no subdirectories, and each snapshot file is named like this:
+//! On disk, the delta files are stored in timelines/<timelineid> directory.
+//! Currently, there are no subdirectories, and each delta file is named like this:
 //!
-//!    <spcnode>_<dbnode>_<relnode>_<forknum>_<start LSN>_<end LSN>
+//!    <spcnode>_<dbnode>_<relnode>_<forknum>_<segno>_<start LSN>_<end LSN>
 //!
 //! For example:
 //!
-//!    1663_13990_2609_0_000000000169C348_000000000169C349
+//!    1663_13990_2609_0_5_000000000169C348_000000000169C349
 //!
 //! If a relation is dropped, we add a '_DROPPED' to the end of the filename to indicate that.
 //! So the above example would become:
 //!
-//!    1663_13990_2609_0_000000000169C348_000000000169C349_DROPPED
+//!    1663_13990_2609_0_5_000000000169C348_000000000169C349_DROPPED
 //!
 //! The end LSN indicates when it was dropped in that case, we don't store it in the
 //! file contents in any way.
 //!
-//! A snapshot file is constructed using the 'bookfile' crate. Each file consists of two
+//! A detlta file is constructed using the 'bookfile' crate. Each file consists of two
 //! parts: the page versions and the relation sizes. They are stored as separate chapters.
 //!
+use crate::layered_repository::filename::DeltaFileName;
 use crate::layered_repository::storage_layer::{
    Layer, PageReconstructData, PageVersion, SegmentTag,
 };
-use crate::layered_repository::filename::{SnapshotFileName};
 use crate::PageServerConf;
 use crate::{ZTenantId, ZTimelineId};
 use anyhow::{bail, Result};
@@ -51,28 +48,28 @@ use std::fs::File;
 use std::io::Write;
 use std::ops::Bound::Included;
 use std::path::PathBuf;
-use std::sync::{Mutex, MutexGuard};
+use std::sync::{Arc, Mutex, MutexGuard};

 use bookfile::{Book, BookWriter};

 use zenith_utils::bin_ser::BeSer;
 use zenith_utils::lsn::Lsn;

-// Magic constant to identify a Zenith snapshot file
-static SNAPSHOT_FILE_MAGIC: u32 = 0x5A616E01;
+// Magic constant to identify a Zenith delta file
+static DELTA_FILE_MAGIC: u32 = 0x5A616E01;

 static PAGE_VERSIONS_CHAPTER: u64 = 1;
 static REL_SIZES_CHAPTER: u64 = 2;

 ///
-/// SnapshotLayer is the in-memory data structure associated with an
-/// on-disk snapshot file.  We keep a SnapshotLayer in memory for each
+/// DeltaLayer is the in-memory data structure associated with an
+/// on-disk delta file.  We keep a DeltaLayer in memory for each
 /// file, in the LayerMap. If a layer is in "loaded" state, we have a
 /// copy of the file in memory, in 'inner'. Otherwise the struct is
 /// just a placeholder for a file that exists on disk, and it needs to
 /// be loaded before using it in queries.
 ///
-pub struct SnapshotLayer {
+pub struct DeltaLayer {
    conf: &'static PageServerConf,
    pub tenantid: ZTenantId,
    pub timelineid: ZTimelineId,
@@ -81,15 +78,19 @@ pub struct SnapshotLayer {
    //
    // This entry contains all the changes from 'start_lsn' to 'end_lsn'. The
    // start is inclusive, and end is exclusive.
+    //
    pub start_lsn: Lsn,
    pub end_lsn: Lsn,

    dropped: bool,

-    inner: Mutex<SnapshotLayerInner>,
+    /// Base layer preceding this layer.
+    predecessor: Option<Arc<dyn Layer>>,
+
+    inner: Mutex<DeltaLayerInner>,
 }

-pub struct SnapshotLayerInner {
+pub struct DeltaLayerInner {
    /// If false, the 'page_versions' and 'relsizes' have not been
    /// loaded into memory yet.
    loaded: bool,
@@ -102,7 +103,7 @@ pub struct SnapshotLayerInner {
    relsizes: BTreeMap<Lsn, u32>,
 }

-impl Layer for SnapshotLayer {
+impl Layer for DeltaLayer {
    fn get_timeline_id(&self) -> ZTimelineId {
        return self.timelineid;
    }
@@ -123,6 +124,18 @@ impl Layer for SnapshotLayer {
        return self.end_lsn;
    }

+    fn filename(&self) -> PathBuf {
+        PathBuf::from(
+            DeltaFileName {
+                seg: self.seg,
+                start_lsn: self.start_lsn,
+                end_lsn: self.end_lsn,
+                dropped: self.dropped,
+            }
+            .to_string(),
+        )
+    }
+
    /// Look up given page in the cache.
    fn get_page_reconstruct_data(
        &self,
@@ -159,6 +172,24 @@ impl Layer for SnapshotLayer {
                }
            }

+            // Use the base image, if needed
+            if let Some(need_lsn) = need_base_image_lsn {
+                if let Some(predecessor) = &self.predecessor {
+                    need_base_image_lsn = predecessor.get_page_reconstruct_data(
+                        blknum,
+                        need_lsn,
+                        reconstruct_data,
+                    )?;
+                } else {
+                    bail!(
+                        "no base img found for {} at blk {} at LSN {}",
+                        self.seg,
+                        blknum,
+                        lsn
+                    );
+                }
+            }
+
            // release lock on 'inner'
        }

@@ -167,26 +198,22 @@ impl Layer for SnapshotLayer {

    /// Get size of the relation at given LSN
    fn get_seg_size(&self, lsn: Lsn) -> Result<u32> {
+        assert!(lsn >= self.start_lsn);
+
        // Scan the BTreeMap backwards, starting from the given entry.
        let inner = self.load()?;
        let mut iter = inner.relsizes.range((Included(&Lsn(0)), Included(&lsn)));

+        let result;
        if let Some((_entry_lsn, entry)) = iter.next_back() {
-            let result = *entry;
-            drop(inner);
-            trace!("get_seg_size: {} at {} -> {}", self.seg, lsn, result);
-            Ok(result)
+            result = *entry;
+        // Use the base image if needed
+        } else if let Some(predecessor) = &self.predecessor {
+            result = predecessor.get_seg_size(lsn)?;
        } else {
-            error!(
-                "No size found for {} at {} in snapshot layer {} {}-{}",
-                self.seg, lsn, self.seg, self.start_lsn, self.end_lsn
-            );
-            bail!(
-                "No size found for {} at {} in snapshot layer",
-                self.seg,
-                lsn
-            );
+            result = 0;
        }
+        Ok(result)
    }

    /// Does this segment exist at given LSN?
@@ -199,15 +226,37 @@ impl Layer for SnapshotLayer {
        // Otherwise, it exists.
        Ok(true)
    }
+
+    ///
+    /// Release most of the memory used by this layer. If it's accessed again later,
+    /// it will need to be loaded back.
+    ///
+    fn unload(&self) -> Result<()> {
+        let mut inner = self.inner.lock().unwrap();
+        inner.page_versions = BTreeMap::new();
+        inner.relsizes = BTreeMap::new();
+        inner.loaded = false;
+        Ok(())
+    }
+
+    fn delete(&self) -> Result<()> {
+        // delete underlying file
+        fs::remove_file(self.path())?;
+        Ok(())
+    }
+
+    fn is_incremental(&self) -> bool {
+        true
+    }
 }

-impl SnapshotLayer {
+impl DeltaLayer {
    fn path(&self) -> PathBuf {
        Self::path_for(
            self.conf,
            self.timelineid,
            self.tenantid,
-            &SnapshotFileName {
+            &DeltaFileName {
                seg: self.seg,
                start_lsn: self.start_lsn,
                end_lsn: self.end_lsn,
@@ -220,13 +269,13 @@ impl SnapshotLayer {
        conf: &'static PageServerConf,
        timelineid: ZTimelineId,
        tenantid: ZTenantId,
-        fname: &SnapshotFileName,
+        fname: &DeltaFileName,
    ) -> PathBuf {
        conf.timeline_path(&timelineid, &tenantid)
            .join(fname.to_string())
    }

-    /// Create a new snapshot file, using the given btreemaps containing the page versions and
+    /// Create a new delta file, using the given btreemaps containing the page versions and
    /// relsizes.
    ///
    /// This is used to write the in-memory layer to disk. The in-memory layer uses the same
@@ -240,10 +289,11 @@ impl SnapshotLayer {
        start_lsn: Lsn,
        end_lsn: Lsn,
        dropped: bool,
+        predecessor: Option<Arc<dyn Layer>>,
        page_versions: BTreeMap<(u32, Lsn), PageVersion>,
        relsizes: BTreeMap<Lsn, u32>,
-    ) -> Result<SnapshotLayer> {
-        let snapfile = SnapshotLayer {
+    ) -> Result<DeltaLayer> {
+        let delta_layer = DeltaLayer {
            conf: conf,
            timelineid: timelineid,
            tenantid: tenantid,
@@ -251,23 +301,24 @@ impl SnapshotLayer {
            start_lsn: start_lsn,
            end_lsn,
            dropped,
-            inner: Mutex::new(SnapshotLayerInner {
+            inner: Mutex::new(DeltaLayerInner {
                loaded: true,
                page_versions: page_versions,
                relsizes: relsizes,
            }),
+            predecessor,
        };
-        let inner = snapfile.inner.lock().unwrap();
+        let inner = delta_layer.inner.lock().unwrap();

        // Write the in-memory btreemaps into a file
-        let path = snapfile.path();
+        let path = delta_layer.path();

        // Note: This overwrites any existing file. There shouldn't be any.
        // FIXME: throw an error instead?
        let file = File::create(&path)?;
-        let book = BookWriter::new(file, SNAPSHOT_FILE_MAGIC)?;
+        let book = BookWriter::new(file, DELTA_FILE_MAGIC)?;

-        // Write out page versions
+        // Write out the other page versions
        let mut chapter = book.new_chapter(PAGE_VERSIONS_CHAPTER);
        let buf = BTreeMap::ser(&inner.page_versions)?;
        chapter.write_all(&buf)?;
@@ -285,13 +336,13 @@ impl SnapshotLayer {

        drop(inner);

-        Ok(snapfile)
+        Ok(delta_layer)
    }

    ///
    /// Load the contents of the file into memory
    ///
-    fn load(&self) -> Result<MutexGuard<SnapshotLayerInner>> {
+    fn load(&self) -> Result<MutexGuard<DeltaLayerInner>> {
        // quick exit if already loaded
        let mut inner = self.inner.lock().unwrap();

@@ -303,7 +354,7 @@ impl SnapshotLayer {
            self.conf,
            self.timelineid,
            self.tenantid,
-            &SnapshotFileName {
+            &DeltaFileName {
                seg: self.seg,
                start_lsn: self.start_lsn,
                end_lsn: self.end_lsn,
@@ -322,7 +373,7 @@ impl SnapshotLayer {

        debug!("loaded from {}", &path.display());

-        *inner = SnapshotLayerInner {
+        *inner = DeltaLayerInner {
            loaded: true,
            page_versions,
            relsizes,
@@ -331,16 +382,15 @@ impl SnapshotLayer {
        Ok(inner)
    }

-    /// Create SnapshotLayers representing all files on disk
-    ///
-    // TODO: returning an Iterator would be more idiomatic
-    pub fn load_snapshot_layer(
+    /// Create a DeltaLayer struct representing an existing file on disk.
+    pub fn new(
        conf: &'static PageServerConf,
        timelineid: ZTimelineId,
        tenantid: ZTenantId,
-        filename: &SnapshotFileName,
-    ) -> Result<SnapshotLayer> {
-        let snapfile = SnapshotLayer {
+        filename: &DeltaFileName,
+        predecessor: Option<Arc<dyn Layer>>,
+    ) -> DeltaLayer {
+        DeltaLayer {
            conf,
            timelineid,
            tenantid,
@@ -348,32 +398,13 @@ impl SnapshotLayer {
            start_lsn: filename.start_lsn,
            end_lsn: filename.end_lsn,
            dropped: filename.dropped,
-            inner: Mutex::new(SnapshotLayerInner {
+            inner: Mutex::new(DeltaLayerInner {
                loaded: false,
                page_versions: BTreeMap::new(),
                relsizes: BTreeMap::new(),
            }),
-        };
-
-        Ok(snapfile)
-    }
-
-    pub fn delete(&self) -> Result<()> {
-        // delete underlying file
-        fs::remove_file(self.path())?;
-        Ok(())
-    }
-
-    ///
-    /// Release most of the memory used by this layer. If it's accessed again later,
-    /// it will need to be loaded back.
-    ///
-    pub fn unload(&self) -> Result<()> {
-        let mut inner = self.inner.lock().unwrap();
-        inner.page_versions = BTreeMap::new();
-        inner.relsizes = BTreeMap::new();
-        inner.loaded = false;
-        Ok(())
+            predecessor,
+        }
    }

    /// debugging function to print out the contents of the layer
--- a/pageserver/src/layered_repository/filename.rs
+++ b/pageserver/src/layered_repository/filename.rs
@@ -1,4 +1,7 @@
-use crate::layered_repository::storage_layer::{SegmentTag};
+//!
+//! Helper functions for dealing with filenames of the image and delta layer files.
+//!
+use crate::layered_repository::storage_layer::SegmentTag;
 use crate::relish::*;
 use crate::PageServerConf;
 use crate::{ZTenantId, ZTimelineId};
@@ -9,24 +12,29 @@ use anyhow::Result;
 use log::*;
 use zenith_utils::lsn::Lsn;

+// Note: LayeredTimeline::load_layer_map() relies on this sort order
 #[derive(Debug, PartialEq, Eq, PartialOrd, Ord, Clone)]
-pub struct SnapshotFileName {
+pub struct DeltaFileName {
    pub seg: SegmentTag,
    pub start_lsn: Lsn,
    pub end_lsn: Lsn,
    pub dropped: bool,
 }

-impl SnapshotFileName {
+/// Represents the filename of a DeltaLayer
+///
+///    <spcnode>_<dbnode>_<relnode>_<forknum>_<seg>_<start LSN>_<end LSN>
+///
+/// or if it was dropped:
+///
+///    <spcnode>_<dbnode>_<relnode>_<forknum>_<seg>_<start LSN>_<end LSN>_DROPPED
+///
+impl DeltaFileName {
+    ///
+    /// Parse a string as a delta file name. Returns None if the filename does not
+    /// match the expected pattern.
+    ///
    fn from_str(fname: &str) -> Option<Self> {
-        // Split the filename into parts
-        //
-        //    <spcnode>_<dbnode>_<relnode>_<forknum>_<seg>_<start LSN>_<end LSN>
-        //
-        // or if it was dropped:
-        //
-        //    <spcnode>_<dbnode>_<relnode>_<forknum>_<seg>_<start LSN>_<end LSN>_DROPPED
-        //
        let rel;
        let mut parts;
        if let Some(rest) = fname.strip_prefix("rel_") {
@@ -88,16 +96,14 @@ impl SnapshotFileName {
            if suffix == "DROPPED" {
                dropped = true;
            } else {
-                warn!("unrecognized filename in timeline dir: {}", fname);
                return None;
            }
        }
        if parts.next().is_some() {
-            warn!("unrecognized filename in timeline dir: {}", fname);
            return None;
        }

-        Some(SnapshotFileName {
+        Some(DeltaFileName {
            seg,
            start_lsn,
            end_lsn,
@@ -142,31 +148,158 @@ impl SnapshotFileName {
    }
 }

-impl fmt::Display for SnapshotFileName {
+impl fmt::Display for DeltaFileName {
    fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
        write!(f, "{}", self.to_string())
    }
 }

+#[derive(Debug, PartialEq, Eq, PartialOrd, Ord, Clone)]
+pub struct ImageFileName {
+    pub seg: SegmentTag,
+    pub lsn: Lsn,
+}

-/// Create SnapshotFileName structs representing all files on disk
 ///
-// TODO: returning an Iterator would be more idiomatic
-pub fn list_snapshot_files(
+/// Represents the filename of an ImageLayer
+///
+///    <spcnode>_<dbnode>_<relnode>_<forknum>_<seg>_<LSN>
+///
+impl ImageFileName {
+    ///
+    /// Parse a string as an image file name. Returns None if the filename does not
+    /// match the expected pattern.
+    ///
+    fn from_str(fname: &str) -> Option<Self> {
+        let rel;
+        let mut parts;
+        if let Some(rest) = fname.strip_prefix("rel_") {
+            parts = rest.split('_');
+            rel = RelishTag::Relation(RelTag {
+                spcnode: parts.next()?.parse::<u32>().ok()?,
+                dbnode: parts.next()?.parse::<u32>().ok()?,
+                relnode: parts.next()?.parse::<u32>().ok()?,
+                forknum: parts.next()?.parse::<u8>().ok()?,
+            });
+        } else if let Some(rest) = fname.strip_prefix("pg_xact_") {
+            parts = rest.split('_');
+            rel = RelishTag::Slru {
+                slru: SlruKind::Clog,
+                segno: u32::from_str_radix(parts.next()?, 16).ok()?,
+            };
+        } else if let Some(rest) = fname.strip_prefix("pg_multixact_members_") {
+            parts = rest.split('_');
+            rel = RelishTag::Slru {
+                slru: SlruKind::MultiXactMembers,
+                segno: u32::from_str_radix(parts.next()?, 16).ok()?,
+            };
+        } else if let Some(rest) = fname.strip_prefix("pg_multixact_offsets_") {
+            parts = rest.split('_');
+            rel = RelishTag::Slru {
+                slru: SlruKind::MultiXactOffsets,
+                segno: u32::from_str_radix(parts.next()?, 16).ok()?,
+            };
+        } else if let Some(rest) = fname.strip_prefix("pg_filenodemap_") {
+            parts = rest.split('_');
+            rel = RelishTag::FileNodeMap {
+                spcnode: parts.next()?.parse::<u32>().ok()?,
+                dbnode: parts.next()?.parse::<u32>().ok()?,
+            };
+        } else if let Some(rest) = fname.strip_prefix("pg_twophase_") {
+            parts = rest.split('_');
+            rel = RelishTag::TwoPhase {
+                xid: parts.next()?.parse::<u32>().ok()?,
+            };
+        } else if let Some(rest) = fname.strip_prefix("pg_control_checkpoint_") {
+            parts = rest.split('_');
+            rel = RelishTag::Checkpoint;
+        } else if let Some(rest) = fname.strip_prefix("pg_control_") {
+            parts = rest.split('_');
+            rel = RelishTag::ControlFile;
+        } else {
+            return None;
+        }
+
+        let segno = parts.next()?.parse::<u32>().ok()?;
+
+        let seg = SegmentTag { rel, segno };
+
+        let lsn = Lsn::from_hex(parts.next()?).ok()?;
+
+        if parts.next().is_some() {
+            return None;
+        }
+
+        Some(ImageFileName { seg, lsn })
+    }
+
+    fn to_string(&self) -> String {
+        let basename = match self.seg.rel {
+            RelishTag::Relation(reltag) => format!(
+                "rel_{}_{}_{}_{}",
+                reltag.spcnode, reltag.dbnode, reltag.relnode, reltag.forknum
+            ),
+            RelishTag::Slru {
+                slru: SlruKind::Clog,
+                segno,
+            } => format!("pg_xact_{:04X}", segno),
+            RelishTag::Slru {
+                slru: SlruKind::MultiXactMembers,
+                segno,
+            } => format!("pg_multixact_members_{:04X}", segno),
+            RelishTag::Slru {
+                slru: SlruKind::MultiXactOffsets,
+                segno,
+            } => format!("pg_multixact_offsets_{:04X}", segno),
+            RelishTag::FileNodeMap { spcnode, dbnode } => {
+                format!("pg_filenodemap_{}_{}", spcnode, dbnode)
+            }
+            RelishTag::TwoPhase { xid } => format!("pg_twophase_{}", xid),
+            RelishTag::Checkpoint => format!("pg_control_checkpoint"),
+            RelishTag::ControlFile => format!("pg_control"),
+        };
+
+        format!(
+            "{}_{}_{:016X}",
+            basename,
+            self.seg.segno,
+            u64::from(self.lsn),
+        )
+    }
+}
+
+impl fmt::Display for ImageFileName {
+    fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
+        write!(f, "{}", self.to_string())
+    }
+}
+
+/// Scan timeline directory and create ImageFileName and DeltaFilename
+/// structs representing all files on disk
+///
+/// TODO: returning an Iterator would be more idiomatic
+pub fn list_files(
    conf: &'static PageServerConf,
    timelineid: ZTimelineId,
    tenantid: ZTenantId,
-) -> Result<Vec<SnapshotFileName>> {
+) -> Result<(Vec<ImageFileName>, Vec<DeltaFileName>)> {
    let path = conf.timeline_path(&timelineid, &tenantid);

-    let mut snapfiles: Vec<SnapshotFileName> = Vec::new();
+    let mut deltafiles: Vec<DeltaFileName> = Vec::new();
+    let mut imgfiles: Vec<ImageFileName> = Vec::new();
    for direntry in fs::read_dir(path)? {
        let fname = direntry?.file_name();
        let fname = fname.to_str().unwrap();

-        if let Some(snapfilename) = SnapshotFileName::from_str(fname) {
-            snapfiles.push(snapfilename);
+        if let Some(deltafilename) = DeltaFileName::from_str(fname) {
+            deltafiles.push(deltafilename);
+        } else if let Some(imgfilename) = ImageFileName::from_str(fname) {
+            imgfiles.push(imgfilename);
+        } else if fname == "wal" || fname == "metadata" {
+            // ignore these
+        } else {
+            warn!("unrecognized filename in timeline dir: {}", fname);
        }
    }
-    return Ok(snapfiles);
+    return Ok((imgfiles, deltafiles));
 }
--- a/pageserver/src/layered_repository/image_layer.rs
+++ b/pageserver/src/layered_repository/image_layer.rs
@@ -0,0 +1,348 @@
+//! An ImageLayer represents an image or a snapshot of a segment at one particular LSN.
+//! It is stored in a file on disk.
+//!
+//! On disk, the image files are stored in timelines/<timelineid> directory.
+//! Currently, there are no subdirectories, and each snapshot file is named like this:
+//!
+//!    <spcnode>_<dbnode>_<relnode>_<forknum>_<segno>_<LSN>
+//!
+//! For example:
+//!
+//!    1663_13990_2609_0_5_000000000169C348
+//!
+//! An image file is constructed using the 'bookfile' crate.
+//!
+//! When a snapshot file needs to be accessed, we slurp the whole file into memory,
+//! into the ImageLayerInner struct. See load() and unload() functions.
+//! TODO: That's very inefficient, we should be smarter.
+//!
+use crate::layered_repository::filename::ImageFileName;
+use crate::layered_repository::storage_layer::{Layer, PageReconstructData, SegmentTag};
+use crate::layered_repository::LayeredTimeline;
+use crate::layered_repository::RELISH_SEG_SIZE;
+use crate::PageServerConf;
+use crate::{ZTenantId, ZTimelineId};
+use anyhow::{bail, Result};
+use bytes::Bytes;
+use log::*;
+use std::fs;
+use std::fs::File;
+use std::io::Write;
+use std::path::PathBuf;
+use std::sync::{Mutex, MutexGuard};
+
+use bookfile::{Book, BookWriter};
+
+use zenith_utils::bin_ser::BeSer;
+use zenith_utils::lsn::Lsn;
+
+// Magic constant to identify a Zenith segment image file
+static IMAGE_FILE_MAGIC: u32 = 0x5A616E01 + 1;
+
+static BASE_IMAGES_CHAPTER: u64 = 1;
+
+///
+/// ImageLayer is the in-memory data structure associated with an on-disk image
+/// file.  We keep an ImageLayer in memory for each file, in the LayerMap. If a
+/// layer is in "loaded" state, we have a copy of the file in memory, in 'inner'.
+/// Otherwise the struct is just a placeholder for a file that exists on disk,
+/// and it needs to be loaded before using it in queries.
+///
+pub struct ImageLayer {
+    conf: &'static PageServerConf,
+    pub tenantid: ZTenantId,
+    pub timelineid: ZTimelineId,
+    pub seg: SegmentTag,
+
+    // This entry contains an image of all pages as of this LSN
+    pub lsn: Lsn,
+
+    inner: Mutex<ImageLayerInner>,
+}
+
+pub struct ImageLayerInner {
+    /// If false, the 'page_versions' and 'relsizes' have not been
+    /// loaded into memory yet.
+    loaded: bool,
+
+    /// The data is held in this vector of Bytes buffers, with one
+    /// Bytes for each block. It's indexed by block number (counted from
+    /// the beginning of the segment)
+    base_images: Vec<Bytes>,
+}
+
+impl Layer for ImageLayer {
+    fn filename(&self) -> PathBuf {
+        PathBuf::from(
+            ImageFileName {
+                seg: self.seg,
+                lsn: self.lsn,
+            }
+            .to_string(),
+        )
+    }
+
+    fn get_timeline_id(&self) -> ZTimelineId {
+        return self.timelineid;
+    }
+
+    fn get_seg_tag(&self) -> SegmentTag {
+        return self.seg;
+    }
+
+    fn is_dropped(&self) -> bool {
+        return false;
+    }
+
+    fn get_start_lsn(&self) -> Lsn {
+        return self.lsn;
+    }
+
+    fn get_end_lsn(&self) -> Lsn {
+        return self.lsn;
+    }
+
+    /// Look up given page in the file
+    fn get_page_reconstruct_data(
+        &self,
+        blknum: u32,
+        lsn: Lsn,
+        reconstruct_data: &mut PageReconstructData,
+    ) -> Result<Option<Lsn>> {
+        let need_base_image_lsn: Option<Lsn>;
+
+        assert!(lsn >= self.lsn);
+
+        {
+            let inner = self.load()?;
+
+            let base_blknum: usize = (blknum % RELISH_SEG_SIZE) as usize;
+            if let Some(img) = inner.base_images.get(base_blknum) {
+                reconstruct_data.page_img = Some(img.clone());
+                need_base_image_lsn = None;
+            } else {
+                bail!(
+                    "no base img found for {} at blk {} at LSN {}",
+                    self.seg,
+                    base_blknum,
+                    lsn
+                );
+            }
+            // release lock on 'inner'
+        }
+
+        Ok(need_base_image_lsn)
+    }
+
+    /// Get size of the segment
+    fn get_seg_size(&self, _lsn: Lsn) -> Result<u32> {
+        let inner = self.load()?;
+        let result = inner.base_images.len() as u32;
+
+        Ok(result)
+    }
+
+    /// Does this segment exist at given LSN?
+    fn get_seg_exists(&self, _lsn: Lsn) -> Result<bool> {
+        Ok(true)
+    }
+
+    ///
+    /// Release most of the memory used by this layer. If it's accessed again later,
+    /// it will need to be loaded back.
+    ///
+    fn unload(&self) -> Result<()> {
+        let mut inner = self.inner.lock().unwrap();
+        inner.base_images = Vec::new();
+        inner.loaded = false;
+        Ok(())
+    }
+
+    fn delete(&self) -> Result<()> {
+        // delete underlying file
+        fs::remove_file(self.path())?;
+        Ok(())
+    }
+
+    fn is_incremental(&self) -> bool {
+        false
+    }
+}
+
+impl ImageLayer {
+    fn path(&self) -> PathBuf {
+        Self::path_for(
+            self.conf,
+            self.timelineid,
+            self.tenantid,
+            &ImageFileName {
+                seg: self.seg,
+                lsn: self.lsn,
+            },
+        )
+    }
+
+    fn path_for(
+        conf: &'static PageServerConf,
+        timelineid: ZTimelineId,
+        tenantid: ZTenantId,
+        fname: &ImageFileName,
+    ) -> PathBuf {
+        conf.timeline_path(&timelineid, &tenantid)
+            .join(fname.to_string())
+    }
+
+    /// Create a new image file, using the given array of pages.
+    fn create(
+        conf: &'static PageServerConf,
+        timelineid: ZTimelineId,
+        tenantid: ZTenantId,
+        seg: SegmentTag,
+        lsn: Lsn,
+        base_images: Vec<Bytes>,
+    ) -> Result<ImageLayer> {
+        let layer = ImageLayer {
+            conf: conf,
+            timelineid: timelineid,
+            tenantid: tenantid,
+            seg: seg,
+            lsn: lsn,
+            inner: Mutex::new(ImageLayerInner {
+                loaded: true,
+                base_images: base_images,
+            }),
+        };
+        let inner = layer.inner.lock().unwrap();
+
+        // Write the images into a file
+        let path = layer.path();
+
+        // Note: This overwrites any existing file. There shouldn't be any.
+        // FIXME: throw an error instead?
+        let file = File::create(&path)?;
+        let book = BookWriter::new(file, IMAGE_FILE_MAGIC)?;
+
+        let mut chapter = book.new_chapter(BASE_IMAGES_CHAPTER);
+        let buf = Vec::ser(&inner.base_images)?;
+
+        chapter.write_all(&buf)?;
+        let book = chapter.close()?;
+
+        book.close()?;
+
+        trace!("saved {}", &path.display());
+
+        drop(inner);
+
+        Ok(layer)
+    }
+
+    // Create a new image file by materializing every page in a source layer
+    // at given LSN.
+    pub fn create_from_src(
+        conf: &'static PageServerConf,
+        timeline: &LayeredTimeline,
+        src: &dyn Layer,
+        lsn: Lsn,
+    ) -> Result<ImageLayer> {
+        let seg = src.get_seg_tag();
+        let timelineid = timeline.timelineid;
+
+        let startblk;
+        let size;
+        if seg.rel.is_blocky() {
+            size = src.get_seg_size(lsn)?;
+            startblk = seg.segno * RELISH_SEG_SIZE;
+        } else {
+            size = 1;
+            startblk = 0;
+        }
+
+        trace!(
+            "creating new ImageLayer for {} on timeline {} at {}",
+            seg,
+            timelineid,
+            lsn,
+        );
+
+        let mut base_images: Vec<Bytes> = Vec::new();
+        for blknum in startblk..(startblk + size) {
+            let img = timeline.materialize_page(seg, blknum, lsn, &*src)?;
+
+            base_images.push(img);
+        }
+
+        Self::create(conf, timelineid, timeline.tenantid, seg, lsn, base_images)
+    }
+
+    ///
+    /// Load the contents of the file into memory
+    ///
+    fn load(&self) -> Result<MutexGuard<ImageLayerInner>> {
+        // quick exit if already loaded
+        let mut inner = self.inner.lock().unwrap();
+
+        if inner.loaded {
+            return Ok(inner);
+        }
+
+        let path = Self::path_for(
+            self.conf,
+            self.timelineid,
+            self.tenantid,
+            &ImageFileName {
+                seg: self.seg,
+                lsn: self.lsn,
+            },
+        );
+
+        let file = File::open(&path)?;
+        let book = Book::new(file)?;
+
+        let chapter = book.read_chapter(BASE_IMAGES_CHAPTER)?;
+        let base_images = Vec::des(&chapter)?;
+
+        debug!("loaded from {}", &path.display());
+
+        *inner = ImageLayerInner {
+            loaded: true,
+            base_images,
+        };
+
+        Ok(inner)
+    }
+
+    /// Create an ImageLayer struct representing an existing file on disk
+    pub fn new(
+        conf: &'static PageServerConf,
+        timelineid: ZTimelineId,
+        tenantid: ZTenantId,
+        filename: &ImageFileName,
+    ) -> ImageLayer {
+        ImageLayer {
+            conf,
+            timelineid,
+            tenantid,
+            seg: filename.seg,
+            lsn: filename.lsn,
+            inner: Mutex::new(ImageLayerInner {
+                loaded: false,
+                base_images: Vec::new(),
+            }),
+        }
+    }
+
+    /// debugging function to print out the contents of the layer
+    #[allow(unused)]
+    pub fn dump(&self) -> String {
+        let mut result = format!("----- image layer for {} at {} ----\n", self.seg, self.lsn);
+
+        //let inner = self.inner.lock().unwrap();
+
+        //for (k, v) in inner.page_versions.iter() {
+        //    result += &format!("blk {} at {}: {}/{}\n", k.0, k.1, v.page_image.is_some(), v.record.is_some());
+        //}
+
+        result
+    }
+}
--- a/pageserver/src/layered_repository/inmemory_layer.rs
+++ b/pageserver/src/layered_repository/inmemory_layer.rs
@@ -2,11 +2,12 @@
 //! An in-memory layer stores recently received page versions in memory. The page versions
 //! are held in a BTreeMap, and there's another BTreeMap to track the size of the relation.
 //!
+use crate::layered_repository::filename::DeltaFileName;
 use crate::layered_repository::storage_layer::{
    Layer, PageReconstructData, PageVersion, SegmentTag, RELISH_SEG_SIZE,
 };
 use crate::layered_repository::LayeredTimeline;
-use crate::layered_repository::SnapshotLayer;
+use crate::layered_repository::{DeltaLayer, ImageLayer};
 use crate::repository::WALRecord;
 use crate::PageServerConf;
 use crate::{ZTenantId, ZTimelineId};
@@ -15,6 +16,7 @@ use bytes::Bytes;
 use log::*;
 use std::collections::BTreeMap;
 use std::ops::Bound::Included;
+use std::path::PathBuf;
 use std::sync::{Arc, Mutex};

 use zenith_utils::lsn::Lsn;
@@ -32,11 +34,15 @@ pub struct InMemoryLayer {
    ///
    start_lsn: Lsn,

+    /// LSN of the oldest page version stored in this layer
    oldest_pending_lsn: Lsn,

    /// The above fields never change. The parts that do change are in 'inner',
    /// and protected by mutex.
    inner: Mutex<InMemoryLayerInner>,
+
+    /// Predecessor layer
+    img_layer: Option<Arc<dyn Layer>>,
 }

 pub struct InMemoryLayerInner {
@@ -56,21 +62,45 @@ pub struct InMemoryLayerInner {
 }

 impl InMemoryLayerInner {
-    fn get_seg_size(&self, seg: SegmentTag, lsn: Lsn) -> Result<u32> {
+    fn get_seg_size(&self, lsn: Lsn) -> u32 {
        // Scan the BTreeMap backwards, starting from the given entry.
        let mut iter = self.segsizes.range((Included(&Lsn(0)), Included(&lsn)));

        if let Some((_entry_lsn, entry)) = iter.next_back() {
-            let result = *entry;
-            trace!("get_seg_size: {} at {} -> {}", seg, lsn, result);
-            Ok(result)
+            *entry
        } else {
-            bail!("No size found for {} at {} in memory", seg, lsn);
+            0
        }
    }
 }

 impl Layer for InMemoryLayer {
+    // An in-memory layer doesn't really have a filename as it's not stored on disk,
+    // but we construct a filename as if it was a delta layer
+    fn filename(&self) -> PathBuf {
+        let inner = self.inner.lock().unwrap();
+
+        let end_lsn;
+        let dropped;
+        if let Some(drop_lsn) = inner.drop_lsn {
+            end_lsn = drop_lsn;
+            dropped = true;
+        } else {
+            end_lsn = Lsn(u64::MAX);
+            dropped = false;
+        }
+
+        let delta_filename = DeltaFileName {
+            seg: self.seg,
+            start_lsn: self.start_lsn,
+            end_lsn: end_lsn,
+            dropped: dropped,
+        }
+        .to_string();
+
+        PathBuf::from(format!("inmem-{}", delta_filename))
+    }
+
    fn get_timeline_id(&self) -> ZTimelineId {
        return self.timelineid;
    }
@@ -137,7 +167,22 @@ impl Layer for InMemoryLayer {
                }
            }

-            // release lock on 'page_versions'
+            // Use the base image, if needed
+            if let Some(need_lsn) = need_base_image_lsn {
+                if let Some(img_layer) = &self.img_layer {
+                    need_base_image_lsn =
+                        img_layer.get_page_reconstruct_data(blknum, need_lsn, reconstruct_data)?;
+                } else {
+                    bail!(
+                        "no base img found for {} at blk {} at LSN {}",
+                        self.seg,
+                        blknum,
+                        lsn
+                    );
+                }
+            }
+
+            // release lock on 'inner'
        }

        Ok(need_base_image_lsn)
@@ -145,8 +190,10 @@ impl Layer for InMemoryLayer {

    /// Get size of the relation at given LSN
    fn get_seg_size(&self, lsn: Lsn) -> Result<u32> {
+        assert!(lsn >= self.start_lsn);
+
        let inner = self.inner.lock().unwrap();
-        inner.get_seg_size(self.seg, lsn)
+        Ok(inner.get_seg_size(lsn))
    }

    /// Does this segment exist at given LSN?
@@ -163,6 +210,23 @@ impl Layer for InMemoryLayer {
        // Otherwise, it exists
        Ok(true)
    }
+
+    /// Cannot unload anything in an in-memory layer, since there's no backing
+    /// store. To release memory used by an in-memory layer, use 'freeze' to turn
+    /// it into an on-disk layer.
+    fn unload(&self) -> Result<()> {
+        Ok(())
+    }
+
+    /// Nothing to do here. When you drop the last reference to the layer, it will
+    /// be deallocated.
+    fn delete(&self) -> Result<()> {
+        Ok(())
+    }
+
+    fn is_incremental(&self) -> bool {
+        self.img_layer.is_some()
+    }
 }

 impl InMemoryLayer {
@@ -201,6 +265,7 @@ impl InMemoryLayer {
                page_versions: BTreeMap::new(),
                segsizes: BTreeMap::new(),
            }),
+            img_layer: None,
        })
    }

@@ -260,7 +325,7 @@ impl InMemoryLayer {

            // use inner get_seg_size, since calling self.get_seg_size will try to acquire self.inner.lock
            // which we've just acquired above
-            let oldsize = inner.get_seg_size(self.seg, lsn).unwrap_or(0);
+            let oldsize = inner.get_seg_size(lsn);
            if newsize > oldsize {
                trace!(
                    "enlarging segment {} from {} to {} blocks at {}",
@@ -305,58 +370,43 @@ impl InMemoryLayer {
    /// Initialize a new InMemoryLayer for, by copying the state at the given
    /// point in time from given existing layer.
    ///
-    pub fn copy_snapshot(
+    pub fn create_successor_layer(
        conf: &'static PageServerConf,
-        timeline: &LayeredTimeline,
-        src: &dyn Layer,
+        src: Arc<dyn Layer>,
        timelineid: ZTimelineId,
        tenantid: ZTenantId,
        start_lsn: Lsn,
        oldest_pending_lsn: Lsn,
    ) -> Result<InMemoryLayer> {
-        trace!(
-            "initializing new InMemoryLayer for writing {} on timeline {} at {}",
-            src.get_seg_tag(),
-            timelineid,
-            start_lsn
-        );
-        let mut page_versions = BTreeMap::new();
-        let mut segsizes = BTreeMap::new();
-
        let seg = src.get_seg_tag();

-        let startblk;
-        let size;
-        if seg.rel.is_blocky() {
-            size = src.get_seg_size(start_lsn)?;
-            segsizes.insert(start_lsn, size);
-            startblk = seg.segno * RELISH_SEG_SIZE;
-        } else {
-            size = 1;
-            startblk = 0;
-        }
+        trace!(
+            "initializing new InMemoryLayer for writing {} on timeline {} at {}",
+            seg,
+            timelineid,
+            start_lsn,
+        );

-        for blknum in startblk..(startblk + size) {
-            let img = timeline.materialize_page(seg, blknum, start_lsn, src)?;
-            let pv = PageVersion {
-                page_image: Some(img),
-                record: None,
-            };
-            page_versions.insert((blknum, start_lsn), pv);
+        // For convenience, copy the segment size from the predecessor layer
+        let mut segsizes = BTreeMap::new();
+        if seg.rel.is_blocky() {
+            let size = src.get_seg_size(start_lsn)?;
+            segsizes.insert(start_lsn, size);
        }

        Ok(InMemoryLayer {
            conf,
            timelineid,
            tenantid,
-            seg: src.get_seg_tag(),
+            seg,
            start_lsn,
            oldest_pending_lsn,
            inner: Mutex::new(InMemoryLayerInner {
                drop_lsn: None,
-                page_versions: page_versions,
+                page_versions: BTreeMap::new(),
                segsizes: segsizes,
            }),
+            img_layer: Some(src),
        })
    }

@@ -365,18 +415,21 @@ impl InMemoryLayer {
    ///
    /// The cutoff point for the layer that's written to disk is 'end_lsn'.
    ///
-    /// Returns new layers that replace this one. Always returns a
-    /// SnapshotLayer containing the page versions that were written to disk,
-    /// but if there were page versions newer than 'end_lsn', also return a new
-    /// in-memory layer containing those page versions. The caller replaces
-    /// this layer with the returned layers in the layer map.
+    /// Returns new layers that replace this one. Always returns a new image
+    /// layer containing the page versions at the cutoff LSN, that were written
+    /// to disk, and usually also a DeltaLayer that includes all the WAL records
+    /// between start LSN and the cutoff. (The delta layer is not needed when
+    /// a new relish is created with a single LSN, so that the start and end LSN
+    /// are the same.) If there were page versions newer than 'end_lsn', also
+    /// returns a new in-memory layer containing those page versions. The caller
+    /// replaces this layer with the returned layers in the layer map.
    ///
    pub fn freeze(
        &self,
        cutoff_lsn: Lsn,
        // This is needed just to call materialize_page()
        timeline: &LayeredTimeline,
-    ) -> Result<(Arc<SnapshotLayer>, Option<Arc<InMemoryLayer>>)> {
+    ) -> Result<(Vec<Arc<dyn Layer>>, Option<Arc<InMemoryLayer>>)> {
        info!(
            "freezing in memory layer for {} on timeline {} at {}",
            self.seg, self.timelineid, cutoff_lsn
@@ -416,7 +469,10 @@ impl InMemoryLayer {
            before_page_versions = BTreeMap::new();
            after_page_versions = BTreeMap::new();
            for ((blknum, lsn), pv) in inner.page_versions.iter() {
-                if *lsn > end_lsn {
+                if *lsn == end_lsn {
+                    // Page versions at the cutoff LSN will be stored in the
+                    // materialized image layer.
+                } else if *lsn > end_lsn {
                    after_page_versions.insert((*blknum, *lsn), pv.clone());
                } else {
                    before_page_versions.insert((*blknum, *lsn), pv.clone());
@@ -432,35 +488,46 @@ impl InMemoryLayer {
        // we can release the lock now.
        drop(inner);

-        // Write the page versions before the cutoff to disk.
-        let snapfile = SnapshotLayer::create(
-            self.conf,
-            self.timelineid,
-            self.tenantid,
-            self.seg,
-            self.start_lsn,
-            end_lsn,
-            dropped,
-            before_page_versions,
-            before_segsizes,
-        )?;
+        let mut frozen_layers: Vec<Arc<dyn Layer>> = Vec::new();

-        trace!(
-            "freeze: created snapshot layer {} {}-{}",
-            snapfile.get_seg_tag(),
-            snapfile.get_start_lsn(),
-            snapfile.get_end_lsn()
-        );
+        // write a new base image layer at the cutoff point
+        let imgfile = ImageLayer::create_from_src(self.conf, timeline, self, end_lsn)?;
+        let imgfile_rc: Arc<dyn Layer> = Arc::new(imgfile);
+        frozen_layers.push(Arc::clone(&imgfile_rc));
+        trace!("freeze: created image layer {} at {}", self.seg, end_lsn);
+
+        if self.start_lsn != end_lsn {
+            // Write the page versions before the cutoff to disk.
+            let delta_layer = DeltaLayer::create(
+                self.conf,
+                self.timelineid,
+                self.tenantid,
+                self.seg,
+                self.start_lsn,
+                end_lsn,
+                dropped,
+                self.img_layer.clone(),
+                before_page_versions,
+                before_segsizes,
+            )?;
+            let delta_layer_rc: Arc<dyn Layer> = Arc::new(delta_layer);
+            frozen_layers.push(delta_layer_rc);
+            trace!(
+                "freeze: created delta layer {} {}-{}",
+                self.seg,
+                self.start_lsn,
+                end_lsn
+            );
+        } else {
+            assert!(before_page_versions.is_empty());
+        }

        // If there were any "new" page versions, initialize a new in-memory layer to hold
        // them
        let new_open = if !after_segsizes.is_empty() || !after_page_versions.is_empty() {
-            trace!("freeze: created new in-mem layer {} {}-", self.seg, end_lsn);
-
-            let new_open = Self::copy_snapshot(
+            let new_open = Self::create_successor_layer(
                self.conf,
-                timeline,
-                &snapfile,
+                imgfile_rc,
                self.timelineid,
                self.tenantid,
                end_lsn,
@@ -470,15 +537,14 @@ impl InMemoryLayer {
            new_inner.page_versions.append(&mut after_page_versions);
            new_inner.segsizes.append(&mut after_segsizes);
            drop(new_inner);
+            trace!("freeze: created new in-mem layer {} {}-", self.seg, end_lsn);

            Some(Arc::new(new_open))
        } else {
            None
        };

-        let new_historic = Arc::new(snapfile);
-
-        Ok((new_historic, new_open))
+        Ok((frozen_layers, new_open))
    }

    /// debugging function to print out the contents of the layer
--- a/pageserver/src/layered_repository/layer_map.rs
+++ b/pageserver/src/layered_repository/layer_map.rs
@@ -1,16 +1,16 @@
 //!
 //! The layer map tracks what layers exist for all the relations in a timeline.
 //!
-//! When the timeline is first accessed, the server lists of all snapshot files
+//! When the timeline is first accessed, the server lists of all layer files
 //! in the timelines/<timelineid> directory, and populates this map with
-//! SnapshotLayers corresponding to each file. When new WAL is received,
-//! we create InMemoryLayers to hold the incoming records. Now and then,
-//! in the checkpoint() function, the in-memory layers are frozen, forming
-//! new snapshot layers and corresponding files are written to disk.
+//! ImageLayer and DeltaLayer structs corresponding to each file. When new WAL
+//! is received, we create InMemoryLayers to hold the incoming records. Now and
+//! then, in the checkpoint() function, the in-memory layers are frozen, forming
+//! new image and delta layers and corresponding files are written to disk.
 //!

 use crate::layered_repository::storage_layer::{Layer, SegmentTag};
-use crate::layered_repository::{InMemoryLayer, SnapshotLayer};
+use crate::layered_repository::InMemoryLayer;
 use crate::relish::*;
 use anyhow::Result;
 use log::*;
@@ -43,7 +43,7 @@ pub struct LayerMap {
 /// BTreeMap keyed by the layer's start LSN.
 struct SegEntry {
    pub open: Option<Arc<InMemoryLayer>>,
-    pub historic: BTreeMap<Lsn, Arc<SnapshotLayer>>,
+    pub historic: BTreeMap<Lsn, Arc<dyn Layer>>,
 }

 /// Entry held LayerMap.open_segs, with boilerplate comparison
@@ -156,7 +156,7 @@ impl LayerMap {
    ///
    /// Insert an on-disk layer
    ///
-    pub fn insert_historic(&mut self, layer: Arc<SnapshotLayer>) {
+    pub fn insert_historic(&mut self, layer: Arc<dyn Layer>) {
        let tag = layer.get_seg_tag();
        let start_lsn = layer.get_start_lsn();

@@ -179,7 +179,7 @@ impl LayerMap {
    ///
    /// This should be called when the corresponding file on disk has been deleted.
    ///
-    pub fn remove_historic(&mut self, layer: &SnapshotLayer) {
+    pub fn remove_historic(&mut self, layer: &dyn Layer) {
        let tag = layer.get_seg_tag();
        let start_lsn = layer.get_start_lsn();

@@ -221,17 +221,21 @@ impl LayerMap {
        Ok(rels)
    }

-    /// Is there a newer layer for given segment?
-    pub fn newer_layer_exists(&self, seg: SegmentTag, lsn: Lsn) -> bool {
+    /// Is there a newer image layer for given segment?
+    ///
+    /// This is used for garbage collection, to determine if an old layer can
+    /// be deleted. We ignore in-memory layers because they are not durable
+    /// on disk, and delta layers because they depend on an older layer.
+    pub fn newer_image_layer_exists(&self, seg: SegmentTag, lsn: Lsn) -> bool {
        if let Some(segentry) = self.segs.get(&seg) {
-            if let Some(_open) = &segentry.open {
-                return true;
-            }
-
            for (newer_lsn, layer) in segentry
                .historic
                .range((Included(lsn), Included(Lsn(u64::MAX))))
            {
+                // Ignore delta layers.
+                if layer.is_incremental() {
+                    continue;
+                }
                if layer.get_end_lsn() > lsn {
                    trace!(
                        "found later layer for {}, {} {}-{}",
@@ -279,11 +283,11 @@ impl Default for LayerMap {

 pub struct HistoricLayerIter<'a> {
    segiter: std::collections::hash_map::Iter<'a, SegmentTag, SegEntry>,
-    iter: Option<std::collections::btree_map::Iter<'a, Lsn, Arc<SnapshotLayer>>>,
+    iter: Option<std::collections::btree_map::Iter<'a, Lsn, Arc<dyn Layer>>>,
 }

 impl<'a> Iterator for HistoricLayerIter<'a> {
-    type Item = Arc<SnapshotLayer>;
+    type Item = Arc<dyn Layer>;

    fn next(&mut self) -> std::option::Option<<Self as std::iter::Iterator>::Item> {
        loop {
--- a/pageserver/src/layered_repository/storage_layer.rs
+++ b/pageserver/src/layered_repository/storage_layer.rs
@@ -9,6 +9,7 @@ use anyhow::Result;
 use bytes::Bytes;
 use serde::{Deserialize, Serialize};
 use std::fmt;
+use std::path::PathBuf;

 use zenith_utils::lsn::Lsn;

@@ -102,6 +103,11 @@ pub trait Layer: Send + Sync {
    fn get_end_lsn(&self) -> Lsn;
    fn is_dropped(&self) -> bool;

+    /// Filename used to store this layer on disk. (Even in-memory layers
+    /// implement this, to print a handy unique identifier for the layer for
+    /// log messages, even though they're never not on disk.)
+    fn filename(&self) -> PathBuf;
+
    ///
    /// Return data needed to reconstruct given page at LSN.
    ///
@@ -121,8 +127,22 @@ pub trait Layer: Send + Sync {
        reconstruct_data: &mut PageReconstructData,
    ) -> Result<Option<Lsn>>;

-    // Functions that correspond to the Timeline trait functions.
+    /// Return size of the segment at given LSN. (Only for blocky relations.)
    fn get_seg_size(&self, lsn: Lsn) -> Result<u32>;

+    /// Does the segment exist at given LSN? Or was it dropped before it.
    fn get_seg_exists(&self, lsn: Lsn) -> Result<bool>;
+
+    /// Does this layer only contain some data for the segment (incremental),
+    /// or does it contain a version of every page? This is important to know
+    /// for garbage collecting old layers: an incremental layer depends on
+    /// the previous non-incremental layer.
+    fn is_incremental(&self) -> bool;
+
+    /// Release memory used by this layer. There is no corresponding 'load'
+    /// function, that's done implicitly when you call one of the get-functions.
+    fn unload(&self) -> Result<()>;
+
+    /// Permanently remove this layer from disk.
+    fn delete(&self) -> Result<()>;
 }
--- a/pageserver/src/page_service.rs
+++ b/pageserver/src/page_service.rs
@@ -604,43 +604,38 @@ impl postgres_backend::Handler for PageServerHandler {
                RowDescriptor::int8_col(b"elapsed"),
            ]))?
            .write_message_noflush(&BeMessage::DataRow(&[
-                Some(&result.snapshot_relfiles_total.to_string().as_bytes()),
+                Some(&result.ondisk_relfiles_total.to_string().as_bytes()),
                Some(
                    &result
-                        .snapshot_relfiles_needed_by_cutoff
+                        .ondisk_relfiles_needed_by_cutoff
                        .to_string()
                        .as_bytes(),
                ),
                Some(
                    &result
-                        .snapshot_relfiles_needed_by_branches
+                        .ondisk_relfiles_needed_by_branches
                        .to_string()
                        .as_bytes(),
                ),
-                Some(&result.snapshot_relfiles_not_updated.to_string().as_bytes()),
-                Some(&result.snapshot_relfiles_removed.to_string().as_bytes()),
-                Some(&result.snapshot_relfiles_dropped.to_string().as_bytes()),
-                Some(&result.snapshot_nonrelfiles_total.to_string().as_bytes()),
+                Some(&result.ondisk_relfiles_not_updated.to_string().as_bytes()),
+                Some(&result.ondisk_relfiles_removed.to_string().as_bytes()),
+                Some(&result.ondisk_relfiles_dropped.to_string().as_bytes()),
+                Some(&result.ondisk_nonrelfiles_total.to_string().as_bytes()),
                Some(
                    &result
-                        .snapshot_nonrelfiles_needed_by_cutoff
+                        .ondisk_nonrelfiles_needed_by_cutoff
                        .to_string()
                        .as_bytes(),
                ),
                Some(
                    &result
-                        .snapshot_nonrelfiles_needed_by_branches
+                        .ondisk_nonrelfiles_needed_by_branches
                        .to_string()
                        .as_bytes(),
                ),
-                Some(
-                    &result
-                        .snapshot_nonrelfiles_not_updated
-                        .to_string()
-                        .as_bytes(),
-                ),
-                Some(&result.snapshot_nonrelfiles_removed.to_string().as_bytes()),
-                Some(&result.snapshot_nonrelfiles_dropped.to_string().as_bytes()),
+                Some(&result.ondisk_nonrelfiles_not_updated.to_string().as_bytes()),
+                Some(&result.ondisk_nonrelfiles_removed.to_string().as_bytes()),
+                Some(&result.ondisk_nonrelfiles_dropped.to_string().as_bytes()),
                Some(&result.elapsed.as_millis().to_string().as_bytes()),
            ]))?
            .write_message(&BeMessage::CommandComplete(b"SELECT 1"))?;
--- a/pageserver/src/repository.rs
+++ b/pageserver/src/repository.rs
@@ -55,39 +55,38 @@ pub trait Repository: Send + Sync {
 ///
 #[derive(Default)]
 pub struct GcResult {
-    pub snapshot_relfiles_total: u64,
-    pub snapshot_relfiles_needed_by_cutoff: u64,
-    pub snapshot_relfiles_needed_by_branches: u64,
-    pub snapshot_relfiles_not_updated: u64,
-    pub snapshot_relfiles_removed: u64, // # of snapshot files removed because they have been made obsolete by newer snapshot files.
-    pub snapshot_relfiles_dropped: u64, // # of snapshot files removed because the relation was dropped
+    pub ondisk_relfiles_total: u64,
+    pub ondisk_relfiles_needed_by_cutoff: u64,
+    pub ondisk_relfiles_needed_by_branches: u64,
+    pub ondisk_relfiles_not_updated: u64,
+    pub ondisk_relfiles_removed: u64, // # of layer files removed because they have been made obsolete by newer ondisk files.
+    pub ondisk_relfiles_dropped: u64, // # of layer files removed because the relation was dropped

-    pub snapshot_nonrelfiles_total: u64,
-    pub snapshot_nonrelfiles_needed_by_cutoff: u64,
-    pub snapshot_nonrelfiles_needed_by_branches: u64,
-    pub snapshot_nonrelfiles_not_updated: u64,
-    pub snapshot_nonrelfiles_removed: u64, // # of snapshot files removed because they have been made obsolete by newer snapshot files.
-    pub snapshot_nonrelfiles_dropped: u64, // # of snapshot files removed because the relation was dropped
+    pub ondisk_nonrelfiles_total: u64,
+    pub ondisk_nonrelfiles_needed_by_cutoff: u64,
+    pub ondisk_nonrelfiles_needed_by_branches: u64,
+    pub ondisk_nonrelfiles_not_updated: u64,
+    pub ondisk_nonrelfiles_removed: u64, // # of layer files removed because they have been made obsolete by newer ondisk files.
+    pub ondisk_nonrelfiles_dropped: u64, // # of layer files removed because the relation was dropped

    pub elapsed: Duration,
 }

 impl AddAssign for GcResult {
    fn add_assign(&mut self, other: Self) {
-        self.snapshot_relfiles_total += other.snapshot_relfiles_total;
-        self.snapshot_relfiles_needed_by_cutoff += other.snapshot_relfiles_needed_by_cutoff;
-        self.snapshot_relfiles_needed_by_branches += other.snapshot_relfiles_needed_by_branches;
-        self.snapshot_relfiles_not_updated += other.snapshot_relfiles_not_updated;
-        self.snapshot_relfiles_removed += other.snapshot_relfiles_removed;
-        self.snapshot_relfiles_dropped += other.snapshot_relfiles_dropped;
+        self.ondisk_relfiles_total += other.ondisk_relfiles_total;
+        self.ondisk_relfiles_needed_by_cutoff += other.ondisk_relfiles_needed_by_cutoff;
+        self.ondisk_relfiles_needed_by_branches += other.ondisk_relfiles_needed_by_branches;
+        self.ondisk_relfiles_not_updated += other.ondisk_relfiles_not_updated;
+        self.ondisk_relfiles_removed += other.ondisk_relfiles_removed;
+        self.ondisk_relfiles_dropped += other.ondisk_relfiles_dropped;

-        self.snapshot_nonrelfiles_total += other.snapshot_nonrelfiles_total;
-        self.snapshot_nonrelfiles_needed_by_cutoff += other.snapshot_nonrelfiles_needed_by_cutoff;
-        self.snapshot_nonrelfiles_needed_by_branches +=
-            other.snapshot_nonrelfiles_needed_by_branches;
-        self.snapshot_nonrelfiles_not_updated += other.snapshot_nonrelfiles_not_updated;
-        self.snapshot_nonrelfiles_removed += other.snapshot_nonrelfiles_removed;
-        self.snapshot_nonrelfiles_dropped += other.snapshot_nonrelfiles_dropped;
+        self.ondisk_nonrelfiles_total += other.ondisk_nonrelfiles_total;
+        self.ondisk_nonrelfiles_needed_by_cutoff += other.ondisk_nonrelfiles_needed_by_cutoff;
+        self.ondisk_nonrelfiles_needed_by_branches += other.ondisk_nonrelfiles_needed_by_branches;
+        self.ondisk_nonrelfiles_not_updated += other.ondisk_nonrelfiles_not_updated;
+        self.ondisk_nonrelfiles_removed += other.ondisk_nonrelfiles_removed;
+        self.ondisk_nonrelfiles_dropped += other.ondisk_nonrelfiles_dropped;

        self.elapsed += other.elapsed;
    }