Add a README file to explain how the snapshot files are managed.

This explains how and when snapshot files are created, how they're
used to find the correct page version, and how garbage collection
works. I tried to resist the temptation to write how it *should* work,
and purely document how it currently works in this branch.
This commit is contained in:
Heikki Linnakangas
2021-07-16 14:44:41 +03:00
parent 8d0086f749
commit f8e533bbdf

View File

@@ -0,0 +1,267 @@
# Overview
The on-disk format is based on immutable files. The page server
receives a stream of incoming WAL, parses the WAL records to determine
which pages they apply to, and accumulates the incoming changes in
memory. Every now and then, the accumulated changes are written out to
new files.
The files are called "snapshot files". Each snapshot file corresponds
to one PostgreSQL relation fork. The snapshot files for each timeline
are stored in the timeline's subdirectory under .zenith/timelines.
The files are named like this:
<spcnode>_<dbnode>_<relnode>_<forknum>_<start LSN>_<end LSN>
For example:
1663_13990_2609_0_000000000169C348_0000000001702000
Each snapshot file contains a full snapshot, that is, full copy of all
pages in the relation, as of the "start LSN". It also contains all WAL
records applicable to the relation between the start and end
LSNs. With this information, the page server can reconstruct any page
version of the relation in the LSN range.
If a file has been dropped, the last snapshot file for it is created
with the _DROPPED suffix, e.g.
1663_13990_2609_0_000000000169C348_0000000001702000_DROPPED
## Notation used in this document
The full path of a snapshot file looks like this:
.zenith/timelines/4af489b06af8eed9e27a841775616962/1663_13990_2609_0_000000000169C348_0000000001702000
For simplicity, the examples below use a simplified notation for the
paths. The timeline ID is replaced with the human-readable branch
name, and spcnode+dbnode+relnode+forkum with a human-readable table
name. The LSNs are also shorter. For example, a snapshot file for
'orders' table on 'main' branch, with LSN range 100-200 would be:
main/orders_100_200
# Creating snapshot files
Let's start with a simple example with a system that contains one
branch called 'main' and two tables, 'orders' and 'customers'. The end
of WAL is currently at LSN 250. In this starting situation, you would
have two files on disk:
main/orders_100_200
main/customers_100_200
In addition to those files, the recent changes between LSN 200 and the
end of WAL at 250 are kept in memory. If the page server crashes, the
latest records between 200-250 need to be re-read from the WAL.
Whenever enough WAL has been accumulated in memory, the page server
writes out the changes in memory into new snapshot files. This process
is called "checkpointing" (not to be confused with the PostgreSQL
checkpoints, that's a different thing). The page server only creates
snapshot files for relations that have been modified since the last
checkpoint. For example, if the current end of WAL is at LSN 450, and
the last checkpoint happened at LSN 400 but there hasn't been any
recent changes to 'customers' table, you would have these files on
disk:
main/orders_100_200
main/orders_200_300
main/orders_300_400
main/customers_100_200
If the customers table is modified later, a new file is created for it
at the next checkpoint. The new file will cover the "gap" from the
last snapshot file, so the LSN ranges are always contiguous:
main/orders_100_200
main/orders_200_300
main/orders_300_400
main/customers_100_200
main/customers_200_500
## Reading page versions
Whenever a GetPage@LSN request comes in from the compute node, the
page server needs to reconstruct the requested page, as it was at the
requested LSN. To do that, the page server first checks the recent
in-memory layer; if the requested page version is found there, it can
be returned immediatedly without looking at the files on
disk. Otherwise the page server needs to locate the snapshot file that
contains the requested page version.
For example, if a request comes in for table 'orders' at LSN 250, the
page server would load the 'main/orders_200_300' file into memory, and
reconstruct and return the requested page from it, as it was at
LSN 250. Because the snapshot file consists of a full image of the
relation at the start LSN and the WAL, reconstructing the page
involves replaying any WAL records applicable to the page between LSNs
200-250, starting from the base image at LSN 200.
# Multiple branches
Imagine that a child branch is created at LSN 250:
@250
----main--+-------------------------->
\
+---child-------------->
Then, the 'orders' table is updated differently on the 'main' and
'child' branches. You now have this situation on disk:
main/orders_100_200
main/orders_200_300
main/orders_300_400
main/customers_100_200
child/orders_250_300
child/orders_300_400
Because the 'customers' table hasn't been modified on the child
branch, there is no file for it there. If you request a page for it on
the 'child' branch, the page server will not find any snapshot file
for it in the 'child' directory, so it will recurse to look into the
parent 'main' branch instead.
From the 'child' branch's point of view, the history for each relation
is linear, and the request's LSN identifies unambiguously which file
you need to look at. For example, the history for the 'orders' table
on the 'main' branch consists of these files:
main/orders_100_200
main/orders_200_300
main/orders_300_400
And from the 'child' branch's point of view, it consists of these
files:
main/orders_100_200
main/orders_200_300
child/orders_250_300
child/orders_300_400
The branch metadata includes the point where the child branch was
created, LSN 250. If a page request comes with LSN 275, we read the
page version from the 'child/orders_250_300' file. If the request LSN
is 225, we read it from the 'main/orders_200_300' file instead. The
page versions between 250-300 in the 'main/orders_200_300' file are
ignored when operating on the child branch.
Note: It doesn't make any difference if the child branch is created
when the end of the main branch was at LSN 250, or later when the tip of
the main branch had already moved on. The latter case, creating a
branch at a historic LSN, is how we support PITR in Zenith.
# Garbage collection
In this scheme, we keep creating new snapshot files over time. We also
need a mechanism to remove old files that are no longer needed,
because disk space isn't infinite.
What files are still needed? Currently, the page server supports PITR
and branching from any branch at any LSN that is "recent enough" from
the tip of the branch. "Recent enough" is defined as an LSN horizon,
which by default is 64 MB. (See DEFAULT_GC_HORIZON). For this
example, let's assume that the LSN horizon is 150 units.
Let's look at the single branch scenario again. Imagine that the end
of the branch is LSN 525, so that the GC horizon is currently at
525-150 = 375
main/orders_100_200
main/orders_200_300
main/orders_300_400
main/orders_400_500
main/customers_100_200
We can remove files 'main/orders_100_200' and 'main/orders_200_300',
because the end LSNs of those files are older than GC horizon 375, and
there are more recent snapshot files for the table. 'main/orders_300_400'
and 'main/orders_400_500' are still within the horizon, so they must be
retained. 'main/customers_100_200' is old enough, but it cannot be
removed because there is no newer snapshot file for the table.
Things get slightly more complicated with multiple branches. All of
the above still holds, but in addition to recent files we must also
retain older shapshot files that are still needed by child branches.
For example, if child branch is created at LSN 150, and the 'customers'
table is updated on the branch, you would have these files:
main/orders_100_200
main/orders_200_300
main/orders_300_400
main/orders_400_500
main/customers_100_200
child/customers_150_300
In this situation, the 'main/orders_100_200' file cannot be removed,
even though it is older than the GC horizon, because it is still
needed by the child branch. 'main/orders_200_300' can still be
removed. So after garbage collection, these files would remain:
main/orders_100_200
main/orders_300_400
main/orders_400_500
main/customers_100_200
child/customers_150_300
If 'orders' is modified later on the 'child' branch, we will create a
snapshot file for it on the child:
main/orders_100_200
main/orders_300_400
main/orders_400_500
main/customers_100_200
child/customers_150_300
child/orders_150_400
After this, the 'main/orders_100_200' file can be removed. It is no
longer needed by the child branch, because there is a newer snapshot
file there. TODO: This optimization hasn't been implemented! The GC
algorithm will curently keep the file on the 'main' branch anyway, for
as long as the child branch exists.
# On LSN ranges
In principle, each relation can be checkpointed separately, i.e. the
LSN ranges of the files don't need to line up. So this would be legal:
main/orders_100_200
main/orders_200_300
main/orders_300_400
main/customers_150_250
main/customers_250_500
However, the code currently always checkpoints all relations together.
So that situation doesn't arise in practice.
It would also be OK to have overlapping LSN ranges for the same relation:
main/orders_100_200
main/orders_200_300
main/orders_250_350
main/orders_300_400
The code that reads the snapshot files should cope with this, but this
situation doesn't arise either, because the checkpointing code never
does that. It could be useful, however, as a transient state when
garbage collecting around branch points, or explicit recovery
points. For example, if we start with this:
main/orders_100_200
main/orders_200_300
main/orders_300_400
And there is a branch or explicit recovery point at LSN 150, we could
replace 'main/orders_100_200' with 'main/orders_150_150' to keep a
snapshot only at that exact point that's still needed, removing the
other page versions around it.