diff --git a/pageserver/src/layered_repository/README.md b/pageserver/src/layered_repository/README.md new file mode 100644 index 0000000000..404c925eb1 --- /dev/null +++ b/pageserver/src/layered_repository/README.md @@ -0,0 +1,267 @@ +# Overview + +The on-disk format is based on immutable files. The page server +receives a stream of incoming WAL, parses the WAL records to determine +which pages they apply to, and accumulates the incoming changes in +memory. Every now and then, the accumulated changes are written out to +new files. + +The files are called "snapshot files". Each snapshot file corresponds +to one PostgreSQL relation fork. The snapshot files for each timeline +are stored in the timeline's subdirectory under .zenith/timelines. + +The files are named like this: + + _____ + +For example: + + 1663_13990_2609_0_000000000169C348_0000000001702000 + +Each snapshot file contains a full snapshot, that is, full copy of all +pages in the relation, as of the "start LSN". It also contains all WAL +records applicable to the relation between the start and end +LSNs. With this information, the page server can reconstruct any page +version of the relation in the LSN range. + +If a file has been dropped, the last snapshot file for it is created +with the _DROPPED suffix, e.g. + + 1663_13990_2609_0_000000000169C348_0000000001702000_DROPPED + +## Notation used in this document + +The full path of a snapshot file looks like this: + + .zenith/timelines/4af489b06af8eed9e27a841775616962/1663_13990_2609_0_000000000169C348_0000000001702000 + +For simplicity, the examples below use a simplified notation for the +paths. The timeline ID is replaced with the human-readable branch +name, and spcnode+dbnode+relnode+forkum with a human-readable table +name. The LSNs are also shorter. For example, a snapshot file for +'orders' table on 'main' branch, with LSN range 100-200 would be: + + main/orders_100_200 + + +# Creating snapshot files + +Let's start with a simple example with a system that contains one +branch called 'main' and two tables, 'orders' and 'customers'. The end +of WAL is currently at LSN 250. In this starting situation, you would +have two files on disk: + + main/orders_100_200 + main/customers_100_200 + +In addition to those files, the recent changes between LSN 200 and the +end of WAL at 250 are kept in memory. If the page server crashes, the +latest records between 200-250 need to be re-read from the WAL. + +Whenever enough WAL has been accumulated in memory, the page server +writes out the changes in memory into new snapshot files. This process +is called "checkpointing" (not to be confused with the PostgreSQL +checkpoints, that's a different thing). The page server only creates +snapshot files for relations that have been modified since the last +checkpoint. For example, if the current end of WAL is at LSN 450, and +the last checkpoint happened at LSN 400 but there hasn't been any +recent changes to 'customers' table, you would have these files on +disk: + + main/orders_100_200 + main/orders_200_300 + main/orders_300_400 + main/customers_100_200 + +If the customers table is modified later, a new file is created for it +at the next checkpoint. The new file will cover the "gap" from the +last snapshot file, so the LSN ranges are always contiguous: + + main/orders_100_200 + main/orders_200_300 + main/orders_300_400 + main/customers_100_200 + main/customers_200_500 + +## Reading page versions + +Whenever a GetPage@LSN request comes in from the compute node, the +page server needs to reconstruct the requested page, as it was at the +requested LSN. To do that, the page server first checks the recent +in-memory layer; if the requested page version is found there, it can +be returned immediatedly without looking at the files on +disk. Otherwise the page server needs to locate the snapshot file that +contains the requested page version. + +For example, if a request comes in for table 'orders' at LSN 250, the +page server would load the 'main/orders_200_300' file into memory, and +reconstruct and return the requested page from it, as it was at +LSN 250. Because the snapshot file consists of a full image of the +relation at the start LSN and the WAL, reconstructing the page +involves replaying any WAL records applicable to the page between LSNs +200-250, starting from the base image at LSN 200. + + +# Multiple branches + +Imagine that a child branch is created at LSN 250: + + @250 + ----main--+--------------------------> + \ + +---child--------------> + + +Then, the 'orders' table is updated differently on the 'main' and +'child' branches. You now have this situation on disk: + + main/orders_100_200 + main/orders_200_300 + main/orders_300_400 + main/customers_100_200 + child/orders_250_300 + child/orders_300_400 + +Because the 'customers' table hasn't been modified on the child +branch, there is no file for it there. If you request a page for it on +the 'child' branch, the page server will not find any snapshot file +for it in the 'child' directory, so it will recurse to look into the +parent 'main' branch instead. + +From the 'child' branch's point of view, the history for each relation +is linear, and the request's LSN identifies unambiguously which file +you need to look at. For example, the history for the 'orders' table +on the 'main' branch consists of these files: + + main/orders_100_200 + main/orders_200_300 + main/orders_300_400 + +And from the 'child' branch's point of view, it consists of these +files: + + main/orders_100_200 + main/orders_200_300 + child/orders_250_300 + child/orders_300_400 + +The branch metadata includes the point where the child branch was +created, LSN 250. If a page request comes with LSN 275, we read the +page version from the 'child/orders_250_300' file. If the request LSN +is 225, we read it from the 'main/orders_200_300' file instead. The +page versions between 250-300 in the 'main/orders_200_300' file are +ignored when operating on the child branch. + +Note: It doesn't make any difference if the child branch is created +when the end of the main branch was at LSN 250, or later when the tip of +the main branch had already moved on. The latter case, creating a +branch at a historic LSN, is how we support PITR in Zenith. + + +# Garbage collection + +In this scheme, we keep creating new snapshot files over time. We also +need a mechanism to remove old files that are no longer needed, +because disk space isn't infinite. + +What files are still needed? Currently, the page server supports PITR +and branching from any branch at any LSN that is "recent enough" from +the tip of the branch. "Recent enough" is defined as an LSN horizon, +which by default is 64 MB. (See DEFAULT_GC_HORIZON). For this +example, let's assume that the LSN horizon is 150 units. + +Let's look at the single branch scenario again. Imagine that the end +of the branch is LSN 525, so that the GC horizon is currently at +525-150 = 375 + + main/orders_100_200 + main/orders_200_300 + main/orders_300_400 + main/orders_400_500 + main/customers_100_200 + +We can remove files 'main/orders_100_200' and 'main/orders_200_300', +because the end LSNs of those files are older than GC horizon 375, and +there are more recent snapshot files for the table. 'main/orders_300_400' +and 'main/orders_400_500' are still within the horizon, so they must be +retained. 'main/customers_100_200' is old enough, but it cannot be +removed because there is no newer snapshot file for the table. + +Things get slightly more complicated with multiple branches. All of +the above still holds, but in addition to recent files we must also +retain older shapshot files that are still needed by child branches. +For example, if child branch is created at LSN 150, and the 'customers' +table is updated on the branch, you would have these files: + + main/orders_100_200 + main/orders_200_300 + main/orders_300_400 + main/orders_400_500 + main/customers_100_200 + child/customers_150_300 + +In this situation, the 'main/orders_100_200' file cannot be removed, +even though it is older than the GC horizon, because it is still +needed by the child branch. 'main/orders_200_300' can still be +removed. So after garbage collection, these files would remain: + + main/orders_100_200 + + main/orders_300_400 + main/orders_400_500 + main/customers_100_200 + child/customers_150_300 + +If 'orders' is modified later on the 'child' branch, we will create a +snapshot file for it on the child: + + main/orders_100_200 + + main/orders_300_400 + main/orders_400_500 + main/customers_100_200 + child/customers_150_300 + child/orders_150_400 + +After this, the 'main/orders_100_200' file can be removed. It is no +longer needed by the child branch, because there is a newer snapshot +file there. TODO: This optimization hasn't been implemented! The GC +algorithm will curently keep the file on the 'main' branch anyway, for +as long as the child branch exists. + + +# On LSN ranges + +In principle, each relation can be checkpointed separately, i.e. the +LSN ranges of the files don't need to line up. So this would be legal: + + main/orders_100_200 + main/orders_200_300 + main/orders_300_400 + main/customers_150_250 + main/customers_250_500 + +However, the code currently always checkpoints all relations together. +So that situation doesn't arise in practice. + +It would also be OK to have overlapping LSN ranges for the same relation: + + main/orders_100_200 + main/orders_200_300 + main/orders_250_350 + main/orders_300_400 + +The code that reads the snapshot files should cope with this, but this +situation doesn't arise either, because the checkpointing code never +does that. It could be useful, however, as a transient state when +garbage collecting around branch points, or explicit recovery +points. For example, if we start with this: + + main/orders_100_200 + main/orders_200_300 + main/orders_300_400 + +And there is a branch or explicit recovery point at LSN 150, we could +replace 'main/orders_100_200' with 'main/orders_150_150' to keep a +snapshot only at that exact point that's still needed, removing the +other page versions around it.