rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2025-12-22 21:59:59 +00:00

Files

Arpad Müller 045bc6af8b Add new compaction abstraction, simulator, and implementation. (#6830 )

Rebased version of #5234, part of #6768

This consists of three parts:

1. A refactoring and new contract for implementing and testing
compaction.

The logic is now in a separate crate, with no dependency on the
'pageserver' crate. It defines an interface that the real pageserver
must implement, in order to call the compaction algorithm. The interface
models things like delta and image layers, but just the parts that the
compaction algorithm needs to make decisions. That makes it easier unit
test the algorithm and experiment with different implementations.

I did not convert the current code to the new abstraction, however. When
compaction algorithm is set to "Legacy", we just use the old code. It
might be worthwhile to convert the old code to the new abstraction, so
that we can compare the behavior of the new algorithm against the old
one, using the same simulated cases. If we do that, have to be careful
that the converted code really is equivalent to the old.

This inclues only trivial changes to the main pageserver code. All the
new code is behind a tenant config option. So this should be pretty safe
to merge, even if the new implementation is buggy, as long as we don't
enable it.

2. A new compaction algorithm, implemented using the new abstraction.

The new algorithm is tiered compaction. It is inspired by the PoC at PR
#4539, although I did not use that code directly, as I needed the new
implementation to fit the new abstraction. The algorithm here is less
advanced, I did not implement partial image layers, for example. I
wanted to keep it simple on purpose, so that as we add bells and
whistles, we can see the effects using the included simulator.

One difference to #4539 and your typical LSM tree implementations is how
we keep track of the LSM tree levels. This PR doesn't have a permanent
concept of a level, tier or sorted run at all. There are just delta and
image layers. However, when compaction starts, we look at the layers
that exist, and arrange them into levels, depending on their shapes.
That is ephemeral: when the compaction finishes, we forget that
information. This allows the new algorithm to work without any extra
bookkeeping. That makes it easier to transition from the old algorithm
to new, and back again.

There is just a new tenant config option to choose the compaction
algorithm. The default is "Legacy", meaning the current algorithm in
'main'. If you set it to "Tiered", the new algorithm is used.

3. A simulator, which implements the new abstraction.

The simulator can be used to analyze write and storage amplification,
without running a test with the full pageserver. It can also draw an SVG
animation of the simulation, to visualize how layers are created and
deleted.

To run the simulator:

    cargo run --bin compaction-simulator run-suite

---------

Co-authored-by: Heikki Linnakangas <heikki@neon.tech>

2024-02-27 17:15:46 +01:00

2.1 KiB

Raw Permalink Blame History

TODO

If the key space can be perfectly partitioned at some key, perform planning on each partition separately. For example, if we are compacting a level with layers like this:
```
            :
+--+ +----+ :  +------+
|  | |    | :  |      |
+--+ +----+ :  +------+
            :
+-----+ +-+ : +--------+
|     | | | : |        |
+-----+ +-+ : +--------+
            :
```
At the dotted line, there is a natural split in the key space, such that all layers are either on the left or the right of it. We can compact the partitions separately. We could choose to create image layers for one partition but not the other one, for example.
All the layers don't have to be exactly the same size, we can choose to cut a layer short or stretch it a little larger than the target size, if it helps the overall system. We can help perfect partitions (see previous bullet point) to happen more frequently, by choosing the cut points wisely. For example, try to cut layers at boundaries of underlying image layers. And "snap to grid", i.e. don't cut layers at any key, but e.g. only when key % 10000 = 0.
Avoid rewriting layers when we'd just create an identical layer to an input layer.
Parallelism. The code is already split up into planning and execution, so that we first split up the compaction work into "Jobs", and then execute them. It would be straightforward to execute multiple jobs in parallel.
Materialize extra pages in delta layers during compaction. This would reduce read amplification. There has been the idea of partial image layers. Materializing extra pages in the delta layers achieve the same goal, without introducing a new concept.

Simulator

Expand the simulator for more workloads
Automate a test suite that runs the simluator with different workloads and spits out a table of results
Model read amplification
More sanity checking. One idea is to keep a reference count of each MockRecord, i.e. use Arc instead of plain MockRecord, and panic if a MockRecord that is newer than PITR horizon is completely dropped. That would indicate that the record was lost.

2.1 KiB Raw Permalink Blame History

TODO

Simulator

2.1 KiB

Raw Permalink Blame History