Whenever we start processing a request, we now enter a tracing "span"
that includes context information like the tenant and timeline ID, and
the operation we're performing. That context information gets attached
to every log message we create within the span. That way, we don't need
to include basic context information like that in every log message, and
it also becomes easier to filter the logs programmatically.
This removes the eplicit timeline and tenant IDs from most log messages,
as you get that information from the enclosing span now.
Also improve log messages in general, dialing down the level of some
messages that are not very useful, and adding information to others.
We now obey the RUST_LOG env variable, if it's set.
The 'tracing' crate allows for different log formatters, like JSON or
bunyan output. The one we use now is human-readable multi-line format,
which is nice when reading the log directly, but hard for
post-processing. For production, we'll probably want JSON output and
some tools for working with it, but that's left as a TODO. The log
format is easy to change.
The caller is now responsible for lookin up the predecessor layer,
instead. This makes the code simpler, as you don't need to update the
predecessor reference when a layer is frozen or written to disk.
There was a bug in that, as Konstantin noted on discord:
Assume that freeze doesn't create new inmem layer
(maybe_new_open=None). Then we temporary place in historics frozen
layer. Assume that now new put_wal_record request arrives. There is
no open in-mem layer, so it has to create new one. It is looking for
previous layer for read and set it as new in-mem layer
predecessor. But as far as I understand, prev layer should be our
temporary frozen layer. Which will be then removed from
historics.
That leaves the predecessor field of the new in-memory layer pointing
at the frozen in-memory layer that has been removed from the layer map,
preventing it from being removed from memory.
This makes two subtle changes:
1. When the first new layer is created on a branch for a segment that
existed on the ancestor branch, the start_lsn of the new layer is now
the branch point + 1. We were previously slightly confused on what
the branch point LSN meant. It means that all the WAL up to and
*including* the LSN on the old branch is visible to the new branch.
If we mark the start LSN of the new layer as equal to the branch point,
that's wrong, because if there is a WAL record with that LSN on the
predecessor layer, the new layer would hide it. This bug was hidden
when the layer on the new branch contained a direct reference to the
layer in the old branch, as get_page_reconstruct_data() followed that
reference directly when it didn't find the page version in the new
layer. But now that the caller performs the lookup, it will look up
the new layer that doesn't contain the record, and you get an error.
2. InMemoryLayer now always stores the segment size at the beginning
of the layer's LSN range. Previously, get_seg_size() might have
recursed into the predecessor layer to get the size, but now we
avoid that by always copying over the last size from the previous
layer, when a new layer is created.
Commit ca9af37478 removed the import_timeline_wal() call from here.
After that, the info!() message is bogus, as we no longer load the WAL
from local disk. Also, the logical size assertion is pointless now.
The metadata file is now always 512 bytes. The last 4 bytes are a
crc32c checksum of the previous 508 bytes. Padding zeroes are added
between the serde serialization and the start of the checksum.
A single write call is used, and the file is fsyncd after.
On file creation, the parent directory is fsyncd as well.
Similar to what commit 7fb7f67b did to 'freeze', dealing with the
dropped segment separately from the rest of the logic makes the code
easier to follow. It is also needed by the next commit that replaces
the code to build new BTreeMap with an iterator; we cannot pass one
of two kinds of closures as argument, it has to always be the same one.
Having separate DeltaLayer::create() calls for the case of dropped
segment and the other cases works around that.
Reduces the CPU time spent in the write() syscalls. I noticed that we were
spending a lot of CPU time in libc::write, coming from request_redo(), in
the 'bulk_insert' test. According to some quick profiling with 'perf',
this reduces the CPU time spent in request_redo() from about 30% to 15%.
For some reason, it doesn't reduce the overall runtime of the 'bulk_insert'
test much, maybe by one second if you squint (from about 37s to 36s), so
there must be some other bottleneck, like I/O. But this is surely still
a good idea, just based on the reduced CPU cycles.
Commit message copied below:
* Allow LeSer/BeSer impls missing Serialize/Deserialize
Currently, using `LeSer` or `BeSer` requires that the type implements
both `Serialize` and `DeserializeOwned`, even if we're only using the
trait for one of those functionalities.
Moving the bounds to the methods gives the convenience of the traits
without requiring unnecessary derives.
* Remove unused #[derive(Serialize/Deserialize)]
This should hopefully reduce compile times - if only by a little bit.
Some of these were already unused (we weren't using LeSer/BeSer for the
types), but most are have *become* unused with the change to
LeSer/BeSer.
* Allow LeSer/BeSer impls missing Serialize/Deserialize
Currently, using `LeSer` or `BeSer` requires that the type implements
both `Serialize` and `DeserializeOwned`, even if we're only using the
trait for one of those functionalities.
Moving the bounds to the methods gives the convenience of the traits
without requiring unnecessary derives.
* Remove unused #[derive(Serialize/Deserialize)]
This should hopefully reduce compile times - if only by a little bit.
Some of these were already unused (we weren't using LeSer/BeSer for the
types), but most are have *become* unused with the change to
LeSer/BeSer.
This introduces a new tree data structure for holding intervals, and
queries of the form "which intervals contain the given point?". It then
uses that to store the Layers in the layer map, instead of the BTreeMap.
While we don't currently create overlapping layers in the page server,
that situation might arise in the future if we start to create extra
layers for performance purposes, or as part of some multi-stage
garbage collection operation that creates new layers in some interval
and then removes old ones. The situation might also arise if you have
multiple page servers running on the same timeline, freezing layers at
different points, and both uploading them to S3.
So even though overlapping layers might not happen currently, let's
avoid getting confused if it does happen for some reason.
Fixes https://github.com/zenithdb/zenith/issues/517.
After this, a layer's start bound is always defined to be inclusive, and
end bound exclusive.
For example, if you have a layer in the range 100-200, that layer can be
used for GetPage@LSN requests at LSN 100, 199, or anything in between.
But for LSN 200, you need to look at the next layer (if one exists).
This is one part of a fix for https://github.com/zenithdb/zenith/issues/517.
After this, the page server shouldn't create layers for the same segment
with the same LSN, which avoids the issue. However, the same thing would
still happen, if you managed to create layers with same start LSN again.
That could happen e.g. if you had two page servers running, or in some
weird crash/restart scenario, or due to bugs or features added later. The
next commit makes the layer map more robust, so that it tolerates that
situation without deleting wrong files.
- Turn dropped layers into non-writeable in get_layer_for_write().
- Handle non-writeable dropped layers in checkpointer. They don't need freezing, so just remove them from list of open_segs and write out to disk.
- Remove code that handles dropped layers in freeze() function. It is not used anymore.
Some dropped layers serve as tombstones for earlier layers and thus cannot be garbage collected.
Add new fields to GcResult for layers that are preserved as tombstones