Commit Graph

499 Commits

Author SHA1 Message Date
Patrick Insinger
b26b471250 Use binary format for PageVersion
Still lots of work to do here. Also a lot of copying going on that
should be avoided.
2021-11-04 15:14:34 -07:00
Patrick Insinger
445288d7c1 Change chunked_buffer API 2021-11-04 15:09:32 -07:00
Heikki Linnakangas
556422f9bb Use a chunked byte buffer to hold page versions in in-memory layer.
Change the way the page versions are stored in an in-memory layer.
Instead of storing the original PageVersion structs, serialize them
into an expandable buffer. That has a couple of advantages: first, the
serialized form is more compact, which greatly reduces the memory
usage. Secondly, we can track the memory used more accurately.

This moves the serialization work from the checkpointer to the WAL
receiver. It also means that if a page in an in-memory layer is
accessed, it needs to be deserialized on the GetPage call. That's adds
a bit of latency, but it's not significant compared to the overhead of
replaying WAL records. The simpler memory accounting is worth it.

Per Konstantin's idea a long time ago.
2021-11-04 15:09:32 -07:00
Heikki Linnakangas
6d742719a1 Fix infinite loop in looking up predecessor layer
Commit 960c7d69a8 changed the LSN returned in the Continue case in
InMemoryLayer::get_page_reconstruct_data(), but neglected to make the
same change in DeltaLayer.

Also add an escape hatch to the loop in materialize_page() to avoid
getting stuck in an infinite loop, if a bug like this reoccurs.
2021-11-04 16:07:12 +02:00
Dmitry Rodionov
987833e0b9 Propagate git SHA to zenith binaries
Git commit sha is displayed when --version flag is used and is written
to logs during service startup. Uses git_version crate when git is
available, and GIT_VERSION environment variable otherwise which is the case for docker
builds.
2021-11-04 14:22:29 +03:00
Kirill Bulatov
f36acf00de Reduce "relish" word usages in remote storage 2021-11-04 12:53:42 +02:00
Kirill Bulatov
956fc3dec9 Tidy up and make consistent the remote storate API 2021-11-04 12:53:42 +02:00
Heikki Linnakangas
b38e841f2d Use poll() in communication with WAL redo process.
The tokio futures added some overhead, so switch to plain non-blocking
I/O with poll(). In a simple pgbench test on my laptop (select-only
queries, scale-factor 1 `pgbench -P1 -T50 -S`), this gives about 10%
improvement, from about 4300 TPS to 4800 TPS.
2021-11-04 10:39:04 +02:00
Heikki Linnakangas
3a0111c75e Refactor functions for constructing WAL redo messages.
Instead of building a separate Vec<u8> to hold each message, serialize all
the messages to one big Vec<u8>. This eliminates some Vec allocation and
memcpy() overhead. The downside is that if there are a lot of records to
replay, we have to serialize them all into one big chunk of memory.
That shouldn't be a problem in practice. If you need to replay millions
of records to reconstruct a page, we should've materialized a new image
of that page earlier already.
2021-11-04 10:39:00 +02:00
Heikki Linnakangas
4ba783d0af Remove a couple of unused functions.
We might want to have custom serialize/deserialize functions for
WALRecords and PageVersions for performance reasons, see github issue 832.
But that would probably look a bit different from this, and currently
these functions are just dead.
2021-11-03 19:10:23 +02:00
Patrick Insinger
0457fe81a9 pageserver - make PageVersion an enum 2021-11-03 09:28:49 -07:00
Heikki Linnakangas
fb524dd973 Put a global limit on memory used by in-memory layers.
Adds simple global tracking of memory used by the in-memory layers. It's
very approximate, it doesn't take into account allocator, memory
fragmentation or many other things, but it's a good first step.

After storing a WAL record in the repository, the WAL receiver checks
if the global memory usage. If it's above a configurable threshold (hard
coded at 128 MB at the moment), it evicts a layer. The victim layer is
chosen by GClock algorithm, similar to that used in the Postgres buffer
cache.

This stops the page server from using an unbounded amount of memory. It's
pretty crude, the eviction and materializing and writing a layer to disk
happens now in the WAL receiver thread. It would be nice to move that
to a background thread, and it would be nice to have a smarter policy on
when to materialize a new image layer and when to just write out a delta
layer, and it would be nice to have more accurate accounting of memory.
But this should fix the most pressing OOM issues, and is a step in the
right direction.

Co-authored-by: Patrick Insinger <patrickinsinger@gmail.com>
2021-11-02 15:49:39 +02:00
Heikki Linnakangas
8c6d2664c0 Support removing arbitrary open layers, not just the oldest one 2021-11-02 15:43:16 +02:00
Patrick Insinger
cdbbd15eb9 pageserver - add InMemoryLayer global map (#817) 2021-11-01 12:20:24 -07:00
anastasia
83ed930bc2 WIP. Launch and shutdown tenant threads together with walreceiver.
TODO: now walreceiver only disconnects if safekeeper was shut down. Implemnt proper walreceiver disconnection.
2021-11-01 18:04:00 +03:00
anastasia
071e30cc53 Expose TENANT_THREADS_COUNT metric to observe number of currently active checkpointer and GC threads 2021-11-01 18:04:00 +03:00
Kirill Bulatov
e6ef27637b Better API to handle timeline metadata properly 2021-10-29 23:51:40 +03:00
Patrick Insinger
b532470792 Set SO_REUSEADDR for all TCP listeners 2021-10-29 12:45:26 -07:00
Kirill Bulatov
edba2e9744 Use a proper extension for the readme file 2021-10-28 18:55:14 +03:00
anastasia
ea5900f155 Refactoring of checkpointer and GC.
Move them to a separate tenant_threads module to detangle thread management from LayeredRepository implementation.
2021-10-27 20:50:26 +03:00
anastasia
28ab40c8b7 fix init_repo() call in register_relish_download() 2021-10-27 20:50:26 +03:00
Dmitry Rodionov
1c0e85f9a0 review cleanups 2021-10-27 13:30:34 +03:00
Dmitry Rodionov
5bc09074ea add a flag to avoid non incremental size calculation in pageserver http api
This calculation is not that heavy but it is needed only in tests, and
in case the number of tenants/timelines is high the calculation can take
noticeable time.

Resolves https://github.com/zenithdb/zenith/issues/804
2021-10-27 13:30:34 +03:00
anastasia
f43f8401ee Don't wait for wal-redo process for non-relational records replay 2021-10-26 19:30:28 +03:00
Kirill Bulatov
f291ab2b87 Do not panic on missing tenant 2021-10-25 18:36:30 +03:00
Kirill Bulatov
04fb0a0342 Add core relish backup and restore functionality 2021-10-22 22:22:38 +03:00
Konstantin Knizhnik
c310932121 Implement backpressure for compute node to avoid WAL overflow
Co-authored-by: Arseny Sher <sher-ars@yandex.ru>
Co-authored-by: Alexey Kondratov <kondratov.aleksey@gmail.com>
2021-10-21 18:15:50 +03:00
anastasia
7f9d2a7d05 Change 'zenith tenant list' API to return tenant state added in 0dc7a3fc 2021-10-21 11:04:22 +03:00
Heikki Linnakangas
feae7f39c1 Support read-only nodes
Change 'zenith.signal' file to a human-readable format, similar to
backup_label. It can contain a "PREV LSN: %X/%X" line, or a special
value to indicate that it's OK to start with invalid LSN ('none'), or
that it's a read-only node and generating WAL is forbidden
('invalid').

The 'zenith pg create' and 'zenith pg start' commands now take a node
name parameter, separate from the branch name. If the node name is not
given, it defaults to the branch name, so this doesn't break existing
scripts.

If you pass "foo@<lsn>" as the branch name, a read-only node anchored
at that LSN is created. The anchoring is performed by setting the
'recovery_target_lsn' option in the postgresql.conf file, and putting
the server into standby mode with 'standby.signal'.

We no longer store the synthetic checkpoint record in the WAL segment.
The postgres startup code has been changed to use the copy of the
checkpoint record in the pg_control file, when starting in zenith
mode.
2021-10-19 09:48:12 +03:00
Heikki Linnakangas
e272a380b4 On new repo, start writing WAL only after the initial checkpoint record.
Previously, the first WAL record on the 'main' branch overwrote the
initial checkpoint record, with invalid 'xl_prev'. That's harmless, but
also pretty ugly. I bumped into this while I was trying to tighen up the
checks for when a valid 'prev_lsn' is required. With this patch, the
first WAL record gets a valid 'xl_prev' value. It doesn't matter much
currently, but let's be tidy.
2021-10-19 09:48:04 +03:00
anastasia
0dc7a3fc15 Change tenant_mgr to use TenantState.
It allows to avoid locking entire TENANTS list while one tenant is bootstrapping
and prepares the code for remote storage integration.
2021-10-18 15:40:06 +03:00
Kirill Bulatov
e9b5224a8a Fix toml serde gotchas 2021-10-18 14:14:27 +03:00
Heikki Linnakangas
bdd039a9ee S3 DELETE call returns 204, not 200.
According to the S3 API docs, the DELETE call returns code "204 No content"
on success.
2021-10-17 16:21:58 +03:00
Heikki Linnakangas
b405eef324 Avoid writing the metadata file when it hasn't changed. 2021-10-17 14:54:39 +03:00
Kirill Bulatov
ba557d126b React on sigint 2021-10-15 21:24:24 +03:00
Kirill Bulatov
4ade0bb41c Refactor upload/download_relish function signatures.
This makes them more generic, by taking any Read / Write trait
implementation, instead of operating directly on a a file.
2021-10-15 11:34:15 +03:00
Arseny Sher
de744a44dd Add /timeline http request to safekeeper returning its status.
Which is mainly generational state (terms) and useful LSNs.

Also add /status basic healthcheck request which is now used in tests to
determine the safekeeper is up; this fixes #726.

ref #115
2021-10-14 19:02:38 +03:00
Heikki Linnakangas
0e026371ec Optimize WAL decoding slightly.
This adds a fast-path for the common case that the record doesn't
cross a page boundary. We now split off a new Bytes directly from the
original input buffer in that case, instead of copying the record to a
new BytesMut. Shaves about 5% of the page server's CPU time on my
laptop, in the 'test_bulk_insert' test.
2021-10-14 14:21:23 +03:00
Heikki Linnakangas
8a4f092e82 Skip syncing the temp initdb installation.
Doesn't make much difference on my laptop with SSD, but every little
helps, and with a slower disk it might be noticeable.
2021-10-13 16:59:00 +03:00
Patrick Insinger
1c29de81de pageserver - remove lsn from WALRecord 2021-10-13 00:03:42 -07:00
Patrick Insinger
160c4aff61 pageserver - use write guard for checkpointing 2021-10-12 10:02:15 -07:00
Patrick Insinger
6e5ca5dc5c pageserver - create TimelineWriter 2021-10-12 10:02:15 -07:00
Heikki Linnakangas
95a85312f5 Simplify code to build walredo messages.
No need to use BytesMut in these functions. Plain Vec is simpler. And
should be marginally faster too; I saw BytesMut functions previously
in 'perf' profile, consuming around 5% of the overall pageserver CPU
time. That's gone with this patch, although I don't see any discernible
difference in the overall performance test results.
2021-10-12 10:16:26 +03:00
Heikki Linnakangas
934fb8592f Detect when a checkpoint is modified in a smarter way.
Previously, the WAL receiver we would make a decoded copy of the current
Checkpoint before each WAL record, and compare it with the Checkpoint
after the record has been processed. If it has changed, the checkpoint
relish is updated in the repository. That's somewhat expensive, the
Checkpoint::encode() function is visible in 'perf' profile. Change that
so that we set a flag whenever the Checkpoint struct is modified, so that
we dont need to compare the whole struct anymore.
2021-10-12 09:09:10 +03:00
anastasia
d7c9dd06f4 Implement graceful shutdown at 'pageserver stop':
- perform checkpoint for each tenant repository.
- wait for the completion of all threads.

Add new option 'immediate' to 'pageserver stop' command to terminate the pageserver immediately.
2021-10-11 13:35:01 +03:00
Heikki Linnakangas
7216f22609 Use tracing crate to have more context in log messages.
Whenever we start processing a request, we now enter a tracing "span"
that includes context information like the tenant and timeline ID, and
the operation we're performing. That context information gets attached
to every log message we create within the span. That way, we don't need
to include basic context information like that in every log message, and
it also becomes easier to filter the logs programmatically.

This removes the eplicit timeline and tenant IDs from most log messages,
as you get that information from the enclosing span now.

Also improve log messages in general, dialing down the level of some
messages that are not very useful, and adding information to others.

We now obey the RUST_LOG env variable, if it's set.

The 'tracing' crate allows for different log formatters, like JSON or
bunyan output. The one we use now is human-readable multi-line format,
which is nice when reading the log directly, but hard for
post-processing.  For production, we'll probably want JSON output and
some tools for working with it, but that's left as a TODO. The log
format is easy to change.
2021-10-11 08:59:06 +03:00
Kirill Bulatov
bf58f7f649 Expose certain layered repository structs to reuse in relish storage (#688) 2021-10-09 19:23:57 +03:00
Patrick Insinger
3f0ebc6a40 pageserver - move early File::open call 2021-10-09 08:45:52 -07:00
Patrick Insinger
c356030660 pageserver - use VecMap for delta metadata & sizes 2021-10-08 15:05:22 -07:00
Patrick Insinger
c4bb6d78d4 pageserver - use VecMap for in memory segsizes 2021-10-08 14:37:32 -07:00