Commit Graph

132 Commits

Author SHA1 Message Date
Kirill Bulatov
8043612334 Remove Timeline trait, rename LayeredTimeline struct into Timeline 2022-08-19 16:40:37 +03:00
Kirill Bulatov
3b819ee159 Remove extra type aliases (#2280) 2022-08-17 17:51:53 +03:00
Kirill Bulatov
995a2de21e Share exponential backoff code and fix logic for delete task failure (#2252) 2022-08-11 23:21:06 +03:00
Ankur Srivastava
84d1bc06a9 refactor: replace lazy-static with once-cell (#2195)
- Replacing all the occurrences of lazy-static with `once-cell::sync::Lazy`
- fixes #1147

Signed-off-by: Ankur Srivastava <best.ankur@gmail.com>
2022-08-05 19:34:04 +02:00
Heikki Linnakangas
d6f12cff8e Make DatadirTimeline a trait, implemented by LayeredTimeline.
Previously DatadirTimeline was a separate struct, and there was a 1:1
relationship between each DatadirTimeline and LayeredTimeline. That was
a bit awkward; whenever you created a timeline, you also needed to create
the DatadirTimeline wrapper around it, and if you only had a reference
to the LayeredTimeline, you would need to look up the corresponding
DatadirTimeline struct through tenant_mgr::get_local_timeline_with_load().
There were a couple of calls like that from LayeredTimeline itself.

Refactor DatadirTimeline, so that it's a trait, and mark LayeredTimeline
as implementing that trait. That way, there's only one object,
LayeredTimeline, and you can call both Timeline and DatadirTimeline
functions on that. You can now also call DatadirTimeline functions from
LayeredTimeline itself.

I considered just moving all the functions from DatadirTimeline directly
to Timeline/LayeredTimeline, but I still like to have some separation.
Timeline provides a simple key-value API, and handles durably storing
key/value pairs, and branching. Whereas DatadirTimeline is stateless, and
provides an abstraction over the key-value store, to present an interface
with relations, databases, etc. Postgres concepts.

This simplified the logical size calculation fast-path for branch
creation, introduced in commit 28243d68e6. LayerTimeline can now
access the ancestor's logical size directly, so it doesn't need the
caller to pass it to it. I moved the fast-path to init_logical_size()
function itself. It now checks if the ancestor's last LSN is the same
as the branch point, i.e. if there haven't been any changes on the
ancestor after the branch, and copies the size from there. An
additional bonus is that the optimization will now work any time you
have a branch of another branch, with no changes from the ancestor,
not only at a create-branch command.
2022-07-27 10:26:21 +03:00
bojanserafimov
d29c545b5d Gc/compaction thread pool, take 2 (#1933)
Decrease the number of pageserver threads by running gc and compaction in a blocking tokio thread pool
2022-07-05 02:06:40 -04:00
Dmitry Rodionov
6e26588d17 Allow to customize shutdown condition in PostgresBackend
Use it in PageServerHandler to check per thread shutdown condition
from thread_mgr which takes into account tenants and timelines
2022-06-07 22:11:54 +03:00
Arthur Petukhovsky
ec8861b8cc Fix pageserver metrics names (#1682)
Try to follow Prometheus style-guide https://prometheus.io/docs/practices/naming/ for metrics names. More specifically:
- Use `pageserver_` prefix for all pagserver metrics
- Specify `_seconds` unit in time metrics
- Use unit as a suffix in other cases, such as `_hits`, `_bytes`, `_records`
- Use `_total` suffix for accumulating counters (note that Histograms append that suffix internally)
2022-05-12 19:53:07 +03:00
Kirill Bulatov
de37f982db Share the remote storage as a crate 2022-05-07 00:30:36 +03:00
Dmitry Rodionov
ad25736f3a Exit pageserver process with correct error code
When we shutdown pageserver due to an error (e g one of th important
thrads panicked) use 1 exit code so systemd can properly restart it
2022-05-02 19:04:45 +03:00
Konstantin Knizhnik
5f83c9290b Make it possible to specify per-tenant configuration parameters
Add tenant config API and 'zenith tenant config' CLI command.
Add 'show' query to pageserver protocol for tenantspecific config parameters

Refactoring: move tenant_config code to a separate module.
Save tenant conf file to tenant's directory, when tenant is created to recover it on pageserver restart.
Ignore error during tenant config loading, while it is not supported by console

Define PiTR interval for GC.

refer #1320
2022-04-22 11:24:29 +03:00
Heikki Linnakangas
a4700c9bbe Use pprof to get flamegraph of get_page and get_relsize requests.
This depends on a hacked version of the 'pprof-rs' crate. Because of
that, it's under an optional 'profiling' feature. It is disabled by
default, but enabled for release builds in CircleCI config. It doesn't
currently work on macOS.

The flamegraph is written to 'flamegraph.svg' in the pageserver
workdir when the 'pageserver' process exits.

Add a performance test that runs the perf_pgbench test, with profiling
enabled.
2022-04-21 20:32:48 +03:00
Kirill Bulatov
81cad6277a Move and library crates into a dedicated directory and rename them 2022-04-21 13:30:33 +03:00
Heikki Linnakangas
214567bf8f Use B-tree for the index in image and delta layers.
We now use a page cache for those, instead of slurping the whole index into
memory.

Fixes https://github.com/zenithdb/zenith/issues/1356

This is a backwards-incompatible change to the storage format, so
bump STORAGE_FORMAT_VERSION.
2022-04-07 20:58:55 +03:00
Heikki Linnakangas
5d9851f5d1 Refactor the I/O functions.
This introduces two new abstraction layers for I/O:

- Block I/O, and
- Blob I/O.

The BlockReader trait abstracts a file or something else that can be read
in 8kB pages. It is implemented by EphemeralFiles, and by a new
FileBlockReader struct that allows reading arbitrary VirtualFiles in that
manner, utilizing the page cache.

There is also a new BlockCursor struct that works as a cursor over a
BlockReader. When you create a BlockCursor and read the first page using
it, it keeps the reference to the page. If you access the same page again,
it avoids going to page cache and quickly returns the same page again.
That can save a lot of lookups in the page cache if you perform multiple
reads.

The Blob-oriented API allows reading and writing "blobs" of arbitrary
length. It is a layer on top of the block-oriented API. When you write
a blob with the write_blob() function, it writes a length field
followed by the actual data to the underlying block storage, and
returns the offset where the blob was stored. The blob can be
retrieved later using the offset.

Finally, this replaces the I/O code in image-, delta-, and in-memory
layers to use the new abstractions. These replace the 'bookfile'
crate.

This is a backwards-incompatible change to the storage format.
2022-04-07 20:58:54 +03:00
Heikki Linnakangas
07342f7519 Major storage format rewrite.
This is a backwards-incompatible change. The new pageserver cannot
read repositories created with an old pageserver binary, or vice
versa.

Simplify Repository to a value-store
------------------------------------

Move the responsibility of tracking relation metadata, like which
relations exist and what are their sizes, from Repository to a new
module, pgdatadir_mapping.rs. The interface to Repository is now a
simple key-value PUT/GET operations.

It's still not any old key-value store though. A Repository is still
responsible from handling branching, and every GET operation comes
with an LSN.

Mapping from Postgres data directory to keys/values
---------------------------------------------------

All the data is now stored in the key-value store. The
'pgdatadir_mapping.rs' module handles mapping from PostgreSQL objects
like relation pages and SLRUs, to key-value pairs.

The key to the Repository key-value store is a Key struct, which
consists of a few integer fields. It's wide enough to store a full
RelFileNode, fork and block number, and to distinguish those from
metadata keys.

'pgdatadir_mapping.rs' is also responsible for maintaining a
"partitioning" of the keyspace. Partitioning means splitting the
keyspace so that each partition holds a roughly equal number of keys.
The partitioning is used when new image layer files are created, so
that each image layer file is roughly the same size.

The partitioning is also responsible for reclaiming space used by
deleted keys. The Repository implementation doesn't have any explicit
support for deleting keys. Instead, the deleted keys are simply
omitted from the partitioning, and when a new image layer is created,
the omitted keys are not copied over to the new image layer. We might
want to implement tombstone keys in the future, to reclaim space
faster, but this will work for now.

Changes to low-level layer file code
------------------------------------

The concept of a "segment" is gone. Each layer file can now store an
arbitrary range of Keys.

Checkpointing, compaction
-------------------------

The background tasks are somewhat different now. Whenever
checkpoint_distance is reached, the WAL receiver thread "freezes" the
current in-memory layer, and creates a new one. This is a quick
operation and doesn't perform any I/O yet. It then launches a
background "layer flushing thread" to write the frozen layer to disk,
as a new L0 delta layer. This mechanism takes care of durability. It
replaces the checkpointing thread.

Compaction is a new background operation that takes a bunch of L0
delta layers, and reshuffles the data in them. It runs in a separate
compaction thread.

Deployment
----------

This also contains changes to the ansible scripts that enable having
multiple different pageservers running at the same time in the staging
environment. We will use that to keep an old version of the pageserver
running, for clusters created with the old version, at the same time
with a new pageserver with the new binary.

Author: Heikki Linnakangas
Author: Konstantin Knizhnik <knizhnik@zenith.tech>
Author: Andrey Taranik <andrey@zenith.tech>
Reviewed-by: Matthias Van De Meent <matthias@zenith.tech>
Reviewed-by: Bojan Serafimov <bojan@zenith.tech>
Reviewed-by: Konstantin Knizhnik <knizhnik@zenith.tech>
Reviewed-by: Anton Shyrabokau <antons@zenith.tech>
Reviewed-by: Dhammika Pathirana <dham@zenith.tech>
Reviewed-by: Kirill Bulatov <kirill@zenith.tech>
Reviewed-by: Anastasia Lubennikova <anastasia@zenith.tech>
Reviewed-by: Alexey Kondratov <alexey@zenith.tech>
2022-03-28 05:41:15 -05:00
Kirill Bulatov
28bc8e3f5c Log pageserver threads better and shut down on errors in them 2022-03-25 11:58:54 +02:00
Kirill Bulatov
10f811e886 Use timeline instead of branch in pageserver's API 2022-03-10 19:38:58 +02:00
Heikki Linnakangas
dab30c27b6 Refactor thread management and shutdown
This introduces a new module to handle thread creation and shutdown.
All page server threads are now registered in a global hash map, and
there's a function to request individual threads to shut down gracefully.

Thread shutdown request is signalled to the thread with a flag, as well
as a Future that can be used to wake up async operations if shutdown is
requested. Use that facility to have the libpq listener thread respond
to pageserver shutdown, based on Kirill's earlier prototype
(https://github.com/zenithdb/zenith/pull/1088). That addresses
https://github.com/zenithdb/zenith/issues/1036, previously the libpq
listener thread would not exit until one more connection arrives.

This also eliminates a resource leak in the accept() loop. Previously,
we added the JoinHanlde of each new thread to a vector but old handles
for threads that had already exited were never removed.
2022-01-14 18:36:10 +02:00
Konstantin Knizhnik
1c47fbae81 Do not write image layers during enforced checkpoint (#1057)
* Do not write image layers during enforced checkpoint
refer #1056

* Add Flush option to CheckpointConfig

refer #1057
2022-01-01 19:08:09 +03:00
Kirill Bulatov
114a757d1c Use generic config parameters in pageserver cli
Co-authored-by: Heikki Linnakangas <heikki.linnakangas@iki.fi>
2021-12-23 18:58:28 +02:00
Heikki Linnakangas
f8f88154d5 Split restore_local_repo.rs into two files, with more descriptive names. 2021-12-14 20:24:03 +02:00
Kirill Bulatov
673c297949 Download timelines on demand 2021-12-10 17:23:35 +02:00
Heikki Linnakangas
c77e30116e Split waldecoder.rs into two source files.
Move the code for decoding a WAL stream into WAL records into
'postgres_ffi', and keep the code to parse the WAL records deeper in
'pageserver' crate, renamed to walrecord.rs.

This tidies up the dependencies a bit. 'walkeeper' reuses the same
waldecoder routines, and it used to depend on 'pageserver' because of
that. Now it only depends on 'postgres_ffi'.

(The comment in walkeeper/Cargo.toml that claimed that the dependency was
needed for ZTimelineId was obsolete. ZTimelineId is defined in
'zenith_utils', the dependency was actually needed for the waldecoder.)
2021-12-10 15:14:13 +02:00
Kirill Bulatov
670205e17a Evict excessively failing sync tasks, improve processing for the rest of
the tasks
2021-11-30 13:58:49 +02:00
Dmitry Rodionov
130184fee9 Prohibit branch creation and basebackup at out of scope lsns
Out of scope LSNs include pre initdb LSNs, and LSNs prior to
latest_gc_cutoff.

To get there there was also two cleanups:
* Fix error handling in Execute message handler. This fixes behaviour
  when basebackup retured an error. Previously pageserver thread just
  died.
* Remove "ancestor" file which previously contained ancestor id and
  branch lsn. Currently the same data can be obtained from metadata file.
  And just the way we handled ancestor file in the code introduced the
  case when branching fails timeline directory is created but there is no data in it
  except ancestor file. And this confused gc because it scans
  directories. So it is better to just remove ancestor file and clean up
  this timeline directory creation so it happens after all validity
  checks have passed
2021-11-25 15:27:16 +03:00
Heikki Linnakangas
431d32756b Add a buffer cache, and use it to store materialized pages.
The buffer cache is shared across all tenants, allowing memory to be
dynamically allocated where it's needed the most. The cache works on 8 kB
pages, and uses the clock algorithm for replacement policy; same as the
PostgreSQL buffer cache.

One peculiarity is that the materialized page versions can be looked up
by an inexact LSN, to find the latest page version with an LSN >= the
search key.

The code is structured to support caching other kinds of pages in the same
cache in the future, but with a different mapping key.

Co-authored-by: Patrick Insinger <patrick@zenith.tech>
2021-11-12 11:02:12 -08:00
Heikki Linnakangas
9300107cdf Cache Book objects, use virtual files to avoid running out of fds.
Currently, whenever a page version is needed from an image or delta
layer, we open the file and read and parse the bookfile headers. That's
pretty expensive. To reduce the overhead, introduce a cache of open file
descriptors, and use that to cache the Book objects so that we don't need
to read the metadata on every access.
2021-11-10 17:19:37 +02:00
Kirill Bulatov
f36acf00de Reduce "relish" word usages in remote storage 2021-11-04 12:53:42 +02:00
Heikki Linnakangas
fb524dd973 Put a global limit on memory used by in-memory layers.
Adds simple global tracking of memory used by the in-memory layers. It's
very approximate, it doesn't take into account allocator, memory
fragmentation or many other things, but it's a good first step.

After storing a WAL record in the repository, the WAL receiver checks
if the global memory usage. If it's above a configurable threshold (hard
coded at 128 MB at the moment), it evicts a layer. The victim layer is
chosen by GClock algorithm, similar to that used in the Postgres buffer
cache.

This stops the page server from using an unbounded amount of memory. It's
pretty crude, the eviction and materializing and writing a layer to disk
happens now in the WAL receiver thread. It would be nice to move that
to a background thread, and it would be nice to have a smarter policy on
when to materialize a new image layer and when to just write out a delta
layer, and it would be nice to have more accurate accounting of memory.
But this should fix the most pressing OOM issues, and is a step in the
right direction.

Co-authored-by: Patrick Insinger <patrickinsinger@gmail.com>
2021-11-02 15:49:39 +02:00
anastasia
ea5900f155 Refactoring of checkpointer and GC.
Move them to a separate tenant_threads module to detangle thread management from LayeredRepository implementation.
2021-10-27 20:50:26 +03:00
Kirill Bulatov
04fb0a0342 Add core relish backup and restore functionality 2021-10-22 22:22:38 +03:00
Heikki Linnakangas
53b4dc944d Don't create unused "wal" directory
It hasn't been used since commit ca9af37478.
2021-10-07 10:36:26 +03:00
Kirill Bulatov
287ea2e5e3 Limit concurrent relish storage sync operations 2021-10-01 08:37:09 +03:00
Egor Suvorov
b0a7234759 pageserver: fix stale default listen addrs
* In command line help
* In dummy_conf
2021-09-28 20:57:51 +03:00
Egor Suvorov
ddf4b15ebc pageserver: use const_format crate to generate default listen addrs 2021-09-28 20:57:51 +03:00
Konstantin Knizhnik
ca9af37478 Do not write WAL at pageserver (#645)
* Do not write WAL at pageserver

* Remove import_timeline_wal function
2021-09-27 14:15:55 +03:00
Kirill Bulatov
1d5abf1253 Initial version of the relish storage 2021-09-17 15:30:22 +03:00
anastasia
8de41f1d70 Change checkpoint_distance type to u64 2021-09-16 12:33:50 +03:00
anastasia
6984d33b4e Run GC and checkpointer separate threads.
Add checkpoint_period configuration parameter
2021-09-16 12:33:50 +03:00
anastasia
98d4f9cea5 Add checkpoint_distance config parameter.
- Change hardcoded OLDEST_INMEM_DISTANCE value to pageserver config option checkpoint_distance.
- Get rid of 'force' flag in checkpoint_internal(). Use checkpoint_distance=0 instead.
2021-09-16 12:33:50 +03:00
Dmitry Rodionov
4ebe643d0c Support parallel test running for python tests
Support is done via pytest-xdist plugin.
To use the feature add -n<concurrency> to pytest invocation
e.g. pytest -n8 to run 8 tests in parallel.

Changes in code are mostly about ports assigning. Previously port for
pageserver was hardcoded without the ability to override through zenith
cli and ports for started compute nodes were calculated twice, in zenith
cli and in test code. Now zenith cli supports port arguments for
pageserver and compute nodes to be passed explicitly.

Tests are modified in such a way that each worker gets a non overlapping
port range which can be configured and now contains 100 ports. These
ports are distributed to test services (pageserver, wal acceptors,
compute nodes) so they can work independently.
2021-09-15 14:02:15 +03:00
Dmitry Rodionov
84008a2560 factor out common logging initialisation routine
This contains a lowest common denominator of pageserver and safekeeper log
initialisation routines. It uses daemonize flag to decide where to
stream log messages. In case daemonize is true log messages are
forwarded to file. Otherwise streaming to stdout is used. Usage of
stdout for log output is the default in docker side of things, so make
it easier to browse our logs via builtin docker commands.
2021-09-14 18:09:14 +03:00
Heikki Linnakangas
04ee1d5977 Add test for managing old open segments in binary heap.
I thought this test would trigger the bug fixed previous commit, but
it did not. More tests are nice in any case.
2021-09-07 18:10:07 +03:00
Heikki Linnakangas
b949127b06 Rename page_cache.rs to tenant_mgr.rs.
Once upon a time, 'page_cache.rs' contained an actual page cache, but
it hasn't for a very long time. Rename to reflect what it actually does
these days.
2021-08-30 15:17:30 +03:00
Heikki Linnakangas
4046530160 Remove remnants of choosing between repository formats.
Now that we only have one Repository implementation, no need for the
command-line options to choose it either. I'm removing these as a separate
commit to show what we will need to do if we add another Repository
implementation in the future (even though I don't foresee us doing that
any time soon)
2021-08-25 18:37:22 +03:00
Heikki Linnakangas
5998744bcc Remove rocksdb implementation.
The layered storage format is good enough that we don't need the rocksdb
implementation anymore. There are a lot of known issues but we'll keep
working on them.
2021-08-25 18:37:22 +03:00
Dmitry Rodionov
23b5249512 translate pageserver api to http 2021-08-24 19:05:00 +03:00
Heikki Linnakangas
2450f82de5 Introduce a new "layered" repository implementation.
This replaces the RocksDB based implementation with an approach using
"snapshot files" on disk, and in-memory btreemaps to hold the recent
changes.

This make the repository implementation a configuration option. You can
choose 'layered' or 'rocksdb' with "zenith init --repository-format=<format>"
The unit tests have been refactored to exercise both implementations.
'layered' is now the default.

Push/pull is not implemented. The 'test_history_inmemory' test has been
commented out accordingly. It's not clear how we will implement that
functionality; probably by copying the snapshot files directly.
2021-08-16 10:06:48 +03:00
Dmitry Rodionov
ce5333656f Introduce authentication v0.1.
Current state with authentication.
Page server validates JWT token passed as a password during connection
phase and later when performing an action such as create branch tenant
parameter of an operation is validated to match one submitted in token.
To allow access from console there is dedicated scope: PageServerApi,
this scope allows access to all tenants. See code for access validation in:
PageServerHandler::check_permission.

Because we are in progress of refactoring of communication layer
involving wal proposer protocol, and safekeeper<->pageserver. Safekeeper
now doesn’t check token passed from compute, and uses “hardcoded” token
passed via environment variable to communicate with pageserver.

Compute postgres now takes token from environment variable and passes it
as a password field in pageserver connection. It is not passed through
settings because then user will be able to retrieve it using pg_settings
or SHOW ..

I’ve added basic test in test_auth.py. Probably after we add
authentication to remaining network paths we should enable it by default
and switch all existing tests to use it.
2021-08-11 20:05:54 +03:00