Commit Graph

650 Commits

Author SHA1 Message Date
Heikki Linnakangas
13ec0ce7b2 fix formatting 2022-03-17 19:40:08 +02:00
Heikki Linnakangas
3da14d56f2 Fix materialized page caching. 2022-03-17 17:12:30 +02:00
Heikki Linnakangas
b0b2093d00 Improve comments and tidy up the code in pgdatadir_mapping.rs. 2022-03-17 13:14:33 +02:00
Heikki Linnakangas
7560854370 Rename things in KeyPartition, per Bojan's suggestions. 2022-03-16 19:29:07 +02:00
Heikki Linnakangas
6a264aaca3 Stopgap "fix" for test_parallel_copy failure in debug mode. 2022-03-14 19:54:38 +02:00
Heikki Linnakangas
60ed6b3710 Shave some CPU cycles from reading blobs from files.
This shows up in 'perf' profile when running in debug mode. Not so
significant in release mode, but still.
2022-03-14 19:53:00 +02:00
Heikki Linnakangas
89690d7349 Prevent compaction from running at same time as GC.
For same reasons as we prohibited concurrent checkpointing and GC
previosly.
2022-03-14 14:22:04 +02:00
Heikki Linnakangas
09f2dff537 Refactor the checkpoint and compaction functions.
The concept of a "checkpoint" had become quite muddled. This tries to
clarify it again.
2022-03-14 13:22:46 +02:00
Heikki Linnakangas
2d8587f67d Separate flushing in-memory layer to disk from checkpoints.
When 'checkpoint_distance' is reached, freeze the current in-memory
layer directly in the WAL receiver thread. And to flush the frozen
layer to disk, launch a separate "layer flushing thread". This leaves
only the compaction duty to the checkpoint thread.
2022-03-14 11:37:22 +02:00
Heikki Linnakangas
c559c72ede Merge remote-tracking branch 'origin/main' into HEAD 2022-03-14 10:26:05 +02:00
Heikki Linnakangas
f06707badc Bugfix: a few constant keys were missing from collect_keyspace
As a result, you got "could not find data for key" errors.
2022-03-13 01:15:32 +02:00
Heikki Linnakangas
64cdd6064d Don't ClearVisibilityMapFlags records for non-existent blocks.
We create a ClearVisibilityMapFlags record for the VM page, when a heap
WAL record indicates that the VM bit needs to be cleared. However,
sometimes the VM block would not exist. It seems that PostgreSQL
sometimes sets the clear-VM bit on WAL records, even though the
corresponding VM page hasn't been initialized yet. There's no point in
trying to clear a bit on a non-existent bit, so just skip emitting the
record if the VM page doesn't exist.

I'm not entirely sure why we're only seeing this bug with this PR, I
think it existed before. Maybe we were more sloppy and returned
an all-zeros page?
2022-03-13 01:14:58 +02:00
Heikki Linnakangas
ee40297758 Refactor keyspace code
Have separate classes for the KeySpace, a partitioning of the KeySpace
(KeyPartitioning), and a builder object used to construct the KeySpace.
Previously, KeyPartitioning did all those things, and it was a bit
confusing.
2022-03-11 16:24:13 +02:00
Heikki Linnakangas
d5b8380dae Improve comments on image layer.
Make it more explicit that if a key doesn't exist in an image layer, it
doesn't exist.
2022-03-11 09:47:09 +02:00
Heikki Linnakangas
bce2da4e55 Another 'tablespace' test fix. 2022-03-11 00:53:46 +02:00
Dhammika Pathirana
5d7bd8643a Fix page reconstruct time histo
Signed-off-by: Dhammika Pathirana <dhammika@gmail.com>
2022-03-10 14:42:28 -08:00
Dhammika Pathirana
27dadba52c Fix retain references to layer histograms
Signed-off-by: Dhammika Pathirana <dhammika@gmail.com>
2022-03-10 14:42:28 -08:00
Dhammika Pathirana
f67d010d1b Add ps smgr/storage metrics tenant tags
Signed-off-by: Dhammika Pathirana <dhammika@gmail.com>

Add tenant_id,timeline_id in smgr/storage metrics (#1234)
2022-03-10 14:42:28 -08:00
Heikki Linnakangas
3948956e87 Fix pg_table_size() on a view 2022-03-10 23:35:24 +02:00
Heikki Linnakangas
a726b555fb Handle tablespaces gracefully.
We don't really support tablespaces. But this makes the 'tablespace'
Postgres regression test pass, like it did previously.
2022-03-10 22:56:37 +02:00
Kirill Bulatov
093ad8ab59 Send 409 HTTP responses on timeline and tenant creation for existing entity 2022-03-10 19:38:58 +02:00
Kirill Bulatov
c51d545fd9 Serialize Lsn as strings in http api 2022-03-10 19:38:58 +02:00
Kirill Bulatov
fe6fccfdae Allow already existing repo when creating a tenant 2022-03-10 19:38:58 +02:00
Kirill Bulatov
dd74c66ef0 Do not create timeline along with tenant 2022-03-10 19:38:58 +02:00
Kirill Bulatov
a5e10c4f64 Tidy up pageserver's endpoints 2022-03-10 19:38:58 +02:00
Kirill Bulatov
7b5482bac0 Properly store the branch name mappings 2022-03-10 19:38:58 +02:00
Kirill Bulatov
c7569dce47 Allow passing initial timeline id into zenith CLI commands 2022-03-10 19:38:58 +02:00
Kirill Bulatov
4d0f7fd1e4 Update Zenith CLI config between runs 2022-03-10 19:38:58 +02:00
Kirill Bulatov
f49990ed43 Allow creating timelines by branching off ancestors 2022-03-10 19:38:58 +02:00
Kirill Bulatov
0c91091c63 Avoid point in time concept on pageserver level 2022-03-10 19:38:58 +02:00
Kirill Bulatov
10f811e886 Use timeline instead of branch in pageserver's API 2022-03-10 19:38:58 +02:00
Heikki Linnakangas
0e3512aad0 Crank down logging again 2022-03-10 18:50:12 +02:00
Heikki Linnakangas
dd56eeefbf Crank up logging 2022-03-10 15:45:50 +02:00
Heikki Linnakangas
d19a293e7e Add a test for branching 2022-03-10 14:56:13 +02:00
Heikki Linnakangas
be4aebd7e9 silence clippy 2022-03-10 13:36:28 +02:00
Heikki Linnakangas
dac73328ba Fix bug where reldir was not written to image layer. 2022-03-10 13:20:08 +02:00
Heikki Linnakangas
fb79c7f1f0 Make compaction more concurrent 2022-03-10 13:20:08 +02:00
Heikki Linnakangas
e7bd74d558 Tidy up 2022-03-10 13:20:08 +02:00
Heikki Linnakangas
da8beffc95 Fix logical timeline size tracking 2022-03-10 13:20:08 +02:00
Heikki Linnakangas
98ec8418c4 Fix bug with the partitioning and GC 2022-03-10 13:20:08 +02:00
Heikki Linnakangas
92d1322cd5 comments, other cleanup 2022-03-10 13:20:08 +02:00
Heikki Linnakangas
2896d35a8b rustfmt and clippy fixes 2022-03-10 13:20:08 +02:00
Heikki Linnakangas
e096c62494 Misc fixes and stuff 2022-03-09 11:36:39 +02:00
Heikki Linnakangas
356f716d39 Fixes 2022-03-09 11:36:39 +02:00
Heikki Linnakangas
798ff26fb0 More work on compaction, and resurrect some unit tests 2022-03-09 11:36:39 +02:00
Heikki Linnakangas
28045890eb Work on compaction. 2022-03-09 11:36:39 +02:00
Heikki Linnakangas
6127b6638b Major storage format rewrite
Major changes and new concepts:

Simplify Repository to a value-store
------------------------------------

Move the responsibility of tracking relation metadata, like which relations
exist and what are their sizes, from Repository to a new module,
pgdatadir_mapping.rs. The interface to Repository is now a simple key-value
PUT/GET operations.

It's still not any old key-value store though. A Repository is still
responsible from handling branching, and every GET operation comes with
an LSN.

Key
---

The key to the Repository key-value store is a Key struct, which consists
of a few integer fields. It's wide enough to store a full RelFileNode,
fork and block number, and to distinguish those from metadata keys.

See pgdatadir_mapping.rs for how relation blocks and metadata keys are
mapped to the Key struct.

Store arbitrary key-ranges in the layer files
---------------------------------------------

The concept of a "segment" is gone. Each layer file can store an arbitrary
range of Keys.

TODO:

- Deleting keys, to reclaim space. This isn't visible to Postgres, dropping
  or truncating a relation works as you would expect if you look at it from
  the compute node. If you drop a relation, for example, the relation is
  removed from the metadata entry, so that it appears to be gone. However,
  the layered repository implementation never reclaims the storage.

- Tracking "logical database size", for disk space quotas. That ought to
  be reimplemented now in pgdatadir_mapping.rs, or perhaps in walingest.rs.

- LSM compaction. The logic for checkpointing and creating image layers is
  very dumb. AFAIK the *read* code could deal with a full-fledged LSM tree
  now consisting of the delta and image layers. But there's no code to
  take a bunch of delta layers and compact them, and the heuristics for
  when to create image layers is pretty dumb.

- The code to track the layers is inefficient. All layers are just stored in
  a vector, and whenever we need to find a layer, we do a linear search in
  it.
2022-03-09 11:36:39 +02:00
Heikki Linnakangas
c7c1e19667 Use more generics, less dyn 2022-03-09 11:36:38 +02:00
Kirill Bulatov
9424bfae22 Use a separate newtype for ZId that (de)serialize as hex strings 2022-03-04 10:58:40 +02:00
Dmitry Rodionov
1d90b1b205 add node id to pageserver (#1310)
* Add --id argument to safekeeper setting its unique u64 id.

In preparation for storage node messaging. IDs are supposed to be monotonically
assigned by the console. In tests it is issued by ZenithEnv; at the zenith cli
level and fixtures, string name is completely replaced by integer id. Example
TOML configs are adjusted accordingly.

Sequential ids are chosen over Zid mainly because they are compact and easy to
type/remember.

* add node id to pageserver

This adds node id parameter to pageserver configuration. Also I use a
simple builder to construct pageserver config struct to avoid setting
node id to some temporary invalid value. Some of the changes in test
fixtures are needed to split init and start operations for envrionment.

Co-authored-by: Arseny Sher <sher-ars@yandex.ru>
2022-03-04 01:10:42 +03:00