Compare commits

..

228 Commits

Author SHA1 Message Date
Arpad Müller
a9e5da9613 wip 2023-09-01 00:19:47 +02:00
Arpad Müller
a5acfdaa5a Use write_and_fsync in save_metadata 2023-09-01 00:19:47 +02:00
Arpad Müller
70052ae1ca Make save_metadata async fn 2023-09-01 00:19:47 +02:00
Arpad Müller
fe69dd9a40 Add write_and_fsync function 2023-09-01 00:19:47 +02:00
Arpad Müller
930eccfcaa Make write_blob and the things it calls async fn 2023-09-01 00:19:47 +02:00
Arpad Müller
29c2381fa5 Move generics on trait into macro
Monomorphization is basically like macro expansion, it just happens at a
later compiler phase.
2023-09-01 00:19:47 +02:00
Arpad Müller
7839cda66a Remove bounds 2023-09-01 00:19:47 +02:00
Arpad Müller
d565df25d6 Make write_all_at async 2023-09-01 00:19:47 +02:00
Christian Schwarz
92b7d7f466 FileBlockReader::fill_buffer make it obvious that we need to switch to async API 2023-09-01 00:19:47 +02:00
Arpad Müller
cfabd8b598 Remove Read impl that was only used in one place 2023-09-01 00:19:47 +02:00
Arpad Müller
4432094443 Move used FileExt functions to inherent impls 2023-09-01 00:19:47 +02:00
Christian Schwarz
83babcce30 FileBlockReader<File> is never used 2023-09-01 00:19:47 +02:00
Arpad Müller
735e20112a Move VirtualFile::seek to inherent function 2023-09-01 00:19:47 +02:00
John Spray
616e7046c7 s3_scrubber: import into the main neon repository (#5141)
## Problem

The S3 scrubber currently lives at
https://github.com/neondatabase/s3-scrubber

We don't have tests that use it, and it has copies of some data
structures that can get stale.

## Summary of changes

- Import the s3-scrubber as `s3_scrubber/
- Replace copied_definitions/ in the scrubber with direct access to the
`utils` and `pageserver` crates
- Modify visibility of a few definitions in `pageserver` to allow the
scrubber to use them
- Update scrubber code for recent changes to `IndexPart`
- Update `KNOWN_VERSIONS` for IndexPart and move the definition into
index.rs so that it is easier to keep up to date

As a future refinement, it would be good to pull the remote persistence
types (like IndexPart) out of `pageserver` into a separate library so
that the scrubber doesn't have to link against the whole pageserver, and
so that it's clearer which types need to be public.

Co-authored-by: Kirill Bulatov <kirill@neon.tech>
Co-authored-by: Dmitry Rodionov <dmitry@neon.tech>
Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
2023-08-31 19:01:39 +01:00
Conrad Ludgate
1b916a105a proxy: locked is not retriable (#5162)
## Problem

Management service returns Locked when quotas are exhausted. We cannot
retry on those

## Summary of changes

Makes Locked status unretriable
2023-08-31 15:50:15 +03:00
Conrad Ludgate
d11621d904 Proxy: proxy protocol v2 (#5028)
## Problem

We need to log the client IP, not the IP of the NLB.

## Summary of changes

Parse the proxy [protocol version
2](https://www.haproxy.org/download/1.8/doc/proxy-protocol.txt) if
possible
2023-08-31 14:30:25 +03:00
John Spray
43bb8bfdbb pageserver: fix flake in test_timeline_deletion_with_files_stuck_in_upload_queue (#5149)
## Problem

Test failing on a different ERROR log than it anticipated.

Closes: https://github.com/neondatabase/neon/issues/5148

## Summary of changes

Add the "could not flush frozen layer" error log to the permitted
errors.
2023-08-31 10:42:32 +01:00
John Spray
300a5aa05e pageserver: fix test v4_indexpart_is_parsed (#5157)
## Problem

Two recent PRs raced:
- https://github.com/neondatabase/neon/pull/5153
- https://github.com/neondatabase/neon/pull/5140

## Summary of changes

Add missing `generation` argument to IndexLayerMetadata construction
2023-08-31 10:40:46 +01:00
Nikita Kalyanov
b9c111962f pass JWT to management API (#5151)
support authentication with JWT from env for proxy calls to mgmt API
2023-08-31 12:23:51 +03:00
John Spray
83ae2bd82c pageserver: generation number support in keys and indices (#5140)
## Problem

To implement split brain protection, we need tenants and timelines to be
aware of their current generation, and use it when composing S3 keys.


## Summary of changes

- A `Generation` type is introduced in the `utils` crate -- it is in
this broadly-visible location because it will later be used from
`control_plane/` as well as `pageserver/`. Generations can be a number,
None, or Broken, to support legacy content (None), and Tenants in the
broken state (Broken).
- Tenant, Timeline, and RemoteTimelineClient all get a generation
attribute
- IndexPart's IndexLayerMetadata has a new `generation` attribute.
Legacy layers' metadata will deserialize to Generation::none().
- Remote paths are composed with a trailing generation suffix. If a
generation is equal to Generation::none() (as it currently always is),
then this suffix is an empty string.
- Functions for composing remote storage paths added in
remote_timeline_client: these avoid the way that we currently always
compose a local path and then strip the prefix, and avoid requiring a
PageserverConf reference on functions that want to create remote paths
(the conf is only needed for local paths). These are less DRY than the
old functions, but remote storage paths are a very rarely changing
thing, so it's better to write out our paths clearly in the functions
than to compose timeline paths from tenant paths, etc.
- Code paths that construct a Tenant take a `generation` argument in
anticipation that we will soon load generations on startup before
constructing Tenant.

Until the whole feature is done, we don't want any generation-ful keys
though: so initially we will carry this everywhere with the special
Generation::none() value.

Closes: https://github.com/neondatabase/neon/issues/5135

Co-authored-by: Christian Schwarz <christian@neon.tech>
2023-08-31 09:19:34 +01:00
Alexey Kondratov
f2c21447ce [compute_ctl] Create check availability data during full configuration (#5084)
I've moved it to the API handler in the 589cf1ed2 to do not delay
compute start. Yet, we now skip full configuration and catalog updates
in the most hot path -- waking up suspended compute, and only do it at:

- first start
- start with applying new configuration
- start for availability check

so it doesn't really matter anymore.

The problem with creating the table and test record in the API handler
is that someone can fill up timeline till the logical limit. Then it's
suspended and availability check is scheduled, so it fails.

If table + test row are created at the very beginning, we reserve a 8 KB
page for future checks, which theoretically will last almost forever.
For example, my ~1y old branch still has 8 KB sized test table:
```sql
cloud_admin@postgres=# select pg_relation_size('health_check');
 pg_relation_size
------------------
             8192
(1 row)
```

---------

Co-authored-by: Anastasia Lubennikova <anastasia@neon.tech>
2023-08-30 17:44:28 +02:00
Conrad Ludgate
93dcdb293a proxy: password hack hack (#5126)
## Problem

fixes #4881 

## Summary of changes
2023-08-30 16:20:27 +01:00
John Spray
a93274b389 pageserver: remove vestigial timeline_layers attribute (#5153)
## Problem

`timeline_layers` was write-only since
b95addddd5

We deployed the version that no longer requires it for deserializing, so
now we can stop including it when serializing.

## Summary of changes

Fully remove `timeline_layers`.
2023-08-30 16:14:04 +01:00
Anastasia Lubennikova
a7c0e4dcd0 Check if custiom extension is enabled.
This check was lost in the latest refactoring.

If extension is not present in 'public_extensions' or 'custom_extensions' don't download it
2023-08-30 17:47:06 +03:00
Conrad Ludgate
3b81e0c86d chore: remove webpki (#5069)
## Problem

webpki is unmaintained

Closes https://github.com/neondatabase/neon/security/dependabot/33

## Summary of changes

Update all dependents of webpki.
2023-08-30 15:14:03 +01:00
Anastasia Lubennikova
e5a397cf96 Form archive_path for remote extensions on the fly 2023-08-30 13:56:51 +03:00
Joonas Koivunen
05773708d3 fix: add context for ancestor lsn wait (#5143)
In logs it is confusing to see seqwait timeouts which seemingly arise
from the branched lsn but actually are about the ancestor, leading to
questions like "has the last_record_lsn went back".

Noticed by @problame.
2023-08-30 12:21:41 +03:00
John Spray
382473d9a5 docs: add RFC for remote storage generation numbers (#4919)
## Summary

A scheme of logical "generation numbers" for pageservers and their
attachments is proposed, along with
changes to the remote storage format to include these generation numbers
in S3 keys.

Using the control plane as the issuer of these generation numbers
enables strong anti-split-brain
properties in the pageserver cluster without implementing a consensus
mechanism directly
in the pageservers.

## Motivation

Currently, the pageserver's remote storage format does not provide a
mechanism for addressing
split brain conditions that may happen when replacing a node during
failover or when migrating
a tenant from one pageserver to another. From a remote storage
perspective, a split brain condition
occurs whenever two nodes both think they have the same tenant attached,
and both can write to S3. This
can happen in the case of a network partition, pathologically long
delays (e.g. suspended VM), or software
bugs.

This blocks robust implementation of failover from unresponsive
pageservers, due to the risk that
the unresponsive pageserver is still writing to S3.

---------

Co-authored-by: Christian Schwarz <christian@neon.tech>
Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
2023-08-30 09:49:55 +01:00
Arpad Müller
eb0a698adc Make page cache and read_blk async (#5023)
## Problem

`read_blk` does I/O and thus we would like to make it async. We can't
make the function async as long as the `PageReadGuard` returned by
`read_blk` isn't `Send`. The page cache is called by `read_blk`, and
thus it can't be async without `read_blk` being async. Thus, we have a
circular dependency.

## Summary of changes

Due to the circular dependency, we convert both the page cache and
`read_blk` to async at the same time:

We make the page cache use `tokio::sync` synchronization primitives as
those are `Send`. This makes all the places that acquire a lock require
async though, which we then also do. This includes also asyncification
of the `read_blk` function.

Builds upon #4994, #5015, #5056, and #5129.

Part of #4743.
2023-08-30 09:04:31 +02:00
Arseny Sher
81b6578c44 Allow walsender in recovery mode give WAL till dynamic flush_lsn.
Instead of fixed during the start of replication. To this end, create
term_flush_lsn watch channel similar to commit_lsn one. This allows to continue
recovery streaming if new data appears.
2023-08-29 23:19:40 +03:00
Arseny Sher
bc49c73fee Move wal_stream_connection_config to utils.
It will be used by safekeeper as well.
2023-08-29 23:19:40 +03:00
Arseny Sher
e98580b092 Add term and http endpoint to broker messaged SkTimelineInfo.
We need them for safekeeper peer recovery
https://github.com/neondatabase/neon/pull/4875
2023-08-29 23:19:40 +03:00
Arseny Sher
804ef23043 Rename TermSwitchEntry to TermLsn.
Add derive Ord for easy comparison of <term, lsn> pairs.

part of https://github.com/neondatabase/neon/pull/4875
2023-08-29 23:19:40 +03:00
Arseny Sher
87f7d6bce3 Start and stop per timeline recovery task.
Slightly refactors init: now load_tenant_timelines is also async to properly
init the timeline, but to keep global map lock sync we just acquire it anew for
each timeline.

Recovery task itself is just a stub here.

part of
https://github.com/neondatabase/neon/pull/4875
2023-08-29 23:19:40 +03:00
Arseny Sher
39e3fbbeb0 Add safekeeper peers to TimelineInfo.
Now available under GET /tenant/xxx/timeline/yyy for inspection.
2023-08-29 23:19:40 +03:00
Em Sharnoff
8d2a4aa5f8 vm-monitor: Add flag for when file cache on disk (#5130)
Part 1 of 2, for moving the file cache onto disk.

Because VMs are created by the control plane (and that's where the
filesystem for the file cache is defined), we can't rely on any kind of
synchronization between releases, so the change needs to be
feature-gated (kind of), with the default remaining the same for now.

See also: neondatabase/cloud#6593
2023-08-29 12:44:48 -07:00
Joonas Koivunen
d1fcdf75b3 test: enhanced logging for curious mock_s3 (#5134)
Possible flakyness with mock_s3. Add logging in hopes this will happen
again.

Co-authored-by: Alexander Bayandin <alexander@neon.tech>
2023-08-29 14:48:50 +03:00
Alexander Bayandin
7e39a96441 scripts/flaky_tests.py: Improve flaky tests detection (#5094)
## Problem

We still need to rerun some builds manually because flaky tests weren't
detected automatically.
I found two reasons for it:
- If a test is flaky on a particular build type, on a particular
Postgres version, there's a high chance that this test is flaky on all
configurations, but we don't automatically detect such cases.
- We detect flaky tests only on the main branch, which requires manual
retrigger runs for freshly made flaky tests.
Both of them are fixed in the PR.

## Summary of changes
- Spread flakiness of a single test to all configurations
- Detect flaky tests in all branches (not only in the main)
- Look back only at  7 days of test history (instead of 10)
2023-08-29 11:53:24 +01:00
Vadim Kharitonov
babefdd3f9 Upgrade pgvector to 0.5.0 (#5132) 2023-08-29 12:53:50 +03:00
Arpad Müller
805fee1483 page cache: small code cleanups (#5125)
## Problem

I saw these things while working on #5111.

## Summary of changes

* Add a comment explaining why we use `Vec::leak` instead of
`Vec::into_boxed_slice` plus `Box::leak`.
* Add another comment explaining what `valid` is doing, it wasn't very
clear before.
* Add a function `set_usage_count` to not set it directly.
2023-08-29 11:49:04 +03:00
Felix Prasanna
85d6d9dc85 monitor/compute_ctl: remove references to the informant (#5115)
Also added some docs to the monitor :)

Co-authored-by: Em Sharnoff <sharnoff@neon.tech>
2023-08-29 02:59:27 +03:00
Em Sharnoff
e40ee7c3d1 remove unused file 'vm-cgconfig.conf' (#5127)
Honestly no clue why it's still here, should have been removed ages ago.
This is handled by vm-builder now.
2023-08-28 13:04:57 -07:00
Christian Schwarz
0fe3b3646a page cache: don't proactively evict EphemeralFile pages (#5129)
Before this patch, when dropping an EphemeralFile, we'd scan the entire
`slots` to proactively evict its pages (`drop_buffers_for_immutable`).

This was _necessary_ before #4994 because the page cache was a
write-back cache: we'd be deleting the EphemeralFile from disk after,
so, if we hadn't evicted its pages before that, write-back in
`find_victim` wouldhave failed.

But, since #4994, the page cache is a read-only cache, so, it's safe
to keep read-only data cached. It's never going to get accessed again
and eventually, `find_victim` will evict it.

The only remaining advantage of `drop_buffers_for_immutable` over
relying on `find_victim` is that `find_victim` has to do the clock
page replacement iterations until the count reaches 0,
whereas `drop_buffers_for_immutable` can kick the page out right away.

However, weigh that against the cost of `drop_buffers_for_immutable`,
which currently scans the entire `slots` array to find the
EphemeralFile's pages.

Alternatives have been proposed in #5122 and #5128, but, they come
with their own overheads & trade-offs.

Also, the real reason why we're looking into this piece of code is
that we want to make the slots rwlock async in #5023.
Since `drop_buffers_for_immutable` is called from drop, and there
is no async drop, it would be nice to not have to deal with this.

So, let's just stop doing `drop_buffers_for_immutable` and observe
the performance impact in benchmarks.
2023-08-28 20:42:18 +02:00
Em Sharnoff
529f8b5016 compute_ctl: Fix switched vm-monitor args (#5117)
Small switcheroo from #4946.
2023-08-28 14:55:41 +02:00
Joonas Koivunen
fbcd174489 load_layer_map: schedule deletions for any future layers (#5103)
Unrelated fixes noticed while integrating #4938.

- Stop leaking future layers in remote storage
- We schedule extra index_part uploads if layer name to be removed was
not actually present
2023-08-28 10:51:49 +03:00
Felix Prasanna
7b5489a0bb compute_ctl: start pg in cgroup for vms (#4920)
Starts `postgres` in cgroup directly from `compute_ctl` instead of from
`vm-builder`. This is required because the `vm-monitor` cannot be in the
cgroup it is managing. Otherwise, it itself would be frozen when
freezing the cgroup.

Requires https://github.com/neondatabase/cloud/pull/6331, which adds the
`AUTOSCALING` environment variable letting `compute_ctl` know to start
`postgres` in the cgroup.

Requires https://github.com/neondatabase/autoscaling/pull/468, which
prevents `vm-builder` from starting the monitor and putting postgres in
a cgroup. This will require a `VM_BUILDER_VERSION` bump.
2023-08-25 15:59:12 -04:00
Felix Prasanna
40268dcd8d monitor: fix filecache calculations (#5112)
## Problem
An underflow bug in the filecache calculations.

## Summary of changes
Fixed the bug, cleaned up calculations in general.
2023-08-25 13:29:10 -04:00
Vadim Kharitonov
4436c84751 Change codeowners (#5109) 2023-08-25 19:48:16 +03:00
John Spray
b758bf47ca pageserver: refactor TimelineMetadata serialization in IndexPart (#5091)
## Problem

The `metadata_bytes` field of IndexPart required explicit
deserialization & error checking everywhere it was used -- there isn't
anything special about this structure that should prevent it from being
serialized & deserialized along with the rest of the structure.

## Summary of changes

- Implement Serialize and Deserialize for TimelineMetadata
- Replace IndexPart::metadata_bytes with a simpler `metadata`, that can
be used directly.

---------

Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
2023-08-25 16:16:20 +01:00
Felix Prasanna
024e306f73 monitor: improve logging (#5099) 2023-08-25 10:09:53 -04:00
Alek Westover
f71c82e5de remove obsolete need dependency (#5087) 2023-08-25 09:10:26 -04:00
Conrad Ludgate
faf070f288 proxy: dont return connection pending (#5107)
## Problem

We were returning Pending when a connection had a notice/notification
(introduced recently in #5020). When returning pending, the runtime
assumes you will call `cx.waker().wake()` in order to continue
processing.

We weren't doing that, so the connection task would get stuck

## Summary of changes

Don't return pending. Loop instead
2023-08-25 15:08:45 +03:00
Arpad Müller
8c13296add Remove BlockReader::read_blk in favour of BlockCursor (#5015)
## Problem

We want to make `read_blk` an async function, but outside of
`async_trait`, which allocates, and nightly features, we can't use async
fn's in traits.

## Summary of changes

* Remove all uses of `BlockReader::read_blk` in favour of using block
  cursors, at least where the type of the `BlockReader` is behind a
  generic
* Introduce a `BlockReaderRef` enum that lists all implementors of
  `BlockReader::read_blk`.
* Remove `BlockReader::read_blk` and move its implementations into
  inherent functions on the types instead.

We don't turn `read_blk` into an async fn yet, for that we also need to
modify the page cache. So this is a preparatory PR, albeit an important
one.

Part of #4743.
2023-08-25 12:28:01 +02:00
Felix Prasanna
18537be298 monitor: listen on correct port to accept agent connections (#5100)
## Problem
The previous arguments have the monitor listen on `localhost`, which the
informant can connect to since it's also in the VM, but which the agent
cannot. Also, the port is wrong.

## Summary of changes
Listen on `0.0.0.0:10301`
2023-08-24 17:32:46 -04:00
Felix Prasanna
3128eeff01 compute_ctl: add vm-monitor (#4946)
Co-authored-by: Em Sharnoff <sharnoff@neon.tech>
2023-08-24 15:54:37 -04:00
Arpad Müller
227c87e333 Make EphemeralFile::write_blob function async (#5056)
## Problem

The `EphemeralFile::write_blob` function accesses the page cache
internally. We want to require `async` for these accesses in #5023.

## Summary of changes

This removes the implementaiton of the `BlobWriter` trait for
`EphemeralFile` and turns the `write_blob` function into an inherent
function. We can then make it async as well as the `push_bytes`
function. We move the `SER_BUFFER` thread-local into the
`InMemoryLayerInner` so that the same buffer can be accessed by
different threads as the async is (potentially) moved between threads.

Part of #4743, preparation for #5023.
2023-08-24 19:18:30 +02:00
Alek Westover
e8f9aaf78c Don't use non-existent docker tags (#5096) 2023-08-24 19:45:23 +03:00
Chengpeng Yan
fa74d5649e rename EphmeralFile::size to EphemeralFile::len (#5076)
## Problem
close https://github.com/neondatabase/neon/issues/5034

## Summary of changes
Based on the
[comment](https://github.com/neondatabase/neon/pull/4994#discussion_r1297277922).
Just rename the `EphmeralFile::size` to `EphemeralFile::len`.
2023-08-24 16:41:57 +02:00
Joonas Koivunen
f70871dfd0 internal-devx: pageserver future layers (#5092)
I've personally forgotten why/how can we have future layers during
reconciliation. Adds `#[cfg(feature = "testing")]` logging when we
upload such index_part.json, with a cross reference to where the cleanup
happens.

Latest private slack thread:
https://neondb.slack.com/archives/C033RQ5SPDH/p1692879032573809?thread_ts=1692792276.173979&cid=C033RQ5SPDH

Builds upon #5074. Should had been considered on #4837.
2023-08-24 17:22:36 +03:00
Alek Westover
99a1be6c4e remove upload step from neon, it is in private repo now (#5085) 2023-08-24 17:14:40 +03:00
Joonas Koivunen
76aa01c90f refactor: single phase Timeline::load_layer_map (#5074)
Current implementation first calls `load_layer_map`, which loads all
local layers, cleans up files, leave cleaning up stuff to "second
function". Then the "second function" is finally called, it does not do
the cleanup and some of the first functions setup can torn down. "Second
function" is actually both `reconcile_with_remote` and
`create_remote_layers`.

This change makes it a bit more verbose but in one phase with the
following sub-steps:
1. scan the timeline directory
2. delete extra files
    - now including on-demand download files
    - fixes #3660
3. recoincile the two sources of layers (directory, index_part)
4. rename_to_backup future layers, short layers
5. create the remaining as layers

Needed by #4938.

It was also noticed that this is blocking code in an `async fn` so just
do it in a `spawn_blocking`, which should be healthy for our startup
times. Other effects includes hopefully halving of `stat` calls; extra
calls which were not done previously are now done for the future layers.

Co-authored-by: Christian Schwarz <christian@neon.tech>
Co-authored-by: John Spray <john@neon.tech>
2023-08-24 16:07:40 +03:00
John Spray
3e2f0ffb11 libs: make backoff::retry() take a cancellation token (#5065)
## Problem

Currently, anything that uses backoff::retry will delay the join of its
task by however long its backoff sleep is, multiplied by its max
retries.

Whenever we call a function that sleeps, we should be passing in a
CancellationToken.

## Summary of changes

- Add a `Cancel` type to backoff::retry that wraps a CancellationToken
and an error `Fn` to generate an error if the cancellation token fires.
- In call sites that already run in a `task_mgr` task, use
`shutdown_token()` to provide the token. In other locations, use a dead
`CancellationToken` to satisfy the interface, and leave a TODO to fix it
up when we broaden the use of explicit cancellation tokens.
2023-08-24 14:54:46 +03:00
Arseny Sher
d597e6d42b Track list of walreceivers and their voting/streaming state in shmem.
Also add both walsenders and walreceivers to TimelineStatus (available under
v1/tenant/xxx/timeline/yyy).

Prepares for
https://github.com/neondatabase/neon/pull/4875
2023-08-23 16:04:08 +03:00
Christian Schwarz
71ccb07a43 ci: fix upload-postgres-extensions-to-s3 job (#5063)
This is cherry-picked-then-improved version of release branch commit
4204960942 PR #4861)

The commit

	commit 5f8fd640bf
	Author: Alek Westover <alek.westover@gmail.com>
	Date:   Wed Jul 26 08:24:03 2023 -0400

	    Upload Test Remote Extensions (#4792)

switched to using the release tag instead of `latest`, but,
the `promote-images` job only uploads `latest` to the prod ECR.

The switch to using release tag was good in principle, but,
it broke the release pipeline. So, switch release pipeline
back to using `latest`.

Note that a proper fix should abandon use of `:latest` tag
at all: currently, if a `main` pipeline runs concurrently
with a `release` pipeline, the `release` pipeline may end
up using the `main` pipeline's images.

---------

Co-authored-by: Alexander Bayandin <alexander@neon.tech>
2023-08-22 22:45:25 +03:00
Joonas Koivunen
ad8d777c1c refactor: remove is_incremental=true for ImageLayers footgun (#5061)
Accidentially giving is_incremental=true for ImageLayers costs a lot of
debugging time. Removes all API which would allow to do that. They can
easily be restored later *when needed*.

Split off from #4938.
2023-08-22 22:12:05 +03:00
Joonas Koivunen
2f97b43315 build: update tar, get rid of duplicate xattr (#5071)
`tar` recently pushed to 0.4.40. No big changes, but less Cargo.lock and
one less nagging from `cargo-deny`.

The diff:
https://github.com/alexcrichton/tar-rs/compare/0.4.38...0.4.40.
2023-08-22 21:21:44 +03:00
Joonas Koivunen
533a92636c refactor: pre-cleanup Layer, PersistentLayer and impls (#5059)
Remove pub but dead code, move trait methods as inherent methods, remove
unnecessary. Split off from #4938.
2023-08-22 21:14:28 +03:00
Alek Westover
bf303a6575 Trigger workflow in remote (private) repo to build and upload private extensions (#4944) 2023-08-22 13:32:29 -04:00
Christian Schwarz
8cd20485f8 metrics: smgr query time: add a pre-aggregated histogram (#5064)
When doing global queries in VictoriaMetrics, the per-timeline
histograms make us run into cardinality limits.

We don't want to give them up just yet because we don't
have an alternative for drilling down on timeline-specific
performance issues.

So, add a pre-aggregated histogram and add observations to it
whenever we add observations to the per-timeline histogram.

While we're at it, switch to using a strummed enum for the operation
type names.
2023-08-22 20:08:31 +03:00
Joonas Koivunen
933a869f00 refactor: compaction becomes async again (#5058)
#4938 will make on-demand download of layers in compaction possible, so
it's not suitable for our "policy" of no `spawn_blocking(|| ...
Handle::block_on(async { spawn_blocking(...).await })` because this
poses a clear deadlock risk. Nested spawn_blockings are because of the
download using `tokio::fs::File`.

- Remove `spawn_blocking` from caller of `compact_level0_phase1`
- Remove `Handle::block_on` from `compact_level0_phase1` (indentation
change)
- Revert to `AsLayerDesc::layer_desc` usage temporarily (until it
becomes field access in #4938)
2023-08-22 20:03:14 +03:00
Conrad Ludgate
8c6541fea9 chore: add supported targets to deny (#5070)
## Problem

many duplicate windows crates pollute the cargo deny output

## Summary of changes

we don't build those crates, so remove those targets from being checked
2023-08-22 19:44:31 +03:00
Alek Westover
5cf75d92d8 Fix cargo deny errors (#5068)
## Problem
cargo deny lint broken

Links to the CVEs:

[rustsec.org/advisories/RUSTSEC-2023-0052](https://rustsec.org/advisories/RUSTSEC-2023-0052)

[rustsec.org/advisories/RUSTSEC-2023-0053](https://rustsec.org/advisories/RUSTSEC-2023-0053)
One is fixed, the other one isn't so we allow it (for now), to unbreak
CI. Then later we'll try to get rid of webpki in favour of the rustls
fork.

## Summary of changes
```
+ignore = ["RUSTSEC-2023-0052"]
```
2023-08-22 18:41:32 +03:00
Conrad Ludgate
0b001a0001 proxy: remove connections on shutdown (#5051)
## Problem

On shutdown, proxy connections are staying open.

## Summary of changes

Remove the connections on shutdown
2023-08-21 19:20:58 +01:00
Felix Prasanna
4a8bd866f6 bump vm-builder version to v0.16.3 (#5055)
This change to autoscaling allows agents to connect directly to the
monitor, completely removing the informant.
2023-08-21 13:29:16 -04:00
John Spray
615a490239 pageserver: refactor Tenant/Timeline args into structs (#5053)
## Problem

There are some common types that we pass into tenants and timelines as
we construct them, such as remote storage and the broker client.
Currently the list is small, but this is likely to grow -- the deletion
queue PR (#4960) pushed some methods to the point of clippy complaining
they had too many args, because of the extra deletion queue client being
passed around.

There are some shared objects that currently aren't passed around
explicitly because they use a static `once_cell` (e.g.
CONCURRENT_COMPACTIONS), but as we add more resource management and
concurreny control over time, it will be more readable & testable to
pass a type around in the respective Resources object, rather than to
coordinate via static objects. The `Resources` structures in this PR
will make it easier to add references to central coordination functions,
without having to rely on statics.

## Summary of changes

- For `Tenant`, the `broker_client` and `remote_storage` are bundled
into `TenantSharedResources`
- For `Timeline`, the `remote_client` is wrapped into
`TimelineResources`.

Both of these structures will get an additional deletion queue member in
#4960.
2023-08-21 17:30:28 +01:00
John Spray
b95addddd5 pageserver: do not read redundant timeline_layers from IndexPart, so that we can remove it later (#4972)
## Problem

IndexPart contains two redundant lists of layer names: a set of the
names, and then a map of name to metadata.

We already required that all the layers in `timeline_layers` are also in
`layers_metadata`, in `initialize_with_current_remote_index_part`, so if
there were any index_part.json files in the field that relied on these
sets being different, they would already be broken.

## Summary of changes

`timeline_layers` is made private and no longer read at runtime. It is
still serialized, but not deserialized.

`disk_consistent_lsn` is also made private, as this field only exists
for convenience of humans reading the serialized JSON.

This prepares us to entirely remove `timeline_layers` in a future
release, once this change is fully deployed, and therefore no
pageservers are trying to read the field.
2023-08-21 14:29:36 +03:00
Joonas Koivunen
130ccb4b67 Remove initial timeline id troubles (#5044)
I made a mistake when I adding `env.initial_timeline:
Optional[TimelineId]` in the #3839, should had just generated it and
used it to create a specific timeline. This PR fixes those mistakes, and
some extra calling into psql which must be slower than python field
access.
2023-08-20 12:33:19 +03:00
Dmitry Rodionov
9140a950f4 Resume tenant deletion on attach (#5039)
I'm still a bit nervous about attach -> crash case. But it should work.
(unlike case with timeline). Ideally would be cool to cover this with
test.

This continues tradition of adding bool flags for Tenant::set_stopping.
Probably lifecycle project will help with fixing it.
2023-08-20 12:28:50 +03:00
Arpad Müller
a23b0773f1 Fix DeltaLayer dumping (#5045)
## Problem

Before, DeltaLayer dumping (via `cargo run --release -p pagectl --
print-layer-file` ) would crash as one can't call `Handle::block_on` in
an async executor thread.

## Summary of changes

Avoid the problem by using `DeltaLayerInner::load_keys` to load the keys
into RAM (which we already do during compaction), and then load the
values one by one during dumping.
2023-08-19 00:56:03 +02:00
Joonas Koivunen
368ee6c8ca refactor: failpoint support (#5033)
- move them to pageserver which is the only dependant on the crate fail
- "move" the exported macro to the new module
- support at init time the same failpoints as runtime

Found while debugging test failures and making tests more repeatable by
allowing "exit" from pageserver start via environment variables. Made
those changes to `test_gc_cutoff.py`.

---------

Co-authored-by: Christian Schwarz <christian@neon.tech>
2023-08-19 01:01:44 +03:00
Felix Prasanna
5c6a692cf1 bump VM_BUILDER_VERSION to v0.16.2 (#5031)
A very slight change that allows us to configure the UID of the
neon-postgres cgroup owner. We start postgres in this cgroup so we can
scale it with the cgroups v2 api. Currently, the control plane
overwrites the entrypoint set by `vm-builder`, so `compute_ctl` (and
thus postgres), is not started in the neon-postgres cgroup. Having
`compute_ctl` start postgres in the cgroup should fix this. However, at
the moment appears like it does not have the correct permissions.
Configuring the neon-postgres UID to `postgres` (which is the UID
`compute_ctl` runs under) should hopefully fix this.


See #4920 - the PR to modify `compute_ctl` to start postgres in the
cgorup.
See: neondatabase/autoscaling#480, neondatabase/autoscaling#477. Both
these PR's are part of an effort to increase `vm-builder`'s
configurability and allow us to adjust it as we integrate in the
monitor.
2023-08-18 14:29:20 -04:00
Dmitry Rodionov
30888a24d9 Avoid flakiness in test_timeline_delete_fail_before_local_delete (#5032)
The problem was that timeline detail can return timelines in not only
active state. And by the time request comes timeline deletion can still
be in progress if we're unlucky (test execution happened to be slower
for some reason)

Reference for failed test run

https://neon-github-public-dev.s3.amazonaws.com/reports/pr-5022/5891420105/index.html#suites/f588e0a787c49e67b29490359c589fae/dab036e9bd673274

The error was `Exception: detail succeeded (it should return 404)`

reported by @koivunej
2023-08-18 20:49:11 +03:00
Dmitry Rodionov
f6c671c140 resume timeline deletions on attach (#5030)
closes [#5036](https://github.com/neondatabase/neon/issues/5036)
2023-08-18 20:48:33 +03:00
Christian Schwarz
ed5bce7cba rfcs: archive my MVCC S3 Notion Proposal (#5040)
This is a copy from the [original Notion
page](https://www.notion.so/neondatabase/Proposal-Pageserver-MVCC-S3-Storage-8a424c0c7ec5459e89d3e3f00e87657c?pvs=4),
taken on 2023-08-16.

This is for archival mostly.
The RFC that we're likely to go with is
https://github.com/neondatabase/neon/pull/4919.
2023-08-18 19:34:29 +02:00
Christian Schwarz
7a63685cde simplify page-caching of EphemeralFile (#4994)
(This PR is the successor of https://github.com/neondatabase/neon/pull/4984 )

## Summary

The current way in which `EphemeralFile` uses `PageCache` complicates
the Pageserver code base to a degree that isn't worth it.
This PR refactors how we cache `EphemeralFile` contents, by exploiting
the append-only nature of `EphemeralFile`.

The result is that `PageCache` only holds `ImmutableFilePage` and
`MaterializedPage`.
These types of pages are read-only and evictable without write-back.
This allows us to remove the writeback code from `PageCache`, also
eliminating an entire failure mode.

Futher, many great open-source libraries exist to solve the problem of a
read-only cache,
much better than our `page_cache.rs` (e.g., better replacement policy,
less global locking).
With this PR, we can now explore using them.

## Problem & Analysis

Before this PR, `PageCache` had three types of pages:

* `ImmutableFilePage`: caches Delta / Image layer file contents
* `MaterializedPage`: caches results of Timeline::get (page
materialization)
* `EphemeralPage`: caches `EphemeralFile` contents

`EphemeralPage` is quite different from `ImmutableFilePage` and
`MaterializedPage`:

* Immutable and materialized pages are for the acceleration of (future)
reads of the same data using `PAGE_CACHE_SIZE * PAGE_SIZE` bytes of
DRAM.
* Ephemeral pages are a write-back cache of `EphemeralFile` contents,
i.e., if there is pressure in the page cache, we spill `EphemeralFile`
contents to disk.

`EphemeralFile` is only used by `InMemoryLayer`, for the following
purposes:
* **write**: when filling up the `InMemoryLayer`, via `impl BlobWriter
for EphemeralFile`
* **read**: when doing **page reconstruction** for a page@lsn that isn't
written to disk
* **read**: when writing L0 layer files, we re-read the `InMemoryLayer`
and put the contents into the L0 delta writer
(**`create_delta_layer`**). This happens every 10min or when
InMemoryLayer reaches 256MB in size.

The access patterns of the `InMemoryLayer` use case are as follows:

* **write**: via `BlobWriter`, strictly append-only
* **read for page reconstruction**: via `BlobReader`, random
* **read for `create_delta_layer`**: via `BlobReader`, dependent on
data, but generally random. Why?
* in classical LSM terms, this function is what writes the
memory-resident `C0` tree into the disk-resident `C1` tree
* in our system, though, the values of InMemoryLayer are stored in an
EphemeralFile, and hence they are not guaranteed to be memory-resident
* the function reads `Value`s in `Key, LSN` order, which is `!=` insert
order

What do these `EphemeralFile`-level access patterns mean for the page
cache?

* **write**:
* the common case is that `Value` is a WAL record, and if it isn't a
full-page-image WAL record, then it's smaller than `PAGE_SIZE`
* So, the `EphemeralPage` pages act as a buffer for these `< PAGE_CACHE`
sized writes.
* If there's no page cache eviction between subsequent
`InMemoryLayer::put_value` calls, the `EphemeralPage` is still resident,
so the page cache avoids doing a `write` system call.
* In practice, a busy page server will have page cache evictions because
we only configure 64MB of page cache size.
* **reads for page reconstruction**: read acceleration, just as for the
other page types.
* **reads for `create_delta_layer`**:
* The `Value` reads happen through a `BlockCursor`, which optimizes the
case of repeated reads from the same page.
* So, the best case is that subsequent values are located on the same
page; hence `BlockCursor`s buffer is maximally effective.
* The worst case is that each `Value` is on a different page; hence the
`BlockCursor`'s 1-page-sized buffer is ineffective.
* The best case translates into `256MB/PAGE_SIZE` page cache accesses,
one per page.
    * the worst case translates into `#Values` page cache accesses
* again, the page cache accesses must be assumed to be random because
the `Value`s aren't accessed in insertion order but `Key, LSN` order.

## Summary of changes

Preliminaries for this PR were:

- #5003
- #5004 
- #5005 
  - uncommitted microbenchmark in #5011 

Based on the observations outlined above, this PR makes the following
changes:

* Rip out `EphemeralPage` from `page_cache.rs`
* Move the `block_io::FileId` to `page_cache::FileId`
* Add a `PAGE_SIZE`d buffer to the `EphemeralPage` struct.
  It's called `mutable_tail`.
* Change `write_blob` to use `mutable_tail` for the write buffering
instead of a page cache page.
* if `mutable_tail` is full, it writes it out to disk, zeroes it out,
and re-uses it.
* There is explicitly no double-buffering, so that memory allocation per
`EphemeralFile` instance is fixed.
* Change `read_blob` to return different `BlockLease` variants depending
on `blknum`
* for the `blknum` that corresponds to the `mutable_tail`, return a ref
to it
* Rust borrowing rules prevent `write_blob` calls while refs are
outstanding.
  * for all non-tail blocks, return a page-cached `ImmutablePage`
* It is safe to page-cache these as ImmutablePage because EphemeralFile
is append-only.

## Performance 

How doe the changes above affect performance?
M claim is: not significantly.

* **write path**:
* before this PR, the `EphemeralFile::write_blob` didn't issue its own
`write` system calls.
* If there were enough free pages, it didn't issue *any* `write` system
calls.
* If it had to evict other `EphemeralPage`s to get pages a page for its
writes (`get_buf_for_write`), the page cache code would implicitly issue
the writeback of victim pages as needed.
* With this PR, `EphemeralFile::write_blob` *always* issues *all* of its
*own* `write` system calls.
* Also, the writes are explicit instead of implicit through page cache
write back, which will help #4743
* The perf impact of always doing the writes is the CPU overhead and
syscall latency.
* Before this PR, we might have never issued them if there were enough
free pages.
* We don't issue `fsync` and can expect the writes to only hit the
kernel page cache.
* There is also an advantage in issuing the writes directly: the perf
impact is paid by the tenant that caused the writes, instead of whatever
tenant evicts the `EphemeralPage`.
* **reads for page reconstruction**: no impact.
* The `write_blob` function pre-warms the page cache when it writes the
`mutable_tail` to disk.
* So, the behavior is the same as with the EphemeralPages before this
PR.
* **reads for `create_delta_layer`**: no impact.
  * Same argument as for page reconstruction.
  * Note for the future:
* going through the page cache likely causes read amplification here.
Why?
* Due to the `Key,Lsn`-ordered access pattern, we don't read all the
values in the page before moving to the next page. In the worst case, we
might read the same page multiple times to read different `Values` from
it.
    * So, it might be better to bypass the page cache here.
    * Idea drafts:
      * bypass PS page cache + prefetch pipeline + iovec-based IO
* bypass PS page cache + use `copy_file_range` to copy from ephemeral
file into the L0 delta file, without going through user space
2023-08-18 20:31:03 +03:00
Joonas Koivunen
0a082aee77 test: allow race with flush and stopped queue (#5027)
A lucky race can happen with the shutdown order I guess right now. Seen
in [test_tenant_delete_smoke].

The message is not the greatest to match against.

[test_tenant_delete_smoke]:
https://neon-github-public-dev.s3.amazonaws.com/reports/main/5892262320/index.html#suites/3556ed71f2d69272a7014df6dcb02317/189a0d1245fb5a8c
2023-08-18 19:36:25 +03:00
Arthur Petukhovsky
0b90411380 Fix safekeeper recovery with auth (#5035)
Fix missing a password in walrcv_connect for a safekeeper recovery. Add
a test which restarts endpoint and triggers a recovery.
2023-08-18 16:48:55 +01:00
Arpad Müller
f4da010aee Make the compaction warning more tolerant (#5024)
## Problem

The performance benchmark in `test_runner/performance/test_layer_map.py`
is currently failing due to the warning added in #4888.

## Summary of changes

The test mentioned has a `compaction_target_size` of 8192, which is just
one page size. This is an unattainable goal, as we generate at least
three pages: one for the header, one for the b-tree (minimally sized
ones have just the root node in a single page), one for the data.

Therefore, we add two pages to the warning limit. The warning text
becomes a bit less accurate but I think this is okay.
2023-08-18 16:36:31 +02:00
Conrad Ludgate
ec10838aa4 proxy: pool connection logs (#5020)
## Problem

Errors and notices that happen during a pooled connection lifecycle have
no session identifiers

## Summary of changes

Using a watch channel, we set the session ID whenever it changes. This
way we can see the status of a connection for that session

Also, adding a connection id to be able to search the entire connection
lifecycle
2023-08-18 11:44:08 +01:00
Joonas Koivunen
67af24191e test: cleanup remote_timeline_client tests (#5013)
I will have to change these as I change remote_timeline_client api in
#4938. So a bit of cleanup, handle my comments which were just resolved
during initial review.

Cleanup:
- use unwrap in tests instead of mixed `?` and `unwrap`
- use `Handle` instead of `&'static Reactor` to make the
RemoteTimelineClient more natural
- use arrays in tests
- use plain `#[tokio::test]`
2023-08-17 19:27:30 +03:00
Joonas Koivunen
6af5f9bfe0 fix: format context (#5022)
We return an error with unformatted `{timeline_id}`.
2023-08-17 14:30:25 +00:00
Dmitry Rodionov
64fc7eafcd Increase timeout once again. (#5021)
When failpoint is early in deletion process it takes longer to complete
after failpoint is removed.

Example was: https://neon-github-public-dev.s3.amazonaws.com/reports/main/5889544346/index.html#suites/3556ed71f2d69272a7014df6dcb02317/49826c68ce8492b1
2023-08-17 15:37:28 +03:00
Conrad Ludgate
3e4710c59e proxy: add more sasl logs (#5012)
## Problem

A customer is having trouble connecting to neon from their production
environment. The logs show a mix of "Internal error" and "authentication
protocol violation" but not the full error

## Summary of changes

Make sure we don't miss any logs during SASL/SCRAM
2023-08-17 12:05:54 +01:00
Dmitry Rodionov
d8b0a298b7 Do not attach deleted tenants (#5008)
Rather temporary solution before proper:
https://github.com/neondatabase/neon/issues/5006

It requires more plumbing so lets not attach deleted tenants first and
then implement resume.

Additionally fix `assert_prefix_empty`. It had a buggy prefix calculation,
and since we always asserted for absence of stuff it worked. Here I
started to assert for presence of stuff too and it failed. Added more
"presence" asserts to other places to be confident that it works.

Resolves [#5016](https://github.com/neondatabase/neon/issues/5016)
2023-08-17 13:46:49 +03:00
Alexander Bayandin
c8094ee51e test_compatibility: run amcheck unconditionally (#4985)
## Problem

The previous version of neon (that we use in the forward compatibility test)
has installed `amcheck` extension now. We can run `pg_amcheck`
unconditionally.

## Summary of changes
- Run `pg_amcheck` in compatibility tests unconditionally
2023-08-17 11:46:00 +01:00
Christian Schwarz
957af049c2 ephemeral file: refactor write_blob impl to concentrate mutable state (#5004)
Before this patch, we had the `off` and `blknum` as function-wide
mutable state. Now it's contained in the `Writer` struct.

The use of `push_bytes` instead of index-based filling of the buffer
also makes it easier to reason about what's going on.

This is prep for https://github.com/neondatabase/neon/pull/4994
2023-08-17 13:07:25 +03:00
Anastasia Lubennikova
786c7b3708 Refactor remote extensions index download.
Don't download ext_index.json from s3, but instead receive it as a part of spec from control plane.
This eliminates s3 access for most compute starts,
and also allows us to update extensions spec on the fly
2023-08-17 12:48:33 +03:00
Joonas Koivunen
d3612ce266 delta_layer: Restore generic from last week (#5014)
Restores #4937 work relating to the ability to use `ResidentDeltaLayer`
(which is an Arc wrapper) in #4938 for the ValueRef's by removing the
borrow from `ValueRef` and providing it from an upper layer.

This should not have any functional changes, most importantly, the
`main` will continue to use the borrowed `DeltaLayerInner`. It might be
that I can change #4938 to be like this. If that is so, I'll gladly rip
out the `Ref` and move the borrow back. But I'll first want to look at
the current test failures.
2023-08-17 11:47:31 +03:00
Christian Schwarz
994411f5c2 page cache: newtype the blob_io and ephemeral_file file ids (#5005)
This makes it more explicit that these are different u64-sized
namespaces.
Re-using one in place of the other would be catastrophic.

Prep for https://github.com/neondatabase/neon/pull/4994
which will eliminate the ephemeral_file::FileId and move the
blob_io::FileId into page_cache.
It makes sense to have this preliminary commit though,
to minimize amount of new concept in #4994 and other
preliminaries that depend on that work.
2023-08-16 18:33:47 +02:00
Conrad Ludgate
25934ec1ba proxy: reduce global conn pool contention (#4747)
## Problem

As documented, the global connection pool will be high contention.

## Summary of changes

Use DashMap rather than Mutex<HashMap>.

Of note, DashMap currently uses a RwLock internally, but it's partially
sharded to reduce contention by a factor of N. We could potentially use
flurry which is a port of Java's concurrent hashmap, but I have no good
understanding of it's performance characteristics. Dashmap is at least
equivalent to hashmap but less contention.

See the read heavy benchmark to analyse our expected performance
<https://github.com/xacrimon/conc-map-bench#ready-heavy>

I also spoke with the developer of dashmap recently, and they are
working on porting the implementation to use concurrent HAMT FWIW
2023-08-16 17:20:28 +01:00
Arpad Müller
0bdbc39cb1 Compaction: unify key and value reference vecs (#4888)
## Problem

PR #4839 has already reduced the number of b-tree traversals and vec
creations from 3 to 2, but as pointed out in
https://github.com/neondatabase/neon/pull/4839#discussion_r1279167815 ,
we would ideally just traverse the b-tree once during compaction.

Afer #4836, the two vecs created are one for the list of keys, lsns and
sizes, and one for the list of `(key, lsn, value reference)`. However,
they are not equal, as pointed out in
https://github.com/neondatabase/neon/pull/4839#issuecomment-1660418012
and the following comment: the key vec creation combines multiple
entries for which the lsn is changing but the key stays the same into
one, with the size being the sum of the sub-sizes. In SQL, this would
correspond to something like `SELECT key, lsn, SUM(size) FROM b_tree
GROUP BY key;` and `SELECT key, lsn, val_ref FROM b_tree;`. Therefore,
the join operation is non-trivial.

## Summary of changes

This PR merges the two lists of keys and value references into one. It's
not a trivial change and affects the size pattern of the resulting
files, which is why this is in a separate PR from #4839 .

The key vec is used in compaction for determining when to start a new
layer file. The loop uses various thresholds to come to this conclusion,
but the grouping via the key has led to the behaviour that regardless of
the threshold, it only starts a new file when either a new key is
encountered, or a new delta file.

The new code now does the combination after the merging and sorting of
the various keys from the delta files. This *mostly* does the same as
the old code, except for a detail: with the grouping done on a
per-delta-layer basis, the sorted and merged vec would still have
multiple entries for multiple delta files, but now, we don't have an
easy way to tell when a new input delta layer file is encountered, so we
cannot create multiple entries on that basis easily.

To prevent possibly infinite growth, our new grouping code compares the
combined size with the threshold, and if it is exceeded, it cuts a new
entry so that the downstream code can cut a new output file. Here, we
perform a tradeoff however, as if the threshold is too small, we risk
putting entries for the same key into multiple layer files, but if the
threshold is too big, we can in some instances exceed the target size.

Currently, we set the threshold to the target size, so in theory we
would stay below or roughly at double the `target_file_size`.

We also fix the way the size was calculated for the last key. The calculation
was wrong and accounted for the old layer's btree, even though we
already account for the overhead of the in-construction btree.

Builds on top of #4839 .
2023-08-16 18:27:18 +03:00
Dmitry Rodionov
96b84ace89 Correctly remove orphaned objects in RemoteTimelineClient::delete_all (#5000)
Previously list_prefixes was incorrectly used for that purpose. Change
to use list_files. Add a test.

Some drive by refactorings on python side to move helpers out of
specific test file to be widely accessible

resolves https://github.com/neondatabase/neon/issues/4499
2023-08-16 17:31:16 +03:00
Christian Schwarz
368b783ada ephemeral_file: remove FileExt impl (was only used by tests) (#5003)
Extracted from https://github.com/neondatabase/neon/pull/4994
2023-08-16 15:41:25 +02:00
Dmitry Rodionov
0f47bc03eb Fix delete_objects in UnreliableWrapper (#5002)
For `delete_objects` it was injecting failures for whole delete_objects operation
and then for every delete it contains. Make it fail once for the whole operation.
2023-08-16 14:08:53 +03:00
Arseny Sher
fdbe8dc8e0 Fix test_s3_wal_replay flakiness.
ref https://github.com/neondatabase/neon/issues/4466
2023-08-16 12:57:43 +03:00
Arthur Petukhovsky
1b97a3074c Disable neon-pool-opt-in (#4995) 2023-08-15 20:57:56 +03:00
John Spray
5c836ee5b4 tests: extend timeout in timeline deletion test (#4992)
## Problem

This was set to 5 seconds, which was very close to how long a compaction
took on my workstation, and when deletion is blocked on compaction the
test would fail.

We will fix this to make compactions drop out on deletion, but for the
moment let's stabilize the test.

## Summary of changes

Change timeout on timeline deletion in
`test_timeline_deletion_with_files_stuck_in_upload_queue` from 5 seconds
to 30 seconds.
2023-08-15 20:14:03 +03:00
Arseny Sher
4687b2e597 Test that auth on pg/http services can be enabled separately in sks.
To this end add
1) -e option to 'neon_local safekeeper start' command appending extra options
   to safekeeper invocation;
2) Allow multiple occurrences of the same option in safekeepers, the last
   value is taken.
3) Allow to specify empty string for *-auth-public-key-path opts, it
   disables auth for the service.
2023-08-15 19:31:20 +03:00
Arseny Sher
13adc83fc3 Allow to enable http/pg/pg tenant only auth separately in safekeeper.
The same option enables auth and specifies public key, so this allows to use
different public keys as well. The motivation is to
1) Allow to e.g. change pageserver key/token without replacing all compute
  tokens.
2) Enable auth gradually.
2023-08-15 19:31:20 +03:00
Dmitry Rodionov
52c2c69351 fsync directory before mark file removal (#4986)
## Problem

Deletions can be possibly reordered. Use fsync to avoid the case when
mark file doesnt exist but other tenant/timeline files do.

See added comments.

resolves #4987
2023-08-15 19:24:23 +03:00
Alexander Bayandin
207919f5eb Upload test results to DB right after generation (#4967)
## Problem

While adding new test results format, I've also changed the way we
upload Allure reports to S3
(722c7956bb)
to avoid duplicated results from previous runs. But it broke links at
earlier results (results are still available but on different URLs).

This PR fixes this (by reverting logic in
722c7956bb
changes), and moves the logic for storing test results into db to allure
generate step. It allows us to avoid test results duplicates in the db
and saves some time on extra s3 downloads that happened in a different
job before the PR.

Ref https://neondb.slack.com/archives/C059ZC138NR/p1691669522160229

## Summary of changes
- Move test results storing logic from a workflow to
`actions/allure-report-generate`
2023-08-15 15:32:30 +01:00
George MacKerron
218be9eb32 Added deferrable transaction option to http batch queries (#4993)
## Problem

HTTP batch queries currently allow us to set the isolation level and
read only, but not deferrable.

## Summary of changes

Add support for deferrable.

Echo deferrable status in response headers only if true.

Likewise, now echo read-only status in response headers only if true.
2023-08-15 14:52:00 +01:00
Joonas Koivunen
8198b865c3 Remote storage metrics follow-up (#4957)
#4942 left old metrics in place for migration purposes. It was noticed
that from new metrics the total number of deleted objects was forgotten,
add it.

While reviewing, it was noticed that the delete_object could just be
delete_objects of one.

---------

Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
2023-08-15 12:30:27 +03:00
Arpad Müller
baf395983f Turn BlockLease associated type into an enum (#4982)
## Problem

The `BlockReader` trait is not ready to be asyncified, as associated
types are not supported by asyncification strategies like via the
`async_trait` macro, or via adopting enums.

## Summary of changes

Remove the `BlockLease` associated type from the `BlockReader` trait and
turn it into an enum instead, bearing the same name. The enum has two
variants, one of which is gated by `#[cfg(test)]`. Therefore, outside of
test settings, the enum has zero overhead over just having the
`PageReadGuard`. Using the enum allows us to impl `BlockReader` without
needing the page cache.

Part of https://github.com/neondatabase/neon/issues/4743
2023-08-14 18:48:09 +02:00
Arpad Müller
ce7efbe48a Turn BlockCursor::{read_blob,read_blob_into_buf} async fn (#4905)
## Problem

The `BlockCursor::read_blob` and `BlockCursor::read_blob_into_buf`
functions are calling `read_blk` internally, so if we want to make that
function async fn, they need to be async themselves.

## Summary of changes

* We first turn `ValueRef::load` into an async fn.
* Then, we switch the `RwLock` implementation in `InMemoryLayer` to use
the one from `tokio`.
* Last, we convert the `read_blob` and `read_blob_into_buf` functions
into async fn.

In three instances we use `Handle::block_on`:

* one use is in compaction code, which currently isn't async. We put the
entire loop into an `async` block to prevent the potentially hot loop
from doing cross-thread operations.
* one use is in dumping code for `DeltaLayer`. The "proper" way to
address this would be to enable the visit function to take async
closures, but then we'd need to be generic over async fs non async,
which [isn't supported by rust right
now](https://blog.rust-lang.org/inside-rust/2022/07/27/keyword-generics.html).
The other alternative would be to do a first pass where we cache the
data into memory, and only then to dump it.
* the third use is in writing code, inside a loop that copies from one
file to another. It is is synchronous and we'd like to keep it that way
(for now?).

Part of #4743
2023-08-14 17:20:37 +02:00
Tristan Partin
ef4a76c01e Update Postgres to v15.4 and v14.9 (#4965) 2023-08-14 16:19:45 +01:00
George MacKerron
1ca08cc523 Changed batch query body to from [...] to { queries: [...] } (#4975)
## Problem

It's nice if `single query : single response :: batch query : batch
response`.

But at present, in the single case we send `{ query: '', params: [] }`
and get back a single `{ rows: [], ... }` object, while in the batch
case we send an array of `{ query: '', params: [] }` objects and get
back not an array of `{ rows: [], ... }` objects but a `{ results: [ {
rows: [] , ... }, { rows: [] , ... }, ... ] }` object instead.

## Summary of changes

With this change, the batch query body becomes `{ queries: [{ query: '',
params: [] }, ... ] }`, which restores a consistent relationship between
the request and response bodies.
2023-08-14 16:07:33 +01:00
Dmitry Rodionov
4626d89eda Harden retries on tenant/timeline deletion path. (#4973)
Originated from test failure where we got SlowDown error from s3.
The patch generalizes `download_retry` to not be download specific.
Resulting `retry` function is moved to utils crate. `download_retries`
is now a thin wrapper around this `retry` function.

To ensure that all needed retries are in place test code now uses
`test_remote_failures=1` setting.

Ref https://neondb.slack.com/archives/C059ZC138NR/p1691743624353009
2023-08-14 17:16:49 +03:00
Arseny Sher
49c57c0b13 Add neon_local to docker image.
People sometimes ask about this.

https://community.neon.tech/t/is-the-neon-local-binary-in-any-of-the-official-docker-images/360/2
2023-08-14 14:08:51 +03:00
John Spray
d3a97fdf88 pageserver: avoid incrementing access time when reading layers for compaction (#4971)
## Problem

Currently, image generation reads delta layers before writing out
subsequent image layers, which updates the access time of the delta
layers and effectively puts them at the back of the queue for eviction.
This is the opposite of what we want, because after a delta layer is
covered by a later image layer, it's likely that subsequent reads of
latest data will hit the image rather than the delta layer, so the delta
layer should be quite a good candidate for eviction.

## Summary of changes

`RequestContext` gets a new `ATimeBehavior` field, and a
`RequestContextBuilder` helper so that we can optionally add the new
field without growing `RequestContext::new` every time we add something
like this.

Request context is passed into the `record_access` function, and the
access time is not updated if `ATimeBehavior::Skip` is set.

The compaction background task constructs its request context with this
skip policy.

Closes: https://github.com/neondatabase/neon/issues/4969
2023-08-14 10:18:22 +01:00
Arthur Petukhovsky
763f5c0641 Remove dead code from walproposer_utils.c (#4525)
This code was mostly copied from walsender.c and the idea was to keep it
similar to walsender.c, so that we can easily copy-paste future upstream
changes to walsender.c to waproposer_utils.c, too. But right now I see that
deleting it doesn't break anything, so it's better to remove unused parts.
2023-08-14 09:49:51 +01:00
Arseny Sher
8173813584 Add term=n option to safekeeper START_REPLICATION command.
It allows term leader to ensure he pulls data from the correct term. Absense of
it wasn't very problematic due to CRC checks, but let's be strict.

walproposer still doesn't use it as we're going to remove recovery completely
from it.
2023-08-12 12:20:13 +03:00
Felix Prasanna
cc2d00fea4 bump vm-builder version to v0.15.4 (#4980)
Patches a bug in vm-builder where it did not include enough parameters
in the query string. These parameters are `host=localhost port=5432`.
These parameters were not necessary for the monitor because the `pq` go
postgres driver included them by default.
2023-08-11 14:26:53 -04:00
Arpad Müller
9ffccb55f1 InMemoryLayer: move end_lsn out of the lock (#4963)
## Problem

In some places, the lock on `InMemoryLayerInner` is only created to
obtain `end_lsn`. This is not needed however, if we move `end_lsn` to
`InMemoryLayer` instead.

## Summary of changes

Make `end_lsn` a member of `InMemoryLayer`, and do less locking of
`InMemoryLayerInner`. `end_lsn` is changed from `Option<Lsn>` into an
`OnceLock<Lsn>`. Thanks to this change, we don't need to lock any more
in three functions.

Part of #4743 . Suggested in
https://github.com/neondatabase/neon/pull/4905#issuecomment-1666458428 .
2023-08-11 18:01:02 +02:00
Arthur Petukhovsky
3a6b99f03c proxy: improve http logs (#4976)
Fix multiline logs on websocket errors and always print sql-over-http
errors sent to the user.
2023-08-11 18:18:07 +03:00
Dmitry Rodionov
d39fd66773 tests: remove redundant wait_while (#4952)
Remove redundant `wait_while` in tests. It had only one usage. Use
`wait_tenant_status404`.

Related:
https://github.com/neondatabase/neon/pull/4855#discussion_r1289610641
2023-08-11 10:18:13 +03:00
Arthur Petukhovsky
73d7a9bc6e proxy: propagate ws span (#4966)
Found this log on staging:
```
2023-08-10T17:42:58.573790Z  INFO handling interactive connection from client protocol="ws"
```

We seem to be losing websocket span in spawn, this patch fixes it.
2023-08-10 23:38:22 +03:00
Sasha Krassovsky
3a71cf38c1 Grant BypassRLS to new neon_superuser roles (#4935) 2023-08-10 21:04:45 +02:00
Conrad Ludgate
25c66dc635 proxy: http logging to 11 (#4950)
## Problem

Mysterious network issues

## Summary of changes

Log a lot more about HTTP/DNS in hopes of detecting more of the network
errors
2023-08-10 17:49:24 +01:00
George MacKerron
538373019a Increase max sql-over-http response size from 1MB to 10MB (#4961)
## Problem

1MB response limit is very small.

## Summary of changes

This data is not yet tracked, so we shoudn't raise the limit too high yet. 
But as discussed with @kelvich and @conradludgate, this PR lifts it to 
10MB, and adds also details of the limit to the error response.
2023-08-10 17:21:52 +01:00
Dmitry Rodionov
c58b22bacb Delete tenant's data from s3 (#4855)
## Summary of changes

For context see
https://github.com/neondatabase/neon/blob/main/docs/rfcs/022-pageserver-delete-from-s3.md

Create Flow to delete tenant's data from pageserver. The approach
heavily mimics previously implemented timeline deletion implemented
mostly in https://github.com/neondatabase/neon/pull/4384 and followed up
in https://github.com/neondatabase/neon/pull/4552

For remaining deletion related issues consult with deletion project
here: https://github.com/orgs/neondatabase/projects/33

resolves #4250
resolves https://github.com/neondatabase/neon/issues/3889

---------

Co-authored-by: Joonas Koivunen <joonas@neon.tech>
2023-08-10 18:53:16 +03:00
Alek Westover
17aea78aa7 delete already present files from library index (#4955) 2023-08-10 16:51:16 +03:00
Joonas Koivunen
71f9d9e5a3 test: allow slow shutdown warning (#4953)
Introduced in #4886, did not consider that tests with real_s3 could
sometimes go over the limit. Do not fail tests because of that.
2023-08-10 15:55:41 +03:00
Alek Westover
119b86480f test: make pg_regress less flaky, hopefully (#4903)
`pg_regress` is flaky: https://github.com/neondatabase/neon/issues/559

Consolidated `CHECKPOINT` to `check_restored_datadir_content`, add a
wait for `wait_for_last_flush_lsn`.

Some recently introduced flakyness was fixed with #4948.

---------

Co-authored-by: Joonas Koivunen <joonas@neon.tech>
2023-08-10 15:24:43 +03:00
Arpad Müller
fa1f87b268 Make the implementation of DiskBtreeReader::visit non-recursive (#4884)
## Problem

The `DiskBtreeReader::visit` function calls `read_blk` internally, and
while #4863 converted the API of `visit` to async, the internal function
is still recursive. So, analogously to #4838, we turn the recursive
function into an iterative one.

## Summary of changes

First, we prepare the change by moving the for loop outside of the case
switch, so that we only have one loop that calls recursion. Then, we
switch from using recursion to an approach where we store the search
path inside the tree on a stack on the heap.

The caller of the `visit` function can control when the search over the
B-Tree ends, by returning `false` from the closure. This is often used
to either only find one specific entry (by always returning `false`),
but it is also used to iterate over all entries of the B-tree (by always
returning `true`), or to look for ranges (mostly in tests, but
`get_value_reconstruct_data` also has such a use).

Each stack entry contains two things: the block number (aka the block's
offset), and a children iterator. The children iterator is constructed
depending on the search direction, and with the results of a binary
search over node's children list. It is the only thing that survives a
spilling/push to the stack, everything else is reconstructed. In other
words, each stack spill, will, if the search is still ongoing, cause an
entire re-parsing of the node. Theoretically, this would be a linear
overhead in the number of leaves the search visits. However, one needs
to note:

* the workloads to look for a specific entry are just visiting one leaf,
ever, so this is mostly about workloads that visit larger ranges,
including ones that visit the entire B-tree.
* the requests first hit the page cache, so often the cost is just in
terms of node deserialization
* for nodes that only have leaf nodes as children, no spilling to the
stack-on-heap happens (outside of the initial request where the iterator
is `None`). In other words, for balanced trees, the spilling overhead is
$\Theta\left(\frac{n}{b^2}\right)$, where `b` is the branching factor
and `n` is the number of nodes in the tree. The B-Trees in the current
implementation have a branching factor of roughly `PAGE_SZ/L` where
`PAGE_SZ` is 8192, and `L` is `DELTA_KEY_SIZE = 26` or `KEY_SIZE = 18`
in production code, so this gives us an estimate that we'd be re-loading
an inner node for every 99000 leaves in the B-tree in the worst case.

Due to these points above, I'd say that not fully caching the inner
nodes with inner children is reasonable, especially as we also want to
be fast for the "find one specific entry" workloads, where the stack
content is never accessed: any action to make the spilling
computationally more complex would contribute to wasted cycles here,
even if these workloads "only" spill one node for each depth level of
the b-tree (which is practically always a low single-digit number,
Kleppmann points out on page 81 that for branching factor 500, a four
level B-tree with 4 KB pages can store 250 TB of data).

But disclaimer, this is all stuff I thought about in my head, I have not
confirmed it with any benchmarks or data.

Builds on top of #4863, part of #4743
2023-08-10 13:43:13 +02:00
Joonas Koivunen
db48f7e40d test: mark test_download_extensions.py skipped for now (#4948)
The test mutates a shared directory which does not work with multiple
concurrent tests. It is being fixed, so this should be a very temporary
band-aid.

Cc: #4949.
2023-08-10 11:05:27 +00:00
Alek Westover
e157b16c24 if control file already exists ignore the remote version of the extension (#4945) 2023-08-09 18:56:09 +00:00
bojanserafimov
94ad9204bb Measure compute-pageserver latency (#4901)
Co-authored-by: Joonas Koivunen <joonas@neon.tech>
2023-08-09 13:20:30 -04:00
Joonas Koivunen
c8aed107c5 refactor: make {Delta,Image}LayerInners usable without {Delta,Image}Layer (#4937)
On the quest of #4745, these are more related to the task at hand, but
still small. In addition to $subject, allow
`ValueRef<ResidentDeltaLayer>`.
2023-08-09 19:18:44 +03:00
Anastasia Lubennikova
da128a509a fix pkglibdir path for remote extensions 2023-08-09 19:13:11 +03:00
Alexander Bayandin
5993b2bedc test_runner: remove excessive timeouts (#4659)
## Problem

For some tests, we override the default timeout (300s / 5m) with a larger
values like 600s / 10m or even 1800s / 30m, even if it's not required.
I've collected some statistics (for the last 60 days) for tests
duration:

| test | max (s) | p99 (s) | p50 (s) | count |

|-----------------------------------|---------|---------|---------|-------|
| test_hot_standby | 9 | 2 | 2 | 5319 |
| test_import_from_vanilla | 16 | 9 | 6 | 5692 |
| test_import_from_pageserver_small | 37 | 7 | 5 | 5719 |
| test_pg_regress | 101 | 73 | 44 | 5642 |
| test_isolation | 65 | 56 | 39 | 5692 |

A couple of tests that I left with custom 600s / 10m timeout.

| test | max (s) | p99 (s) | p50 (s) | count |

|-----------------------------------|---------|---------|---------|-------|
| test_gc_cutoff | 456 | 224 | 109 | 5694 |
| test_pageserver_chaos | 528 | 267 | 121 | 5712 |

## Summary of changes
- Remove `@pytest.mark.timeout` annotation from several tests
2023-08-09 16:27:53 +01:00
Anastasia Lubennikova
4ce7aa9ffe Fix extensions download error handling (#4941)
Don't panic if library or extension is not found in remote extension storage 
or download has failed. Instead, log the error and proceed - if file is not 
present locally as well, postgres will fail with postgres error.  If it is a 
shared_preload_library, it won't start, because of bad config. Otherwise, it 
will just fail to run the SQL function/ command that needs the library. 

Also, don't try to download extensions if remote storage is not configured.
2023-08-09 15:37:51 +03:00
Joonas Koivunen
cbd04f5140 remove_remote_layer: uninteresting refactorings (#4936)
In the quest to solve #4745 by moving the download/evictedness to be
internally mutable factor of a Layer and get rid of `trait
PersistentLayer` at least for prod usage, `layer_removal_cs`, we present
some misc cleanups.

---------

Co-authored-by: Dmitry Rodionov <dmitry@neon.tech>
2023-08-09 14:35:56 +03:00
Arpad Müller
1037a8ddd9 Explain why VirtualFile stores tenant_id and timeline_id as strings (#4930)
## Problem

One might wonder why the code here doesn't use `TimelineId` or
`TenantId`. I originally had a refactor to use them, but then discarded
it, because converting to strings on each time there is a read or write
is wasteful.

## Summary of changes

We add some docs explaining why here no `TimelineId` or `TenantId` is
being used.
2023-08-08 23:41:09 +02:00
Felix Prasanna
6661f4fd44 bump vm-builder version to v0.15.0-alpha1 (#4934) 2023-08-08 15:22:10 -05:00
Alexander Bayandin
b9f84b9609 Improve test results format (#4549)
## Problem

The current test history format is a bit inconvenient:
- It stores all test results in one row, so all queries should include
subqueries which expand the test 
- It includes duplicated test results if the rerun is triggered manually
for one of the test jobs (for example, if we rerun `debug-pg14`, then
the report will include duplicates for other build types/postgres
versions)
- It doesn't have a reference to run_id, which we use to create a link
to allure report

Here's the proposed new format:
```
    id           BIGSERIAL PRIMARY KEY,
    parent_suite TEXT NOT NULL,
    suite        TEXT NOT NULL,
    name         TEXT NOT NULL,
    status       TEXT NOT NULL,
    started_at   TIMESTAMPTZ NOT NULL,
    stopped_at   TIMESTAMPTZ NOT NULL,
    duration     INT NOT NULL,
    flaky        BOOLEAN NOT NULL,
    build_type   TEXT NOT NULL,
    pg_version   INT NOT NULL,
    run_id       BIGINT NOT NULL,
    run_attempt  INT NOT NULL,
    reference    TEXT NOT NULL,
    revision     CHAR(40) NOT NULL,
    raw          JSONB COMPRESSION lz4 NOT NULL,
```

## Summary of changes
- Misc allure changes:
  - Update allure to 2.23.1
- Delete files from previous runs in HTML report (by using `sync
--delete` instead of `mv`)
- Use `test-cases/*.json` instead of `suites.json`, using this directory
allows us to catch all reruns.
- Until we migrated `scripts/flaky_tests.py` and
`scripts/benchmark_durations.py` store test results in 2 formats (in 2
different databases).
2023-08-08 20:09:38 +01:00
Felix Prasanna
459253879e Revert "bump vm-builder to v0.15.0-alpha1 (#4895)" (#4931)
This reverts commit 682dfb3a31.
2023-08-08 20:21:39 +03:00
Conrad Ludgate
0fa85aa08e proxy: delay auth on retry (#4929)
## Problem

When an endpoint is shutting down, it can take a few seconds. Currently
when starting a new compute, this causes an "endpoint is in transition"
error. We need to add delays before retrying to ensure that we allow
time for the endpoint to shutdown properly.

## Summary of changes

Adds a delay before retrying in auth. connect_to_compute already has
this delay
2023-08-08 17:19:24 +03:00
Cuong Nguyen
039017cb4b Add new flag for advertising pg address (#4898)
## Problem

The safekeeper advertises the same address specified in `--listen-pg`,
which is problematic when the listening address is different from the
address that the pageserver can use to connect to the safekeeper.

## Summary of changes

Add a new optional flag called `--advertise-pg` for the address to be
advertised. If this flag is not specified, the behavior is the same as
before.
2023-08-08 14:26:38 +03:00
John Spray
4dc644612b pageserver: expose prometheus metrics for startup time (#4893)
## Problem

Currently to know how long pageserver startup took requires inspecting
logs.

## Summary of changes

`pageserver_startup_duration_ms` metric is added, with label `phase` for
different phases of startup.

These are broken down by phase, where the phases correspond to the
existing wait points in the code:
- Start of doing I/O
- When tenant load is done
- When initial size calculation is done
- When background jobs start
- Then "complete" when everything is done.

`pageserver_startup_is_loading` is a 0/1 gauge that indicates whether we are in the initial load of tenants.

`pageserver_tenant_activation_seconds` is a histogram of time in seconds taken to activate a tenant.

Co-authored-by: Joonas Koivunen <joonas@neon.tech>
2023-08-08 12:41:37 +03:00
Anastasia Lubennikova
6d17d6c775 Use WebIdentityTokenCredentialsProvider to access remote extensions (#4921)
Fixes access to s3 buckets that use IAM roles for service accounts
access control method

---------

Co-authored-by: Joonas Koivunen <joonas@neon.tech>
2023-08-08 12:37:22 +03:00
John Spray
4892a5c5b7 pageserver: avoid logging the "ERROR" part of DbErrors that are successes (#4902)
## Problem

The pageserver<->safekeeper protocol uses error messages to indicate end
of stream. pageserver already logs these at INFO level, but the inner
error message includes the word "ERROR", which interferes with log
searching.
   
Example:
```
  walreceiver connection handling ended: db error: ERROR: ending streaming to Some("pageserver") at 0/4031CA8
```
    
The inner DbError has a severity of ERROR so DbError's Display
implementation includes that ERROR, even though we are actually
logging the error at INFO level.

## Summary of changes

Introduce an explicit WalReceiverError type, and in its From<>
for postgres errors, apply the logic from ExpectedError, for
expected errors, and a new condition for successes.
    
The new output looks like:
```
    walreceiver connection handling ended: Successful completion: ending streaming to Some("pageserver") at 0/154E9C0, receiver is caughtup and there is no computes
 ```
2023-08-08 12:35:24 +03:00
John Spray
33cb1e9c0c tests: enable higher concurrency and adjust tests with outlier runtime (#4904)
## Problem

I spent a few minutes seeing how fast I could get our regression test
suite to run on my workstation, for when I want to run a "did I break
anything?" smoke test before pushing to CI.

- Test runtime was dominated by a couple of tests that run for longer
than all the others take together
- Test concurrency was limited to <16 by the ports-per-worker setting

There's no "right answer" for how long a test should
be, but as a rule of thumb, no one test should run
for much longer than the time it takes to run all the
other tests together.

## Summary of changes

- Make the ports per worker setting dynamic depending on worker count
- Modify the longest running tests to run for a shorter time
(`test_duplicate_layers` which uses a pgbench runtime) or fewer
iterations (`test_restarts_frequent_checkpoints`).
2023-08-08 09:16:21 +01:00
Arpad Müller
9559ef6f3b Sort by (key, lsn), not just key (#4918)
## Problem

PR #4839 didn't output the keys/values in lsn order, but for a given
key, the lsns were kept in incoming file order.

I think the ordering by lsn is expected.

## Summary of changes

We now also sort by `(key, lsn)`, like we did before #4839.
2023-08-07 18:14:15 +03:00
John Spray
64a4fb35c9 pagectl: skip metadata file in pagectl draw-timeline (#4872)
## Problem

Running `pagectl draw-timeline` on a pageserver directory wasn't working
out of the box because it trips up on the `metadata` file.

## Summary of changes

Just ignore the `metadata` file in the list of input files passed to
`draw-timeline`.
2023-08-07 08:24:50 +01:00
MMeent
95ec42f2b8 Change log levels on various operations (#4914)
Cache changes are now DEBUG2
Logs that indicate disabled caches now explicitly call out that the file cache is disabled on WARNING level instead of LOG/INFO
2023-08-06 20:37:09 +02:00
Joonas Koivunen
ba9df27e78 fix: silence not found error when removing ephmeral (#4900)
We currently cannot drop tenant before removing it's directory, or use
Tenant::drop for this. This creates unnecessary or inactionable warnings
during detach at least. Silence the most typical, file not found. Log
remaining at `error!`.

Cc: #2442
2023-08-04 21:03:17 +03:00
Joonas Koivunen
ea3e1b51ec Remote storage metrics (#4892)
We don't know how our s3 remote_storage is performing, or if it's
blocking the shutdown. Well, for sampling reasons, we will not really
know even after this PR.

Add metrics:
- align remote_storage metrics towards #4813 goals
- histogram
`remote_storage_s3_request_seconds{request_type=(get_object|put_object|delete_object|list_objects),
result=(ok|err|cancelled)}`
- histogram `remote_storage_s3_wait_seconds{request_type=(same kinds)}`
- counter `remote_storage_s3_cancelled_waits_total{request_type=(same
kinds)}`

Follow-up work:
- After release, remove the old metrics, migrate dashboards

Histogram buckets are rough guesses, need to be tuned. In pageserver we
have a download timeout of 120s, so I think the 100s bucket is quite
nice.
2023-08-04 21:01:29 +03:00
John Spray
e3e739ee71 pageserver: remove no-op attempt to report fail/failpoint feature (#4879)
## Problem

The current output from a prod binary at startup is:
```
git-env:765455bca22700e49c053d47f44f58a6df7c321f failpoints: true, features: [] launch_timestamp: 2023-08-02 10:30:35.545217477 UTC
```

It's confusing to read that line, then read the code and think "if
failpoints is true, but not in the features list, what does that mean?".
As far as I can tell, the check of `fail/failpoints` is just always
false because cargo doesn't expose features across crates like this: the
`fail/failpoints` syntax works in the cargo CLI but not from a macro in
some crate other than `fail`.

## Summary of changes

Remove the lines that try to check `fail/failpoints` from the pageserver
entrypoint module. This has no functional impact but makes the code
slightly easier to understand when trying to make sense of the line
printed on startup.
2023-08-04 17:56:31 +01:00
Conrad Ludgate
606caa0c5d proxy: update logs and span data to be consistent and have more info (#4878)
## Problem

Pre-requisites for #4852 and #4853

## Summary of changes

1. Includes the client's IP address (which we already log) with the span
info so we can have it on all associated logs. This makes making
dashboards based on IP addresses easier.
2. Switch to a consistent error/warning log for errors during
connection. This includes error, num_retries, retriable=true/false and a
consistent log message that we can grep for.
2023-08-04 12:37:18 +03:00
Arpad Müller
6a906c68c9 Make {DeltaLayer,ImageLayer}::{load,load_inner} async (#4883)
## Problem

The functions `DeltaLayer::load_inner` and `ImageLayer::load_inner` are
calling `read_blk` internally, which we would like to turn into an async
fn.

## Summary of changes

We switch from `once_cell`'s `OnceCell` implementation to the one in
`tokio` in order to be able to call an async `get_or_try_init` function.

Builds on top of #4839, part of #4743
2023-08-04 12:35:45 +03:00
Felix Prasanna
682dfb3a31 bump vm-builder to v0.15.0-alpha1 (#4895) 2023-08-03 14:26:14 -04:00
Joonas Koivunen
5263b39e2c fix: shutdown logging again (#4886)
During deploys of 2023-08-03 we logged too much on shutdown. Fix the
logging by timing each top level shutdown step, and possibly warn on it
taking more than a rough threshold, based on how long I think it
possibly should be taking. Also remove all shutdown logging from
background tasks since there is already "shutdown is taking a long time"
logging.

Co-authored-by: John Spray <john@neon.tech>
2023-08-03 20:34:05 +03:00
Arpad Müller
a241c8b2a4 Make DiskBtreeReader::{visit, get} async (#4863)
## Problem

`DiskBtreeReader::get` and `DiskBtreeReader::visit` both call `read_blk`
internally, which we would like to make async in the future. This PR
focuses on making the interface of these two functions `async`. There is
further work to be done in forms of making `visit` to not be recursive
any more, similar to #4838. For that, see
https://github.com/neondatabase/neon/pull/4884.

Builds on top of https://github.com/neondatabase/neon/pull/4839, part of
https://github.com/neondatabase/neon/issues/4743

## Summary of changes

Make `DiskBtreeReader::get` and `DiskBtreeReader::visit` async functions
and `await` in the places that call these functions.
2023-08-03 17:36:46 +02:00
John Spray
e71d8095b9 README: make it a bit clearer how to get regression tests running (#4885)
## Problem

When setting up for the first time I hit a couple of nits running tests:
- It wasn't obvious that `openssl` and `poetry` were needed (poetry is
mentioned kind of obliquely via "dependency installation notes" rather
than being in the list of rpm/deb packages to install.
- It wasn't obvious how to get the tests to run for just particular
parameters (e.g. just release mode)

## Summary of changes

Add openssl and poetry to the package lists.

Add an example of how to run pytest for just a particular build type and
postgres version.
2023-08-03 15:23:23 +01:00
Dmitry Rodionov
1497a42296 tests: split neon_fixtures.py (#4871)
## Problem

neon_fixtures.py has grown to unmanageable size. It attracts conflicts.

When adding specific utils under for example `fixtures/pageserver`
things sometimes need to import stuff from `neon_fixtures.py` which
creates circular import. This is usually only needed for type
annotations, so `typing.TYPE_CHECKING` flag can mask the issue.
Nevertheless I believe that splitting neon_fixtures.py into smaller
parts is a better approach.

Currently the PR contains small things, but I plan to continue and move
NeonEnv to its own `fixtures.env` module. To keep the diff small I think
this PR can already be merged to cause less conflicts.

UPD: it looks like currently its not really possible to fully avoid
usage of `typing.TYPE_CHECKING`, because some components directly depend
on each other. I e Env -> Cli -> Env cycle. But its still worth it to
avoid it in as many places as possible. And decreasing neon_fixture's
size still makes sense.
2023-08-03 17:20:24 +03:00
Alexander Bayandin
cd33089a66 test_runner: set AWS credentials for endpoints (#4887)
## Problem

If AWS credentials are not set locally (via
AWS_ACCESS_KEY_ID/AWS_SECRET_ACCESS_KEY env vars)
`test_remote_library[release-pg15-mock_s3]` test fails with the
following error:

```
ERROR could not start the compute node: Failed to download a remote file: Failed to download S3 object: failed to construct request
```

## Summary of changes
- set AWS credentials for endpoints programmatically
2023-08-03 16:44:48 +03:00
Arpad Müller
416c14b353 Compaction: sort on slices directly instead of kmerge (#4839)
## Problem

The k-merge in pageserver compaction currently relies on iterators over
the keys and also over the values. This approach does not support async
code because we are using iterators and those don't support async in
general. Also, the k-merge implementation we use doesn't support async
either. Instead, as we already load all the keys into memory, just do
sorting in-memory.

## Summary of changes

The PR can be read commit-by-commit, but most importantly, it:

* Stops using kmerge in compaction, using slice sorting instead.
* Makes `load_keys` and `load_val_refs` async, using `Handle::block_on`
in the compaction code as we don't want to turn the compaction function,
called inside `spawn_blocking`, into an async fn.

Builds on top of #4836, part of
https://github.com/neondatabase/neon/issues/4743
2023-08-03 15:30:41 +02:00
John Spray
df49a9b7aa pagekeeper: suppress error logs in shutdown/detach (#4876)
## Problem

Error messages like this coming up during normal operations:
```
        Compaction failed, retrying in 2s: timeline is Stopping

       Compaction failed, retrying in 2s: Cannot run compaction iteration on inactive tenant
```

## Summary of changes

Add explicit handling for the shutdown case in these locations, to
suppress error logs.
2023-08-02 19:31:09 +01:00
bojanserafimov
4ad0c8f960 compute_ctl: Prewarm before starting http server (#4867) 2023-08-02 14:19:06 -04:00
Joonas Koivunen
e0b05ecafb build: ca-certificates need to be present (#4880)
as needed since #4715 or this will happen:

```
ERROR panic{thread=main location=.../hyper-rustls-0.23.2/src/config.rs:48:9}: no CA certificates found
```
2023-08-02 20:34:21 +03:00
Vadim Kharitonov
ca4d71a954 Upgrade pg_embedding to 0.3.5 (#4873) 2023-08-02 18:18:33 +03:00
Alexander Bayandin
381f41e685 Bump cryptography from 41.0.2 to 41.0.3 (#4870) 2023-08-02 14:10:36 +03:00
Alek Westover
d005c77ea3 Tar Remote Extensions (#4715)
Add infrastructure to dynamically load postgres extensions and shared libraries from remote extension storage.

Before postgres start downloads list of available remote extensions and libraries, and also downloads 'shared_preload_libraries'. After postgres is running, 'compute_ctl' listens for HTTP requests to load files.

Postgres has new GUC 'extension_server_port' to specify port on which 'compute_ctl' listens for requests.

When PostgreSQL requests a file, 'compute_ctl' downloads it.

See more details about feature design and remote extension storage layout in docs/rfcs/024-extension-loading.md

---------

Co-authored-by: Anastasia Lubennikova <anastasia@neon.tech>
Co-authored-by: Alek Westover <alek.westover@gmail.com>
2023-08-02 12:38:12 +03:00
Joonas Koivunen
04776ade6c fix(consumption): rename _size_ => _data_ (#4866)
I failed at renaming the metric middle part while managing to do a great
job with the suffix. Fix the middle part as well.
2023-08-01 19:18:25 +03:00
Dmitry Rodionov
c3fe335eaf wait for tenant to be active before polling for timeline absence (#4856)
## Problem

https://neon-github-public-dev.s3.amazonaws.com/reports/main/5692829577/index.html#suites/f588e0a787c49e67b29490359c589fae/4c50937643d68a66

## Summary of changes

wait for tenant to be active after restart before polling for timeline
absence
2023-08-01 18:28:18 +03:00
Joonas Koivunen
3a00a5deb2 refactor: tidy consumption metrics (#4860)
Tidying up I've been wanting to do for some time.

Follow-up to #4857.
2023-08-01 18:14:16 +03:00
Joonas Koivunen
78fa2b13e5 test: written_size_bytes_delta (#4857)
Two stabs at this, by mocking a http receiver and the globals out (now
reverted) and then by separating the timeline dependency and just
testing what kind of events certain timelines produce. I think this
pattern could work for some of our problems.

Follow-up to #4822.
2023-08-01 15:30:36 +03:00
John Spray
7c076edeea pageserver: tweak period of imitate_layer_accesses (#4859)
## Problem

When the eviction threshold is an integer multiple of the eviction
period, it is unreliable to skip imitating accesses based on whether the
last imitation was more recent than the threshold.

This is because as finite time passes
between the time used for the periodic execution, and the 'now' time
used for updating last_layer_access_imitation. When this is just a few
milliseconds, and everything else is on-time, then a 5 second threshold
with a 1 second period will end up entering its 5th iteration slightly
_less than_ 5 second since last_layer_access_imitation, and thereby
skipping instead of running the imitation. If a few milliseconds then
pass before we check the access time of a file that _should_ have been
bumped by the imitation pass, then we end up evicting something we
shouldn't have evicted.

## Summary of changes

We can make this race far less likely by using the threshold minus one
interval as the period for re-executing the imitate_layer_accesses: that
way we're not vulnerable to racing by just a few millis, and there would
have to be a delay of the order `period` to cause us to wrongly evict a
layer.

This is not a complete solution: it would be good to revisit this and
use a non-walltime mechanism for pinning these layers into local
storage, rather than relying on bumping access times.
2023-08-01 13:17:49 +01:00
Arpad Müller
69528b7c30 Prepare k-merge in compaction for async I/O (#4836)
## Problem

The k-merge in pageserver compaction currently relies on iterators over
the keys and also over the values. This approach does not support async
code because we are using iterators and those don't support async in
general. Also, the k-merge implementation we use doesn't support async
either. Instead, as we already load all the keys into memory, the plan
is to just do the sorting in-memory for now, switch to async, and then
once we want to support workloads that don't have all keys stored in
memory, we can look into switching to a k-merge implementation that
supports async instead.

## Summary of changes

The core of this PR is the move from functions on the `PersistentLayer`
trait to return custom iterator types to inherent functions on `DeltaLayer`
that return buffers with all keys or value references.
Value references are a type we created in this PR, containing a
`BlobRef` as well as an `Arc` pointer to the `DeltaLayerInner`, so that
we can lazily load the values during compaction. This preserves the
property of the current code.

This PR does not switch us to doing the k-merge via sort on slices, but
with this PR, doing such a switch is relatively easy and only requires
changes of the compaction code itself.

Part of https://github.com/neondatabase/neon/issues/4743
2023-08-01 13:38:35 +02:00
Konstantin Knizhnik
a98a80abc2 Deffine NEON_SMGR to make it possible for extensions to use Neon SMG API (#4840)
## Problem

See https://neondb.slack.com/archives/C036U0GRMRB/p1689148023067319

## Summary of changes

Define NEON_SMGR in smgr.h

## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist

---------

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
2023-08-01 10:04:45 +03:00
Alex Chi Z
7b6c849456 support isolation level + read only for http batch sql (#4830)
We will retrieve `neon-batch-isolation-level` and `neon-batch-read-only`
from the http header, which sets the txn properties.
https://github.com/neondatabase/serverless/pull/38#issuecomment-1653130981

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2023-08-01 02:59:11 +03:00
Joonas Koivunen
326189d950 consumption_metrics: send timeline_written_size_delta (#4822)
We want to have timeline_written_size_delta which is defined as
difference to the previously sent `timeline_written_size` from the
current `timeline_written_size`.

Solution is to send it. On the first round `disk_consistent_lsn` is used
which is captured during `load` time. After that an incremental "event"
is sent on every collection. Incremental "events" are not part of
deduplication.

I've added some infrastructure to allow somewhat typesafe
`EventType::Absolute` and `EventType::Incremental` factories per
metrics, now that we have our first `EventType::Incremental` usage.
2023-07-31 22:10:19 +03:00
bojanserafimov
ddbe170454 Prewarm compute nodes (#4828) 2023-07-31 14:13:32 -04:00
Alexander Bayandin
39e458f049 test_compatibility: fix pg_tenant_only_port port collision (#4850)
## Problem

Compatibility tests fail from time to time due to `pg_tenant_only_port`
port collision (added in https://github.com/neondatabase/neon/pull/4731)

## Summary of changes
- replace `pg_tenant_only_port` value in config with new port
- remove old logic, than we don't need anymore
- unify config overrides
2023-07-31 20:49:46 +03:00
Vadim Kharitonov
e1424647a0 Update pg_embedding to 0.3.1 version (#4811) 2023-07-31 20:23:18 +03:00
Yinnan Yao
705ae2dce9 Fix error message for listen_pg_addr_tenant_only binding (#4787)
## Problem

Wrong use of `conf.listen_pg_addr` in `error!()`.

## Summary of changes

Use `listen_pg_addr_tenant_only` instead of `conf.listen_pg_addr`.

Signed-off-by: yaoyinnan <35447132+yaoyinnan@users.noreply.github.com>
2023-07-31 14:40:52 +01:00
Conrad Ludgate
eb78603121 proxy: div by zero (#4845)
## Problem

1. In the CacheInvalid state loop, we weren't checking the
`num_retries`. If this managed to get up to `32`, the retry_after
procedure would compute 2^32 which would overflow to 0 and trigger a div
by zero
2. When fixing the above, I started working on a flow diagram for the
state machine logic and realised it was more complex than it had to be:
  a. We start in a `Cached` state
b. `Cached`: call `connect_once`. After the first connect_once error, we
always move to the `CacheInvalid` state, otherwise, we return the
connection.
c. `CacheInvalid`: we attempt to `wake_compute` and we either switch to
Cached or we retry this step (or we error).
d. `Cached`: call `connect_once`. We either retry this step or we have a
connection (or we error) - After num_retries > 1 we never switch back to
`CacheInvalid`.

## Summary of changes

1. Insert a `num_retries` check in the `handle_try_wake` procedure. Also
using floats in the retry_after procedure to prevent the overflow
entirely
2. Refactor connect_to_compute to be more linear in design.
2023-07-31 09:30:24 -04:00
John Spray
f0ad603693 pageserver: add unit test for deleted_at in IndexPart (#4844)
## Problem

Existing IndexPart unit tests only exercised the version 1 format (i.e.
without deleted_at set).

## Summary of changes

Add a test that sets version to 2, and sets a value for deleted_at.

Closes https://github.com/neondatabase/neon/issues/4162
2023-07-31 12:51:18 +01:00
Arpad Müller
e5183f85dc Make DiskBtreeReader::dump async (#4838)
## Problem

`DiskBtreeReader::dump` calls `read_blk` internally, which we want to
make async in the future. As it is currently relying on recursion, and
async doesn't like recursion, we want to find an alternative to that and
instead traverse the tree using a loop and a manual stack.

## Summary of changes

* Make `DiskBtreeReader::dump` and all the places calling it async
* Make `DiskBtreeReader::dump` non-recursive internally and use a stack
instead. It now deparses the node in each iteration, which isn't
optimal, but on the other hand it's hard to store the node as it is
referencing the buffer. Self referential data are hard in Rust. For a
dumping function, speed isn't a priority so we deparse the node multiple
times now (up to branching factor many times).

Part of https://github.com/neondatabase/neon/issues/4743

I have verified that output is unchanged by comparing the output of this
command both before and after this patch:
```
cargo test -p pageserver -- particular_data --nocapture 
```
2023-07-31 12:52:29 +02:00
Joonas Koivunen
89ee8f2028 fix: demote warnings, fix flakyness (#4837)
`WARN ... found future (image|delta) layer` are not actionable log
lines. They don't need to be warnings. `info!` is enough.

This also fixes some known but not tracked flakyness in
[`test_remote_timeline_client_calls_started_metric`][evidence].

[evidence]:
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-4829/5683495367/index.html#/testresult/34fe79e24729618b

Closes #3369.
Closes #4473.
2023-07-31 07:43:12 +00:00
Alex Chi Z
a8f3540f3d proxy: add unit test for wake_compute (#4819)
## Problem

ref https://github.com/neondatabase/neon/pull/4721, ref
https://github.com/neondatabase/neon/issues/4709

## Summary of changes

This PR adds unit tests for wake_compute.

The patch adds a new variant `Test` to auth backends. When
`wake_compute` is called, we will verify if it is the exact operation
sequence we are expecting. The operation sequence now contains 3 more
operations: `Wake`, `WakeRetry`, and `WakeFail`.

The unit tests for proxy connects are now complete and I'll continue
work on WebSocket e2e test in future PRs.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2023-07-28 19:10:55 -04:00
Konstantin Knizhnik
4338eed8c4 Make it possible to grant self perfmissions to self created roles (#4821)
## Problem

See: https://neondb.slack.com/archives/C04USJQNLD6/p1689973957908869

## Summary of changes

Bump Postgres version 

## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist

---------

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
2023-07-28 22:06:03 +03:00
Joonas Koivunen
2fbdf26094 test: raise timeout to avoid flakyness (#4832)
2s timeout was too tight for our CI,
[evidence](https://neon-github-public-dev.s3.amazonaws.com/reports/main/5669956577/index.html#/testresult/6388e31182cc2d6e).
15s might be better.

Also cleanup code no longer needed after #4204.
2023-07-28 14:32:01 -04:00
Alexander Bayandin
7374634845 test_runner: clean up test_compatibility (#4770)
## Problem

We have some amount of outdated logic in test_compatibility, that we
don't need anymore.

## Summary of changes
- Remove `PR4425_ALLOWED_DIFF` and tune `dump_differs` method to accept
allowed diffs in the future (a cleanup after
https://github.com/neondatabase/neon/pull/4425)
- Remote etcd related code (a cleanup after
https://github.com/neondatabase/neon/pull/2733)
- Don't set `preserve_database_files`
2023-07-28 16:15:31 +01:00
Alexander Bayandin
9fdd3a4a1e test_runner: add amcheck to test_compatibility (#4772)
Run `pg_amcheck` in forward and backward compatibility tests to catch
some data corruption.

## Summary of changes
- Add amcheck compiling to Makefile
- Add `pg_amcheck` to test_compatibility
2023-07-28 16:00:55 +01:00
Alek Westover
3681fc39fd modify relative_path_to_s3_object logic for prefix=None (#4795)
see added unit tests for more description
2023-07-28 10:03:18 -04:00
Joonas Koivunen
67d2fa6dec test: fix test_neon_cli_basics flakyness without making it better for future (#4827)
The test was starting two endpoints on the same branch as discovered by
@petuhovskiy.

The fix is to allow passing branch-name from the python side over to
neon_local, which already accepted it.

Split from #4824, which will handle making this more misuse resistant.
2023-07-27 19:13:58 +03:00
Dmitry Rodionov
cafbe8237e Move tenant/delete.rs to tenant/timeline/delete.rs (#4825)
move tenant/delete.rs to tenant/timeline/delete.rs to prepare for
appearance of tenant deletion routines in tenant/delete.rs
2023-07-27 15:52:36 +03:00
Joonas Koivunen
3e425c40c0 fix(compute_ctl): remove stray variable in error message (#4823)
error is not needed because anyhow will have the cause chain reported
anyways.

related to test_neon_cli_basics being flaky, but doesn't actually fix
any flakyness, just the obvious stray `{e}`.
2023-07-27 15:40:53 +03:00
Joonas Koivunen
395bd9174e test: allow future image layer warning (#4818)
https://neon-github-public-dev.s3.amazonaws.com/reports/main/5670795960/index.html#suites/837740b64a53e769572c4ed7b7a7eeeb/5a73fa4a69399123/retries

Allow it because we are doing immediate stop.
2023-07-27 10:22:44 +03:00
Alek Westover
b9a7a661d0 add list of public extensions and lookup table for libraries (#4807) 2023-07-26 15:55:55 -04:00
Joonas Koivunen
48ce95533c test: allow normal warnings in test_threshold_based_eviction (#4801)
See:
https://neon-github-public-dev.s3.amazonaws.com/reports/main/5654328815/index.html#suites/3fc871d9ee8127d8501d607e03205abb/3482458eba88c021
2023-07-26 20:20:12 +03:00
Dmitry Rodionov
874c31976e dedup cleanup fs traces (#4778)
This is a follow up for discussion:
https://github.com/neondatabase/neon/pull/4552#discussion_r1253417777
see context there
2023-07-26 18:39:32 +03:00
Conrad Ludgate
231d7a7616 proxy: retry compute wake in auth (#4817)
## Problem

wake_compute can fail sometimes but is eligible for retries. We retry
during the main connect, but not during auth.

## Summary of changes

retry wake_compute during auth flow if there was an error talking to
control plane, or if there was a temporary error in waking the compute
node
2023-07-26 16:34:46 +01:00
arpad-m
5705413d90 Use OnceLock instead of manually implementing it (#4805)
## Problem

In https://github.com/neondatabase/neon/issues/4743 , I'm trying to make
more of the pageserver async, but in order for that to happen, I need to
be able to persist the result of `ImageLayer::load` across await points.
For that to happen, the return value needs to be `Send`.

## Summary of changes

Use `OnceLock` in the image layer instead of manually implementing it
with booleans, locks and `Option`.

Part of #4743
2023-07-26 17:20:09 +02:00
Conrad Ludgate
35370f967f proxy: add some connection init logs (#4812)
## Problem

The first session event we emit is after we receive the first startup
packet from the client. This means we can't detect any issues between
TCP open and handling of the first PG packet

## Summary of changes

Add some new logs for websocket upgrade and connection handling
2023-07-26 15:03:51 +00:00
Alexander Bayandin
b98419ee56 Fix allure report overwriting for different Postgres versions (#4806)
## Problem

We've got an example of Allure reports from 2 different runners for the
same build that started to upload at the exact second, making one
overwrite another

## Summary of changes
- Use the Postgres version to distinguish artifacts (along with the
build type)
2023-07-26 15:19:18 +01:00
Alexander Bayandin
86a61b318b Bump certifi from 2022.12.7 to 2023.7.22 (#4815) 2023-07-26 16:32:56 +03:00
Alek Westover
5f8fd640bf Upload Test Remote Extensions (#4792)
We need some real extensions in S3 to accurately test the code for
handling remote extensions.
In this PR we just upload three extensions (anon, kq_imcx and postgis), which is
enough for testing purposes for now. In addition to creating and
uploading the extension archives, we must generate a file
`ext_index.json` which specifies important metadata about the
extensions.

---------

Co-authored-by: Anastasia Lubennikova <anastasia@neon.tech>
Co-authored-by: Alexander Bayandin <alexander@neon.tech>
2023-07-26 15:24:03 +03:00
bojanserafimov
916a5871a6 compute_ctl: Parse sk connstring (#4809) 2023-07-26 08:10:49 -04:00
Dmitry Rodionov
700d929529 Init Timeline in Stopping state in create_timeline_struct when Cause::Delete (#4780)
See
https://github.com/neondatabase/neon/pull/4552#discussion_r1258368127
for context.

TLDR: use CreateTimelineCause to infer desired state instead of using
.set_stopping after initialization
2023-07-26 14:05:18 +03:00
bojanserafimov
520046f5bd cold starts: Add sync-safekeepers fast path (#4804) 2023-07-25 19:44:18 -04:00
Conrad Ludgate
2ebd2ce2b6 proxy: record connection type (#4802)
## Problem

We want to measure how many users are using TCP/WS connections.
We also want to measure how long it takes to establish a connection with
the compute node.

I plan to also add a separate counter for HTTP requests, but because of
pooling this needs to be disambiguated against new HTTP compute
connections

## Summary of changes

* record connection type (ws/tcp) in the connection counters.
* record connection latency including retry latency
2023-07-25 18:57:42 +03:00
Alex Chi Z
bcc2aee704 proxy: add tests for batch http sql (#4793)
This PR adds an integration test case for batch HTTP SQL endpoint.
https://github.com/neondatabase/neon/pull/4654/ should be merged first
before we land this PR.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2023-07-25 15:08:24 +00:00
Dmitry Rodionov
6d023484ed Use mark file to allow for deletion operations to continue through restarts (#4552)
## Problem

Currently we delete local files first, so if pageserver restarts after
local files deletion then remote deletion is not continued. This can be
solved with inversion of these steps.

But even if these steps are inverted when index_part.json is deleted
there is no way to distinguish between "this timeline is good, we just
didnt upload it to remote" and "this timeline is deleted we should
continue with removal of local state". So to solve it we use another
mark file. After index part is deleted presence of this mark file
indentifies that it was a deletion intention.

Alternative approach that was discussed was to delete all except
metadata first, and then delete metadata and index part. In this case we
still do not support local only configs making them rather unsafe
(deletion in them is already unsafe, but this direction solidifies this
direction instead of fixing it). Another downside is that if we crash
after local metadata gets removed we may leave dangling index part on
the remote which in theory shouldnt be a big deal because the file is
small.

It is not a big change to choose another approach at this point.

## Summary of changes

Timeline deletion sequence:
1. Set deleted_at in remote index part.
2. Create local mark file.
3. Delete local files except metadata (it is simpler this way, to be
able to reuse timeline initialization code that expects metadata)
4. Delete remote layers
5. Delete index part
6. Delete meta, timeline directory.
7. Delete mark file.

This works for local only configuration without remote storage.
Sequence is resumable from any point.

resolves #4453
resolves https://github.com/neondatabase/neon/pull/4552 (the issue was
created with async cancellation in mind, but we can still have issues
with retries if metadata is deleted among the first by remove_dir_all
(which doesnt have any ordering guarantees))

---------

Co-authored-by: Joonas Koivunen <joonas@neon.tech>
Co-authored-by: Christian Schwarz <christian@neon.tech>
2023-07-25 16:25:27 +03:00
Nick Randall
062159ac17 support non-interactive transactions in sql-over-http (#4654)
This PR adds support for non-interactive transaction query endpoint.
It accepts an array of queries and parameters and returns an array of
query results. The queries will be run in a single transaction one
after another on the proxy side.
2023-07-25 13:03:55 +01:00
cui fliter
f2e2b8a7f4 fix some typos (#4662)
Typos fix

Signed-off-by: cui fliter <imcusg@gmail.com>
2023-07-25 14:39:29 +03:00
Joonas Koivunen
f9214771b4 fix: count broken tenant more correct (#4800)
count only once; on startup create the counter right away because we
will not observe any changes later.

small, probably never reachable from outside fix for #4796.
2023-07-25 12:31:24 +03:00
Joonas Koivunen
77a68326c5 Thin out TenantState metric, keep set of broken tenants (#4796)
We currently have a timeseries for each of the tenants in different
states. We only want this for Broken. Other states could be counters.

Fix this by making the `pageserver_tenant_states_count` a counter
without a `tenant_id` and
add a `pageserver_broken_tenants_count` which has a `tenant_id` label,
each broken tenant being 1.
2023-07-25 11:15:54 +03:00
Joonas Koivunen
a25504deae Limit concurrent compactions (#4777)
Compactions can create a lot of concurrent work right now with #4265.

Limit compactions to use at most 6/8 background runtime threads.
2023-07-25 10:19:04 +03:00
Joonas Koivunen
294b8a8fde Convert per timeline metrics to global (#4769)
Cut down the per-(tenant, timeline) histograms by making them global:

- `pageserver_getpage_get_reconstruct_data_seconds`
- `pageserver_read_num_fs_layers`
- `pageserver_remote_operation_seconds`
- `pageserver_remote_timeline_client_calls_started`
- `pageserver_wait_lsn_seconds`
- `pageserver_io_operations_seconds`

---------

Co-authored-by: Shany Pozin <shany@neon.tech>
2023-07-25 00:43:27 +03:00
Alex Chi Z
407a20ceae add proxy unit tests for retry connections (#4721)
Given now we've refactored `connect_to_compute` as a generic, we can
test it with mock backends. In this PR, we mock the error API and
connect_once API to test the retry behavior of `connect_to_compute`. In
the next PR, I'll add mock for credentials so that we can also test
behavior with `wake_compute`.

ref https://github.com/neondatabase/neon/issues/4709

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2023-07-24 20:41:42 +03:00
arpad-m
e5b7ddfeee Preparatory pageserver async conversions (#4773)
In #4743, we'd like to convert the read path to use `async` rust. In
preparation of that, this PR switches some functions that are calling
lower level functions like `BlockReader::read_blk`,
`BlockCursor::read_blob`, etc into `async`. The PR does not switch all
functions however, and only focuses on the ones which are easy to
switch.

This leaves around some async functions that are (currently)
unnecessarily `async`, but on the other hand it makes future changes
smaller in diff.

Part of #4743 (but does not completely address it).
2023-07-24 14:01:54 +02:00
Alek Westover
7feb0d1a80 unwrap instead of passing anyhow::Error on failure to spawn a thread (#4779) 2023-07-21 15:17:16 -04:00
Konstantin Knizhnik
457e3a3ebc Mx offset bug (#4775)
Fix mx_offset_to_flags_offset() function

Fixes issue #4774

Postgres `MXOffsetToFlagsOffset` was not correctly converted to Rust
because cast to u16 is done before division by modulo. It is possible
only if divider is power of two.

Add a small rust unit test to check that the function produces same results
as the PostgreSQL macro, and extend the existing python test to cover
this bug.

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
2023-07-21 21:20:53 +03:00
Joonas Koivunen
25d2f4b669 metrics: chunked responses (#4768)
Metrics can get really large in the order of hundreds of megabytes,
which we used to buffer completly (after a few rounds of growing the
buffer).
2023-07-21 15:10:55 +00:00
Alex Chi Z
1685593f38 stable merge and sort in compaction (#4573)
Per discussion at
https://github.com/neondatabase/neon/pull/4537#discussion_r1242086217,
it looks like a better idea to use `<` instead of `<=` for all these
comparisons.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2023-07-21 10:15:44 -04:00
236 changed files with 20809 additions and 6745 deletions

View File

@@ -14,6 +14,7 @@
!pgxn/
!proxy/
!safekeeper/
!s3_scrubber/
!storage_broker/
!trace/
!vendor/postgres-v14/
@@ -21,4 +22,5 @@
!workspace_hack/
!neon_local/
!scripts/ninstall.sh
!scripts/combine_control_files.py
!vm-cgconfig.conf

View File

@@ -1,7 +1,20 @@
name: 'Create Allure report'
description: 'Generate Allure report from uploaded by actions/allure-report-store tests results'
inputs:
store-test-results-into-db:
description: 'Whether to store test results into the database. TEST_RESULT_CONNSTR/TEST_RESULT_CONNSTR_NEW should be set'
type: boolean
required: false
default: false
outputs:
base-url:
description: 'Base URL for Allure report'
value: ${{ steps.generate-report.outputs.base-url }}
base-s3-url:
description: 'Base S3 URL for Allure report'
value: ${{ steps.generate-report.outputs.base-s3-url }}
report-url:
description: 'Allure report URL'
value: ${{ steps.generate-report.outputs.report-url }}
@@ -63,8 +76,8 @@ runs:
rm -f ${ALLURE_ZIP}
fi
env:
ALLURE_VERSION: 2.22.1
ALLURE_ZIP_SHA256: fdc7a62d94b14c5e0bf25198ae1feded6b005fdbed864b4d3cb4e5e901720b0b
ALLURE_VERSION: 2.23.1
ALLURE_ZIP_SHA256: 11141bfe727504b3fd80c0f9801eb317407fd0ac983ebb57e671f14bac4bcd86
# Potentially we could have several running build for the same key (for example, for the main branch), so we use improvised lock for this
- name: Acquire lock
@@ -102,6 +115,11 @@ runs:
REPORT_PREFIX=reports/${BRANCH_OR_PR}
RAW_PREFIX=reports-raw/${BRANCH_OR_PR}/${GITHUB_RUN_ID}
BASE_URL=https://${BUCKET}.s3.amazonaws.com/${REPORT_PREFIX}/${GITHUB_RUN_ID}
BASE_S3_URL=s3://${BUCKET}/${REPORT_PREFIX}/${GITHUB_RUN_ID}
REPORT_URL=${BASE_URL}/index.html
REPORT_JSON_URL=${BASE_URL}/data/suites.json
# Get previously uploaded data for this run
ZSTD_NBTHREADS=0
@@ -110,10 +128,9 @@ runs:
# There's no previously uploaded data for this $GITHUB_RUN_ID
exit 0
fi
for S3_FILEPATH in ${S3_FILEPATHS}; do
time aws s3 cp --only-show-errors "s3://${BUCKET}/${S3_FILEPATH}" "${WORKDIR}"
archive=${WORKDIR}/$(basename $S3_FILEPATH)
time aws s3 cp --recursive --only-show-errors "s3://${BUCKET}/${RAW_PREFIX}/" "${WORKDIR}/"
for archive in $(find ${WORKDIR} -name "*.tar.zst"); do
mkdir -p ${archive%.tar.zst}
time tar -xf ${archive} -C ${archive%.tar.zst}
rm -f ${archive}
@@ -130,9 +147,10 @@ runs:
# Upload a history and the final report (in this particular order to not to have duplicated history in 2 places)
time aws s3 mv --recursive --only-show-errors "${WORKDIR}/report/history" "s3://${BUCKET}/${REPORT_PREFIX}/latest/history"
time aws s3 mv --recursive --only-show-errors "${WORKDIR}/report" "s3://${BUCKET}/${REPORT_PREFIX}/${GITHUB_RUN_ID}"
REPORT_URL=https://${BUCKET}.s3.amazonaws.com/${REPORT_PREFIX}/${GITHUB_RUN_ID}/index.html
# Use aws s3 cp (instead of aws s3 sync) to keep files from previous runs to make old URLs work,
# and to keep files on the host to upload them to the database
time aws s3 cp --recursive --only-show-errors "${WORKDIR}/report" "s3://${BUCKET}/${REPORT_PREFIX}/${GITHUB_RUN_ID}"
# Generate redirect
cat <<EOF > ${WORKDIR}/index.html
@@ -144,8 +162,10 @@ runs:
EOF
time aws s3 cp --only-show-errors ${WORKDIR}/index.html "s3://${BUCKET}/${REPORT_PREFIX}/latest/index.html"
echo "report-url=${REPORT_URL}" >> $GITHUB_OUTPUT
echo "report-json-url=${REPORT_URL%/index.html}/data/suites.json" >> $GITHUB_OUTPUT
echo "base-url=${BASE_URL}" >> $GITHUB_OUTPUT
echo "base-s3-url=${BASE_S3_URL}" >> $GITHUB_OUTPUT
echo "report-url=${REPORT_URL}" >> $GITHUB_OUTPUT
echo "report-json-url=${REPORT_JSON_URL}" >> $GITHUB_OUTPUT
echo "[Allure Report](${REPORT_URL})" >> ${GITHUB_STEP_SUMMARY}
@@ -159,6 +179,41 @@ runs:
aws s3 rm "s3://${BUCKET}/${LOCK_FILE}"
fi
- name: Store Allure test stat in the DB
if: ${{ !cancelled() && inputs.store-test-results-into-db == 'true' }}
shell: bash -euxo pipefail {0}
env:
COMMIT_SHA: ${{ github.event.pull_request.head.sha || github.sha }}
REPORT_JSON_URL: ${{ steps.generate-report.outputs.report-json-url }}
run: |
export DATABASE_URL=${REGRESS_TEST_RESULT_CONNSTR}
./scripts/pysync
poetry run python3 scripts/ingest_regress_test_result.py \
--revision ${COMMIT_SHA} \
--reference ${GITHUB_REF} \
--build-type unified \
--ingest ${WORKDIR}/report/data/suites.json
- name: Store Allure test stat in the DB (new)
if: ${{ !cancelled() && inputs.store-test-results-into-db == 'true' }}
shell: bash -euxo pipefail {0}
env:
COMMIT_SHA: ${{ github.event.pull_request.head.sha || github.sha }}
BASE_S3_URL: ${{ steps.generate-report.outputs.base-s3-url }}
run: |
export DATABASE_URL=${REGRESS_TEST_RESULT_CONNSTR_NEW}
./scripts/pysync
poetry run python3 scripts/ingest_regress_test_result-new-format.py \
--reference ${GITHUB_REF} \
--revision ${COMMIT_SHA} \
--run-id ${GITHUB_RUN_ID} \
--run-attempt ${GITHUB_RUN_ATTEMPT} \
--test-cases-dir ${WORKDIR}/report/data/test-cases
- name: Cleanup
if: always()
shell: bash -euxo pipefail {0}

View File

@@ -31,7 +31,7 @@ runs:
BUCKET=neon-github-public-dev
FILENAME=$(basename $ARCHIVE)
S3_KEY=$(aws s3api list-objects-v2 --bucket ${BUCKET} --prefix ${PREFIX%$GITHUB_RUN_ATTEMPT} | jq -r '.Contents[].Key' | grep ${FILENAME} | sort --version-sort | tail -1 || true)
S3_KEY=$(aws s3api list-objects-v2 --bucket ${BUCKET} --prefix ${PREFIX%$GITHUB_RUN_ATTEMPT} | jq -r '.Contents[]?.Key' | grep ${FILENAME} | sort --version-sort | tail -1 || true)
if [ -z "${S3_KEY}" ]; then
if [ "${SKIP_IF_DOES_NOT_EXIST}" = "true" ]; then
echo 'SKIPPED=true' >> $GITHUB_OUTPUT

View File

@@ -145,7 +145,11 @@ runs:
if [ "${RERUN_FLAKY}" == "true" ]; then
mkdir -p $TEST_OUTPUT
poetry run ./scripts/flaky_tests.py "${TEST_RESULT_CONNSTR}" --days 10 --output "$TEST_OUTPUT/flaky.json"
poetry run ./scripts/flaky_tests.py "${TEST_RESULT_CONNSTR}" \
--days 7 \
--output "$TEST_OUTPUT/flaky.json" \
--pg-version "${DEFAULT_PG_VERSION}" \
--build-type "${BUILD_TYPE}"
EXTRA_PARAMS="--flaky-tests-json $TEST_OUTPUT/flaky.json $EXTRA_PARAMS"
fi
@@ -209,4 +213,4 @@ runs:
uses: ./.github/actions/allure-report-store
with:
report-dir: /tmp/test_output/allure/results
unique-key: ${{ inputs.build_type }}
unique-key: ${{ inputs.build_type }}-${{ inputs.pg_version }}

View File

@@ -432,6 +432,11 @@ jobs:
if: ${{ !cancelled() }}
id: create-allure-report
uses: ./.github/actions/allure-report-generate
with:
store-test-results-into-db: true
env:
REGRESS_TEST_RESULT_CONNSTR: ${{ secrets.REGRESS_TEST_RESULT_CONNSTR }}
REGRESS_TEST_RESULT_CONNSTR_NEW: ${{ secrets.REGRESS_TEST_RESULT_CONNSTR_NEW }}
- uses: actions/github-script@v6
if: ${{ !cancelled() }}
@@ -452,25 +457,6 @@ jobs:
report,
})
- name: Store Allure test stat in the DB
if: ${{ !cancelled() && steps.create-allure-report.outputs.report-json-url }}
env:
COMMIT_SHA: ${{ github.event.pull_request.head.sha || github.sha }}
REPORT_JSON_URL: ${{ steps.create-allure-report.outputs.report-json-url }}
TEST_RESULT_CONNSTR: ${{ secrets.REGRESS_TEST_RESULT_CONNSTR }}
run: |
./scripts/pysync
curl --fail --output suites.json "${REPORT_JSON_URL}"
export BUILD_TYPE=unified
export DATABASE_URL="$TEST_RESULT_CONNSTR"
poetry run python3 scripts/ingest_regress_test_result.py \
--revision ${COMMIT_SHA} \
--reference ${GITHUB_REF} \
--build-type ${BUILD_TYPE} \
--ingest suites.json
coverage-report:
runs-on: [ self-hosted, gen3, small ]
container:
@@ -751,34 +737,6 @@ jobs:
--destination neondatabase/compute-node-${{ matrix.version }}:${{needs.tag.outputs.build-tag}}
--cleanup
# Due to a kaniko bug, we can't use cache for extensions image, thus it takes about the same amount of time as compute-node image to build (~10 min)
# During the transition period we need to have extensions in both places (in S3 and in compute-node image),
# so we won't build extension twice, but extract them from compute-node.
#
# For now we use extensions image only for new custom extensitons
- name: Kaniko build extensions only
run: |
# Kaniko is suposed to clean up after itself if --cleanup flag is set, but it doesn't.
# Despite some fixes were made in https://github.com/GoogleContainerTools/kaniko/pull/2504 (in kaniko v1.11.0),
# it still fails with error:
# error building image: could not save file: copying file: symlink postgres /kaniko/1/usr/local/pgsql/bin/postmaster: file exists
#
# Ref https://github.com/GoogleContainerTools/kaniko/issues/1406
find /kaniko -maxdepth 1 -mindepth 1 -type d -regex "/kaniko/[0-9]*" -exec rm -rv {} \;
/kaniko/executor --reproducible --snapshot-mode=redo --skip-unused-stages --cache=true \
--cache-repo 369495373322.dkr.ecr.eu-central-1.amazonaws.com/cache \
--context . \
--build-arg GIT_VERSION=${{ github.event.pull_request.head.sha || github.sha }} \
--build-arg PG_VERSION=${{ matrix.version }} \
--build-arg BUILD_TAG=${{needs.tag.outputs.build-tag}} \
--build-arg REPOSITORY=369495373322.dkr.ecr.eu-central-1.amazonaws.com \
--dockerfile Dockerfile.compute-node \
--destination 369495373322.dkr.ecr.eu-central-1.amazonaws.com/extensions-${{ matrix.version }}:${{needs.tag.outputs.build-tag}} \
--destination neondatabase/extensions-${{ matrix.version }}:${{needs.tag.outputs.build-tag}} \
--cleanup \
--target postgres-extensions
# Cleanup script fails otherwise - rm: cannot remove '/nvme/actions-runner/_work/_temp/_github_home/.ecr': Permission denied
- name: Cleanup ECR folder
run: rm -rf ~/.ecr
@@ -794,7 +752,7 @@ jobs:
run:
shell: sh -eu {0}
env:
VM_BUILDER_VERSION: v0.13.1
VM_BUILDER_VERSION: v0.17.5
steps:
- name: Checkout
@@ -815,7 +773,11 @@ jobs:
- name: Build vm image
run: |
./vm-builder -enable-file-cache -src=369495373322.dkr.ecr.eu-central-1.amazonaws.com/compute-node-${{ matrix.version }}:${{needs.tag.outputs.build-tag}} -dst=369495373322.dkr.ecr.eu-central-1.amazonaws.com/vm-compute-node-${{ matrix.version }}:${{needs.tag.outputs.build-tag}}
./vm-builder \
-enable-file-cache \
-cgroup-uid=postgres \
-src=369495373322.dkr.ecr.eu-central-1.amazonaws.com/compute-node-${{ matrix.version }}:${{needs.tag.outputs.build-tag}} \
-dst=369495373322.dkr.ecr.eu-central-1.amazonaws.com/vm-compute-node-${{ matrix.version }}:${{needs.tag.outputs.build-tag}}
- name: Pushing vm-compute-node image
run: |
@@ -896,10 +858,8 @@ jobs:
crane tag 369495373322.dkr.ecr.eu-central-1.amazonaws.com/compute-tools:${{needs.tag.outputs.build-tag}} latest
crane tag 369495373322.dkr.ecr.eu-central-1.amazonaws.com/compute-node-v14:${{needs.tag.outputs.build-tag}} latest
crane tag 369495373322.dkr.ecr.eu-central-1.amazonaws.com/vm-compute-node-v14:${{needs.tag.outputs.build-tag}} latest
crane tag 369495373322.dkr.ecr.eu-central-1.amazonaws.com/extensions-v14:${{needs.tag.outputs.build-tag}} latest
crane tag 369495373322.dkr.ecr.eu-central-1.amazonaws.com/compute-node-v15:${{needs.tag.outputs.build-tag}} latest
crane tag 369495373322.dkr.ecr.eu-central-1.amazonaws.com/vm-compute-node-v15:${{needs.tag.outputs.build-tag}} latest
crane tag 369495373322.dkr.ecr.eu-central-1.amazonaws.com/extensions-v15:${{needs.tag.outputs.build-tag}} latest
- name: Push images to production ECR
if: |
@@ -910,10 +870,8 @@ jobs:
crane copy 369495373322.dkr.ecr.eu-central-1.amazonaws.com/compute-tools:${{needs.tag.outputs.build-tag}} 093970136003.dkr.ecr.eu-central-1.amazonaws.com/compute-tools:latest
crane copy 369495373322.dkr.ecr.eu-central-1.amazonaws.com/compute-node-v14:${{needs.tag.outputs.build-tag}} 093970136003.dkr.ecr.eu-central-1.amazonaws.com/compute-node-v14:latest
crane copy 369495373322.dkr.ecr.eu-central-1.amazonaws.com/vm-compute-node-v14:${{needs.tag.outputs.build-tag}} 093970136003.dkr.ecr.eu-central-1.amazonaws.com/vm-compute-node-v14:latest
crane copy 369495373322.dkr.ecr.eu-central-1.amazonaws.com/extensions-v14:${{needs.tag.outputs.build-tag}} 093970136003.dkr.ecr.eu-central-1.amazonaws.com/extensions-v14:latest
crane copy 369495373322.dkr.ecr.eu-central-1.amazonaws.com/compute-node-v15:${{needs.tag.outputs.build-tag}} 093970136003.dkr.ecr.eu-central-1.amazonaws.com/compute-node-v15:latest
crane copy 369495373322.dkr.ecr.eu-central-1.amazonaws.com/vm-compute-node-v15:${{needs.tag.outputs.build-tag}} 093970136003.dkr.ecr.eu-central-1.amazonaws.com/vm-compute-node-v15:latest
crane copy 369495373322.dkr.ecr.eu-central-1.amazonaws.com/extensions-v15:${{needs.tag.outputs.build-tag}} 093970136003.dkr.ecr.eu-central-1.amazonaws.com/extensions-v15:latest
- name: Configure Docker Hub login
run: |
@@ -935,95 +893,56 @@ jobs:
crane tag neondatabase/compute-tools:${{needs.tag.outputs.build-tag}} latest
crane tag neondatabase/compute-node-v14:${{needs.tag.outputs.build-tag}} latest
crane tag neondatabase/vm-compute-node-v14:${{needs.tag.outputs.build-tag}} latest
crane tag neondatabase/extensions-v14:${{needs.tag.outputs.build-tag}} latest
crane tag neondatabase/compute-node-v15:${{needs.tag.outputs.build-tag}} latest
crane tag neondatabase/vm-compute-node-v15:${{needs.tag.outputs.build-tag}} latest
crane tag neondatabase/extensions-v15:${{needs.tag.outputs.build-tag}} latest
- name: Cleanup ECR folder
run: rm -rf ~/.ecr
upload-postgres-extensions-to-s3:
if: |
(github.ref_name == 'main' || github.ref_name == 'release') &&
github.event_name != 'workflow_dispatch'
runs-on: ${{ github.ref_name == 'release' && fromJSON('["self-hosted", "prod", "x64"]') || fromJSON('["self-hosted", "gen3", "small"]') }}
needs: [ tag, promote-images ]
strategy:
fail-fast: false
matrix:
version: [ v14, v15 ]
env:
# While on transition period we extract public extensions from compute-node image and custom extensions from extensions image.
# Later all the extensions will be moved to extensions image.
EXTENSIONS_IMAGE: ${{ github.ref_name == 'release' && '093970136003' || '369495373322'}}.dkr.ecr.eu-central-1.amazonaws.com/extensions-${{ matrix.version }}:latest
COMPUTE_NODE_IMAGE: ${{ github.ref_name == 'release' && '093970136003' || '369495373322'}}.dkr.ecr.eu-central-1.amazonaws.com/compute-node-${{ matrix.version }}:latest
AWS_ACCESS_KEY_ID: ${{ github.ref_name == 'release' && secrets.AWS_ACCESS_KEY_PROD || secrets.AWS_ACCESS_KEY_DEV }}
AWS_SECRET_ACCESS_KEY: ${{ github.ref_name == 'release' && secrets.AWS_SECRET_KEY_PROD || secrets.AWS_SECRET_KEY_DEV }}
S3_BUCKETS: |
${{ github.ref_name == 'release' &&
'neon-prod-extensions-ap-southeast-1 neon-prod-extensions-eu-central-1 neon-prod-extensions-us-east-1 neon-prod-extensions-us-east-2 neon-prod-extensions-us-west-2' ||
'neon-dev-extensions-eu-central-1 neon-dev-extensions-eu-west-1 neon-dev-extensions-us-east-2' }}
build-private-extensions:
runs-on: [ self-hosted, gen3, small ]
container:
image: 369495373322.dkr.ecr.eu-central-1.amazonaws.com/base:pinned
options: --init
needs: [ tag ]
steps:
- name: Pull postgres-extensions image
- name: Set PR's status to pending and request a remote CI test
run: |
docker pull ${EXTENSIONS_IMAGE}
docker pull ${COMPUTE_NODE_IMAGE}
COMMIT_SHA=${{ github.event.pull_request.head.sha }}
COMMIT_SHA=${COMMIT_SHA:-${{ github.sha }}}
REMOTE_REPO="${{ github.repository_owner }}/build-custom-extensions"
- name: Create postgres-extensions container
id: create-container
run: |
EID=$(docker create ${EXTENSIONS_IMAGE} true)
echo "EID=${EID}" >> $GITHUB_OUTPUT
curl -f -X POST \
https://api.github.com/repos/${{ github.repository }}/statuses/$COMMIT_SHA \
-H "Accept: application/vnd.github.v3+json" \
--user "${{ secrets.CI_ACCESS_TOKEN }}" \
--data \
"{
\"state\": \"pending\",
\"context\": \"build-and-upload-extensions\",
\"description\": \"[$REMOTE_REPO] Remote CI job is about to start\"
}"
CID=$(docker create ${COMPUTE_NODE_IMAGE} true)
echo "CID=${CID}" >> $GITHUB_OUTPUT
- name: Extract postgres-extensions from container
run: |
rm -rf ./extensions-to-upload ./custom-extensions # Just in case
# In compute image we have a bit different directory layout
mkdir -p extensions-to-upload/share
docker cp ${{ steps.create-container.outputs.CID }}:/usr/local/share/extension ./extensions-to-upload/share/extension
docker cp ${{ steps.create-container.outputs.CID }}:/usr/local/lib ./extensions-to-upload/lib
# Delete Neon extensitons (they always present on compute-node image)
rm -rf ./extensions-to-upload/share/extension/neon*
rm -rf ./extensions-to-upload/lib/neon*
# Delete leftovers from the extension build step
rm -rf ./extensions-to-upload/lib/pgxs
rm -rf ./extensions-to-upload/lib/pkgconfig
docker cp ${{ steps.create-container.outputs.EID }}:/extensions ./custom-extensions
for EXT_NAME in $(ls ./custom-extensions); do
mkdir -p ./extensions-to-upload/${EXT_NAME}/share
mv ./custom-extensions/${EXT_NAME}/share/extension ./extensions-to-upload/${EXT_NAME}/share/extension
mv ./custom-extensions/${EXT_NAME}/lib ./extensions-to-upload/${EXT_NAME}/lib
done
- name: Upload postgres-extensions to S3
# TODO: Reenable step after switching to the new extensions format (tar-gzipped + index.json)
if: false
run: |
for BUCKET in $(echo ${S3_BUCKETS}); do
aws s3 cp --recursive --only-show-errors ./extensions-to-upload s3://${BUCKET}/${{ needs.tag.outputs.build-tag }}/${{ matrix.version }}
done
- name: Cleanup
if: ${{ always() && (steps.create-container.outputs.CID || steps.create-container.outputs.EID) }}
run: |
docker rm ${{ steps.create-container.outputs.CID }} || true
docker rm ${{ steps.create-container.outputs.EID }} || true
curl -f -X POST \
https://api.github.com/repos/$REMOTE_REPO/actions/workflows/build_and_upload_extensions.yml/dispatches \
-H "Accept: application/vnd.github.v3+json" \
--user "${{ secrets.CI_ACCESS_TOKEN }}" \
--data \
"{
\"ref\": \"main\",
\"inputs\": {
\"ci_job_name\": \"build-and-upload-extensions\",
\"commit_hash\": \"$COMMIT_SHA\",
\"remote_repo\": \"${{ github.repository }}\",
\"compute_image_tag\": \"${{ needs.tag.outputs.build-tag }}\",
\"remote_branch_name\": \"${{ github.ref_name }}\"
}
}"
deploy:
runs-on: [ self-hosted, gen3, small ]
container: 369495373322.dkr.ecr.eu-central-1.amazonaws.com/ansible:latest
needs: [ upload-postgres-extensions-to-s3, promote-images, tag, regress-tests ]
needs: [ promote-images, tag, regress-tests ]
if: ( github.ref_name == 'main' || github.ref_name == 'release' ) && github.event_name != 'workflow_dispatch'
steps:
- name: Fix git ownership
@@ -1097,7 +1016,7 @@ jobs:
OLD_PREFIX=artifacts/${GITHUB_RUN_ID}
FILENAME=neon-${{ runner.os }}-${build_type}-artifact.tar.zst
S3_KEY=$(aws s3api list-objects-v2 --bucket ${BUCKET} --prefix ${OLD_PREFIX} | jq -r '.Contents[].Key' | grep ${FILENAME} | sort --version-sort | tail -1 || true)
S3_KEY=$(aws s3api list-objects-v2 --bucket ${BUCKET} --prefix ${OLD_PREFIX} | jq -r '.Contents[]?.Key' | grep ${FILENAME} | sort --version-sort | tail -1 || true)
if [ -z "${S3_KEY}" ]; then
echo >&2 "Neither s3://${BUCKET}/${OLD_PREFIX}/${FILENAME} nor its version from previous attempts exist"
exit 1

View File

@@ -1,11 +1,12 @@
/compute_tools/ @neondatabase/control-plane
/compute_tools/ @neondatabase/control-plane @neondatabase/compute
/control_plane/ @neondatabase/compute @neondatabase/storage
/libs/pageserver_api/ @neondatabase/compute @neondatabase/storage
/libs/postgres_ffi/ @neondatabase/compute
/libs/remote_storage/ @neondatabase/storage
/libs/safekeeper_api/ @neondatabase/safekeepers
/pageserver/ @neondatabase/compute @neondatabase/storage
/libs/postgres_ffi/ @neondatabase/compute
/libs/remote_storage/ @neondatabase/storage
/libs/safekeeper_api/ @neondatabase/safekeepers
/libs/vm_monitor/ @neondatabase/autoscaling @neondatabase/compute
/pageserver/ @neondatabase/compute @neondatabase/storage
/pgxn/ @neondatabase/compute
/proxy/ @neondatabase/control-plane
/proxy/ @neondatabase/proxy
/safekeeper/ @neondatabase/safekeepers
/vendor/ @neondatabase/compute

872
Cargo.lock generated

File diff suppressed because it is too large Load Diff

View File

@@ -7,6 +7,7 @@ members = [
"proxy",
"safekeeper",
"storage_broker",
"s3_scrubber",
"workspace_hack",
"trace",
"libs/compute_api",
@@ -23,6 +24,7 @@ members = [
"libs/remote_storage",
"libs/tracing-utils",
"libs/postgres_ffi/wal_craft",
"libs/vm_monitor",
]
[workspace.package]
@@ -36,17 +38,19 @@ async-compression = { version = "0.4.0", features = ["tokio", "gzip"] }
flate2 = "1.0.26"
async-stream = "0.3"
async-trait = "0.1"
aws-config = { version = "0.55", default-features = false, features=["rustls"] }
aws-sdk-s3 = "0.27"
aws-smithy-http = "0.55"
aws-credential-types = "0.55"
aws-types = "0.55"
aws-config = { version = "0.56", default-features = false, features=["rustls"] }
aws-sdk-s3 = "0.29"
aws-smithy-http = "0.56"
aws-credential-types = "0.56"
aws-types = "0.56"
axum = { version = "0.6.20", features = ["ws"] }
base64 = "0.13.0"
bincode = "1.3"
bindgen = "0.65"
bstr = "1.0"
byteorder = "1.4"
bytes = "1.0"
cfg-if = "1.0.0"
chrono = { version = "0.4", default-features = false, features = ["clock"] }
clap = { version = "4.0", features = ["derive"] }
close_fds = "0.3.2"
@@ -54,6 +58,7 @@ comfy-table = "6.1"
const_format = "0.2"
crc32c = "0.6"
crossbeam-utils = "0.8.5"
dashmap = "5.5.0"
either = "1.8"
enum-map = "2.4.2"
enumset = "1.0.12"
@@ -73,6 +78,7 @@ humantime = "2.1"
humantime-serde = "1.1.1"
hyper = "0.14"
hyper-tungstenite = "0.9"
inotify = "0.10.2"
itertools = "0.10"
jsonwebtoken = "8"
libc = "0.2"
@@ -88,7 +94,7 @@ opentelemetry = "0.19.0"
opentelemetry-otlp = { version = "0.12.0", default_features=false, features = ["http-proto", "trace", "http", "reqwest-client"] }
opentelemetry-semantic-conventions = "0.11.0"
parking_lot = "0.12"
pbkdf2 = "0.12.1"
pbkdf2 = { version = "0.12.1", features = ["simple", "std"] }
pin-project-lite = "0.2"
prometheus = {version = "0.13", default_features=false, features = ["process"]} # removes protobuf dependency
prost = "0.11"
@@ -100,16 +106,18 @@ reqwest-middleware = "0.2.0"
reqwest-retry = "0.2.2"
routerify = "3"
rpds = "0.13"
rustls = "0.20"
rustls = "0.21"
rustls-pemfile = "1"
rustls-split = "0.3"
scopeguard = "1.1"
sentry = { version = "0.30", default-features = false, features = ["backtrace", "contexts", "panic", "rustls", "reqwest" ] }
sysinfo = "0.29.2"
sentry = { version = "0.31", default-features = false, features = ["backtrace", "contexts", "panic", "rustls", "reqwest" ] }
serde = { version = "1.0", features = ["derive"] }
serde_json = "1"
serde_with = "2.0"
sha2 = "0.10.2"
signal-hook = "0.3"
smallvec = "1.11"
socket2 = "0.5"
strum = "0.24"
strum_macros = "0.24"
@@ -118,11 +126,11 @@ sync_wrapper = "0.1.2"
tar = "0.4"
test-context = "0.1"
thiserror = "1.0"
tls-listener = { version = "0.6", features = ["rustls", "hyper-h1"] }
tls-listener = { version = "0.7", features = ["rustls", "hyper-h1"] }
tokio = { version = "1.17", features = ["macros"] }
tokio-io-timeout = "1.2.0"
tokio-postgres-rustls = "0.9.0"
tokio-rustls = "0.23"
tokio-postgres-rustls = "0.10.0"
tokio-rustls = "0.24"
tokio-stream = "0.1"
tokio-tar = "0.3"
tokio-util = { version = "0.7", features = ["io"] }
@@ -132,11 +140,11 @@ tonic = {version = "0.9", features = ["tls", "tls-roots"]}
tracing = "0.1"
tracing-error = "0.2.0"
tracing-opentelemetry = "0.19.0"
tracing-subscriber = { version = "0.3", default_features = false, features = ["smallvec", "fmt", "tracing-log", "std", "env-filter"] }
tracing-subscriber = { version = "0.3", default_features = false, features = ["smallvec", "fmt", "tracing-log", "std", "env-filter", "json"] }
url = "2.2"
uuid = { version = "1.2", features = ["v4", "serde"] }
walkdir = "2.3.2"
webpki-roots = "0.23"
webpki-roots = "0.25"
x509-parser = "0.15"
## TODO replace this with tracing
@@ -144,11 +152,11 @@ env_logger = "0.10"
log = "0.4"
## Libraries from neondatabase/ git forks, ideally with changes to be upstreamed
postgres = { git = "https://github.com/neondatabase/rust-postgres.git", rev="1aaedab101b23f7612042850d8f2036810fa7c7f" }
postgres-native-tls = { git = "https://github.com/neondatabase/rust-postgres.git", rev="1aaedab101b23f7612042850d8f2036810fa7c7f" }
postgres-protocol = { git = "https://github.com/neondatabase/rust-postgres.git", rev="1aaedab101b23f7612042850d8f2036810fa7c7f" }
postgres-types = { git = "https://github.com/neondatabase/rust-postgres.git", rev="1aaedab101b23f7612042850d8f2036810fa7c7f" }
tokio-postgres = { git = "https://github.com/neondatabase/rust-postgres.git", rev="1aaedab101b23f7612042850d8f2036810fa7c7f" }
postgres = { git = "https://github.com/neondatabase/rust-postgres.git", rev="9011f7110db12b5e15afaf98f8ac834501d50ddc" }
postgres-native-tls = { git = "https://github.com/neondatabase/rust-postgres.git", rev="9011f7110db12b5e15afaf98f8ac834501d50ddc" }
postgres-protocol = { git = "https://github.com/neondatabase/rust-postgres.git", rev="9011f7110db12b5e15afaf98f8ac834501d50ddc" }
postgres-types = { git = "https://github.com/neondatabase/rust-postgres.git", rev="9011f7110db12b5e15afaf98f8ac834501d50ddc" }
tokio-postgres = { git = "https://github.com/neondatabase/rust-postgres.git", rev="9011f7110db12b5e15afaf98f8ac834501d50ddc" }
## Other git libraries
heapless = { default-features=false, features=[], git = "https://github.com/japaric/heapless.git", rev = "644653bf3b831c6bb4963be2de24804acf5e5001" } # upstream release pending
@@ -168,14 +176,15 @@ storage_broker = { version = "0.1", path = "./storage_broker/" } # Note: main br
tenant_size_model = { version = "0.1", path = "./libs/tenant_size_model/" }
tracing-utils = { version = "0.1", path = "./libs/tracing-utils/" }
utils = { version = "0.1", path = "./libs/utils/" }
vm_monitor = { version = "0.1", path = "./libs/vm_monitor/" }
## Common library dependency
workspace_hack = { version = "0.1", path = "./workspace_hack/" }
## Build dependencies
criterion = "0.5.1"
rcgen = "0.10"
rstest = "0.17"
rcgen = "0.11"
rstest = "0.18"
tempfile = "3.4"
tonic-build = "0.9"
@@ -183,7 +192,7 @@ tonic-build = "0.9"
# This is only needed for proxy's tests.
# TODO: we should probably fork `tokio-postgres-rustls` instead.
tokio-postgres = { git = "https://github.com/neondatabase/rust-postgres.git", rev="1aaedab101b23f7612042850d8f2036810fa7c7f" }
tokio-postgres = { git = "https://github.com/neondatabase/rust-postgres.git", rev="9011f7110db12b5e15afaf98f8ac834501d50ddc" }
################# Binary contents sections

View File

@@ -51,6 +51,7 @@ RUN set -e \
--bin safekeeper \
--bin storage_broker \
--bin proxy \
--bin neon_local \
--locked --release \
&& cachepot -s
@@ -76,6 +77,7 @@ COPY --from=build --chown=neon:neon /home/nonroot/target/release/pagectl
COPY --from=build --chown=neon:neon /home/nonroot/target/release/safekeeper /usr/local/bin
COPY --from=build --chown=neon:neon /home/nonroot/target/release/storage_broker /usr/local/bin
COPY --from=build --chown=neon:neon /home/nonroot/target/release/proxy /usr/local/bin
COPY --from=build --chown=neon:neon /home/nonroot/target/release/neon_local /usr/local/bin
COPY --from=pg-build /home/nonroot/pg_install/v14 /usr/local/v14/
COPY --from=pg-build /home/nonroot/pg_install/v15 /usr/local/v15/

View File

@@ -13,7 +13,7 @@ FROM debian:bullseye-slim AS build-deps
RUN apt update && \
apt install -y git autoconf automake libtool build-essential bison flex libreadline-dev \
zlib1g-dev libxml2-dev libcurl4-openssl-dev libossp-uuid-dev wget pkg-config libssl-dev \
libicu-dev libxslt1-dev liblz4-dev libzstd-dev
libicu-dev libxslt1-dev liblz4-dev libzstd-dev zstd
#########################################################################################
#
@@ -77,6 +77,7 @@ ENV PATH "/usr/local/pgsql/bin:$PATH"
RUN wget https://download.osgeo.org/postgis/source/postgis-3.3.2.tar.gz -O postgis.tar.gz && \
echo "9a2a219da005a1730a39d1959a1c7cec619b1efb009b65be80ffc25bad299068 postgis.tar.gz" | sha256sum --check && \
mkdir postgis-src && cd postgis-src && tar xvzf ../postgis.tar.gz --strip-components=1 -C . && \
find /usr/local/pgsql -type f | sed 's|^/usr/local/pgsql/||' > /before.txt &&\
./autogen.sh && \
./configure --with-sfcgal=/usr/local/bin/sfcgal-config && \
make -j $(getconf _NPROCESSORS_ONLN) install && \
@@ -89,17 +90,28 @@ RUN wget https://download.osgeo.org/postgis/source/postgis-3.3.2.tar.gz -O postg
echo 'trusted = true' >> /usr/local/pgsql/share/extension/postgis_tiger_geocoder.control && \
echo 'trusted = true' >> /usr/local/pgsql/share/extension/postgis_topology.control && \
echo 'trusted = true' >> /usr/local/pgsql/share/extension/address_standardizer.control && \
echo 'trusted = true' >> /usr/local/pgsql/share/extension/address_standardizer_data_us.control
echo 'trusted = true' >> /usr/local/pgsql/share/extension/address_standardizer_data_us.control && \
mkdir -p /extensions/postgis && \
cp /usr/local/pgsql/share/extension/postgis.control /extensions/postgis && \
cp /usr/local/pgsql/share/extension/postgis_raster.control /extensions/postgis && \
cp /usr/local/pgsql/share/extension/postgis_sfcgal.control /extensions/postgis && \
cp /usr/local/pgsql/share/extension/postgis_tiger_geocoder.control /extensions/postgis && \
cp /usr/local/pgsql/share/extension/postgis_topology.control /extensions/postgis && \
cp /usr/local/pgsql/share/extension/address_standardizer.control /extensions/postgis && \
cp /usr/local/pgsql/share/extension/address_standardizer_data_us.control /extensions/postgis
RUN wget https://github.com/pgRouting/pgrouting/archive/v3.4.2.tar.gz -O pgrouting.tar.gz && \
echo "cac297c07d34460887c4f3b522b35c470138760fe358e351ad1db4edb6ee306e pgrouting.tar.gz" | sha256sum --check && \
mkdir pgrouting-src && cd pgrouting-src && tar xvzf ../pgrouting.tar.gz --strip-components=1 -C . && \
mkdir build && \
cd build && \
mkdir build && cd build && \
cmake -DCMAKE_BUILD_TYPE=Release .. && \
make -j $(getconf _NPROCESSORS_ONLN) && \
make -j $(getconf _NPROCESSORS_ONLN) install && \
echo 'trusted = true' >> /usr/local/pgsql/share/extension/pgrouting.control
echo 'trusted = true' >> /usr/local/pgsql/share/extension/pgrouting.control && \
find /usr/local/pgsql -type f | sed 's|^/usr/local/pgsql/||' > /after.txt &&\
cp /usr/local/pgsql/share/extension/pgrouting.control /extensions/postgis && \
sort -o /before.txt /before.txt && sort -o /after.txt /after.txt && \
comm -13 /before.txt /after.txt | tar --directory=/usr/local/pgsql --zstd -cf /extensions/postgis.tar.zst -T -
#########################################################################################
#
@@ -199,8 +211,8 @@ RUN wget https://github.com/df7cb/postgresql-unit/archive/refs/tags/7.7.tar.gz -
FROM build-deps AS vector-pg-build
COPY --from=pg-build /usr/local/pgsql/ /usr/local/pgsql/
RUN wget https://github.com/pgvector/pgvector/archive/refs/tags/v0.4.4.tar.gz -O pgvector.tar.gz && \
echo "1cb70a63f8928e396474796c22a20be9f7285a8a013009deb8152445b61b72e6 pgvector.tar.gz" | sha256sum --check && \
RUN wget https://github.com/pgvector/pgvector/archive/refs/tags/v0.5.0.tar.gz -O pgvector.tar.gz && \
echo "d8aa3504b215467ca528525a6de12c3f85f9891b091ce0e5864dd8a9b757f77b pgvector.tar.gz" | sha256sum --check && \
mkdir pgvector-src && cd pgvector-src && tar xvzf ../pgvector.tar.gz --strip-components=1 -C . && \
make -j $(getconf _NPROCESSORS_ONLN) PG_CONFIG=/usr/local/pgsql/bin/pg_config && \
make -j $(getconf _NPROCESSORS_ONLN) install PG_CONFIG=/usr/local/pgsql/bin/pg_config && \
@@ -419,12 +431,16 @@ RUN apt-get update && \
wget https://github.com/ketteq-neon/postgres-exts/archive/e0bd1a9d9313d7120c1b9c7bb15c48c0dede4c4e.tar.gz -O kq_imcx.tar.gz && \
echo "dc93a97ff32d152d32737ba7e196d9687041cda15e58ab31344c2f2de8855336 kq_imcx.tar.gz" | sha256sum --check && \
mkdir kq_imcx-src && cd kq_imcx-src && tar xvzf ../kq_imcx.tar.gz --strip-components=1 -C . && \
mkdir build && \
cd build && \
find /usr/local/pgsql -type f | sed 's|^/usr/local/pgsql/||' > /before.txt &&\
mkdir build && cd build && \
cmake -DCMAKE_BUILD_TYPE=Release .. && \
make -j $(getconf _NPROCESSORS_ONLN) && \
make -j $(getconf _NPROCESSORS_ONLN) install && \
echo 'trusted = true' >> /usr/local/pgsql/share/extension/kq_imcx.control
echo 'trusted = true' >> /usr/local/pgsql/share/extension/kq_imcx.control && \
find /usr/local/pgsql -type f | sed 's|^/usr/local/pgsql/||' > /after.txt &&\
mkdir -p /extensions/kq_imcx && cp /usr/local/pgsql/share/extension/kq_imcx.control /extensions/kq_imcx && \
sort -o /before.txt /before.txt && sort -o /after.txt /after.txt && \
comm -13 /before.txt /after.txt | tar --directory=/usr/local/pgsql --zstd -cf /extensions/kq_imcx.tar.zst -T -
#########################################################################################
#
@@ -535,10 +551,8 @@ FROM build-deps AS pg-embedding-pg-build
COPY --from=pg-build /usr/local/pgsql/ /usr/local/pgsql/
ENV PATH "/usr/local/pgsql/bin/:$PATH"
# eeb3ba7c3a60c95b2604dd543c64b2f1bb4a3703 made on 15/07/2023
# There is no release tag yet
RUN wget https://github.com/neondatabase/pg_embedding/archive/eeb3ba7c3a60c95b2604dd543c64b2f1bb4a3703.tar.gz -O pg_embedding.tar.gz && \
echo "030846df723652f99a8689ce63b66fa0c23477a7fd723533ab8a6b28ab70730f pg_embedding.tar.gz" | sha256sum --check && \
RUN wget https://github.com/neondatabase/pg_embedding/archive/refs/tags/0.3.5.tar.gz -O pg_embedding.tar.gz && \
echo "0e95b27b8b6196e2cf0a0c9ec143fe2219b82e54c5bb4ee064e76398cbe69ae9 pg_embedding.tar.gz" | sha256sum --check && \
mkdir pg_embedding-src && cd pg_embedding-src && tar xvzf ../pg_embedding.tar.gz --strip-components=1 -C . && \
make -j $(getconf _NPROCESSORS_ONLN) && \
make -j $(getconf _NPROCESSORS_ONLN) install && \
@@ -553,16 +567,17 @@ RUN wget https://github.com/neondatabase/pg_embedding/archive/eeb3ba7c3a60c95b26
FROM build-deps AS pg-anon-pg-build
COPY --from=pg-build /usr/local/pgsql/ /usr/local/pgsql/
# Kaniko doesn't allow to do `${from#/usr/local/pgsql/}`, so we use `${from:17}` instead
ENV PATH "/usr/local/pgsql/bin/:$PATH"
RUN wget https://gitlab.com/dalibo/postgresql_anonymizer/-/archive/1.1.0/postgresql_anonymizer-1.1.0.tar.gz -O pg_anon.tar.gz && \
echo "08b09d2ff9b962f96c60db7e6f8e79cf7253eb8772516998fc35ece08633d3ad pg_anon.tar.gz" | sha256sum --check && \
mkdir pg_anon-src && cd pg_anon-src && tar xvzf ../pg_anon.tar.gz --strip-components=1 -C . && \
find /usr/local/pgsql -type f | sort > /before.txt && \
find /usr/local/pgsql -type f | sed 's|^/usr/local/pgsql/||' > /before.txt &&\
make -j $(getconf _NPROCESSORS_ONLN) install PG_CONFIG=/usr/local/pgsql/bin/pg_config && \
echo 'trusted = true' >> /usr/local/pgsql/share/extension/anon.control && \
find /usr/local/pgsql -type f | sort > /after.txt && \
/bin/bash -c 'for from in $(comm -13 /before.txt /after.txt); do to=/extensions/anon/${from:17} && mkdir -p $(dirname ${to}) && cp -a ${from} ${to}; done'
find /usr/local/pgsql -type f | sed 's|^/usr/local/pgsql/||' > /after.txt &&\
mkdir -p /extensions/anon && cp /usr/local/pgsql/share/extension/anon.control /extensions/anon && \
sort -o /before.txt /before.txt && sort -o /after.txt /after.txt && \
comm -13 /before.txt /after.txt | tar --directory=/usr/local/pgsql --zstd -cf /extensions/anon.tar.zst -T -
#########################################################################################
#
@@ -749,22 +764,6 @@ RUN rm -r /usr/local/pgsql/include
# if they were to be used by other libraries.
RUN rm /usr/local/pgsql/lib/lib*.a
#########################################################################################
#
# Extenstion only
#
#########################################################################################
FROM scratch AS postgres-extensions
# After the transition this layer will include all extensitons.
# As for now, it's only for new custom ones
#
# # Default extensions
# COPY --from=postgres-cleanup-layer /usr/local/pgsql/share/extension /usr/local/pgsql/share/extension
# COPY --from=postgres-cleanup-layer /usr/local/pgsql/lib /usr/local/pgsql/lib
# Custom extensions
COPY --from=pg-anon-pg-build /extensions/anon/lib/ /extensions/anon/lib
COPY --from=pg-anon-pg-build /extensions/anon/share/extension /extensions/anon/share/extension
#########################################################################################
#
# Final layer
@@ -794,6 +793,7 @@ COPY --from=compute-tools --chown=postgres /home/nonroot/target/release-line-deb
# libxml2, libxslt1.1 for xml2
# libzstd1 for zstd
# libboost*, libfreetype6, and zlib1g for rdkit
# ca-certificates for communicating with s3 by compute_ctl
RUN apt update && \
apt install --no-install-recommends -y \
gdb \
@@ -817,7 +817,8 @@ RUN apt update && \
libcurl4-openssl-dev \
locales \
procps \
zlib1g && \
zlib1g \
ca-certificates && \
rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/* && \
localedef -i en_US -c -f UTF-8 -A /usr/share/locale/locale.alias en_US.UTF-8

View File

@@ -108,6 +108,8 @@ postgres-%: postgres-configure-% \
$(MAKE) -C $(POSTGRES_INSTALL_DIR)/build/$*/contrib/pg_buffercache install
+@echo "Compiling pageinspect $*"
$(MAKE) -C $(POSTGRES_INSTALL_DIR)/build/$*/contrib/pageinspect install
+@echo "Compiling amcheck $*"
$(MAKE) -C $(POSTGRES_INSTALL_DIR)/build/$*/contrib/amcheck install
.PHONY: postgres-clean-%
postgres-clean-%:

View File

@@ -29,13 +29,13 @@ See developer documentation in [SUMMARY.md](/docs/SUMMARY.md) for more informati
```bash
apt install build-essential libtool libreadline-dev zlib1g-dev flex bison libseccomp-dev \
libssl-dev clang pkg-config libpq-dev cmake postgresql-client protobuf-compiler \
libcurl4-openssl-dev
libcurl4-openssl-dev openssl python-poetry
```
* On Fedora, these packages are needed:
```bash
dnf install flex bison readline-devel zlib-devel openssl-devel \
libseccomp-devel perl clang cmake postgresql postgresql-contrib protobuf-compiler \
protobuf-devel libcurl-devel
protobuf-devel libcurl-devel openssl poetry
```
* On Arch based systems, these packages are needed:
```bash
@@ -235,6 +235,13 @@ CARGO_BUILD_FLAGS="--features=testing" make
./scripts/pytest
```
By default, this runs both debug and release modes, and all supported postgres versions. When
testing locally, it is convenient to run just run one set of permutations, like this:
```sh
DEFAULT_PG_VERSION=15 BUILD_TYPE=release ./scripts/pytest
```
## Documentation
[docs](/docs) Contains a top-level overview of all available markdown documentation.

View File

@@ -8,6 +8,7 @@ license.workspace = true
anyhow.workspace = true
async-compression.workspace = true
chrono.workspace = true
cfg-if.workspace = true
clap.workspace = true
flate2.workspace = true
futures.workspace = true
@@ -23,6 +24,7 @@ tar.workspace = true
reqwest = { workspace = true, features = ["json"] }
tokio = { workspace = true, features = ["rt", "rt-multi-thread"] }
tokio-postgres.workspace = true
tokio-util.workspace = true
tracing.workspace = true
tracing-opentelemetry.workspace = true
tracing-subscriber.workspace = true
@@ -32,3 +34,7 @@ url.workspace = true
compute_api.workspace = true
utils.workspace = true
workspace_hack.workspace = true
toml_edit.workspace = true
remote_storage = { version = "0.1", path = "../libs/remote_storage/" }
vm_monitor = { version = "0.1", path = "../libs/vm_monitor/" }
zstd = "0.12.4"

View File

@@ -19,9 +19,10 @@ Also `compute_ctl` spawns two separate service threads:
- `http-endpoint` runs a Hyper HTTP API server, which serves readiness and the
last activity requests.
If the `vm-informant` binary is present at `/bin/vm-informant`, it will also be started. For VM
compute nodes, `vm-informant` communicates with the VM autoscaling system. It coordinates
downscaling and (eventually) will request immediate upscaling under resource pressure.
If `AUTOSCALING` environment variable is set, `compute_ctl` will start the
`vm-monitor` located in [`neon/libs/vm_monitor`]. For VM compute nodes,
`vm-monitor` communicates with the VM autoscaling system. It coordinates
downscaling and requests immediate upscaling under resource pressure.
Usage example:
```sh

View File

@@ -5,6 +5,8 @@
//! - `compute_ctl` accepts cluster (compute node) specification as a JSON file.
//! - Every start is a fresh start, so the data directory is removed and
//! initialized again on each run.
//! - If remote_extension_config is provided, it will be used to fetch extensions list
//! and download `shared_preload_libraries` from the remote storage.
//! - Next it will put configuration files into the `PGDATA` directory.
//! - Sync safekeepers and get commit LSN.
//! - Get `basebackup` from pageserver using the returned on the previous step LSN.
@@ -18,24 +20,25 @@
//! - `http-endpoint` runs a Hyper HTTP API server, which serves readiness and the
//! last activity requests.
//!
//! If the `vm-informant` binary is present at `/bin/vm-informant`, it will also be started. For VM
//! compute nodes, `vm-informant` communicates with the VM autoscaling system. It coordinates
//! downscaling and (eventually) will request immediate upscaling under resource pressure.
//! If `AUTOSCALING` environment variable is set, `compute_ctl` will start the
//! `vm-monitor` located in [`neon/libs/vm_monitor`]. For VM compute nodes,
//! `vm-monitor` communicates with the VM autoscaling system. It coordinates
//! downscaling and requests immediate upscaling under resource pressure.
//!
//! Usage example:
//! ```sh
//! compute_ctl -D /var/db/postgres/compute \
//! -C 'postgresql://cloud_admin@localhost/postgres' \
//! -S /var/db/postgres/specs/current.json \
//! -b /usr/local/bin/postgres
//! -b /usr/local/bin/postgres \
//! -r {"bucket": "neon-dev-extensions-eu-central-1", "region": "eu-central-1"}
//! ```
//!
use std::collections::HashMap;
use std::fs::File;
use std::panic;
use std::path::Path;
use std::process::exit;
use std::sync::{mpsc, Arc, Condvar, Mutex};
use std::sync::{mpsc, Arc, Condvar, Mutex, RwLock};
use std::{thread, time::Duration};
use anyhow::{Context, Result};
@@ -48,22 +51,33 @@ use compute_api::responses::ComputeStatus;
use compute_tools::compute::{ComputeNode, ComputeState, ParsedSpec};
use compute_tools::configurator::launch_configurator;
use compute_tools::extension_server::{get_pg_version, init_remote_storage};
use compute_tools::http::api::launch_http_server;
use compute_tools::logger::*;
use compute_tools::monitor::launch_monitor;
use compute_tools::params::*;
use compute_tools::spec::*;
const BUILD_TAG_DEFAULT: &str = "local";
// this is an arbitrary build tag. Fine as a default / for testing purposes
// in-case of not-set environment var
const BUILD_TAG_DEFAULT: &str = "5670669815";
fn main() -> Result<()> {
init_tracing_and_logging(DEFAULT_LOG_LEVEL)?;
let build_tag = option_env!("BUILD_TAG").unwrap_or(BUILD_TAG_DEFAULT);
let build_tag = option_env!("BUILD_TAG")
.unwrap_or(BUILD_TAG_DEFAULT)
.to_string();
info!("build_tag: {build_tag}");
let matches = cli().get_matches();
let pgbin_default = String::from("postgres");
let pgbin = matches.get_one::<String>("pgbin").unwrap_or(&pgbin_default);
let remote_ext_config = matches.get_one::<String>("remote-ext-config");
let ext_remote_storage = remote_ext_config.map(|x| {
init_remote_storage(x).expect("cannot initialize remote extension storage from config")
});
let http_port = *matches
.get_one::<u16>("http-port")
@@ -128,14 +142,12 @@ fn main() -> Result<()> {
let compute_id = matches.get_one::<String>("compute-id");
let control_plane_uri = matches.get_one::<String>("control-plane-uri");
// Try to use just 'postgres' if no path is provided
let pgbin = matches.get_one::<String>("pgbin").unwrap();
let spec;
let mut live_config_allowed = false;
match spec_json {
// First, try to get cluster spec from the cli argument
Some(json) => {
info!("got spec from cli argument {}", json);
spec = Some(serde_json::from_str(json)?);
}
None => {
@@ -168,8 +180,10 @@ fn main() -> Result<()> {
let mut new_state = ComputeState::new();
let spec_set;
if let Some(spec) = spec {
let pspec = ParsedSpec::try_from(spec).map_err(|msg| anyhow::anyhow!(msg))?;
info!("new pspec.spec: {:?}", pspec.spec);
new_state.pspec = Some(pspec);
spec_set = true;
} else {
@@ -179,20 +193,35 @@ fn main() -> Result<()> {
connstr: Url::parse(connstr).context("cannot parse connstr as a URL")?,
pgdata: pgdata.to_string(),
pgbin: pgbin.to_string(),
pgversion: get_pg_version(pgbin),
live_config_allowed,
state: Mutex::new(new_state),
state_changed: Condvar::new(),
ext_remote_storage,
ext_download_progress: RwLock::new(HashMap::new()),
build_tag,
};
let compute = Arc::new(compute_node);
// If this is a pooled VM, prewarm before starting HTTP server and becoming
// available for binding. Prewarming helps postgres start quicker later,
// because QEMU will already have it's memory allocated from the host, and
// the necessary binaries will alreaady be cached.
if !spec_set {
compute.prewarm_postgres()?;
}
// Launch http service first, so we were able to serve control-plane
// requests, while configuration is still in progress.
let _http_handle =
launch_http_server(http_port, &compute).expect("cannot launch http endpoint thread");
let extension_server_port: u16 = http_port;
if !spec_set {
// No spec provided, hang waiting for it.
info!("no compute spec provided, waiting");
let mut state = compute.state.lock().unwrap();
while state.status != ComputeStatus::ConfigurationPending {
state = compute.state_changed.wait(state).unwrap();
@@ -223,14 +252,13 @@ fn main() -> Result<()> {
drop(state);
// Launch remaining service threads
let _monitor_handle = launch_monitor(&compute).expect("cannot launch compute monitor thread");
let _configurator_handle =
launch_configurator(&compute).expect("cannot launch configurator thread");
let _monitor_handle = launch_monitor(&compute);
let _configurator_handle = launch_configurator(&compute);
// Start Postgres
let mut delay_exit = false;
let mut exit_code = None;
let pg = match compute.start_compute() {
let pg = match compute.start_compute(extension_server_port) {
Ok(pg) => Some(pg),
Err(err) => {
error!("could not start the compute node: {:?}", err);
@@ -243,6 +271,57 @@ fn main() -> Result<()> {
}
};
// Start the vm-monitor if directed to. The vm-monitor only runs on linux
// because it requires cgroups.
cfg_if::cfg_if! {
if #[cfg(target_os = "linux")] {
use std::env;
use tokio_util::sync::CancellationToken;
use tracing::warn;
let vm_monitor_addr = matches.get_one::<String>("vm-monitor-addr");
let file_cache_connstr = matches.get_one::<String>("filecache-connstr");
let cgroup = matches.get_one::<String>("cgroup");
let file_cache_on_disk = matches.get_flag("file-cache-on-disk");
// Only make a runtime if we need to.
// Note: it seems like you can make a runtime in an inner scope and
// if you start a task in it it won't be dropped. However, make it
// in the outermost scope just to be safe.
let rt = match (env::var_os("AUTOSCALING"), vm_monitor_addr) {
(None, None) => None,
(None, Some(_)) => {
warn!("--vm-monitor-addr option set but AUTOSCALING env var not present");
None
}
(Some(_), None) => {
panic!("AUTOSCALING env var present but --vm-monitor-addr option not set")
}
(Some(_), Some(_)) => Some(
tokio::runtime::Builder::new_multi_thread()
.worker_threads(4)
.enable_all()
.build()
.expect("failed to create tokio runtime for monitor"),
),
};
// This token is used internally by the monitor to clean up all threads
let token = CancellationToken::new();
let vm_monitor = &rt.as_ref().map(|rt| {
rt.spawn(vm_monitor::start(
Box::leak(Box::new(vm_monitor::Args {
cgroup: cgroup.cloned(),
pgconnstr: file_cache_connstr.cloned(),
addr: vm_monitor_addr.cloned().unwrap(),
file_cache_on_disk,
})),
token.clone(),
))
});
}
}
// Wait for the child Postgres process forever. In this state Ctrl+C will
// propagate to Postgres and it will be shut down as well.
if let Some(mut pg) = pg {
@@ -256,6 +335,24 @@ fn main() -> Result<()> {
exit_code = ecode.code()
}
// Terminate the vm_monitor so it releases the file watcher on
// /sys/fs/cgroup/neon-postgres.
// Note: the vm-monitor only runs on linux because it requires cgroups.
cfg_if::cfg_if! {
if #[cfg(target_os = "linux")] {
if let Some(handle) = vm_monitor {
// Kills all threads spawned by the monitor
token.cancel();
// Kills the actual task running the monitor
handle.abort();
// If handle is some, rt must have been used to produce it, and
// hence is also some
rt.unwrap().shutdown_timeout(Duration::from_secs(2));
}
}
}
// Maybe sync safekeepers again, to speed up next startup
let compute_state = compute.state.lock().unwrap().clone();
let pspec = compute_state.pspec.as_ref().expect("spec must be set");
@@ -359,6 +456,40 @@ fn cli() -> clap::Command {
.long("control-plane-uri")
.value_name("CONTROL_PLANE_API_BASE_URI"),
)
.arg(
Arg::new("remote-ext-config")
.short('r')
.long("remote-ext-config")
.value_name("REMOTE_EXT_CONFIG"),
)
// TODO(fprasx): we currently have default arguments because the cloud PR
// to pass them in hasn't been merged yet. We should get rid of them once
// the PR is merged.
.arg(
Arg::new("vm-monitor-addr")
.long("vm-monitor-addr")
.default_value("0.0.0.0:10301")
.value_name("VM_MONITOR_ADDR"),
)
.arg(
Arg::new("cgroup")
.long("cgroup")
.default_value("neon-postgres")
.value_name("CGROUP"),
)
.arg(
Arg::new("filecache-connstr")
.long("filecache-connstr")
.default_value(
"host=localhost port=5432 dbname=postgres user=cloud_admin sslmode=disable",
)
.value_name("FILECACHE_CONNSTR"),
)
.arg(
Arg::new("file-cache-on-disk")
.long("file-cache-on-disk")
.action(clap::ArgAction::SetTrue),
)
}
#[test]

View File

@@ -1,12 +1,39 @@
use anyhow::{anyhow, Result};
use anyhow::{anyhow, Ok, Result};
use postgres::Client;
use tokio_postgres::NoTls;
use tracing::{error, instrument};
use crate::compute::ComputeNode;
/// Create a special service table for availability checks
/// only if it does not exist already.
pub fn create_availability_check_data(client: &mut Client) -> Result<()> {
let query = "
DO $$
BEGIN
IF NOT EXISTS(
SELECT 1
FROM pg_catalog.pg_tables
WHERE tablename = 'health_check'
)
THEN
CREATE TABLE health_check (
id serial primary key,
updated_at timestamptz default now()
);
INSERT INTO health_check VALUES (1, now())
ON CONFLICT (id) DO UPDATE
SET updated_at = now();
END IF;
END
$$;";
client.execute(query, &[])?;
Ok(())
}
/// Update timestamp in a row in a special service table to check
/// that we can actually write some data in this particular timeline.
/// Create table if it's missing.
#[instrument(skip_all)]
pub async fn check_writability(compute: &ComputeNode) -> Result<()> {
// Connect to the database.
@@ -24,19 +51,15 @@ pub async fn check_writability(compute: &ComputeNode) -> Result<()> {
});
let query = "
CREATE TABLE IF NOT EXISTS health_check (
id serial primary key,
updated_at timestamptz default now()
);
INSERT INTO health_check VALUES (1, now())
ON CONFLICT (id) DO UPDATE
SET updated_at = now();";
let result = client.simple_query(query).await?;
if result.len() != 2 {
if result.len() != 1 {
return Err(anyhow::format_err!(
"expected 2 query results, but got {}",
"expected 1 query result, but got {}",
result.len()
));
}

View File

@@ -1,16 +1,23 @@
use std::collections::HashMap;
use std::env;
use std::fs;
use std::io::BufRead;
use std::os::unix::fs::PermissionsExt;
use std::path::Path;
use std::process::{Command, Stdio};
use std::str::FromStr;
use std::sync::{Condvar, Mutex};
use std::sync::{Condvar, Mutex, RwLock};
use std::time::Instant;
use anyhow::{Context, Result};
use chrono::{DateTime, Utc};
use futures::future::join_all;
use futures::stream::FuturesUnordered;
use futures::StreamExt;
use postgres::{Client, NoTls};
use tokio;
use tokio_postgres;
use tracing::{info, instrument, warn};
use tracing::{error, info, instrument, warn};
use utils::id::{TenantId, TimelineId};
use utils::lsn::Lsn;
@@ -18,9 +25,13 @@ use compute_api::responses::{ComputeMetrics, ComputeStatus};
use compute_api::spec::{ComputeMode, ComputeSpec};
use utils::measured_stream::MeasuredReader;
use crate::config;
use remote_storage::{DownloadError, GenericRemoteStorage, RemotePath};
use crate::checker::create_availability_check_data;
use crate::pg_helpers::*;
use crate::spec::*;
use crate::sync_sk::{check_if_synced, ping_safekeeper};
use crate::{config, extension_server};
/// Compute node info shared across several `compute_ctl` threads.
pub struct ComputeNode {
@@ -28,6 +39,7 @@ pub struct ComputeNode {
pub connstr: url::Url,
pub pgdata: String,
pub pgbin: String,
pub pgversion: String,
/// We should only allow live re- / configuration of the compute node if
/// it uses 'pull model', i.e. it can go to control-plane and fetch
/// the latest configuration. Otherwise, there could be a case:
@@ -47,6 +59,19 @@ pub struct ComputeNode {
pub state: Mutex<ComputeState>,
/// `Condvar` to allow notifying waiters about state changes.
pub state_changed: Condvar,
/// the S3 bucket that we search for extensions in
pub ext_remote_storage: Option<GenericRemoteStorage>,
// key: ext_archive_name, value: started download time, download_completed?
pub ext_download_progress: RwLock<HashMap<String, (DateTime<Utc>, bool)>>,
pub build_tag: String,
}
// store some metrics about download size that might impact startup time
#[derive(Clone, Debug)]
pub struct RemoteExtensionMetrics {
num_ext_downloaded: u64,
largest_ext_size: u64,
total_ext_download_size: u64,
}
#[derive(Clone, Debug)]
@@ -86,6 +111,7 @@ pub struct ParsedSpec {
pub tenant_id: TenantId,
pub timeline_id: TimelineId,
pub pageserver_connstr: String,
pub safekeeper_connstrings: Vec<String>,
pub storage_auth_token: Option<String>,
}
@@ -103,6 +129,21 @@ impl TryFrom<ComputeSpec> for ParsedSpec {
.clone()
.or_else(|| spec.cluster.settings.find("neon.pageserver_connstring"))
.ok_or("pageserver connstr should be provided")?;
let safekeeper_connstrings = if spec.safekeeper_connstrings.is_empty() {
if matches!(spec.mode, ComputeMode::Primary) {
spec.cluster
.settings
.find("neon.safekeepers")
.ok_or("safekeeper connstrings should be provided")?
.split(',')
.map(|str| str.to_string())
.collect()
} else {
vec![]
}
} else {
spec.safekeeper_connstrings.clone()
};
let storage_auth_token = spec.storage_auth_token.clone();
let tenant_id: TenantId = if let Some(tenant_id) = spec.tenant_id {
tenant_id
@@ -128,6 +169,7 @@ impl TryFrom<ComputeSpec> for ParsedSpec {
Ok(ParsedSpec {
spec,
pageserver_connstr,
safekeeper_connstrings,
storage_auth_token,
tenant_id,
timeline_id,
@@ -135,6 +177,27 @@ impl TryFrom<ComputeSpec> for ParsedSpec {
}
}
/// If we are a VM, returns a [`Command`] that will run in the `neon-postgres`
/// cgroup. Otherwise returns the default `Command::new(cmd)`
///
/// This function should be used to start postgres, as it will start it in the
/// neon-postgres cgroup if we are a VM. This allows autoscaling to control
/// postgres' resource usage. The cgroup will exist in VMs because vm-builder
/// creates it during the sysinit phase of its inittab.
fn maybe_cgexec(cmd: &str) -> Command {
// The cplane sets this env var for autoscaling computes.
// use `var_os` so we don't have to worry about the variable being valid
// unicode. Should never be an concern . . . but just in case
if env::var_os("AUTOSCALING").is_some() {
let mut command = Command::new("cgexec");
command.args(["-g", "memory:neon-postgres"]);
command.arg(cmd);
command
} else {
Command::new(cmd)
}
}
/// Create special neon_superuser role, that's a slightly nerfed version of a real superuser
/// that we give to customers
fn create_neon_superuser(spec: &ComputeSpec, client: &mut Client) -> Result<()> {
@@ -240,7 +303,7 @@ impl ComputeNode {
#[instrument(skip_all, fields(%lsn))]
fn get_basebackup(&self, compute_state: &ComputeState, lsn: Lsn) -> Result<()> {
let spec = compute_state.pspec.as_ref().expect("spec must be set");
let start_time = Utc::now();
let start_time = Instant::now();
let mut config = postgres::Config::from_str(&spec.pageserver_connstr)?;
@@ -253,7 +316,10 @@ impl ComputeNode {
info!("Storage auth token not set");
}
// Connect to pageserver
let mut client = config.connect(NoTls)?;
let pageserver_connect_micros = start_time.elapsed().as_micros() as u64;
let basebackup_cmd = match lsn {
// HACK We don't use compression on first start (Lsn(0)) because there's no API for it
Lsn(0) => format!("basebackup {} {}", spec.tenant_id, spec.timeline_id),
@@ -299,14 +365,107 @@ impl ComputeNode {
};
// Report metrics
self.state.lock().unwrap().metrics.basebackup_bytes =
measured_reader.get_byte_count() as u64;
self.state.lock().unwrap().metrics.basebackup_ms = Utc::now()
let mut state = self.state.lock().unwrap();
state.metrics.pageserver_connect_micros = pageserver_connect_micros;
state.metrics.basebackup_bytes = measured_reader.get_byte_count() as u64;
state.metrics.basebackup_ms = start_time.elapsed().as_millis() as u64;
Ok(())
}
pub async fn check_safekeepers_synced_async(
&self,
compute_state: &ComputeState,
) -> Result<Option<Lsn>> {
// Construct a connection config for each safekeeper
let pspec: ParsedSpec = compute_state
.pspec
.as_ref()
.expect("spec must be set")
.clone();
let sk_connstrs: Vec<String> = pspec.safekeeper_connstrings.clone();
let sk_configs = sk_connstrs.into_iter().map(|connstr| {
// Format connstr
let id = connstr.clone();
let connstr = format!("postgresql://no_user@{}", connstr);
let options = format!(
"-c timeline_id={} tenant_id={}",
pspec.timeline_id, pspec.tenant_id
);
// Construct client
let mut config = tokio_postgres::Config::from_str(&connstr).unwrap();
config.options(&options);
if let Some(storage_auth_token) = pspec.storage_auth_token.clone() {
config.password(storage_auth_token);
}
(id, config)
});
// Create task set to query all safekeepers
let mut tasks = FuturesUnordered::new();
let quorum = sk_configs.len() / 2 + 1;
for (id, config) in sk_configs {
let timeout = tokio::time::Duration::from_millis(100);
let task = tokio::time::timeout(timeout, ping_safekeeper(id, config));
tasks.push(tokio::spawn(task));
}
// Get a quorum of responses or errors
let mut responses = Vec::new();
let mut join_errors = Vec::new();
let mut task_errors = Vec::new();
let mut timeout_errors = Vec::new();
while let Some(response) = tasks.next().await {
match response {
Ok(Ok(Ok(r))) => responses.push(r),
Ok(Ok(Err(e))) => task_errors.push(e),
Ok(Err(e)) => timeout_errors.push(e),
Err(e) => join_errors.push(e),
};
if responses.len() >= quorum {
break;
}
if join_errors.len() + task_errors.len() + timeout_errors.len() >= quorum {
break;
}
}
// In case of error, log and fail the check, but don't crash.
// We're playing it safe because these errors could be transient
// and we don't yet retry. Also being careful here allows us to
// be backwards compatible with safekeepers that don't have the
// TIMELINE_STATUS API yet.
if responses.len() < quorum {
error!(
"failed sync safekeepers check {:?} {:?} {:?}",
join_errors, task_errors, timeout_errors
);
return Ok(None);
}
Ok(check_if_synced(responses))
}
// Fast path for sync_safekeepers. If they're already synced we get the lsn
// in one roundtrip. If not, we should do a full sync_safekeepers.
pub fn check_safekeepers_synced(&self, compute_state: &ComputeState) -> Result<Option<Lsn>> {
let start_time = Utc::now();
// Run actual work with new tokio runtime
let rt = tokio::runtime::Builder::new_current_thread()
.enable_all()
.build()
.expect("failed to create rt");
let result = rt.block_on(self.check_safekeepers_synced_async(compute_state));
// Record runtime
self.state.lock().unwrap().metrics.sync_sk_check_ms = Utc::now()
.signed_duration_since(start_time)
.to_std()
.unwrap()
.as_millis() as u64;
Ok(())
result
}
// Run `postgres` in a special mode with `--sync-safekeepers` argument
@@ -315,7 +474,7 @@ impl ComputeNode {
pub fn sync_safekeepers(&self, storage_auth_token: Option<String>) -> Result<Lsn> {
let start_time = Utc::now();
let sync_handle = Command::new(&self.pgbin)
let sync_handle = maybe_cgexec(&self.pgbin)
.args(["--sync-safekeepers"])
.env("PGDATA", &self.pgdata) // we cannot use -D in this mode
.envs(if let Some(storage_auth_token) = &storage_auth_token {
@@ -357,27 +516,35 @@ impl ComputeNode {
/// Do all the preparations like PGDATA directory creation, configuration,
/// safekeepers sync, basebackup, etc.
#[instrument(skip_all)]
pub fn prepare_pgdata(&self, compute_state: &ComputeState) -> Result<()> {
pub fn prepare_pgdata(
&self,
compute_state: &ComputeState,
extension_server_port: u16,
) -> Result<()> {
let pspec = compute_state.pspec.as_ref().expect("spec must be set");
let spec = &pspec.spec;
let pgdata_path = Path::new(&self.pgdata);
// Remove/create an empty pgdata directory and put configuration there.
self.create_pgdata()?;
config::write_postgres_conf(&pgdata_path.join("postgresql.conf"), &pspec.spec)?;
config::write_postgres_conf(
&pgdata_path.join("postgresql.conf"),
&pspec.spec,
Some(extension_server_port),
)?;
// Syncing safekeepers is only safe with primary nodes: if a primary
// is already connected it will be kicked out, so a secondary (standby)
// cannot sync safekeepers.
let lsn = match spec.mode {
ComputeMode::Primary => {
info!("starting safekeepers syncing");
let lsn = if let Some(lsn) = self.check_safekeepers_synced() {
info!("checking if safekeepers are synced");
let lsn = if let Ok(Some(lsn)) = self.check_safekeepers_synced(compute_state) {
lsn
} else {
self
.sync_safekeepers(pspec.storage_auth_token.clone())
.with_context(|| "failed to sync safekeepers")?;
info!("starting safekeepers syncing");
self.sync_safekeepers(pspec.storage_auth_token.clone())
.with_context(|| "failed to sync safekeepers")?
};
info!("safekeepers synced at LSN {}", lsn);
lsn
@@ -416,6 +583,50 @@ impl ComputeNode {
Ok(())
}
/// Start and stop a postgres process to warm up the VM for startup.
pub fn prewarm_postgres(&self) -> Result<()> {
info!("prewarming");
// Create pgdata
let pgdata = &format!("{}.warmup", self.pgdata);
create_pgdata(pgdata)?;
// Run initdb to completion
info!("running initdb");
let initdb_bin = Path::new(&self.pgbin).parent().unwrap().join("initdb");
Command::new(initdb_bin)
.args(["-D", pgdata])
.output()
.expect("cannot start initdb process");
// Write conf
use std::io::Write;
let conf_path = Path::new(pgdata).join("postgresql.conf");
let mut file = std::fs::File::create(conf_path)?;
writeln!(file, "shared_buffers=65536")?;
writeln!(file, "port=51055")?; // Nobody should be connecting
writeln!(file, "shared_preload_libraries = 'neon'")?;
// Start postgres
info!("starting postgres");
let mut pg = maybe_cgexec(&self.pgbin)
.args(["-D", pgdata])
.spawn()
.expect("cannot start postgres process");
// Stop it when it's ready
info!("waiting for postgres");
wait_for_postgres(&mut pg, Path::new(pgdata))?;
pg.kill()?;
info!("sent kill signal");
pg.wait()?;
info!("done prewarming");
// clean up
let _ok = fs::remove_dir_all(pgdata);
Ok(())
}
/// Start Postgres as a child process and manage DBs/roles.
/// After that this will hang waiting on the postmaster process to exit.
#[instrument(skip_all)]
@@ -426,7 +637,7 @@ impl ComputeNode {
let pgdata_path = Path::new(&self.pgdata);
// Run postgres as a child process.
let mut pg = Command::new(&self.pgbin)
let mut pg = maybe_cgexec(&self.pgbin)
.args(["-D", &self.pgdata])
.envs(if let Some(storage_auth_token) = &storage_auth_token {
vec![("NEON_AUTH_TOKEN", storage_auth_token)]
@@ -486,6 +697,7 @@ impl ComputeNode {
handle_role_deletions(spec, self.connstr.as_str(), &mut client)?;
handle_grants(spec, self.connstr.as_str())?;
handle_extensions(spec, &mut client)?;
create_availability_check_data(&mut client)?;
// 'Close' connection
drop(client);
@@ -510,7 +722,7 @@ impl ComputeNode {
// Write new config
let pgdata_path = Path::new(&self.pgdata);
config::write_postgres_conf(&pgdata_path.join("postgresql.conf"), &spec)?;
config::write_postgres_conf(&pgdata_path.join("postgresql.conf"), &spec, None)?;
let mut client = Client::connect(self.connstr.as_str(), NoTls)?;
self.pg_reload_conf(&mut client)?;
@@ -540,7 +752,7 @@ impl ComputeNode {
}
#[instrument(skip_all)]
pub fn start_compute(&self) -> Result<std::process::Child> {
pub fn start_compute(&self, extension_server_port: u16) -> Result<std::process::Child> {
let compute_state = self.state.lock().unwrap().clone();
let pspec = compute_state.pspec.as_ref().expect("spec must be set");
info!(
@@ -551,7 +763,38 @@ impl ComputeNode {
pspec.timeline_id,
);
self.prepare_pgdata(&compute_state)?;
info!(
"start_compute spec.remote_extensions {:?}",
pspec.spec.remote_extensions
);
// This part is sync, because we need to download
// remote shared_preload_libraries before postgres start (if any)
if let Some(remote_extensions) = &pspec.spec.remote_extensions {
// First, create control files for all availale extensions
extension_server::create_control_files(remote_extensions, &self.pgbin);
let library_load_start_time = Utc::now();
let remote_ext_metrics = self.prepare_preload_libraries(&pspec.spec)?;
let library_load_time = Utc::now()
.signed_duration_since(library_load_start_time)
.to_std()
.unwrap()
.as_millis() as u64;
let mut state = self.state.lock().unwrap();
state.metrics.load_ext_ms = library_load_time;
state.metrics.num_ext_downloaded = remote_ext_metrics.num_ext_downloaded;
state.metrics.largest_ext_size = remote_ext_metrics.largest_ext_size;
state.metrics.total_ext_download_size = remote_ext_metrics.total_ext_download_size;
info!(
"Loading shared_preload_libraries took {:?}ms",
library_load_time
);
info!("{:?}", remote_ext_metrics);
}
self.prepare_pgdata(&compute_state, extension_server_port)?;
let start_time = Utc::now();
let pg = self.start_postgres(pspec.storage_auth_token.clone())?;
@@ -699,4 +942,173 @@ LIMIT 100",
"{{\"pg_stat_statements\": []}}".to_string()
}
}
// download an archive, unzip and place files in correct locations
pub async fn download_extension(
&self,
real_ext_name: String,
ext_path: RemotePath,
) -> Result<u64, DownloadError> {
let remote_storage = self
.ext_remote_storage
.as_ref()
.ok_or(DownloadError::BadInput(anyhow::anyhow!(
"Remote extensions storage is not configured",
)))?;
let ext_archive_name = ext_path.object_name().expect("bad path");
let mut first_try = false;
if !self
.ext_download_progress
.read()
.expect("lock err")
.contains_key(ext_archive_name)
{
self.ext_download_progress
.write()
.expect("lock err")
.insert(ext_archive_name.to_string(), (Utc::now(), false));
first_try = true;
}
let (download_start, download_completed) =
self.ext_download_progress.read().expect("lock err")[ext_archive_name];
let start_time_delta = Utc::now()
.signed_duration_since(download_start)
.to_std()
.unwrap()
.as_millis() as u64;
// how long to wait for extension download if it was started by another process
const HANG_TIMEOUT: u64 = 3000; // milliseconds
if download_completed {
info!("extension already downloaded, skipping re-download");
return Ok(0);
} else if start_time_delta < HANG_TIMEOUT && !first_try {
info!("download {ext_archive_name} already started by another process, hanging untill completion or timeout");
let mut interval = tokio::time::interval(tokio::time::Duration::from_millis(500));
loop {
info!("waiting for download");
interval.tick().await;
let (_, download_completed_now) =
self.ext_download_progress.read().expect("lock")[ext_archive_name];
if download_completed_now {
info!("download finished by whoever else downloaded it");
return Ok(0);
}
}
// NOTE: the above loop will get terminated
// based on the timeout of the download function
}
// if extension hasn't been downloaded before or the previous
// attempt to download was at least HANG_TIMEOUT ms ago
// then we try to download it here
info!("downloading new extension {ext_archive_name}");
let download_size = extension_server::download_extension(
&real_ext_name,
&ext_path,
remote_storage,
&self.pgbin,
)
.await
.map_err(DownloadError::Other);
self.ext_download_progress
.write()
.expect("bad lock")
.insert(ext_archive_name.to_string(), (download_start, true));
download_size
}
#[tokio::main]
pub async fn prepare_preload_libraries(
&self,
spec: &ComputeSpec,
) -> Result<RemoteExtensionMetrics> {
if self.ext_remote_storage.is_none() {
return Ok(RemoteExtensionMetrics {
num_ext_downloaded: 0,
largest_ext_size: 0,
total_ext_download_size: 0,
});
}
let remote_extensions = spec
.remote_extensions
.as_ref()
.ok_or(anyhow::anyhow!("Remote extensions are not configured",))?;
info!("parse shared_preload_libraries from spec.cluster.settings");
let mut libs_vec = Vec::new();
if let Some(libs) = spec.cluster.settings.find("shared_preload_libraries") {
libs_vec = libs
.split(&[',', '\'', ' '])
.filter(|s| *s != "neon" && !s.is_empty())
.map(str::to_string)
.collect();
}
info!("parse shared_preload_libraries from provided postgresql.conf");
// that is used in neon_local and python tests
if let Some(conf) = &spec.cluster.postgresql_conf {
let conf_lines = conf.split('\n').collect::<Vec<&str>>();
let mut shared_preload_libraries_line = "";
for line in conf_lines {
if line.starts_with("shared_preload_libraries") {
shared_preload_libraries_line = line;
}
}
let mut preload_libs_vec = Vec::new();
if let Some(libs) = shared_preload_libraries_line.split("='").nth(1) {
preload_libs_vec = libs
.split(&[',', '\'', ' '])
.filter(|s| *s != "neon" && !s.is_empty())
.map(str::to_string)
.collect();
}
libs_vec.extend(preload_libs_vec);
}
// Don't try to download libraries that are not in the index.
// Assume that they are already present locally.
libs_vec.retain(|lib| remote_extensions.library_index.contains_key(lib));
info!("Downloading to shared preload libraries: {:?}", &libs_vec);
let mut download_tasks = Vec::new();
for library in &libs_vec {
let (ext_name, ext_path) =
remote_extensions.get_ext(library, true, &self.build_tag, &self.pgversion)?;
download_tasks.push(self.download_extension(ext_name, ext_path));
}
let results = join_all(download_tasks).await;
let mut remote_ext_metrics = RemoteExtensionMetrics {
num_ext_downloaded: 0,
largest_ext_size: 0,
total_ext_download_size: 0,
};
for result in results {
let download_size = match result {
Ok(res) => {
remote_ext_metrics.num_ext_downloaded += 1;
res
}
Err(err) => {
// if we failed to download an extension, we don't want to fail the whole
// process, but we do want to log the error
error!("Failed to download extension: {}", err);
0
}
};
remote_ext_metrics.largest_ext_size =
std::cmp::max(remote_ext_metrics.largest_ext_size, download_size);
remote_ext_metrics.total_ext_download_size += download_size;
}
Ok(remote_ext_metrics)
}
}

View File

@@ -33,7 +33,11 @@ pub fn line_in_file(path: &Path, line: &str) -> Result<bool> {
}
/// Create or completely rewrite configuration file specified by `path`
pub fn write_postgres_conf(path: &Path, spec: &ComputeSpec) -> Result<()> {
pub fn write_postgres_conf(
path: &Path,
spec: &ComputeSpec,
extension_server_port: Option<u16>,
) -> Result<()> {
// File::create() destroys the file content if it exists.
let mut file = File::create(path)?;
@@ -87,5 +91,9 @@ pub fn write_postgres_conf(path: &Path, spec: &ComputeSpec) -> Result<()> {
writeln!(file, "# Managed by compute_ctl: end")?;
}
if let Some(port) = extension_server_port {
writeln!(file, "neon.extension_server_port={}", port)?;
}
Ok(())
}

View File

@@ -1,7 +1,6 @@
use std::sync::Arc;
use std::thread;
use anyhow::Result;
use tracing::{error, info, instrument};
use compute_api::responses::ComputeStatus;
@@ -42,13 +41,14 @@ fn configurator_main_loop(compute: &Arc<ComputeNode>) {
}
}
pub fn launch_configurator(compute: &Arc<ComputeNode>) -> Result<thread::JoinHandle<()>> {
pub fn launch_configurator(compute: &Arc<ComputeNode>) -> thread::JoinHandle<()> {
let compute = Arc::clone(compute);
Ok(thread::Builder::new()
thread::Builder::new()
.name("compute-configurator".into())
.spawn(move || {
configurator_main_loop(&compute);
info!("configurator thread is exited");
})?)
})
.expect("cannot launch configurator thread")
}

View File

@@ -0,0 +1,233 @@
// Download extension files from the extension store
// and put them in the right place in the postgres directory (share / lib)
/*
The layout of the S3 bucket is as follows:
5615610098 // this is an extension build number
├── v14
│   ├── extensions
│   │   ├── anon.tar.zst
│   │   └── embedding.tar.zst
│   └── ext_index.json
└── v15
├── extensions
│   ├── anon.tar.zst
│   └── embedding.tar.zst
└── ext_index.json
5615261079
├── v14
│   ├── extensions
│   │   └── anon.tar.zst
│   └── ext_index.json
└── v15
├── extensions
│   └── anon.tar.zst
└── ext_index.json
5623261088
├── v14
│   ├── extensions
│   │   └── embedding.tar.zst
│   └── ext_index.json
└── v15
├── extensions
│   └── embedding.tar.zst
└── ext_index.json
Note that build number cannot be part of prefix because we might need extensions
from other build numbers.
ext_index.json stores the control files and location of extension archives
It also stores a list of public extensions and a library_index
We don't need to duplicate extension.tar.zst files.
We only need to upload a new one if it is updated.
(Although currently we just upload every time anyways, hopefully will change
this sometime)
*access* is controlled by spec
More specifically, here is an example ext_index.json
{
"public_extensions": [
"anon",
"pg_buffercache"
],
"library_index": {
"anon": "anon",
"pg_buffercache": "pg_buffercache"
},
"extension_data": {
"pg_buffercache": {
"control_data": {
"pg_buffercache.control": "# pg_buffercache extension \ncomment = 'examine the shared buffer cache' \ndefault_version = '1.3' \nmodule_pathname = '$libdir/pg_buffercache' \nrelocatable = true \ntrusted=true"
},
"archive_path": "5670669815/v14/extensions/pg_buffercache.tar.zst"
},
"anon": {
"control_data": {
"anon.control": "# PostgreSQL Anonymizer (anon) extension \ncomment = 'Data anonymization tools' \ndefault_version = '1.1.0' \ndirectory='extension/anon' \nrelocatable = false \nrequires = 'pgcrypto' \nsuperuser = false \nmodule_pathname = '$libdir/anon' \ntrusted = true \n"
},
"archive_path": "5670669815/v14/extensions/anon.tar.zst"
}
}
}
*/
use anyhow::Context;
use anyhow::{self, Result};
use compute_api::spec::RemoteExtSpec;
use remote_storage::*;
use serde_json;
use std::io::Read;
use std::num::{NonZeroU32, NonZeroUsize};
use std::path::Path;
use std::str;
use tar::Archive;
use tokio::io::AsyncReadExt;
use tracing::info;
use tracing::log::warn;
use zstd::stream::read::Decoder;
fn get_pg_config(argument: &str, pgbin: &str) -> String {
// gives the result of `pg_config [argument]`
// where argument is a flag like `--version` or `--sharedir`
let pgconfig = pgbin
.strip_suffix("postgres")
.expect("bad pgbin")
.to_owned()
+ "/pg_config";
let config_output = std::process::Command::new(pgconfig)
.arg(argument)
.output()
.expect("pg_config error");
std::str::from_utf8(&config_output.stdout)
.expect("pg_config error")
.trim()
.to_string()
}
pub fn get_pg_version(pgbin: &str) -> String {
// pg_config --version returns a (platform specific) human readable string
// such as "PostgreSQL 15.4". We parse this to v14/v15
let human_version = get_pg_config("--version", pgbin);
if human_version.contains("15") {
return "v15".to_string();
} else if human_version.contains("14") {
return "v14".to_string();
}
panic!("Unsuported postgres version {human_version}");
}
// download the archive for a given extension,
// unzip it, and place files in the appropriate locations (share/lib)
pub async fn download_extension(
ext_name: &str,
ext_path: &RemotePath,
remote_storage: &GenericRemoteStorage,
pgbin: &str,
) -> Result<u64> {
info!("Download extension {:?} from {:?}", ext_name, ext_path);
let mut download = remote_storage.download(ext_path).await?;
let mut download_buffer = Vec::new();
download
.download_stream
.read_to_end(&mut download_buffer)
.await?;
let download_size = download_buffer.len() as u64;
// it's unclear whether it is more performant to decompress into memory or not
// TODO: decompressing into memory can be avoided
let mut decoder = Decoder::new(download_buffer.as_slice())?;
let mut decompress_buffer = Vec::new();
decoder.read_to_end(&mut decompress_buffer)?;
let mut archive = Archive::new(decompress_buffer.as_slice());
let unzip_dest = pgbin
.strip_suffix("/bin/postgres")
.expect("bad pgbin")
.to_string()
+ "/download_extensions";
archive.unpack(&unzip_dest)?;
info!("Download + unzip {:?} completed successfully", &ext_path);
let sharedir_paths = (
unzip_dest.to_string() + "/share/extension",
Path::new(&get_pg_config("--sharedir", pgbin)).join("extension"),
);
let libdir_paths = (
unzip_dest.to_string() + "/lib",
Path::new(&get_pg_config("--pkglibdir", pgbin)).to_path_buf(),
);
// move contents of the libdir / sharedir in unzipped archive to the correct local paths
for paths in [sharedir_paths, libdir_paths] {
let (zip_dir, real_dir) = paths;
info!("mv {zip_dir:?}/* {real_dir:?}");
for file in std::fs::read_dir(zip_dir)? {
let old_file = file?.path();
let new_file =
Path::new(&real_dir).join(old_file.file_name().context("error parsing file")?);
info!("moving {old_file:?} to {new_file:?}");
// extension download failed: Directory not empty (os error 39)
match std::fs::rename(old_file, new_file) {
Ok(()) => info!("move succeeded"),
Err(e) => {
warn!("move failed, probably because the extension already exists: {e}")
}
}
}
}
info!("done moving extension {ext_name}");
Ok(download_size)
}
// Create extension control files from spec
pub fn create_control_files(remote_extensions: &RemoteExtSpec, pgbin: &str) {
let local_sharedir = Path::new(&get_pg_config("--sharedir", pgbin)).join("extension");
for (ext_name, ext_data) in remote_extensions.extension_data.iter() {
// Check if extension is present in public or custom.
// If not, then it is not allowed to be used by this compute.
if let Some(public_extensions) = &remote_extensions.public_extensions {
if !public_extensions.contains(ext_name) {
if let Some(custom_extensions) = &remote_extensions.custom_extensions {
if !custom_extensions.contains(ext_name) {
continue; // skip this extension, it is not allowed
}
}
}
}
for (control_name, control_content) in &ext_data.control_data {
let control_path = local_sharedir.join(control_name);
if !control_path.exists() {
info!("writing file {:?}{:?}", control_path, control_content);
std::fs::write(control_path, control_content).unwrap();
} else {
warn!("control file {:?} exists both locally and remotely. ignoring the remote version.", control_path);
}
}
}
}
// This function initializes the necessary structs to use remote storage
pub fn init_remote_storage(remote_ext_config: &str) -> anyhow::Result<GenericRemoteStorage> {
#[derive(Debug, serde::Deserialize)]
struct RemoteExtJson {
bucket: String,
region: String,
endpoint: Option<String>,
prefix: Option<String>,
}
let remote_ext_json = serde_json::from_str::<RemoteExtJson>(remote_ext_config)?;
let config = S3Config {
bucket_name: remote_ext_json.bucket,
bucket_region: remote_ext_json.region,
prefix_in_bucket: remote_ext_json.prefix,
endpoint: remote_ext_json.endpoint,
concurrency_limit: NonZeroUsize::new(100).expect("100 != 0"),
max_keys_per_list_response: None,
};
let config = RemoteStorageConfig {
max_concurrent_syncs: NonZeroUsize::new(100).expect("100 != 0"),
max_sync_errors: NonZeroU32::new(100).expect("100 != 0"),
storage: RemoteStorageKind::AwsS3(config),
};
GenericRemoteStorage::from_config(&config)
}

View File

@@ -13,7 +13,7 @@ use hyper::{Body, Method, Request, Response, Server, StatusCode};
use num_cpus;
use serde_json;
use tokio::task;
use tracing::{error, info};
use tracing::{error, info, warn};
use tracing_utils::http::OtelName;
fn status_response_from_state(state: &ComputeState) -> ComputeStatusResponse {
@@ -121,6 +121,83 @@ async fn routes(req: Request<Body>, compute: &Arc<ComputeNode>) -> Response<Body
}
}
// download extension files from S3 on demand
(&Method::POST, route) if route.starts_with("/extension_server/") => {
info!("serving {:?} POST request", route);
info!("req.uri {:?}", req.uri());
// don't even try to download extensions
// if no remote storage is configured
if compute.ext_remote_storage.is_none() {
info!("no extensions remote storage configured");
let mut resp = Response::new(Body::from("no remote storage configured"));
*resp.status_mut() = StatusCode::INTERNAL_SERVER_ERROR;
return resp;
}
let mut is_library = false;
if let Some(params) = req.uri().query() {
info!("serving {:?} POST request with params: {}", route, params);
if params == "is_library=true" {
is_library = true;
} else {
let mut resp = Response::new(Body::from("Wrong request parameters"));
*resp.status_mut() = StatusCode::BAD_REQUEST;
return resp;
}
}
let filename = route.split('/').last().unwrap().to_string();
info!("serving /extension_server POST request, filename: {filename:?} is_library: {is_library}");
// get ext_name and path from spec
// don't lock compute_state for too long
let ext = {
let compute_state = compute.state.lock().unwrap();
let pspec = compute_state.pspec.as_ref().expect("spec must be set");
let spec = &pspec.spec;
// debug only
info!("spec: {:?}", spec);
let remote_extensions = match spec.remote_extensions.as_ref() {
Some(r) => r,
None => {
info!("no remote extensions spec was provided");
let mut resp = Response::new(Body::from("no remote storage configured"));
*resp.status_mut() = StatusCode::INTERNAL_SERVER_ERROR;
return resp;
}
};
remote_extensions.get_ext(
&filename,
is_library,
&compute.build_tag,
&compute.pgversion,
)
};
match ext {
Ok((ext_name, ext_path)) => {
match compute.download_extension(ext_name, ext_path).await {
Ok(_) => Response::new(Body::from("OK")),
Err(e) => {
error!("extension download failed: {}", e);
let mut resp = Response::new(Body::from(e.to_string()));
*resp.status_mut() = StatusCode::INTERNAL_SERVER_ERROR;
resp
}
}
}
Err(e) => {
warn!("extension download failed to find extension: {}", e);
let mut resp = Response::new(Body::from("failed to find file"));
*resp.status_mut() = StatusCode::INTERNAL_SERVER_ERROR;
resp
}
}
}
// Return the `404 Not Found` for any other routes.
_ => {
let mut not_found = Response::new(Body::from("404 Not Found"));

View File

@@ -139,6 +139,34 @@ paths:
application/json:
schema:
$ref: "#/components/schemas/GenericError"
/extension_server:
post:
tags:
- Extension
summary: Download extension from S3 to local folder.
description: ""
operationId: downloadExtension
responses:
200:
description: Extension downloaded
content:
text/plain:
schema:
type: string
description: Error text or 'OK' if download succeeded.
example: "OK"
400:
description: Request is invalid.
content:
application/json:
schema:
$ref: "#/components/schemas/GenericError"
500:
description: Extension download request failed.
content:
application/json:
schema:
$ref: "#/components/schemas/GenericError"
components:
securitySchemes:

View File

@@ -9,7 +9,9 @@ pub mod http;
#[macro_use]
pub mod logger;
pub mod compute;
pub mod extension_server;
pub mod monitor;
pub mod params;
pub mod pg_helpers;
pub mod spec;
pub mod sync_sk;

View File

@@ -1,7 +1,6 @@
use std::sync::Arc;
use std::{thread, time};
use anyhow::Result;
use chrono::{DateTime, Utc};
use postgres::{Client, NoTls};
use tracing::{debug, info};
@@ -105,10 +104,11 @@ fn watch_compute_activity(compute: &ComputeNode) {
}
/// Launch a separate compute monitor thread and return its `JoinHandle`.
pub fn launch_monitor(state: &Arc<ComputeNode>) -> Result<thread::JoinHandle<()>> {
pub fn launch_monitor(state: &Arc<ComputeNode>) -> thread::JoinHandle<()> {
let state = Arc::clone(state);
Ok(thread::Builder::new()
thread::Builder::new()
.name("compute-monitor".into())
.spawn(move || watch_compute_activity(&state))?)
.spawn(move || watch_compute_activity(&state))
.expect("cannot launch compute monitor thread")
}

View File

@@ -124,7 +124,7 @@ pub fn get_spec_from_control_plane(
pub fn handle_configuration(spec: &ComputeSpec, pgdata_path: &Path) -> Result<()> {
// File `postgresql.conf` is no longer included into `basebackup`, so just
// always write all config into it creating new file.
config::write_postgres_conf(&pgdata_path.join("postgresql.conf"), spec)?;
config::write_postgres_conf(&pgdata_path.join("postgresql.conf"), spec, None)?;
update_pg_hba(pgdata_path)?;
@@ -270,7 +270,7 @@ pub fn handle_roles(spec: &ComputeSpec, client: &mut Client) -> Result<()> {
}
RoleAction::Create => {
let mut query: String = format!(
"CREATE ROLE {} CREATEROLE CREATEDB IN ROLE neon_superuser",
"CREATE ROLE {} CREATEROLE CREATEDB BYPASSRLS IN ROLE neon_superuser",
name.pg_quote()
);
info!("role create query: '{}'", &query);

View File

@@ -0,0 +1,98 @@
// Utils for running sync_safekeepers
use anyhow::Result;
use tracing::info;
use utils::lsn::Lsn;
#[derive(Copy, Clone, Debug)]
pub enum TimelineStatusResponse {
NotFound,
Ok(TimelineStatusOkResponse),
}
#[derive(Copy, Clone, Debug)]
pub struct TimelineStatusOkResponse {
flush_lsn: Lsn,
commit_lsn: Lsn,
}
/// Get a safekeeper's metadata for our timeline. The id is only used for logging
pub async fn ping_safekeeper(
id: String,
config: tokio_postgres::Config,
) -> Result<TimelineStatusResponse> {
// TODO add retries
// Connect
info!("connecting to {}", id);
let (client, conn) = config.connect(tokio_postgres::NoTls).await?;
tokio::spawn(async move {
if let Err(e) = conn.await {
eprintln!("connection error: {}", e);
}
});
// Query
info!("querying {}", id);
let result = client.simple_query("TIMELINE_STATUS").await?;
// Parse result
info!("done with {}", id);
if let postgres::SimpleQueryMessage::Row(row) = &result[0] {
use std::str::FromStr;
let response = TimelineStatusResponse::Ok(TimelineStatusOkResponse {
flush_lsn: Lsn::from_str(row.get("flush_lsn").unwrap())?,
commit_lsn: Lsn::from_str(row.get("commit_lsn").unwrap())?,
});
Ok(response)
} else {
// Timeline doesn't exist
Ok(TimelineStatusResponse::NotFound)
}
}
/// Given a quorum of responses, check if safekeepers are synced at some Lsn
pub fn check_if_synced(responses: Vec<TimelineStatusResponse>) -> Option<Lsn> {
// Check if all responses are ok
let ok_responses: Vec<TimelineStatusOkResponse> = responses
.iter()
.filter_map(|r| match r {
TimelineStatusResponse::Ok(ok_response) => Some(ok_response),
_ => None,
})
.cloned()
.collect();
if ok_responses.len() < responses.len() {
info!(
"not synced. Only {} out of {} know about this timeline",
ok_responses.len(),
responses.len()
);
return None;
}
// Get the min and the max of everything
let commit: Vec<Lsn> = ok_responses.iter().map(|r| r.commit_lsn).collect();
let flush: Vec<Lsn> = ok_responses.iter().map(|r| r.flush_lsn).collect();
let commit_max = commit.iter().max().unwrap();
let commit_min = commit.iter().min().unwrap();
let flush_max = flush.iter().max().unwrap();
let flush_min = flush.iter().min().unwrap();
// Check that all values are equal
if commit_min != commit_max {
info!("not synced. {:?} {:?}", commit_min, commit_max);
return None;
}
if flush_min != flush_max {
info!("not synced. {:?} {:?}", flush_min, flush_max);
return None;
}
// Check that commit == flush
if commit_max != flush_max {
info!("not synced. {:?} {:?}", commit_max, flush_max);
return None;
}
Some(*commit_max)
}

View File

@@ -32,3 +32,4 @@ utils.workspace = true
compute_api.workspace = true
workspace_hack.workspace = true
tracing.workspace = true

View File

@@ -658,6 +658,8 @@ fn handle_endpoint(ep_match: &ArgMatches, env: &local_env::LocalEnv) -> Result<(
.get_one::<String>("endpoint_id")
.ok_or_else(|| anyhow!("No endpoint ID was provided to start"))?;
let remote_ext_config = sub_args.get_one::<String>("remote-ext-config");
// If --safekeepers argument is given, use only the listed safekeeper nodes.
let safekeepers =
if let Some(safekeepers_str) = sub_args.get_one::<String>("safekeepers") {
@@ -699,7 +701,7 @@ fn handle_endpoint(ep_match: &ArgMatches, env: &local_env::LocalEnv) -> Result<(
_ => {}
}
println!("Starting existing endpoint {endpoint_id}...");
endpoint.start(&auth_token, safekeepers)?;
endpoint.start(&auth_token, safekeepers, remote_ext_config)?;
} else {
let branch_name = sub_args
.get_one::<String>("branch-name")
@@ -743,7 +745,7 @@ fn handle_endpoint(ep_match: &ArgMatches, env: &local_env::LocalEnv) -> Result<(
pg_version,
mode,
)?;
ep.start(&auth_token, safekeepers)?;
ep.start(&auth_token, safekeepers, remote_ext_config)?;
}
}
"stop" => {
@@ -823,6 +825,16 @@ fn get_safekeeper(env: &local_env::LocalEnv, id: NodeId) -> Result<SafekeeperNod
}
}
// Get list of options to append to safekeeper command invocation.
fn safekeeper_extra_opts(init_match: &ArgMatches) -> Vec<String> {
init_match
.get_many::<String>("safekeeper-extra-opt")
.into_iter()
.flatten()
.map(|s| s.to_owned())
.collect()
}
fn handle_safekeeper(sub_match: &ArgMatches, env: &local_env::LocalEnv) -> Result<()> {
let (sub_name, sub_args) = match sub_match.subcommand() {
Some(safekeeper_command_data) => safekeeper_command_data,
@@ -839,7 +851,9 @@ fn handle_safekeeper(sub_match: &ArgMatches, env: &local_env::LocalEnv) -> Resul
match sub_name {
"start" => {
if let Err(e) = safekeeper.start() {
let extra_opts = safekeeper_extra_opts(sub_args);
if let Err(e) = safekeeper.start(extra_opts) {
eprintln!("safekeeper start failed: {}", e);
exit(1);
}
@@ -864,7 +878,8 @@ fn handle_safekeeper(sub_match: &ArgMatches, env: &local_env::LocalEnv) -> Resul
exit(1);
}
if let Err(e) = safekeeper.start() {
let extra_opts = safekeeper_extra_opts(sub_args);
if let Err(e) = safekeeper.start(extra_opts) {
eprintln!("safekeeper start failed: {}", e);
exit(1);
}
@@ -891,7 +906,7 @@ fn handle_start_all(sub_match: &ArgMatches, env: &local_env::LocalEnv) -> anyhow
for node in env.safekeepers.iter() {
let safekeeper = SafekeeperNode::from_env(env, node);
if let Err(e) = safekeeper.start() {
if let Err(e) = safekeeper.start(vec![]) {
eprintln!("safekeeper {} start failed: {:#}", safekeeper.id, e);
try_stop_all(env, false);
exit(1);
@@ -954,6 +969,14 @@ fn cli() -> Command {
let safekeeper_id_arg = Arg::new("id").help("safekeeper id").required(false);
let safekeeper_extra_opt_arg = Arg::new("safekeeper-extra-opt")
.short('e')
.long("safekeeper-extra-opt")
.num_args(1)
.action(ArgAction::Append)
.help("Additional safekeeper invocation options, e.g. -e=--http-auth-public-key-path=foo")
.required(false);
let tenant_id_arg = Arg::new("tenant-id")
.long("tenant-id")
.help("Tenant id. Represented as a hexadecimal string 32 symbols length")
@@ -1003,6 +1026,12 @@ fn cli() -> Command {
.help("Additional pageserver's configuration options or overrides, refer to pageserver's 'config-override' CLI parameter docs for more")
.required(false);
let remote_ext_config_args = Arg::new("remote-ext-config")
.long("remote-ext-config")
.num_args(1)
.help("Configure the S3 bucket that we search for extensions in.")
.required(false);
let lsn_arg = Arg::new("lsn")
.long("lsn")
.help("Specify Lsn on the timeline to start from. By default, end of the timeline would be used.")
@@ -1116,6 +1145,7 @@ fn cli() -> Command {
.subcommand(Command::new("start")
.about("Start local safekeeper")
.arg(safekeeper_id_arg.clone())
.arg(safekeeper_extra_opt_arg.clone())
)
.subcommand(Command::new("stop")
.about("Stop local safekeeper")
@@ -1126,6 +1156,7 @@ fn cli() -> Command {
.about("Restart local safekeeper")
.arg(safekeeper_id_arg)
.arg(stop_mode_arg.clone())
.arg(safekeeper_extra_opt_arg)
)
)
.subcommand(
@@ -1161,6 +1192,7 @@ fn cli() -> Command {
.arg(pg_version_arg)
.arg(hot_standby_arg)
.arg(safekeepers_arg)
.arg(remote_ext_config_args)
)
.subcommand(
Command::new("stop")

View File

@@ -138,7 +138,13 @@ impl ComputeControlPlane {
mode,
tenant_id,
pg_version,
skip_pg_catalog_updates: false,
// We don't setup roles and databases in the spec locally, so we don't need to
// do catalog updates. Catalog updates also include check availability
// data creation. Yet, we have tests that check that size and db dump
// before and after start are the same. So, skip catalog updates,
// with this we basically test a case of waking up an idle compute, where
// we also skip catalog updates in the cloud.
skip_pg_catalog_updates: true,
});
ep.create_endpoint_dir()?;
@@ -152,7 +158,7 @@ impl ComputeControlPlane {
http_port,
pg_port,
pg_version,
skip_pg_catalog_updates: false,
skip_pg_catalog_updates: true,
})?,
)?;
std::fs::write(
@@ -313,7 +319,7 @@ impl Endpoint {
// TODO: use future host field from safekeeper spec
// Pass the list of safekeepers to the replica so that it can connect to any of them,
// whichever is availiable.
// whichever is available.
let sk_ports = self
.env
.safekeepers
@@ -420,7 +426,12 @@ impl Endpoint {
Ok(())
}
pub fn start(&self, auth_token: &Option<String>, safekeepers: Vec<NodeId>) -> Result<()> {
pub fn start(
&self,
auth_token: &Option<String>,
safekeepers: Vec<NodeId>,
remote_ext_config: Option<&String>,
) -> Result<()> {
if self.status() == "running" {
anyhow::bail!("The endpoint is already running");
}
@@ -488,6 +499,7 @@ impl Endpoint {
pageserver_connstring: Some(pageserver_connstring),
safekeeper_connstrings,
storage_auth_token: auth_token.clone(),
remote_extensions: None,
};
let spec_path = self.endpoint_path().join("spec.json");
std::fs::write(spec_path, serde_json::to_string_pretty(&spec)?)?;
@@ -519,6 +531,11 @@ impl Endpoint {
.stdin(std::process::Stdio::null())
.stderr(logfile.try_clone()?)
.stdout(logfile);
if let Some(remote_ext_config) = remote_ext_config {
cmd.args(["--remote-ext-config", remote_ext_config]);
}
let child = cmd.spawn()?;
// Write down the pid so we can wait for it when we want to stop
@@ -564,9 +581,7 @@ impl Endpoint {
}
Err(e) => {
if attempt == MAX_ATTEMPTS {
return Err(e).context(
"timed out waiting to connect to compute_ctl HTTP; last error: {e}",
);
return Err(e).context("timed out waiting to connect to compute_ctl HTTP");
}
}
}

View File

@@ -101,7 +101,7 @@ impl SafekeeperNode {
self.datadir_path().join("safekeeper.pid")
}
pub fn start(&self) -> anyhow::Result<Child> {
pub fn start(&self, extra_opts: Vec<String>) -> anyhow::Result<Child> {
print!(
"Starting safekeeper at '{}' in '{}'",
self.pg_connection_config.raw_address(),
@@ -161,17 +161,28 @@ impl SafekeeperNode {
let key_path = self.env.base_data_dir.join("auth_public_key.pem");
if self.conf.auth_enabled {
let key_path_string = key_path
.to_str()
.with_context(|| {
format!("Key path {key_path:?} cannot be represented as a unicode string")
})?
.to_owned();
args.extend([
"--auth-validation-public-key-path".to_owned(),
key_path
.to_str()
.with_context(|| {
format!("Key path {key_path:?} cannot be represented as a unicode string")
})?
.to_owned(),
"--pg-auth-public-key-path".to_owned(),
key_path_string.clone(),
]);
args.extend([
"--pg-tenant-only-auth-public-key-path".to_owned(),
key_path_string.clone(),
]);
args.extend([
"--http-auth-public-key-path".to_owned(),
key_path_string.clone(),
]);
}
args.extend(extra_opts);
background_process::start_process(
&format!("safekeeper-{id}"),
&datadir,

View File

@@ -4,7 +4,12 @@
# to your expectations and requirements.
# Root options
targets = []
targets = [
{ triple = "x86_64-unknown-linux-gnu" },
{ triple = "aarch64-unknown-linux-gnu" },
{ triple = "aarch64-apple-darwin" },
{ triple = "x86_64-apple-darwin" },
]
all-features = false
no-default-features = false
feature-depth = 1
@@ -18,7 +23,7 @@ vulnerability = "deny"
unmaintained = "warn"
yanked = "warn"
notice = "warn"
ignore = []
ignore = ["RUSTSEC-2023-0052"]
# This section is considered when running `cargo deny check licenses`
# More documentation for the licenses section can be found here:

View File

@@ -0,0 +1,236 @@
# Supporting custom user Extensions (Dynamic Extension Loading)
Created 2023-05-03
## Motivation
There are many extensions in the PostgreSQL ecosystem, and not all extensions
are of a quality that we can confidently support them. Additionally, our
current extension inclusion mechanism has several problems because we build all
extensions into the primary Compute image: We build the extensions every time
we build the compute image regardless of whether we actually need to rebuild
the image, and the inclusion of these extensions in the image adds a hard
dependency on all supported extensions - thus increasing the image size, and
with it the time it takes to download that image - increasing first start
latency.
This RFC proposes a dynamic loading mechanism that solves most of these
problems.
## Summary
`compute_ctl` is made responsible for loading extensions on-demand into
the container's file system for dynamically loaded extensions, and will also
make sure that the extensions in `shared_preload_libraries` are downloaded
before the compute node starts.
## Components
compute_ctl, PostgreSQL, neon (extension), Compute Host Node, Extension Store
## Requirements
Compute nodes with no extra extensions should not be negatively impacted by
the existence of support for many extensions.
Installing an extension into PostgreSQL should be easy.
Non-preloaded extensions shouldn't impact startup latency.
Uninstalled extensions shouldn't impact query latency.
A small latency penalty for dynamically loaded extensions is acceptable in
the first seconds of compute startup, but not in steady-state operations.
## Proposed implementation
### On-demand, JIT-loading of extensions
Before postgres starts we download
- control files for all extensions available to that compute node;
- all `shared_preload_libraries`;
After postgres is running, `compute_ctl` listens for requests to load files.
When PostgreSQL requests a file, `compute_ctl` downloads it.
PostgreSQL requests files in the following cases:
- When loading a preload library set in `local_preload_libraries`
- When explicitly loading a library with `LOAD`
- Wnen creating extension with `CREATE EXTENSION` (download sql scripts, (optional) extension data files and (optional) library files)))
#### Summary
Pros:
- Startup is only as slow as it takes to load all (shared_)preload_libraries
- Supports BYO Extension
Cons:
- O(sizeof(extensions)) IO requirement for loading all extensions.
### Alternative solutions
1. Allow users to add their extensions to the base image
Pros:
- Easy to deploy
Cons:
- Doesn't scale - first start size is dependent on image size;
- All extensions are shared across all users: It doesn't allow users to
bring their own restrictive-licensed extensions
2. Bring Your Own compute image
Pros:
- Still easy to deploy
- User can bring own patched version of PostgreSQL
Cons:
- First start latency is O(sizeof(extensions image))
- Warm instance pool for skipping pod schedule latency is not feasible with
O(n) custom images
- Support channels are difficult to manage
3. Download all user extensions in bulk on compute start
Pros:
- Easy to deploy
- No startup latency issues for "clean" users.
- Warm instance pool for skipping pod schedule latency is possible
Cons:
- Downloading all extensions in advance takes a lot of time, thus startup
latency issues
4. Store user's extensions in persistent storage
Pros:
- Easy to deploy
- No startup latency issues
- Warm instance pool for skipping pod schedule latency is possible
Cons:
- EC2 instances have only limited number of attachments shared between EBS
volumes, direct-attached NVMe drives, and ENIs.
- Compute instance migration isn't trivially solved for EBS mounts (e.g.
the device is unavailable whilst moving the mount between instances).
- EBS can only mount on one instance at a time (except the expensive IO2
device type).
5. Store user's extensions in network drive
Pros:
- Easy to deploy
- Few startup latency issues
- Warm instance pool for skipping pod schedule latency is possible
Cons:
- We'd need networked drives, and a lot of them, which would store many
duplicate extensions.
- **UNCHECKED:** Compute instance migration may not work nicely with
networked IOs
### Idea extensions
The extension store does not have to be S3 directly, but could be a Node-local
caching service on top of S3. This would reduce the load on the network for
popular extensions.
## Extension Storage implementation
The layout of the S3 bucket is as follows:
```
5615610098 // this is an extension build number
├── v14
│   ├── extensions
│   │   ├── anon.tar.zst
│   │   └── embedding.tar.zst
│   └── ext_index.json
└── v15
├── extensions
│   ├── anon.tar.zst
│   └── embedding.tar.zst
└── ext_index.json
5615261079
├── v14
│   ├── extensions
│   │   └── anon.tar.zst
│   └── ext_index.json
└── v15
├── extensions
│   └── anon.tar.zst
└── ext_index.json
5623261088
├── v14
│   ├── extensions
│   │   └── embedding.tar.zst
│   └── ext_index.json
└── v15
├── extensions
│   └── embedding.tar.zst
└── ext_index.json
```
Note that build number cannot be part of prefix because we might need extensions
from other build numbers.
`ext_index.json` stores the control files and location of extension archives.
It also stores a list of public extensions and a library_index
We don't need to duplicate `extension.tar.zst`` files.
We only need to upload a new one if it is updated.
(Although currently we just upload every time anyways, hopefully will change
this sometime)
*access* is controlled by spec
More specifically, here is an example ext_index.json
```
{
"public_extensions": [
"anon",
"pg_buffercache"
],
"library_index": {
"anon": "anon",
"pg_buffercache": "pg_buffercache"
// for more complex extensions like postgis
// we might have something like:
// address_standardizer: postgis
// postgis_tiger: postgis
},
"extension_data": {
"pg_buffercache": {
"control_data": {
"pg_buffercache.control": "# pg_buffercache extension \ncomment = 'examine the shared buffer cache' \ndefault_version = '1.3' \nmodule_pathname = '$libdir/pg_buffercache' \nrelocatable = true \ntrusted=true"
},
"archive_path": "5670669815/v14/extensions/pg_buffercache.tar.zst"
},
"anon": {
"control_data": {
"anon.control": "# PostgreSQL Anonymizer (anon) extension \ncomment = 'Data anonymization tools' \ndefault_version = '1.1.0' \ndirectory='extension/anon' \nrelocatable = false \nrequires = 'pgcrypto' \nsuperuser = false \nmodule_pathname = '$libdir/anon' \ntrusted = true \n"
},
"archive_path": "5670669815/v14/extensions/anon.tar.zst"
}
}
}
```
### How to add new extension to the Extension Storage?
Simply upload build artifacts to the S3 bucket.
Implement a CI step for that. Splitting it from compute-node-image build.
### How do we deal with extension versions and updates?
Currently, we rebuild extensions on every compute-node-image build and store them in the <build-version> prefix.
This is needed to ensure that `/share` and `/lib` files are in sync.
For extension updates, we rely on the PostgreSQL extension versioning mechanism (sql update scripts) and extension authors to not break backwards compatibility within one major version of PostgreSQL.
### Alternatives
For extensions written on trusted languages we can also adopt
`dbdev` PostgreSQL Package Manager based on `pg_tle` by Supabase.
This will increase the amount supported extensions and decrease the amount of work required to support them.

View File

@@ -0,0 +1,957 @@
# Pageserver: split-brain safety for remote storage through generation numbers
## Summary
A scheme of logical "generation numbers" for tenant attachment to pageservers is proposed, along with
changes to the remote storage format to include these generation numbers in S3 keys.
Using the control plane as the issuer of these generation numbers enables strong anti-split-brain
properties in the pageserver cluster without implementing a consensus mechanism directly
in the pageservers.
## Motivation
Currently, the pageserver's remote storage format does not provide a mechanism for addressing
split brain conditions that may happen when replacing a node or when migrating
a tenant from one pageserver to another.
From a remote storage perspective, a split brain condition occurs whenever two nodes both think
they have the same tenant attached, and both can write to S3. This can happen in the case of a
network partition, pathologically long delays (e.g. suspended VM), or software bugs.
In the current deployment model, control plane guarantees that a tenant is attached to one
pageserver at a time, thereby ruling out split-brain conditions resulting from dual
attachment (however, there is always the risk of a control plane bug). This control
plane guarantee prevents robust response to failures, as if a pageserver is unresponsive
we may not detach from it. The mechanism in this RFC fixes this, by making it safe to
attach to a new, different pageserver even if an unresponsive pageserver may be running.
Futher, lack of safety during split-brain conditions blocks two important features where occasional
split-brain conditions are part of the design assumptions:
- seamless tenant migration ([RFC PR](https://github.com/neondatabase/neon/pull/5029))
- automatic pageserver instance failure handling (aka "failover") (RFC TBD)
### Prior art
- 020-pageserver-s3-coordination.md
- 023-the-state-of-pageserver-tenant-relocation.md
- 026-pageserver-s3-mvcc.md
This RFC has broad similarities to the proposal to implement a MVCC scheme in
S3 object names, but this RFC avoids a general purpose transaction scheme in
favour of more specialized "generations" that work like a transaction ID that
always has the same lifetime as a pageserver process or tenant attachment, whichever
is shorter.
## Requirements
- Accommodate storage backends with no atomic or fencing capability (i.e. work within
S3's limitation that there are no atomics and clients can't be fenced)
- Don't depend on any STONITH or node fencing in the compute layer (i.e. we will not
assume that we can reliably kill and EC2 instance and have it die)
- Scoped per-tenant, not per-pageserver; for _seamless tenant migration_, we need
per-tenant granularity, and for _failover_, we likely want to spread the workload
of the failed pageserver instance to a number of peers, rather than monolithically
moving the entire workload to another machine.
We do not rule out the latter case, but should not constrain ourselves to it.
## Design Tenets
These are not requirements, but are ideas that guide the following design:
- Avoid implementing another consensus system: we already have a strongly consistent
database in the control plane that can do atomic operations where needed, and we also
have a Paxos implementation in the safekeeper.
- Avoiding locking in to specific models of how failover will work (e.g. do not assume that
all the tenants on a pageserver will fail over as a unit).
- Be strictly correct when it comes to data integrity. Occasional failures of availability
are tolerable, occasional data loss is not.
## Non Goals
The changes in this RFC intentionally isolate the design decision of how to define
logical generations numbers and object storage format in a way that is somewhat flexible with
respect to how actual orchestration of failover works.
This RFC intentionally does not cover:
- Failure detection
- Orchestration of failover
- Standby modes to keep data ready for fast migration
- Intentional multi-writer operation on tenants (multi-writer scenarios are assumed to be transient split-brain situations).
- Sharding.
The interaction between this RFC and those features is discussed in [Appendix B](#appendix-b-interoperability-with-other-features)
## Impacted Components
pageserver, control plane, safekeeper (optional)
## Implementation Part 1: Correctness
### Summary
- A per-tenant **generation number** is introduced to uniquely identifying tenant attachments to pageserver processes.
- This generation number increments each time the control plane modifies a tenant (`Project`)'s assigned pageserver, or when the assigned pageserver restarts.
- the control plane is the authority for generation numbers: only it may
increment a generation number.
- **Object keys are suffixed** with the generation number
- **Safety for multiply-attached tenants** is provided by the
generation number in the object key: the competing pageservers will not
try to write to the same keys.
- **Safety in split brain for multiple nodes running with
the same node ID** is provided by the pageserver calling out to the control plane
on startup, to re-attach and thereby increment the generations of any attached tenants
- **Safety for deletions** is achieved by deferring the DELETE from S3 to a point in time where the deleting node has validated with control plane that no attachment with a higher generation has a reference to the to-be-DELETEd key.
- **The control plane is used to issue generation numbers** to avoid the need for
a built-in consensus system in the pageserver, although this could in principle
be changed without changing the storage format.
### Generation numbers
A generation number is associated with each tenant in the control plane,
and each time the attachment status of the tenant changes, this is incremented.
Changes in attachment status include:
- Attaching the tenant to a different pageserver
- A pageserver restarting, and "re-attaching" its tenants on startup
These increments of attachment generation provide invariants we need to avoid
split-brain issues in storage:
- If two pageservers have the same tenant attached, the attachments are guaranteed to have different generation numbers, because the generation would increment
while attaching the second one.
- If there are multiple pageservers running with the same node ID, all the attachments on all pageservers are guaranteed to have different generation numbers, because the generation would increment
when the second node started and re-attached its tenants.
As long as the infrastructure does not transparently replace an underlying
physical machine, we are totally safe. See the later [unsafe case](#unsafe-case-on-badly-behaved-infrastructure) section for details.
### Object Key Changes
#### Generation suffix
All object keys (layer objects and index objects) will contain the attachment
generation as a [suffix](#why-a-generation-suffix-rather-than-prefix).
This suffix is the primary mechanism for protecting against split-brain situations, and
enabling safe multi-attachment of tenants:
- Two pageservers running with the same node ID (e.g. after a failure, where there is
some rogue pageserver still running) will not try to write to the same objects, because at startup they will have re-attached tenants and thereby incremented
generation numbers.
- Multiple attachments (to different pageservers) of the same tenant will not try to write to the same objects, as each attachment would have a distinct generation.
The generation is appended in hex format (8 byte string representing
u32), to all our existing key names. A u32's range limit would permit
27 restarts _per second_ over a 5 year system lifetime: orders of magnitude more than
is realistic.
The exact meaning of the generation suffix can evolve over time if necessary, for
example if we chose to implement a failover mechanism internally to the pageservers
rather than going via the control plane. The storage format just sees it as a number,
with the only semantic property being that the highest numbered index is the latest.
#### Index changes
Since object keys now include a generation suffix, the index of these keys must also be updated. IndexPart currently stores keys and LSNs sufficient to reconstruct key names: this would be extended to store the generation as well.
This will increase the size of the file, but only modestly: layers are already encoded as
their string-ized form, so the overhead is about 10 bytes per layer. This will be less if/when
the index storage format is migrated to a binary format from JSON.
#### Visibility
_This section doesn't describe code changes, but extends on the consequences of the
object key changes given above_
##### Visibility of objects to pageservers
Pageservers can of course list objects in S3 at any time, but in practice their
visible set is based on the contents of their LayerMap, which is initialized
from the `index_part.json.???` that they load.
Starting with the `index_part` from the most recent previous generation
(see [loading index_part](#finding-the-remote-indices-for-timelines)), a pageserver
initially has visibility of all the objects that were referenced in the loaded index.
These objects are guaranteed to remain visible until the current generation is
superseded, via pageservers in older generations avoiding deletions (see [deletion](#deletion)).
The "most recent previous generation" is _not_ necessarily the most recent
in terms of walltime, it is the one that is readable at the time a new generation
starts. Consider the following sequence of a tenant being re-attached to different
pageserver nodes:
- Create + attach on PS1 in generation 1
- PS1 Do some work, write out index_part.json-0001
- Attach to PS2 in generation 2
- Read index_part.json-0001
- PS2 starts doing some work...
- Attach to PS3 in generation 3
- Read index_part.json-0001
- **...PS2 finishes its work: now it writes index_part.json-0002**
- PS3 writes out index_part.json-0003
In the above sequence, the ancestry of indices is:
```
0001 -> 0002
|
-> 0003
```
This is not an issue for safety: if the 0002 references some object that is
not in 0001, then 0003 simply does not see it, and will re-do whatever
work was required (e.g. ingesting WAL or doing compaction). Objects referenced
by only the 0002 index will never be read by future attachment generations, and
will eventually be cleaned up by a scrub (see [scrubbing](#cleaning-up-orphan-objects-scrubbing)).
##### Visibility of LSNs to clients
Because index_part.json is now written with a generation suffix, which data
is visible depends on which generation the reader is operating in:
- If one was passively reading from S3 from outside of a pageserver, the
visibility of data would depend on which index_part.json-<generation> file
one had chosen to read from.
- If two pageservers have the same tenant attached, they may have different
data visible as they're independently replaying the WAL, and maintaining
independent LayerMaps that are written to independent index_part.json files.
Data does not have to be remotely committed to be visible.
- For a pageserver writing with a stale generation, historic LSNs
remain readable until another pageserver (with a higher generation suffix)
decides to execute GC deletions. At this point, we may think of the stale
attachment's generation as having logically ended: during its existence
the generation had a consistent view of the world.
- For a newly attached pageserver, its highest visible LSN may appears to
go backwards with respect to an earlier attachment, if that earlier
attachment had not uploaded all data to S3 before the new attachment.
### Deletion
#### Generation number validation
While writes are de-conflicted by writers always using their own generation number in the key,
deletions are slightly more challenging: if a pageserver A is isolated, and the true active node is
pageserver B, then it is dangerous for A to do any object deletions, even of objects that it wrote
itself, because pageserver's B metadata might reference those objects.
We solve this by inserting a "generation validation" step between the write of a remote index
that un-links a particular object from the index, and the actual deletion of the object, such
that deletions strictly obey the following ordering:
1. Write out index_part.json: this guarantees that any subsequent reader of the metadata will
not try and read the object we unlinked.
2. Call out to control plane to validate that the generation which we use for our attachment is still the latest.
3. If step 2 passes, it is safe to delete the object. Why? The check-in with control plane
together with our visibility rules guarantees that any later generation
will use either the exact `index_part.json` that we uploaded in step 1, or a successor
of it; not an earlier one. In both cases, the `index_part.json` doesn't reference the
key we are deleting anymore, so, the key is invisible to any later attachment generation.
Hence it's safe to delete it.
Note that at step 2 we are only confirming that deletions of objects _no longer referenced
by the specific `index_part.json` written in step 1_ are safe. If we were attempting other deletions concurrently,
these would need their own generation validation step.
If step 2 fails, we may leak the object. This is safe, but has a cost: see [scrubbing](#cleaning-up-orphan-objects-scrubbing). We may avoid this entirely outside of node
failures, if we do proper flushing of deletions on clean shutdown and clean migration.
To avoid doing a huge number of control plane requests to perform generation validation,
validation of many tenants will be done in a single request, and deletions will be queued up
prior to validation: see [Persistent deletion queue](#persistent-deletion-queue) for more.
#### `remote_consistent_lsn` updates
Remote objects are not the only kind of deletion the pageserver does: it also indirectly deletes
WAL data, by feeding back remote_consistent_lsn to safekeepers, as a signal to the safekeepers that
they may drop data below this LSN.
For the same reasons that deletion of objects must be guarded by an attachment generation number
validation step, updates to `remote_consistent_lsn` are subject to the same rules, using
an ordering as follows:
1. upload the index_part that covers data up to LSN `L0` to S3
2. Call out to control plane to validate that the generation which we use for our attachment is still the latest.
3. advance the `remote_consistent_lsn` that we advertise to the safekeepers to `L0`
If step 2 fails, then the `remote_consistent_lsn` advertised
to safekeepers will not advance again until a pageserver
with the latest generation is ready to do so.
**Note:** at step 3 we are not advertising the _latest_ remote_consistent_lsn, we are
advertising the value in the index_part that we uploaded in step 1. This provides
a strong ordering guarantee.
Internally to the pageserver, each timeline will have two remote_consistent_lsn values: the one that
reflects its latest write to remote storage, and the one that reflects the most
recent validation of generation number. It is only the latter value that may
be advertised to the outside world (i.e. to the safekeeper).
The control plane remains unaware of `remote_consistent_lsn`: it only has to validate
the freshness of generation numbers, thereby granting the pageserver permission to
share the information with the safekeeper.
For convenience, in subsequent sections and RFCs we will use "deletion" to mean both deletion
of objects in S3, and updates to the `remote_consistent_lsn`, as updates to the remote consistent
LSN are de-facto deletions done via the safekeeper, and both kinds of deletion are subject to
the same generation validation requirement.
### Pageserver attach/startup changes
#### Attachment
Calls to `/v1/tenant/{tenant_id}/attach` are augmented with an additional
`generation` field in the body.
The pageserver does not persist this: a generation is only good for the lifetime
of a process.
#### Finding the remote indices for timelines
Because index files are now suffixed with generation numbers, the pageserver
cannot always GET the remote index in one request, because it can't always
know a-priori what the latest remote index is.
Typically, the most recent generation to write an index would be our own
generation minus 1. However, this might not be the case: the previous
node might have started and acquired a generation number, and then crashed
before writing out a remote index.
In the general case and as a fallback, the pageserver may list all the `index_part.json`
files for a timeline, sort them by generation, and pick the highest that is `<=`
its current generation for this attachment. The tenant should never load an index
with an attachment generation _newer_ than its own.
These two rules combined ensure that objects written by later generations are never visible to earlier generations.
Note that if a given attachment picks an index part from an earlier generation (say n-2), but crashes & restarts before it writes its own generation's index part, next time it tries to pick an index part there may be an index part from generation n-1.
It would pick the n-1 index part in that case, because it's sorted higher than the previous one from generation n-2.
So, above rules guarantee no determinism in selecting the index part.
are allowed to be attached with stale attachment generations during a multiply-attached
phase in a migration, and in this instance if the old location's pageserver restarts,
it should not try and load the newer generation's index.
To summarize, on starting a timeline, the pageserver will:
1. Issue a GET for index_part.json-<my generation - 1>
2. If 1 failed, issue a ListObjectsv2 request for index_part.json\* and
pick the newest.
One could optimize this further by using the control plane to record specifically
which generation most recently wrote an index_part.json, if necessary, to increase
the probability of finding the index_part.json in one GET. One could also improve
the chances by having pageservers proactively write out index_part.json after they
get a new generation ID.
#### Re-attachment on startup
On startup, the pageserver will call out to an new control plane `/re-attach`
API (see [Generation API](#generation-api)). This returns a list of
tenants that should be attached to the pageserver, and their generation numbers, which
the control plane will increment before returning.
The pageserver should still scan its local disk on startup, but should _delete_
any local content for tenants not indicated in the `/re-attach` response: their
absence is an implicit detach operation.
**Note** if a tenant is omitted from the re-attach response, its local disk content
will be deleted. This will change in subsequent work, when the control plane gains
the concept of a secondary/standby location: a node with local content may revert
to this status and retain some local content.
#### Cleaning up previous generations' remote indices
Deletion of old indices is not necessary for correctness, although it is necessary
to avoid the ListObjects fallback in the previous section becoming ever more expensive.
Once the new attachment has written out its index_part.json, it may asynchronously clean up historic index_part.json
objects that were found.
We may choose to implement this deletion either as an explicit step after we
write out index_part for the first time in a pageserver's lifetime, or for
simplicity just do it periodically as part of the background scrub (see [scrubbing](#cleaning-up-orphan-objects-scrubbing));
### Control Plane Changes
#### Store generations for attaching tenants
- The `Project` table must store the generation number for use when
attaching the tenant to a new pageserver.
- The `/v1/tenant/:tenant_id/attach` pageserver API will require the generation number,
which the control plane can supply by simply incrementing the `Project`'s
generation number each time the tenant is attached to a different server: the same database
transaction that changes the assigned pageserver should also change the generation number.
#### Generation API
This section describes an API that could be provided directly by the control plane,
or built as a separate microservice. In earlier parts of the RFC, when we
discuss the control plane providing generation numbers, we are referring to this API.
The API endpoints used by the pageserver to acquire and validate generation
numbers are quite simple, and only require access to some persistent and
linerizable storage (such as a database).
Building this into the control plane is proposed as a least-effort option to exploit existing infrastructure and implement generation number issuance in the same transaction that mandates it (i.e., the transaction that updates the `Project` assignment to another pageserver).
However, this is not mandatory: this "Generation Number Issuer" could
be built as a microservice. In practice, we will write such a miniature service
anyway, to enable E2E pageserver/compute testing without control plane.
The endpoints required by pageservers are:
##### `/re-attach`
- Request: `{node_id: <u32>}`
- Response:
- 200 `{tenants: [{id: <TenantId>, gen: <u32>}]}`
- 404: unknown node_id
- (Future: 429: flapping detected, perhaps nodes are fighting for the same node ID,
or perhaps this node was in a retry loop)
- (On unknown tenants, omit tenant from `tenants` array)
- Server behavior: query database for which tenants should be attached to this pageserver.
- for each tenant that should be attached, increment the attachment generation and
include the new generation in the response
- Client behavior:
- for all tenants in the response, activate with the new generation number
- for any local disk content _not_ referenced in the response, act as if we
had been asked to detach it (i.e. delete local files)
**Note** the `node_id` in this request will change in future if we move to ephemeral
node IDs, to be replaced with some correlation ID that helps the control plane realize
if a process is running with the same storage as a previous pageserver process (e.g.
we might use EC instance ID, or we might just write some UUID to the disk the first
time we use it)
##### `/validate`
- Request: `{'tenants': [{tenant: <tenant id>, attach_gen: <gen>}, ...]}'`
- Response:
- 200 `{'tenants': [{tenant: <tenant id>, status: <bool>}...]}`
- (On unknown tenants, omit tenant from `tenants` array)
- Purpose: enable the pageserver to discover for the given attachments whether they are still the latest.
- Server behavior: this is a read-only operation: simply compare the generations in the request with
the generations known to the server, and set status to `true` if they match.
- Client behavior: clients must not do deletions within a tenant's remote data until they have
received a response indicating the generation they hold for the attachment is current.
#### Use of `/load` and `/ignore` APIs
Because the pageserver will be changed to only attach tenants on startup
based on the control plane's response to a `/re-attach` request, the load/ignore
APIs no longer make sense in their current form.
The `/load` API becomes functionally equivalent to attach, and will be removed:
any location that used `/load` before should just attach instead.
The `/ignore` API is equivalent to detaching, but without deleting local files.
### Timeline/Branch creation & deletion
All of the previous arguments for safety have described operations within
a timeline, where we may describe a sequence that includes updates to
index_part.json, and where reads and writes are coming from a postgres
endpoint (writes via the safekeeper).
Creating or destroying timeline is a bit different, because writes
are coming from the control plane.
We must be safe against scenarios such as:
- A tenant is attached to pageserver B while pageserver A is
in the middle of servicing an RPC from the control plane to
create or delete a tenant.
- A pageserver A has been sent a timeline creation request
but becomes unresponsive. The tenant is attached to a
different pageserver B, and the timeline creation request
is sent there too.
#### Timeline Creation
If some very slow node tries to do a timeline creation _after_
a more recent generation node has already created the timeline
and written some data into it, that must not cause harm. This
is provided in timeline creations by the way all the objects
within the timeline's remote path include a generation suffix:
a slow node in an old generation that attempts to "create" a timeline
that already exists will just emit an index_part.json with
an old generation suffix.
Timeline IDs are never reused, so we don't have
to worry about the case of create/delete/create cycles. If they
were re-used during a disaster recovery "un-delete" of a timeline,
that special case can be handled by calling out to all available pageservers
to check that they return 404 for the timeline, and to flush their
deletion queues in case they had any deletions pending from the
timeline.
The above makes it safe for control plane to change the assignment of
tenant to pageserver in control plane while a timeline creation is ongoing.
The reason is that the creation request against the new assigned pageserver
uses a new generation number. However, care must be taken by control plane
to ensure that a "timeline creation successul" response from some pageserver
is checked for the pageserver's generation for that timeline's tenant still being the latest.
If it is not the latest, the response does not constitute a successful timeline creation.
It is acceptable to discard such responses, the scrubber will clean up the S3 state.
It is better to issue a timelien deletion request to the stale attachment.
#### Timeline Deletion
Tenant/timeline deletion operations are exempt from generation validation
on deletes, and therefore don't have to go through the same deletion
queue as GC/compaction layer deletions. This is because once a
delete is issued by the control plane, it is a promise that the
control plane will keep trying until the deletion is done, so even stale
pageservers are permitted to go ahead and delete the objects.
The implications of this for control plane are:
- During timeline/tenant deletion, the control plane must wait for the deletion to
be truly complete (status 404) and also handle the case where the pageserver
becomes unavailable, either by waiting for a replacement with the same node_id,
or by *re-attaching the tenant elsewhere.
- The control plane must persist its intent to delete
a timeline/tenant before issuing any RPCs, and then once it starts, it must
keep retrying until the tenant/timeline is gone. This is already handled
by using a persistent `Operation` record that is retried indefinitely.
Timeline deletion may result in a special kind of object leak, where
the latest generation attachment completes a deletion (including erasing
all objects in the timeline path), but some slow/partitioned node is
writing into the timeline path with a stale generation number. This would
not be caught by any per-timeline scrubbing (see [scrubbing](#cleaning-up-orphan-objects-scrubbing)), since scrubbing happens on the
attached pageserver, and once the timeline is deleted it isn't attached anywhere.
This scenario should be pretty rare, and the control plane can make it even
rarer by ensuring that if a tenant is in a multi-attached state (e.g. during
migration), we wait for that to complete before processing the deletion. Beyond
that, we may implement some other top-level scrub of timelines in
an external tool, to identify any tenant/timeline paths that are not found
in the control plane database.
#### Examples
- Deletion, node restarts partway through:
- By the time we returned 202, we have written a remote delete marker
- Any subsequent incarnation of the same node_id will see the remote
delete marker and continue to process the deletion
- If the original pageserver is lost permanently and no replacement
with the same node_id is available, then the control plane must recover
by re-attaching the tenant to a different node.
- Creation, node becomes unresponsive partway through.
- Control plane will see HTTP request timeout, keep re-issuing
request to whoever is the latest attachment point for the tenant
until it succeeds.
- Stale nodes may be trying to execute timeline creation: they will
write out index_part.json files with
stale attachment generation: these will be eventually cleaned up
by the same mechanism as other old indices.
### Unsafe case on badly behaved infrastructure
This section is only relevant if running on a different environment
than EC2 machines with ephemeral disks.
If we ever run pageservers on infrastructure that might transparently restart
a pageserver while leaving an old process running (e.g. a VM gets rescheduled
without the old one being fenced), then there is a risk of corruption, when
the control plane attaches the tenant, as follows:
- If the control plane sends an `/attach` request to node A, then node A dies
and is replaced, and the control plane's retries the request without
incrementing that attachment ID, then it could end up with two physical nodes
both using the same generation number.
- This is not an issue when using EC2 instances with ephemeral storage, as long
as the control plane never re-uses a node ID, but it would need re-examining
if running on different infrastructure.
- To robustly protect against this class of issue, we would either:
- add a "node generation" to distinguish between different processes holding the
same node_id.
- or, dispense with static node_id entirely and issue an ephemeral ID to each
pageserver process when it starts.
## Implementation Part 2: Optimizations
### Persistent deletion queue
Between writing our a new index_part.json that doesn't reference an object,
and executing the deletion, an object passes through a window where it is
only referenced in memory, and could be leaked if the pageserver is stopped
uncleanly. That introduces conflicting incentives: on the one hand, we would
like to delay and batch deletions to
1. minimize the cost of the mandatory validations calls to control plane, and
2. minimize cost for DeleteObjects requests.
On the other hand we would also like to minimize leakage by executing
deletions promptly.
To resolve this, we may make the deletion queue persistent
and then executing these in the background at a later time.
_Note: The deletion queue's reason for existence is optimization rather than correctness,
so there is a lot of flexibility in exactly how the it should work,
as long as it obeys the rule to validate generations before executing deletions,
so the following details are not essential to the overall RFC._
#### Scope
The deletion queue will be global per pageserver, not per-tenant. There
are several reasons for this choice:
- Use the queue as a central point to coalesce validation requests to the
control plane: this avoids individual `Timeline` objects ever touching
the control plane API, and avoids them having to know the rules about
validating deletions. This separation of concerns will avoid burdening
the already many-LoC `Timeline` type with even more responsibility.
- Decouple the deletion queue from Tenant attachment lifetime: we may
"hibernate" an inactive tenant by tearing down its `Tenant`/`Timeline`
objects in the pageserver, without having to wait for deletions to be done.
- Amortize the cost of I/O for the persistent queue, instead of having many
tiny queues.
- Coalesce deletions into a smaller number of larger DeleteObjects calls
Because of the cost of doing I/O for persistence, and the desire to coalesce
generation validation requests across tenants, and coalesce deletions into
larger DeleteObjects requests, there will be one deletion queue per pageserver
rather than one per tenant. This has the added benefit that when deactivating
a tenant, we do not have to drain their deletion queue: deletions can proceed
for a tenant whose main `Tenant` object has been torn down.
#### Flow of deletion
The flow of a deletion is becomes:
1. Need for deletion of an object (=> layer file) is identified.
2. Unlink the object from all the places that reference it (=> `index_part.json`).
3. Enqueue the deletion to a persistent queue.
Each entry is `tenant_id, attachment_generation, S3 key`.
4. Validate & execute in batches:
4.1 For a batch of entries, call into control plane.
4.2 For the subset of entries that passed validation, execute a `DeleteObjects` S3 DELETE request for their S3 keys.
As outlined in the Part 1 on correctness, it is critical that deletions are only
executed once the key is not referenced anywhere in S3.
This property is obviously upheld by the scheme above.
#### We Accept Object Leakage In Acceptable Circumcstances
If we crash in the flow above between (2) and (3), we lose track of unreferenced object.
Further, enqueuing a single to the persistent queue may not be durable immediately to amortize cost of flush to disk.
This is acceptable for now, it can be caught by [the scrubber](#cleaning-up-orphan-objects-scrubbing).
There are various measures we can take to improve this in the future.
1. Cap amount of time until enqueued entry becomes durable (timeout for flush-to-tisk)
2. Proactively flush:
- On graceful shutdown, as we anticipate that some or
all of our attachments may be re-assigned while we are offline.
- On tenant detach.
3. For each entry, keep track of whether it has passed (2).
Only admit entries to (4) one they have passed (2).
This requires re-writing / two queue entries (intent, commit) per deletion.
The important take-away with any of the above is that it's not
disastrous to leak objects in exceptional circumstances.
#### Operations that may skip the queue
Deletions of an entire timeline are [exempt](#Timeline-Deletion) from generation number validation. Once the
control plane sends the deletion request, there is no requirement to retain the readability
of any data within the timeline, and all objects within the timeline path may be deleted
at any time from the control plane's deletion request onwards.
Since deletions of smaller timelines won't have enough objects to compose a full sized
DeleteObjects request, it is still useful to send these through the last part of the
deletion pipeline to coalesce with other executing deletions: to enable this, the
deletion queue should expose two input channels: one for deletions that must be
processed in a generation-aware way, and a fast path for timeline deletions, where
that fast path may skip validation and the persistent queue.
### Cleaning up orphan objects (scrubbing)
An orphan object is any object which is no longer referenced by a running node or by metadata.
Examples of how orphan objects arise:
- A node PUTs a layer object, then crashes before it writes the
index_part.json that references that layer.
- A stale node carries on running for some time, and writes out an unbounded number of
objects while it believes itself to be the rightful writer for a tenant.
- A pageserver crashes between un-linking an object from the index, and persisting
the object to its deletion queue.
Orphan objects are functionally harmless, but have a small cost due to S3 capacity consumed. We
may clean them up at some time in the future, but doing a ListObjectsv2 operation and cross
referencing with the latest metadata to identify objects which are not referenced.
Scrubbing will be done only by an attached pageserver (not some third party process), and deletions requested during scrub will go through the same
validation as all other deletions: the attachment generation must be
fresh. This avoids the possibility of a stale pageserver incorrectly
thinking than an object written by a newer generation is stale, and deleting
it.
It is not strictly necessary that scrubbing be done by an attached
pageserver: it could also be done externally. However, an external
scrubber would still require the same validation procedure that
a pageserver's deletion queue performs, before actually erasing
objects.
## Operational impact
### Availability
Coordination of generation numbers via the control plane introduce a dependency for certain
operations:
1. Starting new pageservers (or activating pageservers after a restart)
2. Executing enqueued deletions
3. Advertising updated `remote_consistent_lsn` to enable WAL trimming
Item 1. would mean that some in-place restarts that previously would have resumed service even if the control plane were
unavailable, will now not resume service to users until the control plane is available. We could
avoid this by having a timeout on communication with the control plane, and after some timeout,
resume service with the previous generation numbers (assuming this was persisted to disk). However,
this is unlikely to be needed as the control plane is already an essential & highly available component. Also, having a node re-use an old generation number would complicate
reasoning about the system, as it would break the invariant that a generation number uniquely identifies
a tenant's attachment to a given pageserver _process_: it would merely identify the tenant's attachment
to the pageserver _machine_ or its _on-disk-state_.
Item 2. is a non-issue operationally: it's harmless to delay deletions, the only impact of objects pending deletion is
the S3 capacity cost.
Item 3. could be an issue if safekeepers are low on disk space and the control plane is unavailable for a long time. If this became an issue,
we could adjust the safekeeper to delete segments from local disk sooner, as soon as they're uploaded to S3, rather than waiting for
remote_consistent_lsn to advance.
For a managed service, the general approach should be to make sure we are monitoring & respond fast enough
that control plane outages are bounded in time.
There is also the fact that control plane runs in a single region.
The latency for distant regions is not a big concern for us because all request types added by this RFC are either infrequent or not in the way of the data path.
However, we lose region isolation for the operations listed above.
The ongoing work to split console and control will give us per-region control plane, and all operations in this RFC can be handled by these per-region control planes.
With that in mind, we accept the trade-offs outlined in this paragraph.
We will also implement an "escape hatch" config generation numbers, where in a major disaster outage,
we may manually run pageservers with a hand-selected generation number, so that we can bring them online
independently of a control plane.
### Rollout
Although there is coupling between components, we may deploy most of the new data plane components
independently of the control plane: initially they can just use a static generation number.
#### Phase 1
The pageserver is deployed with some special config to:
- Always act like everything is generation 1 and do not wait for a control plane issued generation on attach
- Skip the places in deletion and remote_consistent_lsn updates where we would call into control plane
#### Phase 2
The control plane changes are deployed: control plane will now track and increment generation numbers.
#### Phase 3
The pageserver is deployed with its control-plane-dependent changes enabled: it will now require
the control plane to service re-attach requests on startup, and handle generation
validation requests.
### On-disk backward compatibility
Backward compatibility with existing data is straightforward:
- When reading the index, we may assume that any layer whose metadata doesn't include
generations will have a path without generation suffix.
- When locating the index file on attachment, we may use the "fallback" listing path
and if there is only an index without generation suffix, that is the one we load.
It is not necessary to re-write existing layers: even new index files will be able
to represent generation-less layers.
### On-disk forward compatibility
We will do a two phase rollout, probably over multiple releases because we will naturally
have some of the read-side code ready before the overall functionality is ready:
1. Deploy pageservers which understand the new index format and generation suffixes
in keys, but do not write objects with generation numbers in the keys.
2. Deploy pageservers that write objects with generation numbers in the keys.
Old pageservers will be oblivious to generation numbers. That means that they can't
read objects with generation numbers in the name. This is why we must
first step must deploy the ability to read, before the second step
starts writing them.
# Frequently Asked Questions
## Why a generation _suffix_ rather than _prefix_?
The choice is motivated by object listing, since one can list by prefix but not
suffix.
In [finding remote indices](#finding-the-remote-indices-for-timelines), we rely
on being able to do a prefix listing for `<tenant>/<timeline>/index_part.json*`.
That relies on the prefix listing.
The converse case of using a generation prefix and listing by generation is
not needed: one could imagine listing by generation while scrubbing (so that
a particular generation's layers could be scrubbed), but this is not part
of normal operations, and the [scrubber](#cleaning-up-orphan-objects-scrubbing) probably won't work that way anyway.
## Wouldn't it be simpler to have a separate deletion queue per timeline?
Functionally speaking, we could. That's how RemoteTimelineClient currently works,
but this approach does not map well to a long-lived persistent queue with
generation validation.
Anything we do per-timeline generates tiny random I/O, on a pageserver with
tens of thousands of timelines operating: to be ready for high scale, we should:
- A) Amortize costs where we can (e.g. a shared deletion queue)
- B) Expect to put tenants into a quiescent state while they're not
busy: i.e. we shouldn't keep a tenant alive to service its deletion queue.
This was discussed in the [scope](#scope) part of the deletion queue section.
# Appendix A: Examples of use in high availability/failover
The generation numbers proposed in this RFC are adaptable to a variety of different
failover scenarios and models. The sections below sketch how they would work in practice.
### In-place restart of a pageserver
"In-place" here means that the restart is done before any other element in the system
has taken action in response to the node being down.
- After restart, the node issues a re-attach request to the control plane, and
receives new generation numbers for all its attached tenants.
- Tenants may be activated with the generation number in the re-attach response.
- If any of its attachments were in fact stale (i.e. had be reassigned to another
node while this node was offline), then
- the re-attach response will inform the tenant about this by not including
the tenant of this by _not_ incrementing the generation for that attachment.
- This will implicitly block deletions in the tenant, but as an optimization
the pageserver should also proactively stop doing S3 uploads when it notices this stale-generation state.
- The control plane is expected to eventually detach this tenant from the
pageserver.
If the control plane does not include a tenant in the re-attach response,
but there is still local state for the tenant in the filesystem, the pageserver
deletes the local state in response and does not load/active the tenant.
See the [earlier section on pageserver startup](#pageserver-attachstartup-changes) for details.
Control plane can use this mechanism to clean up a pageserver that has been
down for so long that all its tenants were migrated away before it came back
up again and asked for re-attach.
### Failure of a pageserver
In this context, read "failure" as the most ambiguous possible case, where
a pageserver is unavailable to clients and control plane, but may still be executing and talking
to S3.
#### Case A: re-attachment to other nodes
1. Let's say node 0 becomes unresponsive in a cluster of three nodes 0, 1, 2.
2. Some external mechanism notices that the node is unavailable and initiates
movement of all tenants attached to that node to a different node according
to some distribution rule.
In this example, it would mean incrementing the generation
of all tenants that were attached to node 0, as each tenant's assigned pageserver changes.
3. A tenant which is now attached to node 1 will _also_ still be attached to node
0, from the perspective of node 0. Node 0 will still be using its old generation,
node 1 will be using a newer generation.
4. S3 writes will continue from nodes 0 and 1: there will be an index_part.json-00000001
\_and\* an index_part.json-00000002. Objects written under the old suffix
after the new attachment was created do not matter from the rest of the system's
perspective: the endpoints are reading from the new attachment location. Objects
written by node 0 are just garbage that can be cleaned up at leisure. Node 0 will
not do any deletions because it can't synchronize with control plane, or if it could,
its deletion queue processing would get errors for the validation requests.
#### Case B: direct node replacement with same node_id and drive
This is the scenario we would experience if running pageservers in some dynamic
VM/container environment that would auto-replace a given node_id when it became
unresponsive, with the node's storage supplied by some network block device
that is attached to the replacement VM/container.
1. Let's say node 0 fails, and there may be some other peers but they aren't relevant.
2. Some external mechanism notices that the node is unavailable, and creates
a "new node 0" (Node 0b) which is a physically separate server. The original node 0
(Node 0a) may still be running, because we do not assume the environment fences nodes.
3. On startup, node 0b re-attaches and gets higher generation numbers for
all tenants.
4. S3 writes continue from nodes 0a and 0b, but the writes do not collide due to different
generation in the suffix, and the writes from node 0a are not visible to the rest
of the system because endpoints are reading only from node 0b.
# Appendix B: interoperability with other features
## Sharded Keyspace
The design in this RFC maps neatly to a sharded keyspace design where subsets of the key space
for a tenant are assigned to different pageservers:
- the "unit of work" for attachments becomes something like a TenantShard rather than a Tenant
- TenantShards get generation numbers just as Tenants do.
- Write workload (ingest, compaction) for a tenant is spread out across pageservers via
TenantShards, but each TenantShard still has exactly one valid writer at a time.
## Read replicas
_This section is about a passive reader of S3 pageserver state, not a postgres
read replica_
For historical reads to LSNs below the remote persistent LSN, any node may act as a reader at any
time: remote data is logically immutable data, and the use of deferred deletion in this RFC helps
mitigate the fact that remote data is not _physically_ immutable (i.e. the actual data for a given
page moves around as compaction happens).
A read replica needs to be aware of generations in remote data in order to read the latest
metadata (find the index_part.json with the latest suffix). It may either query this
from the control plane, or find it with ListObjectsv2 request
## Seamless migration
To make tenant migration totally seamless, we will probably want to intentionally double-attach
a tenant briefly, serving reads from the old node while waiting for the new node to be ready.
This RFC enables that double-attachment: two nodes may be attached at the same time, with the migration destination
having a higher generation number. The old node will be able to ingest and serve reads, but not
do any deletes. The new node's attachment must also avoid deleting layers that the old node may
still use. A new piece of state
will be needed for this in the control plane's definition of an attachment.
## Warm secondary locations
To enable faster tenant movement after a pageserver is lost, we will probably want to spend some
disk capacity on keeping standby locations populated with local disk data.
There's no conflict between this RFC and that: implementing warm secondary locations on a per-tenant basis
would be a separate change to the control plane to store standby location(s) for a tenant. Because
the standbys do not write to S3, they do not need to be assigned generation numbers. When a tenant is
re-attached to a standby location, that would increment the tenant attachment generation and this
would work the same as any other attachment change, but with a warm cache.
## Ephemeral node IDs
This RFC intentionally avoids changing anything fundamental about how pageservers are identified
and registered with the control plane, to avoid coupling the implementation of pageserver split
brain protection with more fundamental changes in the management of the pageservers.
Moving to ephemeral node IDs would provide an extra layer of
resilience in the system, as it would prevent the control plane
accidentally attaching to two physical nodes with the same
generation, if somehow there were two physical nodes with
the same node IDs (currently we rely on EC2 guarantees to
eliminate this scenario). With ephemeral node IDs, there would be
no possibility of that happening, no matter the behavior of
underlying infrastructure.
Nothing fundamental in the pageserver's handling of generations needs to change to handle ephemeral node IDs, since we hardly use the
`node_id` anywhere. The `/re-attach` API would be extended
to enable the pageserver to obtain its ephemeral ID, and provide
some correlation identifier (e.g. EC instance ID), to help the
control plane re-attach tenants to the same physical server that
previously had them attached.

View File

@@ -0,0 +1,316 @@
This is a copy from the [original Notion page](https://www.notion.so/neondatabase/Proposal-Pageserver-MVCC-S3-Storage-8a424c0c7ec5459e89d3e3f00e87657c?pvs=4), taken on 2023-08-16.
This is for archival mostly.
The RFC that we're likely to go with is https://github.com/neondatabase/neon/pull/4919.
---
# Proposal: Pageserver MVCC S3 Storage
tl;dr: this proposal enables Control Plane to attach a tenant to a new pageserver without being 100% certain that it has been detached from the old pageserver. This enables us to automate failover if a pageserver dies (no human in the loop).
# Problem Statement
The current Neon architecture requires the Control Plane to guarantee that a tenant is only attached to one pageserver at a time. If a tenant is attached to multiple pageservers simultaneously, the pageservers will overwrite each others changes in S3 for that tenant, resulting in data loss for that tenant.
The above imposes limitations on tenant relocation and future designs for high availability. For instance, Control Plane cannot relocate a tenant to another pageserver before it is 100% certain that the tenant is detached from the source pageserver. If the source pageserver is unresponsive, the tenant detach procedure cannot proceed, and Control Plane has no choice but to wait for either the source to become responsive again, or rely on a node failure detection mechanism to detect that the source pageserver is dead, and give permission to skip the detachment step. Either way, the tenant is unavailable for an extended period, and we have no means to improve it in the current architecture.
Note that there is no 100% correct node failure detection mechanism, and even techniques to accelerate failure detection, such as ********************************shoot-the-other-node-in-the-head,******************************** have their limits. So, we currently rely on humans as node failure detectors: they get alerted via PagerDuty, assess the situation under high stress, and make the decision. If they make the wrong call, or the apparent dead pageserver somehow resurrects later, well have data loss.
Also, by relying on humans, were [incurring needless unscalable toil](https://sre.google/sre-book/eliminating-toil/): as Neon grows, pageserver failures will become more and more frequent because our fleet grows. Each instance will need quick response time to minimize downtime for the affected tenants, which implies higher toil, higher resulting attrition, and/or higher personnel cost.
Lastly, there are foreseeable needs by operation and product such as zero-downtime relocation and automatic failover/HA. For such features, the ability to have a tenant purposefully or accidentally attached to more than one pageserver will greatly reduce risk of data loss, and improve availability.
# High-Level Idea
The core idea is to evolve the per-Tenant S3 state to an MVCC-like scheme, allowing multiple pageservers to operate on the same tenant S3 state without interference. To make changes to S3, pageservers acquire long-running transactions from Control Plane. After opening a transaction, Pageservers make PUTs directly against S3, but they keys include the transaction ID, so overwrites never happen. Periodically, pageservers talk back to Control Plane to commit their transaction. This is where Control Plane enforces strict linearizability, favoring availability over work-conservation: commit is only granted if no transaction started after the one thats requesting commit. Garbage collection is done through deadlists, and its simplified tremendously by above commit grant/reject policy.
Minimal changes are required for safekeepers to allow WAL for a single timeline be consumed by more than one pageserver without premature truncation.
**Above scheme makes it safe to attach tenants without a 100% correct node failure detection mechanism. Further, it makes it safe to interleave tenant-attachment to pageservers, unlocking new capabilities for (internal) product features:**
- **Fast, Zero-Toil Failover on Network Partitions or Instance Failure**: if a pageserver is not reachable (network partition, hardware failure, overload) we want to spread its attached tenants to new pageservers to restore availability, within the range of *seconds*. We cannot afford gracious timeouts to maximize the probability that the unreachable pageserver has ceased writing to S3. This proposal enables us to attach the tenants to the replacement pageservers, and redirect their computes, without having to wait for confirmation that the unreachable pageserver has ceased writing to S3.
- **************************************Zero-Downtime Relocation:************************************** we want to be able to relocate tenants to different pageservers with minimized availability or a latency impact. This proposal enables us to attach the relocating Tenant to the destination Pageserver before detaching it from the source Pageserver. This can help minimize downtime because we can wait for the destination to catch up on WAL processing before redirecting Computes.
# Design
The core idea is to evolve the per-Tenant S3 state to a per-tenant MVCC-like scheme.
To make S3 changes for a given tenant, Pageserver requests a transaction ID from control plane for that tenant. Without a transaction ID, Pageserver does not write to S3.
Once Pageserver received a transaction ID it is allowed to produce new objects and overwrite objects created in this transaction. Pageserver is not allowed to delete any objects; instead, it marks the object as deleted by appending the key to the transactions deadlist for later deletion. Commits of transactions are serialized through Control Plane: when Pageserver wants to commit a transaction, it sends an RPC to Control Plane. Control Plane responds with a commit grant or commit reject message. Commit grant means that the transactions changes are now visible to subsequent transactions. Commit reject means that the transactions changes are not and never will be visible to another Pageserver instance, and the rejected Pageserver is to cease further activity on that tenant.
## ****************************************************Commit grant/reject policy****************************************************
For the purposes of Pageserver, we want **linearizability** of a tenants S3 state. Since our transactions are scoped per tenant, it is sufficient for linearizability to grant commit if and only if no other transaction has been started since the commit-requesting transaction started.
For example, consider the case of a single tenant, attached to Pageserver A. Pageserver A has an open transaction but becomes unresponsive. Control Plane decides to relocate the tenant to another Pageserver B. It need *not* wait for A to be 100%-certainly down before B can start uploading to S3 for that tenant. Instead, B can start a new transaction right away, make progress, and get commit grants; What about A? The transaction is RejectPending in Control Plane until A eventually becomes responsive again, tries to commit, gets a rejection, acknowledges it, and thus its transaction becomes RejectAcknowledge. If A is definitively dead, operator can also force-transition from state RejectPending to RejectAcknowledged. But critically, Control Plane doesnt have for As transaction to become RejectAcknowledge before attaching the tenant to B.
```mermaid
sequenceDiagram
participant CP
participant A
participant S3
participant B
CP -->> A: attach tenant
activate A
A -->> CP: start txn
CP -->> A: txn=23, last_committed_txn=22
Note over CP,A: network partition
CP --x A: heartbeat
CP --x A: heartbeat
Note over CP: relocate tenant to avoid downtime
CP -->> B: attach tenant
activate B
B -->> CP: start txn
Note over CP: mark A's txn 23 as RejectPending
CP -->> B: txn=24, last-committed txn is 22
B -->> S3: PUT X.layer.24<br>PUT index_part.json.24 referencing X.layer.24
B -->> CP: request commit
CP -->> B: granted
B -->> CP: start txn
CP -->> B: txn=25, last_committed_txn=22
A -->> S3: PUT Y.layer.23 <br> PUT index_part.json.23 referencing Y.layer.23
A --x CP: request commit
A --x CP: request commit
Note over CP,A: partition is over
A -->> CP: request commit
Note over CP: most recently started txn is 25, not 23, reject
CP -->> A: reject
A -->> CP: acknowledge reject
Note over CP: mark A's txn 23 as RejectAcknowledged
deactivate A
B -->> S3: PUT 000-FFF_X-Y.layer.**************25**************<br>...
deactivate B
```
If a Pageserver gets a rejection to a commit request, it acknowledges rejection and cedes further S3 uploads for the tenant, until it receives a `/detach` request for the tenant (control plane has most likely attached the tenant to another pageserver in the meantime).
In practice, Control Plane will probably extend the commit grant/reject schema above, taking into account the pageserver to which it last attached the tenant. In the above example, Control Plane could remember that the pageserver that is supposed to host the tenant is pageserver B, and reject start-txn and commit requests from pageserver A. It would also use such requests from A as a signal that A is reachable again, and retry the `/detach` .
<aside>
💡 A commit failure causes the tenant to become effectively `Broken`. Pageserver should persist this locally so it doesnt bother ControlPlane for a new txn when Pageserver is restarted.
</aside>
## ********************Visibility********************
We mentioned earlier that once a transaction commits, its changes are visible to subsequent transactions. But how does a given transaction know where to look for the data? There is no longer a single `index_part.json` per timeline, or a single `timelines/:timeline_id` prefix to look for; theyre all multi-versioned, suffixed by the txn number.
The solution is: at transaction start, Pageserver receives the last-committed transaction ID from Control Plane (`last_committed_txn` in the diagram). last_commited_txn is the upper bound for what is visible for the current transaction. Control Plane keeps track of each open transactions last_committed_txn for purposes of garbage collection (see later paragraph).
Equipped with last_committed_txn, Pageserver then discovers
- the current index part of a timeline at `tenants/:tenant_id/timelines/:timeline_id/index_part.json.$last_committed_txn`. The `index_part.json.$last_committed_txn` has the exact same contents as the current architectures index_part.json, i.e. full list of layers.
- the list of existent timelines as part of the `attach` RPC from CP;
There is no other S3 state per tenant, so, thats all the visibility required.
An alternative to receiving the list of existent timelines from CP is to introduce a proper **********SetOfTimelines********** object in S3, and multi-version it just like above. For example, we could have a `tenants/:tenant_id/timelines.json.$txn` file that references `index_part.json.$last_committed_txn` . It can be added later if more separation between CP and PS is desired.
So, the only MVCCed object types in this proposal are LayerFile and IndexPart (=individual timeline), but not the SetOfTimelines in a given tenant. Is this a problem? For example, the Pageservers garbage collection code needs to know the full set of timelines of a tenant. Otherwise itll make incorrect decisions. What if Pageserver A knows about timelines {R,S}, but another Pageserver B created an additional branch T, so, its set of timelines is {R,S,T}. Both pageservers will run GC code, and so, PS A may decide to delete a layer thats still needed for branch T. Not a problem with this propsoal, because the effect of GC (i.e., layer deletion) is properly MVCCed.
## Longevity Of Transactions & Availability
Pageserver depends on Control Plane to start a new transaction. If ControlPlane is down, no new transactions can be started.
Pageservers commit transactions based on a maximum amount of uncommitted changes that have accumulated in S3. A lower maximum increases dependence and load on ControlPlane which decreases availability. A higher maximum risks losing more work in the event of failover; the work will have to be re-done in a new transaction on the new node.
Pageservers are persist the open txn id in local storage, so that they can resume the transaction after restart, without dependence on Control Plane.
## **Operations**
********PUTs:********
- **layer files**
- current architecture: layer files are supposed to be write-once, but actually, there are edge-cases where we PUT the same layer file name twice; namely if we PUT the file to S3 but crash before uploading the index part that references it; then detach + attach, and re-run compaction, which is non-deterministic.
- this proposal: with transactions, we can now upload layers and index_part.json concurrently, just need to make sure layer file upload is done before we request txn commit.
- **index part** upload: `index_part.json.$txn` may be created and subsequently overwritten multiple times in a transaction; it is an availability/work-loss trade-off how often to request a commit from CP.
**************DELETEs**************: for deletion, we maintain a deadlist per transaction. It is located at `tenants/:tenant_id/deadlist/deadlist.json.$txn`. It is PUT once before the pageserver requests requests commit, and not changed after sending request to commit. An object created in the current txn need not (but can) be on the deadlist — it can be DELETEd immediately because its not visible to other transactions. An example use case would be an L0 layer that gets compacted within one transaction; or, if we ever start MVCCing the set of timelines of a tenant, a short-lived branch that is created & destroyed within one transaction.
<aside>
**Deadlist Invariant:** if a an object is on a deadlist of transaction T, it is not referenced from anywhere else in the full state visible to T or any later started transaction > T.
</aside>
### Rationale For Deadlist.json
Given that this proposal only MVCCs layers and indexparts, one may ask why the deadlist isnt part of indexpart. The reason is to not lose generality: the deadlist is just a list of keys; it is not necessary to understand the data format of the versioned object to process the deadlist. This is important for garbage collection / vacuuming, which well come to in the next section.
## Garbage Collection / Vacuuming
After a transaction has reached reject-acknowledged state, Control Plane initiates a garbage collection procedure for the aborted transaction.
Control Plane is in the unique position about transaction states. Here is a sketch of the exact transaction states and what Control Plane keeps track of.
```
struct Tenant {
...
txns: HashMap<TxnId, Transaction>,
// the most recently started txn's id; only most recently sarted can win
next_winner_txn: Option<TxnId>,
}
struct Transaction {
id: TxnId, // immutable
last_committed_txn: TxnId, // immutable; the most recent txn in state `Committed`
// when self was started
pageserver_id: PageserverId,
state: enum {
Open,
Committed,
RejectPending,
RejectAcknowledged, // invariant: we know all S3 activity has ceded
GarbageCollected,
}
}
```
Object creations & deletions by a rejected transaction have never been visible to other transactions. That is true for both RejectPending and RejectAcknowledged states. The difference is that, in RejectPending, the pageserver may still be uploading to S3, whereas in RejectAcknowledged, Control Plane can be certain that all S3 activity in the name of that transaction has ceded. So, once a transaction reaches state RejectAcknowledged state, it is safe to DELETE all objects created by that transaction, and discard the transactions deadlists.
A transaction T in state Committed has subsequent transactions that may or may not reference the objects it created. None of the subsequent transaction can reference the objects on Ts deadlist, though, as per the Deadlist Invariant (see previous section).
So, for garbage collection, we need to assess transactions in state Committed and RejectAcknowledged:
- Commited: delete objects on the deadlist.
- We dont need a LIST request here, the deadlist is sufficient. So, its really cheap.
- This is **not true MVCC garbage collection**; by deleting the objects on Committed transaction T s deadlist, we might delete data referenced by other transactions that were concurrent with T, i.e., they started while T was still open. However, the fact that T is committed means that the other transactions are RejectPending or RejectAcknowledged, so, they dont matter. Pageservers executing these doomed RejectPending transactions must handle 404 for GETs gracefully, e.g., by trying to commit txn so they observe the rejection theyre destined to get anyways. 404s for RejectAcknowledged is handled below.
- RejectAcknowledged: delete all objects created in that txn, and discard deadlists.
- 404s / object-already-deleted type messages must be expected because of Committed garbage collection (see above)
- How to get this list of objects created in a txn? Open but solvable design question; Ideas:
- **Brute force**: within tenant prefix, search for all keys ending in `.$txn` and delete them.
- **WAL for PUTs**: before a txn PUTs an object, it logs to S3, or some other equivalently durable storage, that its going to do it. If we log to S3, this means we have to do an additional WAL PUT per “readl” PUT.
- ******************************LIST with reorged S3 layout (preferred one right now):****************************** layout S3 key space such that `$txn` comes first, i.e., `tenants/:tenant_id/$txn/timelines/:timeline_id/*.json.$txn` . That way, when we need to GC a RejectAcknowledged txn, we just LIST the entire `tenants/:tenant_id/$txn` prefix and delete it. The cost of GC for RejectAcknowledged transactions is thus proportional to the number of objects created in that transaction.
## Branches
This proposal only MVCCs layer files and and index_part.json, but leaves the tenant object not-MVCCed. We argued earlier that its fine to ignore this for now, because
1. Control Plane can act as source-of-truth for the set of timelines, and
2. The only operation that makes decision based on “set of timelines” is GC, which in turn only does layer deletions, and layer deletions ***are*** properly MVCCed.
Now that weve introduced garbage collection, lets elaborate a little more on (2). Recall our example from earlier: Pageserver A knows about timelines {R,S}, but another Pageserver B created an additional branch T, so, its set of timelines is {R,S,T}. Both pageservers will run GC code, and so, PS A may decide to delete a layer thats still needed for branch T.
How does the MVCCing of layer files protect us here? If A decides to delete that layer, its just on As transactions deadlist, but still present in S3 and usable by B. If A commits first, B wont be able to commit and the layers in timeline T will be vacuumed. If B commits first, As deadlist is discarded and the layer continues to exist.
## Safekeeper Changes
We need to teach the safekeepers that there can be multiple pageservers requesting WAL for the same timeline, in order to prevent premature WAL truncation.
In the current architecture, the Safekeeper service currently assumes only one Pageserver and is allowed to prune WAL older than that Pageservers `remote_consistent_lsn`. Safekeeper currently learns the `remote_consistent_lsn` through the walreceiver protocol.
So, if we have a tenant attached to two pageservers at the same time, they will both try to stream WAL and the Safekeeper will get confused about which connections `remote_consistent_lsn` to use as a basis for WAL pruning.
What do we need to change to make it work? We need to make sure that the Safekeepers only prune WAL up to the `remote_consistent_lsn` of the last-committed transaction.
The straight-forward way to get it is to re-design WAL pruning as follows:
1. Pageserver reports remote_consistent_lsn as part of transaction commit to Control Plane.
2. Control Plane makes sure transaction state update is persisted.
3. Control Plane (asynchronous to transaction commit) reconciles with Safekeepers to ensure WAL pruning happens.
The above requires non-trivial changes, but, in the light of other planned projects such as restore-tenant-from-safekeeper-wal-backups, I think Control Plane will need to get involved in WAL pruning anyways.
# How This Proposal Unlocks Future Features
Let us revisit the example from the introduction where we were thinking about handling network partitions. Network partitions need to be solved first, because theyre unavoidable in distributed systems. We did that. Now lets see how we can solve actual product problems:
## **Fast, Zero-Toil Failover on Network Partitions or Instance Failure**
The “Problem Statement” section outlined the current architectures problems with regards to network partitions or instance failure: it requires a 100% correct node-dead detector to make decisions, which doesnt exist in reality. We rely instead on human toil: an oncall engineer has to inspect the situation and make a decision, which may be incorrect and in any case take time in the order of minutes, which means equivalent downtime for users.
With this proposal, automatic failover for pageservers is trivial:
If a pageserver is unresponsive from Control Planes / Computes perspective, Control Plane does the following:
- attach all tenants of the unresponsive pageserver to new pageservers
- switch over these tenants computes immediately;
At this point, availability is restored and user pain relieved.
Whats left is to somehow close the doomed transaction of the unresponsive pageserver, so that it beomes RejectAcknowledged, and GC can make progress. Since S3 is cheap, we can afford to wait a really long time here, especially if we put a soft bound on the amount of data a transaction may produce before it must commit. Procedure:
1. Ensure the unresponsive pageserver is taken out of rotation for new attachments. That probably should happen as part of the routine above.
2. Make a human operator investigate decide what to do (next morning, NO ONCALL ALERT):
1. Inspect the instance, investigate logs, understand root cause.
2. Try to re-establish connectivity between pageserver and Control Plane so that pageserver can retry commits, get rejected, ack rejection ⇒ enable GC.
3. Use below procedure to decomission pageserver.
### Decomissioning A Pageserver (Dead or Alive-but-Unrespsonive)
The solution, enabled by this proposal:
1. Ensure that pageservers S3 credentials are revoked so that it cannot make new uploads, which wouldnt be tracked anywhere.
2. Let enough time pass for the S3 credential revocation to propagate. Amazon doesnt give a guarantee here. As stated earlier, we can easily afford to wait here.
3. Mark all Open and RejectPending transactions of that pageserver as RejectAcknowledge.
Revocation of the S3 credentials is required so that, once we transition all the transactions of that pageserver to RejectAcknowledge, once garbage-collection pass is guaranteed to delete all objects that will ever exist for that pageserver. That way, we need not check *****GarbageCollected***** transactions every again.
## Workflow: Zero-Downtime Relocation
With zero-downtime relocation, the goal is to have the target pageserver warmed up, i.e., at the same `last_record_lsn` as the source pageserver, before switching over Computes from source to target pageserver.
With this proposal, it works like so:
1. Grant source pageserver its last open transaction. This one is doomed to be rejected later, unless the relocation fails.
2. Grant target pageserver its first open transaction.
3. Have target pageserver catch up on WAL, streaming from last-committed-txns remote_consistent_lsn onwards.
4. Once target pageserver reports `last_record_lsn` close enough to source pageserver, target pageserver requests commit.
5. Drain compute traffic from source to target pageserver. (Source can still answer requests until it tries to commit and gets reject, so, this will be quite smooth).
Note that as soon as we complete step (4), the source pageservers transaction is doomed to be rejected later. Conversely, if the target cant catch up fast enough, the source will make a transaction commit earlier. This will generally happen if there is a lot of write traffic coming in. The design space to make thing smooth here is large, but well explored in other areas of computing, e.g., VM live migration. We have all the important policy levers at hand, e.g.,
- delaying source commits if we see target making progress
- slowing down source consumption (need some signalling mechanism for it)
- slowing down compute wal generation
-
It doesnt really matter, whats important is that two pageservers can overlap.
# Additional Trade-Offs / Remarks Brought Up During Peer Review
This proposal was read by and discussed @Stas and @Dmitry Rodionov prior to publishing it with the broader team. (This does not mean they endorse this proposal!).
Issues that we discussed:
1. **Frequency of transactions:** If even idle tenants commit every 10min or so, thats quite a lot of load on Control Plane. Can we minimize it by Equating Transaction Commit Period to Attachment Period? I.e. start txn on attach, commit on detach?
1. Would be nice, but, if a tenant is attached for 1 month, then PS dies, we lose 1 month of work.
2. ⇒ my solution to this problem: Adjusted this proposal to make transaction commit frequency proportional to amount of uncommitted data.
1. Its ok to spend resources on active users, they pay us money to do it!
2. The amount of work per transaction is minimal.
1. In current Control Plane, its a small database transaction that is super unlikely to conflict with other transactions.
2. I have very little concerns about scalability of the commit workload on CP side because it's trivially horizontally scalable by sharding by tenant.
3. There's no super stringent availability requirement on control plane; if a txn can't commit because it can't reach the CP, PS can continue & retry in the background, speculating that it's CP downtime and not PS-partitioned-off scenario.
4. Without stringent availability requirement, there's flexibility for future changes to CP-side-implementation.
2. ************************************************Does this proposal address mirroring / no-performance-degradation failover ?************************************************
1. No it doesnt. It only provides the building block for attaching a tenant to a new pageserver without having to worry that the tenant is detached on the old pageserver.
2. A simple scheme to build no-performance-degradation failover on top of this proposal is to have an asynchronous read-only replica of a tenant on another pageserver in the same region.
3. Another more ambitious scheme to get no-performance-degradation would be [One-Pager: Layer File Spreading (Christian)](https://www.notion.so/One-Pager-Layer-File-Spreading-Christian-eb6b64182a214e11b3fceceee688d843?pvs=21); this proposal would be used in layer file spreading for risk-free automation of TenantLeader failover, which hasnt been addressed Ithere.
4. In any way, failover would restart from an older S3 state, and need to re-ingest WAL before being able to server recently written pages.
1. Is that a show-stopper? I think not.
2. Is it suboptimal? Absolutely: if a pageserver instance fails, all its tenants will be distributed among the remaining pageservers (OK), and all these tenants will ask the safekeepers for WAL at the same time (BAD). So, pageserver instance failure will cause a load spike in safekeepers.
1. Personally I think thats an OK trade-off to make.
2. There are countless options to avoid / mitigate the load spike. E.g., pro-actively streaming WAL to the standby read-only replica.
3. ********************************************Does this proposal allow multiple writers for a tenant?********************************************
1. In abstract terms, this proposal provides a linearized history for a given S3 prefix.
2. In concrete terms, this proposal provides a linearized history per tenant.
3. There can be multiple writers at a given time, but only one of them will win to become part of the linearized history.
4. ************************************************************************************Alternative ideas mentioned during meetings that should be turned into a written prospoal like this one:************************************************************************************
1. @Dmitry Rodionov : having linearized storage of index_part.json in some database that allows serializable transactions / atomic compare-and-swap PUT
2. @Dmitry Rodionov :
3. @Stas : something like this scheme, but somehow find a way to equate attachment duration with transaction duration, without losing work if pageserver dies months after attachment.

View File

@@ -10,6 +10,9 @@ chrono.workspace = true
serde.workspace = true
serde_with.workspace = true
serde_json.workspace = true
regex.workspace = true
utils = { path = "../utils" }
remote_storage = { version = "0.1", path = "../remote_storage/" }
workspace_hack.workspace = true

View File

@@ -68,13 +68,45 @@ where
/// Response of the /metrics.json API
#[derive(Clone, Debug, Default, Serialize)]
pub struct ComputeMetrics {
/// Time spent waiting in pool
pub wait_for_spec_ms: u64,
/// Time spent checking if safekeepers are synced
pub sync_sk_check_ms: u64,
/// Time spent syncing safekeepers (walproposer.c).
/// In most cases this should be zero.
pub sync_safekeepers_ms: u64,
/// Time it took to establish a pg connection to the pageserver.
/// This is two roundtrips, so it's a good proxy for compute-pageserver
/// latency. The latency is usually 0.2ms, but it's not safe to assume
/// that.
pub pageserver_connect_micros: u64,
/// Time to get basebackup from pageserver and write it to disk.
pub basebackup_ms: u64,
/// Compressed size of basebackup received.
pub basebackup_bytes: u64,
/// Time spent starting potgres. This includes initialization of shared
/// buffers, preloading extensions, and other pg operations.
pub start_postgres_ms: u64,
/// Time spent applying pg catalog updates that were made in the console
/// UI. This should be 0 when startup time matters, since cplane tries
/// to do these updates eagerly, and passes the skip_pg_catalog_updates
/// when it's safe to skip this step.
pub config_ms: u64,
/// Total time, from when we receive the spec to when we're ready to take
/// pg connections.
pub total_startup_ms: u64,
pub load_ext_ms: u64,
pub num_ext_downloaded: u64,
pub largest_ext_size: u64, // these are measured in bytes
pub total_ext_download_size: u64,
}
/// Response of the `/computes/{compute_id}/spec` control-plane API.

View File

@@ -3,11 +3,16 @@
//! The spec.json file is used to pass information to 'compute_ctl'. It contains
//! all the information needed to start up the right version of PostgreSQL,
//! and connect it to the storage nodes.
use std::collections::HashMap;
use serde::{Deserialize, Serialize};
use serde_with::{serde_as, DisplayFromStr};
use utils::id::{TenantId, TimelineId};
use utils::lsn::Lsn;
use regex::Regex;
use remote_storage::RemotePath;
/// String type alias representing Postgres identifier and
/// intended to be used for DB / role names.
pub type PgIdent = String;
@@ -60,6 +65,79 @@ pub struct ComputeSpec {
/// If set, 'storage_auth_token' is used as the password to authenticate to
/// the pageserver and safekeepers.
pub storage_auth_token: Option<String>,
// information about available remote extensions
pub remote_extensions: Option<RemoteExtSpec>,
}
#[derive(Clone, Debug, Default, Deserialize, Serialize)]
pub struct RemoteExtSpec {
pub public_extensions: Option<Vec<String>>,
pub custom_extensions: Option<Vec<String>>,
pub library_index: HashMap<String, String>,
pub extension_data: HashMap<String, ExtensionData>,
}
#[derive(Clone, Debug, Serialize, Deserialize)]
pub struct ExtensionData {
pub control_data: HashMap<String, String>,
pub archive_path: String,
}
impl RemoteExtSpec {
pub fn get_ext(
&self,
ext_name: &str,
is_library: bool,
build_tag: &str,
pg_major_version: &str,
) -> anyhow::Result<(String, RemotePath)> {
let mut real_ext_name = ext_name;
if is_library {
// sometimes library names might have a suffix like
// library.so or library.so.3. We strip this off
// because library_index is based on the name without the file extension
let strip_lib_suffix = Regex::new(r"\.so.*").unwrap();
let lib_raw_name = strip_lib_suffix.replace(real_ext_name, "").to_string();
real_ext_name = self
.library_index
.get(&lib_raw_name)
.ok_or(anyhow::anyhow!("library {} is not found", lib_raw_name))?;
}
// Check if extension is present in public or custom.
// If not, then it is not allowed to be used by this compute.
if let Some(public_extensions) = &self.public_extensions {
if !public_extensions.contains(&real_ext_name.to_string()) {
if let Some(custom_extensions) = &self.custom_extensions {
if !custom_extensions.contains(&real_ext_name.to_string()) {
return Err(anyhow::anyhow!("extension {} is not found", real_ext_name));
}
}
}
}
match self.extension_data.get(real_ext_name) {
Some(_ext_data) => {
// Construct the path to the extension archive
// BUILD_TAG/PG_MAJOR_VERSION/extensions/EXTENSION_NAME.tar.zst
//
// Keep it in sync with path generation in
// https://github.com/neondatabase/build-custom-extensions/tree/main
let archive_path_str =
format!("{build_tag}/{pg_major_version}/extensions/{real_ext_name}.tar.zst");
Ok((
real_ext_name.to_string(),
RemotePath::from_string(&archive_path_str)?,
))
}
None => Err(anyhow::anyhow!(
"real_ext_name {} is not found",
real_ext_name
)),
}
}
}
#[serde_as]

View File

@@ -205,5 +205,43 @@
"name": "zenith new",
"new_name": "zenith \"new\""
}
]
],
"remote_extensions": {
"library_index": {
"anon": "anon",
"postgis-3": "postgis",
"libpgrouting-3.4": "postgis",
"postgis_raster-3": "postgis",
"postgis_sfcgal-3": "postgis",
"postgis_topology-3": "postgis",
"address_standardizer-3": "postgis"
},
"extension_data": {
"anon": {
"archive_path": "5834329303/v15/extensions/anon.tar.zst",
"control_data": {
"anon.control": "# PostgreSQL Anonymizer (anon) extension\ncomment = ''Data anonymization tools''\ndefault_version = ''1.1.0''\ndirectory=''extension/anon''\nrelocatable = false\nrequires = ''pgcrypto''\nsuperuser = false\nmodule_pathname = ''$libdir/anon''\ntrusted = true\n"
}
},
"postgis": {
"archive_path": "5834329303/v15/extensions/postgis.tar.zst",
"control_data": {
"postgis.control": "# postgis extension\ncomment = ''PostGIS geometry and geography spatial types and functions''\ndefault_version = ''3.3.2''\nmodule_pathname = ''$libdir/postgis-3''\nrelocatable = false\ntrusted = true\n",
"pgrouting.control": "# pgRouting Extension\ncomment = ''pgRouting Extension''\ndefault_version = ''3.4.2''\nmodule_pathname = ''$libdir/libpgrouting-3.4''\nrelocatable = true\nrequires = ''plpgsql''\nrequires = ''postgis''\ntrusted = true\n",
"postgis_raster.control": "# postgis_raster extension\ncomment = ''PostGIS raster types and functions''\ndefault_version = ''3.3.2''\nmodule_pathname = ''$libdir/postgis_raster-3''\nrelocatable = false\nrequires = postgis\ntrusted = true\n",
"postgis_sfcgal.control": "# postgis topology extension\ncomment = ''PostGIS SFCGAL functions''\ndefault_version = ''3.3.2''\nrelocatable = true\nrequires = postgis\ntrusted = true\n",
"postgis_topology.control": "# postgis topology extension\ncomment = ''PostGIS topology spatial types and functions''\ndefault_version = ''3.3.2''\nrelocatable = false\nschema = topology\nrequires = postgis\ntrusted = true\n",
"address_standardizer.control": "# address_standardizer extension\ncomment = ''Used to parse an address into constituent elements. Generally used to support geocoding address normalization step.''\ndefault_version = ''3.3.2''\nrelocatable = true\ntrusted = true\n",
"postgis_tiger_geocoder.control": "# postgis tiger geocoder extension\ncomment = ''PostGIS tiger geocoder and reverse geocoder''\ndefault_version = ''3.3.2''\nrelocatable = false\nschema = tiger\nrequires = ''postgis,fuzzystrmatch''\nsuperuser= false\ntrusted = true\n",
"address_standardizer_data_us.control": "# address standardizer us dataset\ncomment = ''Address Standardizer US dataset example''\ndefault_version = ''3.3.2''\nrelocatable = true\ntrusted = true\n"
}
}
},
"custom_extensions": [
"anon"
],
"public_extensions": [
"postgis"
]
}
}

View File

@@ -5,7 +5,7 @@ use chrono::{DateTime, Utc};
use rand::Rng;
use serde::Serialize;
#[derive(Serialize, Debug, Clone, Eq, PartialEq, Ord, PartialOrd)]
#[derive(Serialize, Debug, Clone, Copy, Eq, PartialEq, Ord, PartialOrd)]
#[serde(tag = "type")]
pub enum EventType {
#[serde(rename = "absolute")]
@@ -17,6 +17,32 @@ pub enum EventType {
},
}
impl EventType {
pub fn absolute_time(&self) -> Option<&DateTime<Utc>> {
use EventType::*;
match self {
Absolute { time } => Some(time),
_ => None,
}
}
pub fn incremental_timerange(&self) -> Option<std::ops::Range<&DateTime<Utc>>> {
// these can most likely be thought of as Range or RangeFull
use EventType::*;
match self {
Incremental {
start_time,
stop_time,
} => Some(start_time..stop_time),
_ => None,
}
}
pub fn is_incremental(&self) -> bool {
matches!(self, EventType::Incremental { .. })
}
}
#[derive(Serialize, Debug, Clone, Eq, PartialEq, Ord, PartialOrd)]
pub struct Event<Extra> {
#[serde(flatten)]
@@ -31,7 +57,7 @@ pub struct Event<Extra> {
pub extra: Extra,
}
pub fn idempotency_key(node_id: String) -> String {
pub fn idempotency_key(node_id: &str) -> String {
format!(
"{}-{}-{:04}",
Utc::now(),
@@ -45,6 +71,6 @@ pub const CHUNK_SIZE: usize = 1000;
// Just a wrapper around a slice of events
// to serialize it as `{"events" : [ ] }
#[derive(serde::Serialize)]
pub struct EventChunk<'a, T> {
pub events: &'a [T],
pub struct EventChunk<'a, T: Clone> {
pub events: std::borrow::Cow<'a, [T]>,
}

View File

@@ -6,6 +6,7 @@ use once_cell::sync::Lazy;
use prometheus::core::{AtomicU64, Collector, GenericGauge, GenericGaugeVec};
pub use prometheus::opts;
pub use prometheus::register;
pub use prometheus::Error;
pub use prometheus::{core, default_registry, proto};
pub use prometheus::{exponential_buckets, linear_buckets};
pub use prometheus::{register_counter_vec, Counter, CounterVec};

View File

@@ -57,9 +57,9 @@ pub fn slru_may_delete_clogsegment(segpage: u32, cutoff_page: u32) -> bool {
// Multixact utils
pub fn mx_offset_to_flags_offset(xid: MultiXactId) -> usize {
((xid / pg_constants::MULTIXACT_MEMBERS_PER_MEMBERGROUP as u32) as u16
% pg_constants::MULTIXACT_MEMBERGROUPS_PER_PAGE
* pg_constants::MULTIXACT_MEMBERGROUP_SIZE) as usize
((xid / pg_constants::MULTIXACT_MEMBERS_PER_MEMBERGROUP as u32)
% pg_constants::MULTIXACT_MEMBERGROUPS_PER_PAGE as u32
* pg_constants::MULTIXACT_MEMBERGROUP_SIZE as u32) as usize
}
pub fn mx_offset_to_flags_bitshift(xid: MultiXactId) -> u16 {
@@ -81,3 +81,41 @@ fn mx_offset_to_member_page(xid: u32) -> u32 {
pub fn mx_offset_to_member_segment(xid: u32) -> i32 {
(mx_offset_to_member_page(xid) / pg_constants::SLRU_PAGES_PER_SEGMENT) as i32
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_multixid_calc() {
// Check that the mx_offset_* functions produce the same values as the
// corresponding PostgreSQL C macros (MXOffsetTo*). These test values
// were generated by calling the PostgreSQL macros with a little C
// program.
assert_eq!(mx_offset_to_member_segment(0), 0);
assert_eq!(mx_offset_to_member_page(0), 0);
assert_eq!(mx_offset_to_flags_offset(0), 0);
assert_eq!(mx_offset_to_flags_bitshift(0), 0);
assert_eq!(mx_offset_to_member_offset(0), 4);
assert_eq!(mx_offset_to_member_segment(1), 0);
assert_eq!(mx_offset_to_member_page(1), 0);
assert_eq!(mx_offset_to_flags_offset(1), 0);
assert_eq!(mx_offset_to_flags_bitshift(1), 8);
assert_eq!(mx_offset_to_member_offset(1), 8);
assert_eq!(mx_offset_to_member_segment(123456789), 2358);
assert_eq!(mx_offset_to_member_page(123456789), 75462);
assert_eq!(mx_offset_to_flags_offset(123456789), 4780);
assert_eq!(mx_offset_to_flags_bitshift(123456789), 8);
assert_eq!(mx_offset_to_member_offset(123456789), 4788);
assert_eq!(mx_offset_to_member_segment(u32::MAX - 1), 82040);
assert_eq!(mx_offset_to_member_page(u32::MAX - 1), 2625285);
assert_eq!(mx_offset_to_flags_offset(u32::MAX - 1), 5160);
assert_eq!(mx_offset_to_flags_bitshift(u32::MAX - 1), 16);
assert_eq!(mx_offset_to_member_offset(u32::MAX - 1), 5172);
assert_eq!(mx_offset_to_member_segment(u32::MAX), 82040);
assert_eq!(mx_offset_to_member_page(u32::MAX), 2625285);
assert_eq!(mx_offset_to_flags_offset(u32::MAX), 5160);
assert_eq!(mx_offset_to_flags_bitshift(u32::MAX), 24);
assert_eq!(mx_offset_to_member_offset(u32::MAX), 5176);
}
}

View File

@@ -179,7 +179,7 @@ pub struct FeExecuteMessage {
#[derive(Debug)]
pub struct FeCloseMessage;
/// An error occured while parsing or serializing raw stream into Postgres
/// An error occurred while parsing or serializing raw stream into Postgres
/// messages.
#[derive(thiserror::Error, Debug)]
pub enum ProtocolError {

View File

@@ -20,6 +20,7 @@ tokio = { workspace = true, features = ["sync", "fs", "io-util"] }
tokio-util.workspace = true
toml_edit.workspace = true
tracing.workspace = true
scopeguard.workspace = true
metrics.workspace = true
utils.workspace = true
pin-project-lite.workspace = true

View File

@@ -65,6 +65,10 @@ impl RemotePath {
Ok(Self(relative_path.to_path_buf()))
}
pub fn from_string(relative_path: &str) -> anyhow::Result<Self> {
Self::new(Path::new(relative_path))
}
pub fn with_base(&self, base_path: &Path) -> PathBuf {
base_path.join(&self.0)
}
@@ -190,6 +194,20 @@ pub enum GenericRemoteStorage {
}
impl GenericRemoteStorage {
// A function for listing all the files in a "directory"
// Example:
// list_files("foo/bar") = ["foo/bar/a.txt", "foo/bar/b.txt"]
pub async fn list_files(&self, folder: Option<&RemotePath>) -> anyhow::Result<Vec<RemotePath>> {
match self {
Self::LocalFs(s) => s.list_files(folder).await,
Self::AwsS3(s) => s.list_files(folder).await,
Self::Unreliable(s) => s.list_files(folder).await,
}
}
// lists common *prefixes*, if any of files
// Example:
// list_prefixes("foo123","foo567","bar123","bar432") = ["foo", "bar"]
pub async fn list_prefixes(
&self,
prefix: Option<&RemotePath>,
@@ -201,14 +219,6 @@ impl GenericRemoteStorage {
}
}
pub async fn list_files(&self, folder: Option<&RemotePath>) -> anyhow::Result<Vec<RemotePath>> {
match self {
Self::LocalFs(s) => s.list_files(folder).await,
Self::AwsS3(s) => s.list_files(folder).await,
Self::Unreliable(s) => s.list_files(folder).await,
}
}
pub async fn upload(
&self,
from: impl io::AsyncRead + Unpin + Send + Sync + 'static,

View File

@@ -10,6 +10,7 @@ use anyhow::Context;
use aws_config::{
environment::credentials::EnvironmentVariableCredentialsProvider,
imds::credentials::ImdsCredentialsProvider, meta::credentials::CredentialsProviderChain,
provider_config::ProviderConfig, web_identity_token::WebIdentityTokenCredentialsProvider,
};
use aws_credential_types::cache::CredentialsCache;
use aws_sdk_s3::{
@@ -22,6 +23,7 @@ use aws_sdk_s3::{
};
use aws_smithy_http::body::SdkBody;
use hyper::Body;
use scopeguard::ScopeGuard;
use tokio::{
io::{self, AsyncRead},
sync::Semaphore,
@@ -36,82 +38,9 @@ use crate::{
const MAX_DELETE_OBJECTS_REQUEST_SIZE: usize = 1000;
pub(super) mod metrics {
use metrics::{register_int_counter_vec, IntCounterVec};
use once_cell::sync::Lazy;
pub(super) mod metrics;
static S3_REQUESTS_COUNT: Lazy<IntCounterVec> = Lazy::new(|| {
register_int_counter_vec!(
"remote_storage_s3_requests_count",
"Number of s3 requests of particular type",
&["request_type"],
)
.expect("failed to define a metric")
});
static S3_REQUESTS_FAIL_COUNT: Lazy<IntCounterVec> = Lazy::new(|| {
register_int_counter_vec!(
"remote_storage_s3_failures_count",
"Number of failed s3 requests of particular type",
&["request_type"],
)
.expect("failed to define a metric")
});
pub fn inc_get_object() {
S3_REQUESTS_COUNT.with_label_values(&["get_object"]).inc();
}
pub fn inc_get_object_fail() {
S3_REQUESTS_FAIL_COUNT
.with_label_values(&["get_object"])
.inc();
}
pub fn inc_put_object() {
S3_REQUESTS_COUNT.with_label_values(&["put_object"]).inc();
}
pub fn inc_put_object_fail() {
S3_REQUESTS_FAIL_COUNT
.with_label_values(&["put_object"])
.inc();
}
pub fn inc_delete_object() {
S3_REQUESTS_COUNT
.with_label_values(&["delete_object"])
.inc();
}
pub fn inc_delete_objects(count: u64) {
S3_REQUESTS_COUNT
.with_label_values(&["delete_object"])
.inc_by(count);
}
pub fn inc_delete_object_fail() {
S3_REQUESTS_FAIL_COUNT
.with_label_values(&["delete_object"])
.inc();
}
pub fn inc_delete_objects_fail(count: u64) {
S3_REQUESTS_FAIL_COUNT
.with_label_values(&["delete_object"])
.inc_by(count);
}
pub fn inc_list_objects() {
S3_REQUESTS_COUNT.with_label_values(&["list_objects"]).inc();
}
pub fn inc_list_objects_fail() {
S3_REQUESTS_FAIL_COUNT
.with_label_values(&["list_objects"])
.inc();
}
}
use self::metrics::{AttemptOutcome, RequestKind};
/// AWS S3 storage.
pub struct S3Bucket {
@@ -139,18 +68,29 @@ impl S3Bucket {
aws_config.bucket_name
);
let region = Some(Region::new(aws_config.bucket_region.clone()));
let credentials_provider = {
// uses "AWS_ACCESS_KEY_ID", "AWS_SECRET_ACCESS_KEY"
CredentialsProviderChain::first_try(
"env",
EnvironmentVariableCredentialsProvider::new(),
)
// uses "AWS_WEB_IDENTITY_TOKEN_FILE", "AWS_ROLE_ARN", "AWS_ROLE_SESSION_NAME"
// needed to access remote extensions bucket
.or_else("token", {
let provider_conf = ProviderConfig::without_region().with_region(region.clone());
WebIdentityTokenCredentialsProvider::builder()
.configure(&provider_conf)
.build()
})
// uses imds v2
.or_else("imds", ImdsCredentialsProvider::builder().build())
};
let mut config_builder = Config::builder()
.region(Region::new(aws_config.bucket_region.clone()))
.region(region)
.credentials_cache(CredentialsCache::lazy())
.credentials_provider(credentials_provider);
@@ -200,25 +140,56 @@ impl S3Bucket {
)
}
fn relative_path_to_s3_object(&self, path: &RemotePath) -> String {
let mut full_path = self.prefix_in_bucket.clone().unwrap_or_default();
for segment in path.0.iter() {
full_path.push(REMOTE_STORAGE_PREFIX_SEPARATOR);
full_path.push_str(segment.to_str().unwrap_or_default());
pub fn relative_path_to_s3_object(&self, path: &RemotePath) -> String {
assert_eq!(std::path::MAIN_SEPARATOR, REMOTE_STORAGE_PREFIX_SEPARATOR);
let path_string = path
.get_path()
.to_string_lossy()
.trim_end_matches(REMOTE_STORAGE_PREFIX_SEPARATOR)
.to_string();
match &self.prefix_in_bucket {
Some(prefix) => prefix.clone() + "/" + &path_string,
None => path_string,
}
full_path
}
async fn download_object(&self, request: GetObjectRequest) -> Result<Download, DownloadError> {
async fn permit(&self, kind: RequestKind) -> tokio::sync::SemaphorePermit<'_> {
let started_at = start_counting_cancelled_wait(kind);
let permit = self
.concurrency_limiter
.acquire()
.await
.expect("semaphore is never closed");
let started_at = ScopeGuard::into_inner(started_at);
metrics::BUCKET_METRICS
.wait_seconds
.observe_elapsed(kind, started_at);
permit
}
async fn owned_permit(&self, kind: RequestKind) -> tokio::sync::OwnedSemaphorePermit {
let started_at = start_counting_cancelled_wait(kind);
let permit = self
.concurrency_limiter
.clone()
.acquire_owned()
.await
.context("Concurrency limiter semaphore got closed during S3 download")
.map_err(DownloadError::Other)?;
.expect("semaphore is never closed");
metrics::inc_get_object();
let started_at = ScopeGuard::into_inner(started_at);
metrics::BUCKET_METRICS
.wait_seconds
.observe_elapsed(kind, started_at);
permit
}
async fn download_object(&self, request: GetObjectRequest) -> Result<Download, DownloadError> {
let kind = RequestKind::Get;
let permit = self.owned_permit(kind).await;
let started_at = start_measuring_requests(kind);
let get_object = self
.client
@@ -229,26 +200,33 @@ impl S3Bucket {
.send()
.await;
let started_at = ScopeGuard::into_inner(started_at);
if get_object.is_err() {
metrics::BUCKET_METRICS.req_seconds.observe_elapsed(
kind,
AttemptOutcome::Err,
started_at,
);
}
match get_object {
Ok(object_output) => {
let metadata = object_output.metadata().cloned().map(StorageMetadata);
Ok(Download {
metadata,
download_stream: Box::pin(io::BufReader::new(RatelimitedAsyncRead::new(
permit,
object_output.body.into_async_read(),
download_stream: Box::pin(io::BufReader::new(TimedDownload::new(
started_at,
RatelimitedAsyncRead::new(permit, object_output.body.into_async_read()),
))),
})
}
Err(SdkError::ServiceError(e)) if matches!(e.err(), GetObjectError::NoSuchKey(_)) => {
Err(DownloadError::NotFound)
}
Err(e) => {
metrics::inc_get_object_fail();
Err(DownloadError::Other(anyhow::anyhow!(
"Failed to download S3 object: {e}"
)))
}
Err(e) => Err(DownloadError::Other(
anyhow::Error::new(e).context("download s3 object"),
)),
}
}
}
@@ -279,6 +257,54 @@ impl<S: AsyncRead> AsyncRead for RatelimitedAsyncRead<S> {
}
}
pin_project_lite::pin_project! {
/// Times and tracks the outcome of the request.
struct TimedDownload<S> {
started_at: std::time::Instant,
outcome: metrics::AttemptOutcome,
#[pin]
inner: S
}
impl<S> PinnedDrop for TimedDownload<S> {
fn drop(mut this: Pin<&mut Self>) {
metrics::BUCKET_METRICS.req_seconds.observe_elapsed(RequestKind::Get, this.outcome, this.started_at);
}
}
}
impl<S: AsyncRead> TimedDownload<S> {
fn new(started_at: std::time::Instant, inner: S) -> Self {
TimedDownload {
started_at,
outcome: metrics::AttemptOutcome::Cancelled,
inner,
}
}
}
impl<S: AsyncRead> AsyncRead for TimedDownload<S> {
fn poll_read(
self: std::pin::Pin<&mut Self>,
cx: &mut std::task::Context<'_>,
buf: &mut io::ReadBuf<'_>,
) -> std::task::Poll<std::io::Result<()>> {
let this = self.project();
let before = buf.filled().len();
let read = std::task::ready!(this.inner.poll_read(cx, buf));
let read_eof = buf.filled().len() == before;
match read {
Ok(()) if read_eof => *this.outcome = AttemptOutcome::Ok,
Ok(()) => { /* still in progress */ }
Err(_) => *this.outcome = AttemptOutcome::Err,
}
std::task::Poll::Ready(read)
}
}
#[async_trait::async_trait]
impl RemoteStorage for S3Bucket {
/// See the doc for `RemoteStorage::list_prefixes`
@@ -287,6 +313,8 @@ impl RemoteStorage for S3Bucket {
&self,
prefix: Option<&RemotePath>,
) -> Result<Vec<RemotePath>, DownloadError> {
let kind = RequestKind::List;
// get the passed prefix or if it is not set use prefix_in_bucket value
let list_prefix = prefix
.map(|p| self.relative_path_to_s3_object(p))
@@ -303,15 +331,10 @@ impl RemoteStorage for S3Bucket {
let mut document_keys = Vec::new();
let mut continuation_token = None;
loop {
let _guard = self
.concurrency_limiter
.acquire()
.await
.context("Concurrency limiter semaphore got closed during S3 list")
.map_err(DownloadError::Other)?;
metrics::inc_list_objects();
loop {
let _guard = self.permit(kind).await;
let started_at = start_measuring_requests(kind);
let fetch_response = self
.client
@@ -323,12 +346,16 @@ impl RemoteStorage for S3Bucket {
.set_max_keys(self.max_keys_per_list_response)
.send()
.await
.map_err(|e| {
metrics::inc_list_objects_fail();
e
})
.context("Failed to list S3 prefixes")
.map_err(DownloadError::Other)?;
.map_err(DownloadError::Other);
let started_at = ScopeGuard::into_inner(started_at);
metrics::BUCKET_METRICS
.req_seconds
.observe_elapsed(kind, &fetch_response, started_at);
let fetch_response = fetch_response?;
document_keys.extend(
fetch_response
@@ -338,10 +365,10 @@ impl RemoteStorage for S3Bucket {
.filter_map(|o| Some(self.s3_object_to_relative_path(o.prefix()?))),
);
match fetch_response.next_continuation_token {
Some(new_token) => continuation_token = Some(new_token),
continuation_token = match fetch_response.next_continuation_token {
Some(new_token) => Some(new_token),
None => break,
}
};
}
Ok(document_keys)
@@ -349,6 +376,8 @@ impl RemoteStorage for S3Bucket {
/// See the doc for `RemoteStorage::list_files`
async fn list_files(&self, folder: Option<&RemotePath>) -> anyhow::Result<Vec<RemotePath>> {
let kind = RequestKind::List;
let folder_name = folder
.map(|p| self.relative_path_to_s3_object(p))
.or_else(|| self.prefix_in_bucket.clone());
@@ -357,12 +386,8 @@ impl RemoteStorage for S3Bucket {
let mut continuation_token = None;
let mut all_files = vec![];
loop {
let _guard = self
.concurrency_limiter
.acquire()
.await
.context("Concurrency limiter semaphore got closed during S3 list_files")?;
metrics::inc_list_objects();
let _guard = self.permit(kind).await;
let started_at = start_measuring_requests(kind);
let response = self
.client
@@ -373,11 +398,14 @@ impl RemoteStorage for S3Bucket {
.set_max_keys(self.max_keys_per_list_response)
.send()
.await
.map_err(|e| {
metrics::inc_list_objects_fail();
e
})
.context("Failed to list files in S3 bucket")?;
.context("Failed to list files in S3 bucket");
let started_at = ScopeGuard::into_inner(started_at);
metrics::BUCKET_METRICS
.req_seconds
.observe_elapsed(kind, &response, started_at);
let response = response?;
for object in response.contents().unwrap_or_default() {
let object_path = object.key().expect("response does not contain a key");
@@ -399,18 +427,16 @@ impl RemoteStorage for S3Bucket {
to: &RemotePath,
metadata: Option<StorageMetadata>,
) -> anyhow::Result<()> {
let _guard = self
.concurrency_limiter
.acquire()
.await
.context("Concurrency limiter semaphore got closed during S3 upload")?;
let kind = RequestKind::Put;
let _guard = self.permit(kind).await;
metrics::inc_put_object();
let started_at = start_measuring_requests(kind);
let body = Body::wrap_stream(ReaderStream::new(from));
let bytes_stream = ByteStream::new(SdkBody::from(body));
self.client
let res = self
.client
.put_object()
.bucket(self.bucket_name.clone())
.key(self.relative_path_to_s3_object(to))
@@ -418,19 +444,25 @@ impl RemoteStorage for S3Bucket {
.content_length(from_size_bytes.try_into()?)
.body(bytes_stream)
.send()
.await
.map_err(|e| {
metrics::inc_put_object_fail();
e
})?;
.await;
let started_at = ScopeGuard::into_inner(started_at);
metrics::BUCKET_METRICS
.req_seconds
.observe_elapsed(kind, &res, started_at);
res?;
Ok(())
}
async fn download(&self, from: &RemotePath) -> Result<Download, DownloadError> {
// if prefix is not none then download file `prefix/from`
// if prefix is none then download file `from`
self.download_object(GetObjectRequest {
bucket: self.bucket_name.clone(),
key: self.relative_path_to_s3_object(from),
..GetObjectRequest::default()
range: None,
})
.await
}
@@ -457,11 +489,8 @@ impl RemoteStorage for S3Bucket {
.await
}
async fn delete_objects<'a>(&self, paths: &'a [RemotePath]) -> anyhow::Result<()> {
let _guard = self
.concurrency_limiter
.acquire()
.await
.context("Concurrency limiter semaphore got closed during S3 delete")?;
let kind = RequestKind::Delete;
let _guard = self.permit(kind).await;
let mut delete_objects = Vec::with_capacity(paths.len());
for path in paths {
@@ -472,7 +501,7 @@ impl RemoteStorage for S3Bucket {
}
for chunk in delete_objects.chunks(MAX_DELETE_OBJECTS_REQUEST_SIZE) {
metrics::inc_delete_objects(chunk.len() as u64);
let started_at = start_measuring_requests(kind);
let resp = self
.client
@@ -482,10 +511,17 @@ impl RemoteStorage for S3Bucket {
.send()
.await;
let started_at = ScopeGuard::into_inner(started_at);
metrics::BUCKET_METRICS
.req_seconds
.observe_elapsed(kind, &resp, started_at);
match resp {
Ok(resp) => {
metrics::BUCKET_METRICS
.deleted_objects_total
.inc_by(chunk.len() as u64);
if let Some(errors) = resp.errors {
metrics::inc_delete_objects_fail(errors.len() as u64);
return Err(anyhow::format_err!(
"Failed to delete {} objects",
errors.len()
@@ -493,7 +529,6 @@ impl RemoteStorage for S3Bucket {
}
}
Err(e) => {
metrics::inc_delete_objects_fail(chunk.len() as u64);
return Err(e.into());
}
}
@@ -502,24 +537,89 @@ impl RemoteStorage for S3Bucket {
}
async fn delete(&self, path: &RemotePath) -> anyhow::Result<()> {
let _guard = self
.concurrency_limiter
.acquire()
.await
.context("Concurrency limiter semaphore got closed during S3 delete")?;
metrics::inc_delete_object();
self.client
.delete_object()
.bucket(self.bucket_name.clone())
.key(self.relative_path_to_s3_object(path))
.send()
.await
.map_err(|e| {
metrics::inc_delete_object_fail();
e
})?;
Ok(())
let paths = std::array::from_ref(path);
self.delete_objects(paths).await
}
}
/// On drop (cancellation) count towards [`metrics::BucketMetrics::cancelled_waits`].
fn start_counting_cancelled_wait(
kind: RequestKind,
) -> ScopeGuard<std::time::Instant, impl FnOnce(std::time::Instant), scopeguard::OnSuccess> {
scopeguard::guard_on_success(std::time::Instant::now(), move |_| {
metrics::BUCKET_METRICS.cancelled_waits.get(kind).inc()
})
}
/// On drop (cancellation) add time to [`metrics::BucketMetrics::req_seconds`].
fn start_measuring_requests(
kind: RequestKind,
) -> ScopeGuard<std::time::Instant, impl FnOnce(std::time::Instant), scopeguard::OnSuccess> {
scopeguard::guard_on_success(std::time::Instant::now(), move |started_at| {
metrics::BUCKET_METRICS.req_seconds.observe_elapsed(
kind,
AttemptOutcome::Cancelled,
started_at,
)
})
}
#[cfg(test)]
mod tests {
use std::num::NonZeroUsize;
use std::path::Path;
use crate::{RemotePath, S3Bucket, S3Config};
#[test]
fn relative_path() {
let all_paths = vec!["", "some/path", "some/path/"];
let all_paths: Vec<RemotePath> = all_paths
.iter()
.map(|x| RemotePath::new(Path::new(x)).expect("bad path"))
.collect();
let prefixes = [
None,
Some(""),
Some("test/prefix"),
Some("test/prefix/"),
Some("/test/prefix/"),
];
let expected_outputs = vec![
vec!["", "some/path", "some/path"],
vec!["/", "/some/path", "/some/path"],
vec![
"test/prefix/",
"test/prefix/some/path",
"test/prefix/some/path",
],
vec![
"test/prefix/",
"test/prefix/some/path",
"test/prefix/some/path",
],
vec![
"test/prefix/",
"test/prefix/some/path",
"test/prefix/some/path",
],
];
for (prefix_idx, prefix) in prefixes.iter().enumerate() {
let config = S3Config {
bucket_name: "bucket".to_owned(),
bucket_region: "region".to_owned(),
prefix_in_bucket: prefix.map(str::to_string),
endpoint: None,
concurrency_limit: NonZeroUsize::new(100).unwrap(),
max_keys_per_list_response: Some(5),
};
let storage = S3Bucket::new(&config).expect("remote storage init");
for (test_path_idx, test_path) in all_paths.iter().enumerate() {
let result = storage.relative_path_to_s3_object(test_path);
let expected = expected_outputs[prefix_idx][test_path_idx];
assert_eq!(result, expected);
}
}
}
}

View File

@@ -0,0 +1,191 @@
use metrics::{
register_histogram_vec, register_int_counter, register_int_counter_vec, Histogram, IntCounter,
};
use once_cell::sync::Lazy;
pub(super) static BUCKET_METRICS: Lazy<BucketMetrics> = Lazy::new(Default::default);
#[derive(Clone, Copy, Debug)]
pub(super) enum RequestKind {
Get = 0,
Put = 1,
Delete = 2,
List = 3,
}
use RequestKind::*;
impl RequestKind {
const fn as_str(&self) -> &'static str {
match self {
Get => "get_object",
Put => "put_object",
Delete => "delete_object",
List => "list_objects",
}
}
const fn as_index(&self) -> usize {
*self as usize
}
}
pub(super) struct RequestTyped<C>([C; 4]);
impl<C> RequestTyped<C> {
pub(super) fn get(&self, kind: RequestKind) -> &C {
&self.0[kind.as_index()]
}
fn build_with(mut f: impl FnMut(RequestKind) -> C) -> Self {
use RequestKind::*;
let mut it = [Get, Put, Delete, List].into_iter();
let arr = std::array::from_fn::<C, 4, _>(|index| {
let next = it.next().unwrap();
assert_eq!(index, next.as_index());
f(next)
});
if let Some(next) = it.next() {
panic!("unexpected {next:?}");
}
RequestTyped(arr)
}
}
impl RequestTyped<Histogram> {
pub(super) fn observe_elapsed(&self, kind: RequestKind, started_at: std::time::Instant) {
self.get(kind).observe(started_at.elapsed().as_secs_f64())
}
}
pub(super) struct PassFailCancelledRequestTyped<C> {
success: RequestTyped<C>,
fail: RequestTyped<C>,
cancelled: RequestTyped<C>,
}
#[derive(Debug, Clone, Copy)]
pub(super) enum AttemptOutcome {
Ok,
Err,
Cancelled,
}
impl<T, E> From<&Result<T, E>> for AttemptOutcome {
fn from(value: &Result<T, E>) -> Self {
match value {
Ok(_) => AttemptOutcome::Ok,
Err(_) => AttemptOutcome::Err,
}
}
}
impl AttemptOutcome {
pub(super) fn as_str(&self) -> &'static str {
match self {
AttemptOutcome::Ok => "ok",
AttemptOutcome::Err => "err",
AttemptOutcome::Cancelled => "cancelled",
}
}
}
impl<C> PassFailCancelledRequestTyped<C> {
pub(super) fn get(&self, kind: RequestKind, outcome: AttemptOutcome) -> &C {
let target = match outcome {
AttemptOutcome::Ok => &self.success,
AttemptOutcome::Err => &self.fail,
AttemptOutcome::Cancelled => &self.cancelled,
};
target.get(kind)
}
fn build_with(mut f: impl FnMut(RequestKind, AttemptOutcome) -> C) -> Self {
let success = RequestTyped::build_with(|kind| f(kind, AttemptOutcome::Ok));
let fail = RequestTyped::build_with(|kind| f(kind, AttemptOutcome::Err));
let cancelled = RequestTyped::build_with(|kind| f(kind, AttemptOutcome::Cancelled));
PassFailCancelledRequestTyped {
success,
fail,
cancelled,
}
}
}
impl PassFailCancelledRequestTyped<Histogram> {
pub(super) fn observe_elapsed(
&self,
kind: RequestKind,
outcome: impl Into<AttemptOutcome>,
started_at: std::time::Instant,
) {
self.get(kind, outcome.into())
.observe(started_at.elapsed().as_secs_f64())
}
}
pub(super) struct BucketMetrics {
/// Full request duration until successful completion, error or cancellation.
pub(super) req_seconds: PassFailCancelledRequestTyped<Histogram>,
/// Total amount of seconds waited on queue.
pub(super) wait_seconds: RequestTyped<Histogram>,
/// Track how many semaphore awaits were cancelled per request type.
///
/// This is in case cancellations are happening more than expected.
pub(super) cancelled_waits: RequestTyped<IntCounter>,
/// Total amount of deleted objects in batches or single requests.
pub(super) deleted_objects_total: IntCounter,
}
impl Default for BucketMetrics {
fn default() -> Self {
let buckets = [0.01, 0.10, 0.5, 1.0, 5.0, 10.0, 50.0, 100.0];
let req_seconds = register_histogram_vec!(
"remote_storage_s3_request_seconds",
"Seconds to complete a request",
&["request_type", "result"],
buckets.to_vec(),
)
.unwrap();
let req_seconds = PassFailCancelledRequestTyped::build_with(|kind, outcome| {
req_seconds.with_label_values(&[kind.as_str(), outcome.as_str()])
});
let wait_seconds = register_histogram_vec!(
"remote_storage_s3_wait_seconds",
"Seconds rate limited",
&["request_type"],
buckets.to_vec(),
)
.unwrap();
let wait_seconds =
RequestTyped::build_with(|kind| wait_seconds.with_label_values(&[kind.as_str()]));
let cancelled_waits = register_int_counter_vec!(
"remote_storage_s3_cancelled_waits_total",
"Times a semaphore wait has been cancelled per request type",
&["request_type"],
)
.unwrap();
let cancelled_waits =
RequestTyped::build_with(|kind| cancelled_waits.with_label_values(&[kind.as_str()]));
let deleted_objects_total = register_int_counter!(
"remote_storage_s3_deleted_objects_total",
"Amount of deleted objects in total",
)
.unwrap();
Self {
req_seconds,
wait_seconds,
cancelled_waits,
deleted_objects_total,
}
}
}

View File

@@ -71,6 +71,13 @@ impl UnreliableWrapper {
}
}
}
async fn delete_inner(&self, path: &RemotePath, attempt: bool) -> anyhow::Result<()> {
if attempt {
self.attempt(RemoteOp::Delete(path.clone()))?;
}
self.inner.delete(path).await
}
}
#[async_trait::async_trait]
@@ -122,15 +129,15 @@ impl RemoteStorage for UnreliableWrapper {
}
async fn delete(&self, path: &RemotePath) -> anyhow::Result<()> {
self.attempt(RemoteOp::Delete(path.clone()))?;
self.inner.delete(path).await
self.delete_inner(path, true).await
}
async fn delete_objects<'a>(&self, paths: &'a [RemotePath]) -> anyhow::Result<()> {
self.attempt(RemoteOp::DeleteObjects(paths.to_vec()))?;
let mut error_counter = 0;
for path in paths {
if (self.delete(path).await).is_err() {
// Dont record attempt because it was already recorded above
if (self.delete_inner(path, false).await).is_err() {
error_counter += 1;
}
}

View File

@@ -19,7 +19,7 @@ static LOGGING_DONE: OnceCell<()> = OnceCell::new();
const ENABLE_REAL_S3_REMOTE_STORAGE_ENV_VAR_NAME: &str = "ENABLE_REAL_S3_REMOTE_STORAGE";
const BASE_PREFIX: &str = "test/";
const BASE_PREFIX: &str = "test";
/// Tests that S3 client can list all prefixes, even if the response come paginated and requires multiple S3 queries.
/// Uses real S3 and requires [`ENABLE_REAL_S3_REMOTE_STORAGE_ENV_VAR_NAME`] and related S3 cred env vars specified.

View File

@@ -31,6 +31,8 @@ fn lsn_invalid() -> Lsn {
#[serde_as]
#[derive(Debug, Clone, Deserialize, Serialize)]
pub struct SkTimelineInfo {
/// Term.
pub term: Option<u64>,
/// Term of the last entry.
pub last_log_term: Option<u64>,
/// LSN of the last record.
@@ -58,4 +60,6 @@ pub struct SkTimelineInfo {
/// A connection string to use for WAL receiving.
#[serde(default)]
pub safekeeper_connstr: Option<String>,
#[serde(default)]
pub http_connstr: Option<String>,
}

View File

@@ -26,6 +26,7 @@ serde_json.workspace = true
signal-hook.workspace = true
thiserror.workspace = true
tokio.workspace = true
tokio-util.workspace = true
tracing.workspace = true
tracing-error.workspace = true
tracing-subscriber = { workspace = true, features = ["json", "registry"] }
@@ -37,11 +38,16 @@ url.workspace = true
uuid.workspace = true
pq_proto.workspace = true
postgres_connection.workspace = true
metrics.workspace = true
workspace_hack.workspace = true
const_format.workspace = true
# to use tokio channels as streams, this is faster to compile than async_stream
# why is it only here? no other crate should use it, streams are rarely needed.
tokio-stream = { version = "0.1.14" }
[dev-dependencies]
byteorder.workspace = true
bytes.workspace = true

234
libs/utils/src/backoff.rs Normal file
View File

@@ -0,0 +1,234 @@
use std::fmt::{Debug, Display};
use futures::Future;
use tokio_util::sync::CancellationToken;
pub const DEFAULT_BASE_BACKOFF_SECONDS: f64 = 0.1;
pub const DEFAULT_MAX_BACKOFF_SECONDS: f64 = 3.0;
pub async fn exponential_backoff(
n: u32,
base_increment: f64,
max_seconds: f64,
cancel: &CancellationToken,
) {
let backoff_duration_seconds =
exponential_backoff_duration_seconds(n, base_increment, max_seconds);
if backoff_duration_seconds > 0.0 {
tracing::info!(
"Backoff: waiting {backoff_duration_seconds} seconds before processing with the task",
);
drop(
tokio::time::timeout(
std::time::Duration::from_secs_f64(backoff_duration_seconds),
cancel.cancelled(),
)
.await,
)
}
}
pub fn exponential_backoff_duration_seconds(n: u32, base_increment: f64, max_seconds: f64) -> f64 {
if n == 0 {
0.0
} else {
(1.0 + base_increment).powf(f64::from(n)).min(max_seconds)
}
}
/// Configure cancellation for a retried operation: when to cancel (the token), and
/// what kind of error to return on cancellation
pub struct Cancel<E, CF>
where
E: Display + Debug + 'static,
CF: Fn() -> E,
{
token: CancellationToken,
on_cancel: CF,
}
impl<E, CF> Cancel<E, CF>
where
E: Display + Debug + 'static,
CF: Fn() -> E,
{
pub fn new(token: CancellationToken, on_cancel: CF) -> Self {
Self { token, on_cancel }
}
}
/// retries passed operation until one of the following conditions are met:
/// Encountered error is considered as permanent (non-retryable)
/// Retries have been exhausted.
/// `is_permanent` closure should be used to provide distinction between permanent/non-permanent errors
/// When attempts cross `warn_threshold` function starts to emit log warnings.
/// `description` argument is added to log messages. Its value should identify the `op` is doing
/// `cancel` argument is required: any time we are looping on retry, we should be using a CancellationToken
/// to drop out promptly on shutdown.
pub async fn retry<T, O, F, E, CF>(
mut op: O,
is_permanent: impl Fn(&E) -> bool,
warn_threshold: u32,
max_retries: u32,
description: &str,
cancel: Cancel<E, CF>,
) -> Result<T, E>
where
// Not std::error::Error because anyhow::Error doesnt implement it.
// For context see https://github.com/dtolnay/anyhow/issues/63
E: Display + Debug + 'static,
O: FnMut() -> F,
F: Future<Output = Result<T, E>>,
CF: Fn() -> E,
{
let mut attempts = 0;
loop {
if cancel.token.is_cancelled() {
return Err((cancel.on_cancel)());
}
let result = op().await;
match result {
Ok(_) => {
if attempts > 0 {
tracing::info!("{description} succeeded after {attempts} retries");
}
return result;
}
// These are "permanent" errors that should not be retried.
Err(ref e) if is_permanent(e) => {
return result;
}
// Assume that any other failure might be transient, and the operation might
// succeed if we just keep trying.
Err(err) if attempts < warn_threshold => {
tracing::info!("{description} failed, will retry (attempt {attempts}): {err:#}");
}
Err(err) if attempts < max_retries => {
tracing::warn!("{description} failed, will retry (attempt {attempts}): {err:#}");
}
Err(ref err) => {
// Operation failed `max_attempts` times. Time to give up.
tracing::warn!(
"{description} still failed after {attempts} retries, giving up: {err:?}"
);
return result;
}
}
// sleep and retry
exponential_backoff(
attempts,
DEFAULT_BASE_BACKOFF_SECONDS,
DEFAULT_MAX_BACKOFF_SECONDS,
&cancel.token,
)
.await;
attempts += 1;
}
}
#[cfg(test)]
mod tests {
use std::io;
use tokio::sync::Mutex;
use super::*;
#[test]
fn backoff_defaults_produce_growing_backoff_sequence() {
let mut current_backoff_value = None;
for i in 0..10_000 {
let new_backoff_value = exponential_backoff_duration_seconds(
i,
DEFAULT_BASE_BACKOFF_SECONDS,
DEFAULT_MAX_BACKOFF_SECONDS,
);
if let Some(old_backoff_value) = current_backoff_value.replace(new_backoff_value) {
assert!(
old_backoff_value <= new_backoff_value,
"{i}th backoff value {new_backoff_value} is smaller than the previous one {old_backoff_value}"
)
}
}
assert_eq!(
current_backoff_value.expect("Should have produced backoff values to compare"),
DEFAULT_MAX_BACKOFF_SECONDS,
"Given big enough of retries, backoff should reach its allowed max value"
);
}
#[tokio::test(start_paused = true)]
async fn retry_always_error() {
let count = Mutex::new(0);
let err_result = retry(
|| async {
*count.lock().await += 1;
Result::<(), io::Error>::Err(io::Error::from(io::ErrorKind::Other))
},
|_e| false,
1,
1,
"work",
Cancel::new(CancellationToken::new(), || -> io::Error { unreachable!() }),
)
.await;
assert!(err_result.is_err());
assert_eq!(*count.lock().await, 2);
}
#[tokio::test(start_paused = true)]
async fn retry_ok_after_err() {
let count = Mutex::new(0);
retry(
|| async {
let mut locked = count.lock().await;
if *locked > 1 {
Ok(())
} else {
*locked += 1;
Err(io::Error::from(io::ErrorKind::Other))
}
},
|_e| false,
2,
2,
"work",
Cancel::new(CancellationToken::new(), || -> io::Error { unreachable!() }),
)
.await
.unwrap();
}
#[tokio::test(start_paused = true)]
async fn dont_retry_permanent_errors() {
let count = Mutex::new(0);
let _ = retry(
|| async {
let mut locked = count.lock().await;
if *locked > 1 {
Ok(())
} else {
*locked += 1;
Err(io::Error::from(io::ErrorKind::Other))
}
},
|_e| true,
2,
2,
"work",
Cancel::new(CancellationToken::new(), || -> io::Error { unreachable!() }),
)
.await
.unwrap_err();
assert_eq!(*count.lock().await, 1);
}
}

View File

@@ -111,6 +111,10 @@ pub fn fsync(path: &Path) -> io::Result<()> {
.map_err(|e| io::Error::new(e.kind(), format!("Failed to fsync file {path:?}: {e}")))
}
pub async fn fsync_async(path: impl AsRef<std::path::Path>) -> Result<(), std::io::Error> {
tokio::fs::File::open(path).await?.sync_all().await
}
#[cfg(test)]
mod tests {
use tempfile::tempdir;

View File

@@ -24,11 +24,42 @@ pub async fn is_directory_empty(path: impl AsRef<Path>) -> anyhow::Result<bool>
Ok(dir.next_entry().await?.is_none())
}
pub async fn list_dir(path: impl AsRef<Path>) -> anyhow::Result<Vec<String>> {
let mut dir = tokio::fs::read_dir(&path)
.await
.context(format!("read_dir({})", path.as_ref().display()))?;
let mut content = vec![];
while let Some(next) = dir.next_entry().await? {
let file_name = next.file_name();
content.push(file_name.to_string_lossy().to_string());
}
Ok(content)
}
pub fn ignore_not_found(e: io::Error) -> io::Result<()> {
if e.kind() == io::ErrorKind::NotFound {
Ok(())
} else {
Err(e)
}
}
pub fn ignore_absent_files<F>(fs_operation: F) -> io::Result<()>
where
F: Fn() -> io::Result<()>,
{
fs_operation().or_else(ignore_not_found)
}
#[cfg(test)]
mod test {
use std::path::PathBuf;
use crate::fs_ext::is_directory_empty;
use crate::fs_ext::{is_directory_empty, list_dir};
use super::ignore_absent_files;
#[test]
fn is_empty_dir() {
@@ -75,4 +106,42 @@ mod test {
std::fs::remove_file(&file_path).unwrap();
assert!(is_directory_empty(file_path).await.is_err());
}
#[test]
fn ignore_absent_files_works() {
let dir = tempfile::tempdir().unwrap();
let dir_path = dir.path();
let file_path: PathBuf = dir_path.join("testfile");
ignore_absent_files(|| std::fs::remove_file(&file_path)).expect("should execute normally");
let f = std::fs::File::create(&file_path).unwrap();
drop(f);
ignore_absent_files(|| std::fs::remove_file(&file_path)).expect("should execute normally");
assert!(!file_path.exists());
}
#[tokio::test]
async fn list_dir_works() {
let dir = tempfile::tempdir().unwrap();
let dir_path = dir.path();
assert!(list_dir(dir_path).await.unwrap().is_empty());
let file_path: PathBuf = dir_path.join("testfile");
let _ = std::fs::File::create(&file_path).unwrap();
assert_eq!(&list_dir(dir_path).await.unwrap(), &["testfile"]);
let another_dir_path: PathBuf = dir_path.join("testdir");
std::fs::create_dir(another_dir_path).unwrap();
let expected = &["testdir", "testfile"];
let mut actual = list_dir(dir_path).await.unwrap();
actual.sort();
assert_eq!(actual, expected);
}
}

View File

@@ -0,0 +1,113 @@
use std::fmt::Debug;
use serde::{Deserialize, Serialize};
/// Tenant generations are used to provide split-brain safety and allow
/// multiple pageservers to attach the same tenant concurrently.
///
/// See docs/rfcs/025-generation-numbers.md for detail on how generation
/// numbers are used.
#[derive(Copy, Clone, Eq, PartialEq, PartialOrd, Ord)]
pub enum Generation {
// Generations with this magic value will not add a suffix to S3 keys, and will not
// be included in persisted index_part.json. This value is only to be used
// during migration from pre-generation metadata to generation-aware metadata,
// and should eventually go away.
//
// A special Generation is used rather than always wrapping Generation in an Option,
// so that code handling generations doesn't have to be aware of the legacy
// case everywhere it touches a generation.
None,
// Generations with this magic value may never be used to construct S3 keys:
// we will panic if someone tries to. This is for Tenants in the "Broken" state,
// so that we can satisfy their constructor with a Generation without risking
// a code bug using it in an S3 write (broken tenants should never write)
Broken,
Valid(u32),
}
/// The Generation type represents a number associated with a Tenant, which
/// increments every time the tenant is attached to a new pageserver, or
/// an attached pageserver restarts.
///
/// It is included as a suffix in S3 keys, as a protection against split-brain
/// scenarios where pageservers might otherwise issue conflicting writes to
/// remote storage
impl Generation {
/// Create a new Generation that represents a legacy key format with
/// no generation suffix
pub fn none() -> Self {
Self::None
}
// Create a new generation that will panic if you try to use get_suffix
pub fn broken() -> Self {
Self::Broken
}
pub fn new(v: u32) -> Self {
Self::Valid(v)
}
pub fn is_none(&self) -> bool {
matches!(self, Self::None)
}
pub fn get_suffix(&self) -> String {
match self {
Self::Valid(v) => {
format!("-{:08x}", v)
}
Self::None => "".into(),
Self::Broken => {
panic!("Tried to use a broken generation");
}
}
}
}
impl Serialize for Generation {
fn serialize<S>(&self, serializer: S) -> Result<S::Ok, S::Error>
where
S: serde::Serializer,
{
if let Self::Valid(v) = self {
v.serialize(serializer)
} else {
// We should never be asked to serialize a None or Broken. Structures
// that include an optional generation should convert None to an
// Option<Generation>::None
Err(serde::ser::Error::custom(
"Tried to serialize invalid generation ({self})",
))
}
}
}
impl<'de> Deserialize<'de> for Generation {
fn deserialize<D>(deserializer: D) -> Result<Self, D::Error>
where
D: serde::Deserializer<'de>,
{
Ok(Self::Valid(u32::deserialize(deserializer)?))
}
}
// We intentionally do not implement Display for Generation, to reduce the
// risk of a bug where the generation is used in a format!() string directly
// instead of using get_suffix().
impl Debug for Generation {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
match self {
Self::Valid(v) => {
write!(f, "{:08x}", v)
}
Self::None => {
write!(f, "<none>")
}
Self::Broken => {
write!(f, "<broken>")
}
}
}
}

View File

@@ -9,7 +9,6 @@ use metrics::{register_int_counter, Encoder, IntCounter, TextEncoder};
use once_cell::sync::Lazy;
use routerify::ext::RequestExt;
use routerify::{Middleware, RequestInfo, Router, RouterBuilder};
use tokio::task::JoinError;
use tracing::{self, debug, info, info_span, warn, Instrument};
use std::future::Future;
@@ -148,26 +147,140 @@ impl Drop for RequestCancelled {
}
async fn prometheus_metrics_handler(_req: Request<Body>) -> Result<Response<Body>, ApiError> {
use bytes::{Bytes, BytesMut};
use std::io::Write as _;
use tokio::sync::mpsc;
use tokio_stream::wrappers::ReceiverStream;
SERVE_METRICS_COUNT.inc();
let mut buffer = vec![];
let encoder = TextEncoder::new();
/// An [`std::io::Write`] implementation on top of a channel sending [`bytes::Bytes`] chunks.
struct ChannelWriter {
buffer: BytesMut,
tx: mpsc::Sender<std::io::Result<Bytes>>,
written: usize,
}
let metrics = tokio::task::spawn_blocking(move || {
// Currently we take a lot of mutexes while collecting metrics, so it's
// better to spawn a blocking task to avoid blocking the event loop.
metrics::gather()
})
.await
.map_err(|e: JoinError| ApiError::InternalServerError(e.into()))?;
encoder.encode(&metrics, &mut buffer).unwrap();
impl ChannelWriter {
fn new(buf_len: usize, tx: mpsc::Sender<std::io::Result<Bytes>>) -> Self {
assert_ne!(buf_len, 0);
ChannelWriter {
// split about half off the buffer from the start, because we flush depending on
// capacity. first flush will come sooner than without this, but now resizes will
// have better chance of picking up the "other" half. not guaranteed of course.
buffer: BytesMut::with_capacity(buf_len).split_off(buf_len / 2),
tx,
written: 0,
}
}
fn flush0(&mut self) -> std::io::Result<usize> {
let n = self.buffer.len();
if n == 0 {
return Ok(0);
}
tracing::trace!(n, "flushing");
let ready = self.buffer.split().freeze();
// not ideal to call from blocking code to block_on, but we are sure that this
// operation does not spawn_blocking other tasks
let res: Result<(), ()> = tokio::runtime::Handle::current().block_on(async {
self.tx.send(Ok(ready)).await.map_err(|_| ())?;
// throttle sending to allow reuse of our buffer in `write`.
self.tx.reserve().await.map_err(|_| ())?;
// now the response task has picked up the buffer and hopefully started
// sending it to the client.
Ok(())
});
if res.is_err() {
return Err(std::io::ErrorKind::BrokenPipe.into());
}
self.written += n;
Ok(n)
}
fn flushed_bytes(&self) -> usize {
self.written
}
}
impl std::io::Write for ChannelWriter {
fn write(&mut self, mut buf: &[u8]) -> std::io::Result<usize> {
let remaining = self.buffer.capacity() - self.buffer.len();
let out_of_space = remaining < buf.len();
let original_len = buf.len();
if out_of_space {
let can_still_fit = buf.len() - remaining;
self.buffer.extend_from_slice(&buf[..can_still_fit]);
buf = &buf[can_still_fit..];
self.flush0()?;
}
// assume that this will often under normal operation just move the pointer back to the
// beginning of allocation, because previous split off parts are already sent and
// dropped.
self.buffer.extend_from_slice(buf);
Ok(original_len)
}
fn flush(&mut self) -> std::io::Result<()> {
self.flush0().map(|_| ())
}
}
let started_at = std::time::Instant::now();
let (tx, rx) = mpsc::channel(1);
let body = Body::wrap_stream(ReceiverStream::new(rx));
let mut writer = ChannelWriter::new(128 * 1024, tx);
let encoder = TextEncoder::new();
let response = Response::builder()
.status(200)
.header(CONTENT_TYPE, encoder.format_type())
.body(Body::from(buffer))
.body(body)
.unwrap();
let span = info_span!("blocking");
tokio::task::spawn_blocking(move || {
let _span = span.entered();
let metrics = metrics::gather();
let res = encoder
.encode(&metrics, &mut writer)
.and_then(|_| writer.flush().map_err(|e| e.into()));
match res {
Ok(()) => {
tracing::info!(
bytes = writer.flushed_bytes(),
elapsed_ms = started_at.elapsed().as_millis(),
"responded /metrics"
);
}
Err(e) => {
tracing::warn!("failed to write out /metrics response: {e:#}");
// semantics of this error are quite... unclear. we want to error the stream out to
// abort the response to somehow notify the client that we failed.
//
// though, most likely the reason for failure is that the receiver is already gone.
drop(
writer
.tx
.blocking_send(Err(std::io::ErrorKind::BrokenPipe.into())),
);
}
}
});
Ok(response)
}

View File

@@ -1,5 +1,7 @@
use std::ffi::OsStr;
use std::{fmt, str::FromStr};
use anyhow::Context;
use hex::FromHex;
use rand::Rng;
use serde::{Deserialize, Serialize};
@@ -213,6 +215,18 @@ pub struct TimelineId(Id);
id_newtype!(TimelineId);
impl TryFrom<Option<&OsStr>> for TimelineId {
type Error = anyhow::Error;
fn try_from(value: Option<&OsStr>) -> Result<Self, Self::Error> {
value
.and_then(OsStr::to_str)
.unwrap_or_default()
.parse::<TimelineId>()
.with_context(|| format!("Could not parse timeline id from {:?}", value))
}
}
/// Neon Tenant Id represents identifiar of a particular tenant.
/// Is used for distinguishing requests and data belonging to different users.
///

View File

@@ -1,6 +1,8 @@
//! `utils` is intended to be a place to put code that is shared
//! between other crates in this repository.
pub mod backoff;
/// `Lsn` type implements common tasks on Log Sequence Numbers
pub mod lsn;
/// SeqWait allows waiting for a future sequence number to arrive
@@ -25,6 +27,9 @@ pub mod id;
// http endpoint utils
pub mod http;
// definition of the Generation type for pageserver attachment APIs
pub mod generation;
// common log initialisation routine
pub mod logging;
@@ -56,6 +61,8 @@ pub mod serde_regex;
pub mod pageserver_feedback;
pub mod postgres_client;
pub mod tracing_span_assert;
pub mod rate_limit;
@@ -66,44 +73,6 @@ pub mod completion;
/// Reporting utilities
pub mod error;
mod failpoint_macro_helpers {
/// use with fail::cfg("$name", "return(2000)")
///
/// The effect is similar to a "sleep(2000)" action, i.e. we sleep for the
/// specified time (in milliseconds). The main difference is that we use async
/// tokio sleep function. Another difference is that we print lines to the log,
/// which can be useful in tests to check that the failpoint was hit.
#[macro_export]
macro_rules! failpoint_sleep_millis_async {
($name:literal) => {{
// If the failpoint is used with a "return" action, set should_sleep to the
// returned value (as string). Otherwise it's set to None.
let should_sleep = (|| {
::fail::fail_point!($name, |x| x);
::std::option::Option::None
})();
// Sleep if the action was a returned value
if let ::std::option::Option::Some(duration_str) = should_sleep {
$crate::failpoint_sleep_helper($name, duration_str).await
}
}};
}
// Helper function used by the macro. (A function has nicer scoping so we
// don't need to decorate everything with "::")
pub async fn failpoint_sleep_helper(name: &'static str, duration_str: String) {
let millis = duration_str.parse::<u64>().unwrap();
let d = std::time::Duration::from_millis(millis);
tracing::info!("failpoint {:?}: sleeping for {:?}", name, d);
tokio::time::sleep(d).await;
tracing::info!("failpoint {:?}: sleep done", name);
}
}
pub use failpoint_macro_helpers::failpoint_sleep_helper;
/// This is a shortcut to embed git sha into binaries and avoid copying the same build script to all packages
///
/// we have several cases:

View File

@@ -0,0 +1,37 @@
//! Postgres client connection code common to other crates (safekeeper and
//! pageserver) which depends on tenant/timeline ids and thus not fitting into
//! postgres_connection crate.
use anyhow::Context;
use postgres_connection::{parse_host_port, PgConnectionConfig};
use crate::id::TenantTimelineId;
/// Create client config for fetching WAL from safekeeper on particular timeline.
/// listen_pg_addr_str is in form host:\[port\].
pub fn wal_stream_connection_config(
TenantTimelineId {
tenant_id,
timeline_id,
}: TenantTimelineId,
listen_pg_addr_str: &str,
auth_token: Option<&str>,
availability_zone: Option<&str>,
) -> anyhow::Result<PgConnectionConfig> {
let (host, port) =
parse_host_port(listen_pg_addr_str).context("Unable to parse listen_pg_addr_str")?;
let port = port.unwrap_or(5432);
let mut connstr = PgConnectionConfig::new_host_port(host, port)
.extend_options([
"-c".to_owned(),
format!("timeline_id={}", timeline_id),
format!("tenant_id={}", tenant_id),
])
.set_password(auth_token.map(|s| s.to_owned()));
if let Some(availability_zone) = availability_zone {
connstr = connstr.extend_options([format!("availability_zone={}", availability_zone)]);
}
Ok(connstr)
}

View File

@@ -0,0 +1,31 @@
[package]
name = "vm_monitor"
version = "0.1.0"
edition.workspace = true
license.workspace = true
[[bin]]
name = "vm-monitor"
path = "./src/bin/monitor.rs"
# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html
[dependencies]
anyhow.workspace = true
axum.workspace = true
clap.workspace = true
futures.workspace = true
inotify.workspace = true
serde.workspace = true
serde_json.workspace = true
sysinfo.workspace = true
tokio.workspace = true
tokio-postgres.workspace = true
tokio-stream.workspace = true
tokio-util.workspace = true
tracing.workspace = true
tracing-subscriber.workspace = true
workspace_hack = { version = "0.1", path = "../../workspace_hack" }
[target.'cfg(target_os = "linux")'.dependencies]
cgroups-rs = "0.3.3"

34
libs/vm_monitor/README.md Normal file
View File

@@ -0,0 +1,34 @@
# `vm-monitor`
The `vm-monitor` (or just monitor) is a core component of the autoscaling system,
along with the `autoscale-scheduler` and the `autoscaler-agent`s. The monitor has
two primary roles: 1) notifying agents when immediate upscaling is necessary due
to memory conditions and 2) managing Postgres' file cache and a cgroup to carry
out upscaling and downscaling decisions.
## More on scaling
We scale CPU and memory using NeonVM, our in-house QEMU tool for use with Kubernetes.
To control thresholds for receiving memory usage notifications, we start Postgres
in the `neon-postgres` cgroup and set its `memory.{max,high}`.
* See also: [`neondatabase/autoscaling`](https://github.com/neondatabase/autoscaling/)
* See also: [`neondatabase/vm-monitor`](https://github.com/neondatabase/vm-monitor/),
where initial development of the monitor happened. The repository is no longer
maintained but the commit history may be useful for debugging.
## Structure
The `vm-monitor` is loosely comprised of a few systems. These are:
* the server: this is just a simple `axum` server that accepts requests and
upgrades them to websocket connections. The server only allows one connection at
a time. This means that upon receiving a new connection, the server will terminate
and old one if it exists.
* the filecache: a struct that allows communication with the Postgres file cache.
On startup, we connect to the filecache and hold on to the connection for the
entire monitor lifetime.
* the cgroup watcher: the `CgroupWatcher` manages the `neon-postgres` cgroup by
listening for `memory.high` events and setting its `memory.{high,max}` values.
* the runner: the runner marries the filecache and cgroup watcher together,
communicating with the agent throught the `Dispatcher`, and then calling filecache
and cgroup watcher functions as needed to upscale and downscale

View File

@@ -0,0 +1,33 @@
// We expose a standalone binary _and_ start the monitor in `compute_ctl` so that
// we can test the monitor as part of the entire autoscaling system in
// neondatabase/autoscaling.
//
// The monitor was previously started by vm-builder, and for testing purposes,
// we can mimic that setup with this binary.
#[cfg(target_os = "linux")]
#[tokio::main]
async fn main() -> anyhow::Result<()> {
use clap::Parser;
use tokio_util::sync::CancellationToken;
use tracing_subscriber::EnvFilter;
use vm_monitor::Args;
let subscriber = tracing_subscriber::fmt::Subscriber::builder()
.json()
.with_file(true)
.with_line_number(true)
.with_span_list(true)
.with_env_filter(EnvFilter::from_default_env())
.finish();
tracing::subscriber::set_global_default(subscriber)?;
let args: &'static Args = Box::leak(Box::new(Args::parse()));
let token = CancellationToken::new();
vm_monitor::start(args, token).await
}
#[cfg(not(target_os = "linux"))]
fn main() {
panic!("the monitor requires cgroups, which are only available on linux")
}

View File

@@ -0,0 +1,693 @@
use std::{
fmt::{Debug, Display},
fs,
pin::pin,
sync::atomic::{AtomicU64, Ordering},
};
use anyhow::{anyhow, bail, Context};
use cgroups_rs::{
freezer::FreezerController,
hierarchies::{self, is_cgroup2_unified_mode, UNIFIED_MOUNTPOINT},
memory::MemController,
MaxValue,
Subsystem::{Freezer, Mem},
};
use inotify::{EventStream, Inotify, WatchMask};
use tokio::sync::mpsc::{self, error::TryRecvError};
use tokio::time::{Duration, Instant};
use tokio_stream::{Stream, StreamExt};
use tracing::{info, warn};
use crate::protocol::Resources;
use crate::MiB;
/// Monotonically increasing counter of the number of memory.high events
/// the cgroup has experienced.
///
/// We use this to determine if a modification to the `memory.events` file actually
/// changed the `high` field. If not, we don't care about the change. When we
/// read the file, we check the `high` field in the file against `MEMORY_EVENT_COUNT`
/// to see if it changed since last time.
pub static MEMORY_EVENT_COUNT: AtomicU64 = AtomicU64::new(0);
/// Monotonically increasing counter that gives each cgroup event a unique id.
///
/// This allows us to answer questions like "did this upscale arrive before this
/// memory.high?". This static is also used by the `Sequenced` type to "tag" values
/// with a sequence number. As such, prefer to used the `Sequenced` type rather
/// than this static directly.
static EVENT_SEQUENCE_NUMBER: AtomicU64 = AtomicU64::new(0);
/// A memory event type reported in memory.events.
#[derive(Debug, Eq, PartialEq, Copy, Clone)]
pub enum MemoryEvent {
Low,
High,
Max,
Oom,
OomKill,
OomGroupKill,
}
impl MemoryEvent {
fn as_str(&self) -> &str {
match self {
MemoryEvent::Low => "low",
MemoryEvent::High => "high",
MemoryEvent::Max => "max",
MemoryEvent::Oom => "oom",
MemoryEvent::OomKill => "oom_kill",
MemoryEvent::OomGroupKill => "oom_group_kill",
}
}
}
impl Display for MemoryEvent {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
f.write_str(self.as_str())
}
}
/// Configuration for a `CgroupWatcher`
#[derive(Debug, Clone)]
pub struct Config {
// The target difference between the total memory reserved for the cgroup
// and the value of the cgroup's memory.high.
//
// In other words, memory.high + oom_buffer_bytes will equal the total memory that the cgroup may
// use (equal to system memory, minus whatever's taken out for the file cache).
oom_buffer_bytes: u64,
// The amount of memory, in bytes, below a proposed new value for
// memory.high that the cgroup's memory usage must be for us to downscale
//
// In other words, we can downscale only when:
//
// memory.current + memory_high_buffer_bytes < (proposed) memory.high
//
// TODO: there's some minor issues with this approach -- in particular, that we might have
// memory in use by the kernel's page cache that we're actually ok with getting rid of.
pub(crate) memory_high_buffer_bytes: u64,
// The maximum duration, in milliseconds, that we're allowed to pause
// the cgroup for while waiting for the autoscaler-agent to upscale us
max_upscale_wait: Duration,
// The required minimum time, in milliseconds, that we must wait before re-freezing
// the cgroup while waiting for the autoscaler-agent to upscale us.
do_not_freeze_more_often_than: Duration,
// The amount of memory, in bytes, that we should periodically increase memory.high
// by while waiting for the autoscaler-agent to upscale us.
//
// This exists to avoid the excessive throttling that happens when a cgroup is above its
// memory.high for too long. See more here:
// https://github.com/neondatabase/autoscaling/issues/44#issuecomment-1522487217
memory_high_increase_by_bytes: u64,
// The period, in milliseconds, at which we should repeatedly increase the value
// of the cgroup's memory.high while we're waiting on upscaling and memory.high
// is still being hit.
//
// Technically speaking, this actually serves as a rate limit to moderate responding to
// memory.high events, but these are roughly equivalent if the process is still allocating
// memory.
memory_high_increase_every: Duration,
}
impl Config {
/// Calculate the new value for the cgroups memory.high based on system memory
pub fn calculate_memory_high_value(&self, total_system_mem: u64) -> u64 {
total_system_mem.saturating_sub(self.oom_buffer_bytes)
}
}
impl Default for Config {
fn default() -> Self {
Self {
oom_buffer_bytes: 100 * MiB,
memory_high_buffer_bytes: 100 * MiB,
// while waiting for upscale, don't freeze for more than 20ms every 1s
max_upscale_wait: Duration::from_millis(20),
do_not_freeze_more_often_than: Duration::from_millis(1000),
// while waiting for upscale, increase memory.high by 10MiB every 25ms
memory_high_increase_by_bytes: 10 * MiB,
memory_high_increase_every: Duration::from_millis(25),
}
}
}
/// Used to represent data that is associated with a certain point in time, such
/// as an upscale request or memory.high event.
///
/// Internally, creating a `Sequenced` uses a static atomic counter to obtain
/// a unique sequence number. Sequence numbers are monotonically increasing,
/// allowing us to answer questions like "did this upscale happen after this
/// memory.high event?" by comparing the sequence numbers of the two events.
#[derive(Debug, Clone)]
pub struct Sequenced<T> {
seqnum: u64,
data: T,
}
impl<T> Sequenced<T> {
pub fn new(data: T) -> Self {
Self {
seqnum: EVENT_SEQUENCE_NUMBER.fetch_add(1, Ordering::AcqRel),
data,
}
}
}
/// Responds to `MonitorEvents` to manage the cgroup: preventing it from being
/// OOM killed or throttling.
///
/// The `CgroupWatcher` primarily achieves this by reading from a stream of
/// `MonitorEvent`s. See `main_signals_loop` for details on how to keep the
/// cgroup happy.
#[derive(Debug)]
pub struct CgroupWatcher {
pub config: Config,
/// The sequence number of the last upscale.
///
/// If we receive a memory.high event that has a _lower_ sequence number than
/// `last_upscale_seqnum`, then we know it occured before the upscale, and we
/// can safely ignore it.
///
/// Note: Like the `events` field, this doesn't _need_ interior mutability but we
/// use it anyways so that methods take `&self`, not `&mut self`.
last_upscale_seqnum: AtomicU64,
/// A channel on which we send messages to request upscale from the dispatcher.
upscale_requester: mpsc::Sender<()>,
/// The actual cgroup we are watching and managing.
cgroup: cgroups_rs::Cgroup,
}
/// Read memory.events for the desired event type.
///
/// `path` specifies the path to the desired `memory.events` file.
/// For more info, see the `memory.events` section of the [kernel docs]
/// <https://docs.kernel.org/admin-guide/cgroup-v2.html#memory-interface-files>
fn get_event_count(path: &str, event: MemoryEvent) -> anyhow::Result<u64> {
let contents = fs::read_to_string(path)
.with_context(|| format!("failed to read memory.events from {path}"))?;
// Then contents of the file look like:
// low 42
// high 101
// ...
contents
.lines()
.filter_map(|s| s.split_once(' '))
.find(|(e, _)| *e == event.as_str())
.ok_or_else(|| anyhow!("failed to find entry for memory.{event} events in {path}"))
.and_then(|(_, count)| {
count
.parse::<u64>()
.with_context(|| format!("failed to parse memory.{event} as u64"))
})
}
/// Create an event stream that produces events whenever the file at the provided
/// path is modified.
fn create_file_watcher(path: &str) -> anyhow::Result<EventStream<[u8; 1024]>> {
info!("creating file watcher for {path}");
let inotify = Inotify::init().context("failed to initialize file watcher")?;
inotify
.watches()
.add(path, WatchMask::MODIFY)
.with_context(|| format!("failed to start watching {path}"))?;
inotify
// The inotify docs use [0u8; 1024] so we'll just copy them. We only need
// to store one event at a time - if the event gets written over, that's
// ok. We still see that there is an event. For more information, see:
// https://man7.org/linux/man-pages/man7/inotify.7.html
.into_event_stream([0u8; 1024])
.context("failed to start inotify event stream")
}
impl CgroupWatcher {
/// Create a new `CgroupWatcher`.
#[tracing::instrument(skip_all, fields(%name))]
pub fn new(
name: String,
// A channel on which to send upscale requests
upscale_requester: mpsc::Sender<()>,
) -> anyhow::Result<(Self, impl Stream<Item = Sequenced<u64>>)> {
// TODO: clarify exactly why we need v2
// Make sure cgroups v2 (aka unified) are supported
if !is_cgroup2_unified_mode() {
anyhow::bail!("cgroups v2 not supported");
}
let cgroup = cgroups_rs::Cgroup::load(hierarchies::auto(), &name);
// Start monitoring the cgroup for memory events. In general, for
// cgroups v2 (aka unified), metrics are reported in files like
// > `/sys/fs/cgroup/{name}/{metric}`
// We are looking for `memory.high` events, which are stored in the
// file `memory.events`. For more info, see the `memory.events` section
// of https://docs.kernel.org/admin-guide/cgroup-v2.html#memory-interface-files
let path = format!("{}/{}/memory.events", UNIFIED_MOUNTPOINT, &name);
let memory_events = create_file_watcher(&path)
.with_context(|| format!("failed to create event watcher for {path}"))?
// This would be nice with with .inspect_err followed by .ok
.filter_map(move |_| match get_event_count(&path, MemoryEvent::High) {
Ok(high) => Some(high),
Err(error) => {
// TODO: Might want to just panic here
warn!(?error, "failed to read high events count from {}", &path);
None
}
})
// Only report the event if the memory.high count increased
.filter_map(|high| {
if MEMORY_EVENT_COUNT.fetch_max(high, Ordering::AcqRel) < high {
Some(high)
} else {
None
}
})
.map(Sequenced::new);
let initial_count = get_event_count(
&format!("{}/{}/memory.events", UNIFIED_MOUNTPOINT, &name),
MemoryEvent::High,
)?;
info!(initial_count, "initial memory.high event count");
// Hard update `MEMORY_EVENT_COUNT` since there could have been processes
// running in the cgroup before that caused it to be non-zero.
MEMORY_EVENT_COUNT.fetch_max(initial_count, Ordering::AcqRel);
Ok((
Self {
cgroup,
upscale_requester,
last_upscale_seqnum: AtomicU64::new(0),
config: Default::default(),
},
memory_events,
))
}
/// The entrypoint for the `CgroupWatcher`.
#[tracing::instrument(skip_all)]
pub async fn watch<E>(
&self,
// These are ~dependency injected~ (fancy, I know) because this function
// should never return.
// -> therefore: when we tokio::spawn it, we don't await the JoinHandle.
// -> therefore: if we want to stick it in an Arc so many threads can access
// it, methods can never take mutable access.
// - note: we use the Arc strategy so that a) we can call this function
// right here and b) the runner can call the set/get_memory methods
// -> since calling recv() on a tokio::sync::mpsc::Receiver takes &mut self,
// we just pass them in here instead of holding them in fields, as that
// would require this method to take &mut self.
mut upscales: mpsc::Receiver<Sequenced<Resources>>,
events: E,
) -> anyhow::Result<()>
where
E: Stream<Item = Sequenced<u64>>,
{
// There are several actions might do when receiving a `memory.high`,
// such as freezing the cgroup, or increasing its `memory.high`. We don't
// want to do these things too often (because postgres needs to run, and
// we only have so much memory). These timers serve as rate limits for this.
let mut wait_to_freeze = pin!(tokio::time::sleep(Duration::ZERO));
let mut wait_to_increase_memory_high = pin!(tokio::time::sleep(Duration::ZERO));
let mut events = pin!(events);
// Are we waiting to be upscaled? Could be true if we request upscale due
// to a memory.high event and it does not arrive in time.
let mut waiting_on_upscale = false;
loop {
tokio::select! {
upscale = upscales.recv() => {
let Sequenced { seqnum, data } = upscale
.context("failed to listen on upscale notification channel")?;
self.last_upscale_seqnum.store(seqnum, Ordering::Release);
info!(cpu = data.cpu, mem_bytes = data.mem, "received upscale");
}
event = events.next() => {
let Some(Sequenced { seqnum, .. }) = event else {
bail!("failed to listen for memory.high events")
};
// The memory.high came before our last upscale, so we consider
// it resolved
if self.last_upscale_seqnum.fetch_max(seqnum, Ordering::AcqRel) > seqnum {
info!(
"received memory.high event, but it came before our last upscale -> ignoring it"
);
continue;
}
// The memory.high came after our latest upscale. We don't
// want to do anything yet, so peek the next event in hopes
// that it's an upscale.
if let Some(upscale_num) = self
.upscaled(&mut upscales)
.context("failed to check if we were upscaled")?
{
if upscale_num > seqnum {
info!(
"received memory.high event, but it came before our last upscale -> ignoring it"
);
continue;
}
}
// If it's been long enough since we last froze, freeze the
// cgroup and request upscale
if wait_to_freeze.is_elapsed() {
info!("received memory.high event -> requesting upscale");
waiting_on_upscale = self
.handle_memory_high_event(&mut upscales)
.await
.context("failed to handle upscale")?;
wait_to_freeze
.as_mut()
.reset(Instant::now() + self.config.do_not_freeze_more_often_than);
continue;
}
// Ok, we can't freeze, just request upscale
if !waiting_on_upscale {
info!("received memory.high event, but too soon to refreeze -> requesting upscale");
// Make check to make sure we haven't been upscaled in the
// meantine (can happen if the agent independently decides
// to upscale us again)
if self
.upscaled(&mut upscales)
.context("failed to check if we were upscaled")?
.is_some()
{
info!("no need to request upscaling because we got upscaled");
continue;
}
self.upscale_requester
.send(())
.await
.context("failed to request upscale")?;
continue;
}
// Shoot, we can't freeze or and we're still waiting on upscale,
// increase memory.high to reduce throttling
if wait_to_increase_memory_high.is_elapsed() {
info!(
"received memory.high event, \
but too soon to refreeze and already requested upscale \
-> increasing memory.high"
);
// Make check to make sure we haven't been upscaled in the
// meantine (can happen if the agent independently decides
// to upscale us again)
if self
.upscaled(&mut upscales)
.context("failed to check if we were upscaled")?
.is_some()
{
info!("no need to increase memory.high because got upscaled");
continue;
}
// Request upscale anyways (the agent will handle deduplicating
// requests)
self.upscale_requester
.send(())
.await
.context("failed to request upscale")?;
let memory_high =
self.get_high_bytes().context("failed to get memory.high")?;
let new_high = memory_high + self.config.memory_high_increase_by_bytes;
info!(
current_high_bytes = memory_high,
new_high_bytes = new_high,
"updating memory.high"
);
self.set_high_bytes(new_high)
.context("failed to set memory.high")?;
wait_to_increase_memory_high
.as_mut()
.reset(Instant::now() + self.config.memory_high_increase_every)
}
// we can't do anything
}
};
}
}
/// Handle a `memory.high`, returning whether we are still waiting on upscale
/// by the time the function returns.
///
/// The general plan for handling a `memory.high` event is as follows:
/// 1. Freeze the cgroup
/// 2. Start a timer for `self.config.max_upscale_wait`
/// 3. Request upscale
/// 4. After the timer elapses or we receive upscale, thaw the cgroup.
/// 5. Return whether or not we are still waiting for upscale. If we are,
/// we'll increase the cgroups memory.high to avoid getting oom killed
#[tracing::instrument(skip_all)]
async fn handle_memory_high_event(
&self,
upscales: &mut mpsc::Receiver<Sequenced<Resources>>,
) -> anyhow::Result<bool> {
// Immediately freeze the cgroup before doing anything else.
info!("received memory.high event -> freezing cgroup");
self.freeze().context("failed to freeze cgroup")?;
// We'll use this for logging durations
let start_time = Instant::now();
// Await the upscale until we have to unfreeze
let timed =
tokio::time::timeout(self.config.max_upscale_wait, self.await_upscale(upscales));
// Request the upscale
info!(
wait = ?self.config.max_upscale_wait,
"sending request for immediate upscaling",
);
self.upscale_requester
.send(())
.await
.context("failed to request upscale")?;
let waiting_on_upscale = match timed.await {
Ok(Ok(())) => {
info!(elapsed = ?start_time.elapsed(), "received upscale in time");
false
}
// **important**: unfreeze the cgroup before ?-reporting the error
Ok(Err(e)) => {
info!("error waiting for upscale -> thawing cgroup");
self.thaw()
.context("failed to thaw cgroup after errored waiting for upscale")?;
Err(e.context("failed to await upscale"))?
}
Err(_) => {
info!(elapsed = ?self.config.max_upscale_wait, "timed out waiting for upscale");
true
}
};
info!("thawing cgroup");
self.thaw().context("failed to thaw cgroup")?;
Ok(waiting_on_upscale)
}
/// Checks whether we were just upscaled, returning the upscale's sequence
/// number if so.
#[tracing::instrument(skip_all)]
fn upscaled(
&self,
upscales: &mut mpsc::Receiver<Sequenced<Resources>>,
) -> anyhow::Result<Option<u64>> {
let Sequenced { seqnum, data } = match upscales.try_recv() {
Ok(upscale) => upscale,
Err(TryRecvError::Empty) => return Ok(None),
Err(TryRecvError::Disconnected) => {
bail!("upscale notification channel was disconnected")
}
};
// Make sure to update the last upscale sequence number
self.last_upscale_seqnum.store(seqnum, Ordering::Release);
info!(cpu = data.cpu, mem_bytes = data.mem, "received upscale");
Ok(Some(seqnum))
}
/// Await an upscale event, discarding any `memory.high` events received in
/// the process.
///
/// This is used in `handle_memory_high_event`, where we need to listen
/// for upscales in particular so we know if we can thaw the cgroup early.
#[tracing::instrument(skip_all)]
async fn await_upscale(
&self,
upscales: &mut mpsc::Receiver<Sequenced<Resources>>,
) -> anyhow::Result<()> {
let Sequenced { seqnum, .. } = upscales
.recv()
.await
.context("error listening for upscales")?;
self.last_upscale_seqnum.store(seqnum, Ordering::Release);
Ok(())
}
/// Get the cgroup's name.
pub fn path(&self) -> &str {
self.cgroup.path()
}
}
/// Represents a set of limits we apply to a cgroup to control memory usage.
///
/// Setting these values also affects the thresholds for receiving usage alerts.
#[derive(Debug)]
pub struct MemoryLimits {
high: u64,
max: u64,
}
impl MemoryLimits {
pub fn new(high: u64, max: u64) -> Self {
Self { max, high }
}
}
// Methods for manipulating the actual cgroup
impl CgroupWatcher {
/// Get a handle on the freezer subsystem.
fn freezer(&self) -> anyhow::Result<&FreezerController> {
if let Some(Freezer(freezer)) = self
.cgroup
.subsystems()
.iter()
.find(|sub| matches!(sub, Freezer(_)))
{
Ok(freezer)
} else {
anyhow::bail!("could not find freezer subsystem")
}
}
/// Attempt to freeze the cgroup.
pub fn freeze(&self) -> anyhow::Result<()> {
self.freezer()
.context("failed to get freezer subsystem")?
.freeze()
.context("failed to freeze")
}
/// Attempt to thaw the cgroup.
pub fn thaw(&self) -> anyhow::Result<()> {
self.freezer()
.context("failed to get freezer subsystem")?
.thaw()
.context("failed to thaw")
}
/// Get a handle on the memory subsystem.
///
/// Note: this method does not require `self.memory_update_lock` because
/// getting a handle to the subsystem does not access any of the files we
/// care about, such as memory.high and memory.events
fn memory(&self) -> anyhow::Result<&MemController> {
if let Some(Mem(memory)) = self
.cgroup
.subsystems()
.iter()
.find(|sub| matches!(sub, Mem(_)))
{
Ok(memory)
} else {
anyhow::bail!("could not find memory subsystem")
}
}
/// Get cgroup current memory usage.
pub fn current_memory_usage(&self) -> anyhow::Result<u64> {
Ok(self
.memory()
.context("failed to get memory subsystem")?
.memory_stat()
.usage_in_bytes)
}
/// Set cgroup memory.high threshold.
pub fn set_high_bytes(&self, bytes: u64) -> anyhow::Result<()> {
self.memory()
.context("failed to get memory subsystem")?
.set_mem(cgroups_rs::memory::SetMemory {
low: None,
high: Some(MaxValue::Value(u64::min(bytes, i64::MAX as u64) as i64)),
min: None,
max: None,
})
.context("failed to set memory.high")
}
/// Set cgroup memory.high and memory.max.
pub fn set_limits(&self, limits: &MemoryLimits) -> anyhow::Result<()> {
info!(
limits.high,
limits.max,
path = self.path(),
"writing new memory limits",
);
self.memory()
.context("failed to get memory subsystem while setting memory limits")?
.set_mem(cgroups_rs::memory::SetMemory {
min: None,
low: None,
high: Some(MaxValue::Value(
u64::min(limits.high, i64::MAX as u64) as i64
)),
max: Some(MaxValue::Value(u64::min(limits.max, i64::MAX as u64) as i64)),
})
.context("failed to set memory limits")
}
/// Given some amount of available memory, set the desired cgroup memory limits
pub fn set_memory_limits(&mut self, available_memory: u64) -> anyhow::Result<()> {
let new_high = self.config.calculate_memory_high_value(available_memory);
let limits = MemoryLimits::new(new_high, available_memory);
info!(
path = self.path(),
memory = ?limits,
"setting cgroup memory",
);
self.set_limits(&limits)
.context("failed to set cgroup memory limits")?;
Ok(())
}
/// Get memory.high threshold.
pub fn get_high_bytes(&self) -> anyhow::Result<u64> {
let high = self
.memory()
.context("failed to get memory subsystem while getting memory statistics")?
.get_mem()
.map(|mem| mem.high)
.context("failed to get memory statistics from subsystem")?;
match high {
Some(MaxValue::Max) => Ok(i64::MAX as u64),
Some(MaxValue::Value(high)) => Ok(high as u64),
None => anyhow::bail!("failed to read memory.high from memory subsystem"),
}
}
}

View File

@@ -0,0 +1,153 @@
//! Managing the websocket connection and other signals in the monitor.
//!
//! Contains types that manage the interaction (not data interchange, see `protocol`)
//! between agent and monitor, allowing us to to process and send messages in a
//! straightforward way. The dispatcher also manages that signals that come from
//! the cgroup (requesting upscale), and the signals that go to the cgroup
//! (notifying it of upscale).
use anyhow::{bail, Context};
use axum::extract::ws::{Message, WebSocket};
use futures::{
stream::{SplitSink, SplitStream},
SinkExt, StreamExt,
};
use tokio::sync::mpsc;
use tracing::info;
use crate::cgroup::Sequenced;
use crate::protocol::{
OutboundMsg, ProtocolRange, ProtocolResponse, ProtocolVersion, Resources, PROTOCOL_MAX_VERSION,
PROTOCOL_MIN_VERSION,
};
/// The central handler for all communications in the monitor.
///
/// The dispatcher has two purposes:
/// 1. Manage the connection to the agent, sending and receiving messages.
/// 2. Communicate with the cgroup manager, notifying it when upscale is received,
/// and sending a message to the agent when the cgroup manager requests
/// upscale.
#[derive(Debug)]
pub struct Dispatcher {
/// We read agent messages of of `source`
pub(crate) source: SplitStream<WebSocket>,
/// We send messages to the agent through `sink`
sink: SplitSink<WebSocket, Message>,
/// Used to notify the cgroup when we are upscaled.
pub(crate) notify_upscale_events: mpsc::Sender<Sequenced<Resources>>,
/// When the cgroup requests upscale it will send on this channel. In response
/// we send an `UpscaleRequst` to the agent.
pub(crate) request_upscale_events: mpsc::Receiver<()>,
/// The protocol version we have agreed to use with the agent. This is negotiated
/// during the creation of the dispatcher, and should be the highest shared protocol
/// version.
///
// NOTE: currently unused, but will almost certainly be used in the futures
// as the protocol changes
#[allow(unused)]
pub(crate) proto_version: ProtocolVersion,
}
impl Dispatcher {
/// Creates a new dispatcher using the passed-in connection.
///
/// Performs a negotiation with the agent to determine the highest protocol
/// version that both support. This consists of two steps:
/// 1. Wait for the agent to sent the range of protocols it supports.
/// 2. Send a protocol version that works for us as well, or an error if there
/// is no compatible version.
pub async fn new(
stream: WebSocket,
notify_upscale_events: mpsc::Sender<Sequenced<Resources>>,
request_upscale_events: mpsc::Receiver<()>,
) -> anyhow::Result<Self> {
let (mut sink, mut source) = stream.split();
// Figure out the highest protocol version we both support
info!("waiting for agent to send protocol version range");
let Some(message) = source.next().await else {
bail!("websocket connection closed while performing protocol handshake")
};
let message = message.context("failed to read protocol version range off connection")?;
let Message::Text(message_text) = message else {
// All messages should be in text form, since we don't do any
// pinging/ponging. See nhooyr/websocket's implementation and the
// agent for more info
bail!("received non-text message during proocol handshake: {message:?}")
};
let monitor_range = ProtocolRange {
min: PROTOCOL_MIN_VERSION,
max: PROTOCOL_MAX_VERSION,
};
let agent_range: ProtocolRange = serde_json::from_str(&message_text)
.context("failed to deserialize protocol version range")?;
info!(range = ?agent_range, "received protocol version range");
let highest_shared_version = match monitor_range.highest_shared_version(&agent_range) {
Ok(version) => {
sink.send(Message::Text(
serde_json::to_string(&ProtocolResponse::Version(version)).unwrap(),
))
.await
.context("failed to notify agent of negotiated protocol version")?;
version
}
Err(e) => {
sink.send(Message::Text(
serde_json::to_string(&ProtocolResponse::Error(format!(
"Received protocol version range {} which does not overlap with {}",
agent_range, monitor_range
)))
.unwrap(),
))
.await
.context("failed to notify agent of no overlap between protocol version ranges")?;
Err(e).context("error determining suitable protocol version range")?
}
};
Ok(Self {
sink,
source,
notify_upscale_events,
request_upscale_events,
proto_version: highest_shared_version,
})
}
/// Notify the cgroup manager that we have received upscale and wait for
/// the acknowledgement.
#[tracing::instrument(skip_all, fields(?resources))]
pub async fn notify_upscale(&self, resources: Sequenced<Resources>) -> anyhow::Result<()> {
self.notify_upscale_events
.send(resources)
.await
.context("failed to send resources and oneshot sender across channel")
}
/// Send a message to the agent.
///
/// Although this function is small, it has one major benefit: it is the only
/// way to send data accross the connection, and you can only pass in a proper
/// `MonitorMessage`. Without safeguards like this, it's easy to accidentally
/// serialize the wrong thing and send it, since `self.sink.send` will take
/// any string.
pub async fn send(&mut self, message: OutboundMsg) -> anyhow::Result<()> {
info!(?message, "sending message");
let json = serde_json::to_string(&message).context("failed to serialize message")?;
self.sink
.send(Message::Text(json))
.await
.context("stream error sending message")
}
}

View File

@@ -0,0 +1,316 @@
//! Logic for configuring and scaling the Postgres file cache.
use std::num::NonZeroU64;
use crate::MiB;
use anyhow::{anyhow, Context};
use tokio_postgres::{types::ToSql, Client, NoTls, Row};
use tokio_util::sync::CancellationToken;
use tracing::{error, info};
/// Manages Postgres' file cache by keeping a connection open.
#[derive(Debug)]
pub struct FileCacheState {
client: Client,
conn_str: String,
pub(crate) config: FileCacheConfig,
/// A token for cancelling spawned threads during shutdown.
token: CancellationToken,
}
#[derive(Debug)]
pub struct FileCacheConfig {
/// Whether the file cache is *actually* stored in memory (e.g. by writing to
/// a tmpfs or shmem file). If true, the size of the file cache will be counted against the
/// memory available for the cgroup.
pub(crate) in_memory: bool,
/// The size of the file cache, in terms of the size of the resource it consumes
/// (currently: only memory)
///
/// For example, setting `resource_multipler = 0.75` gives the cache a target size of 75% of total
/// resources.
///
/// This value must be strictly between 0 and 1.
resource_multiplier: f64,
/// The required minimum amount of memory, in bytes, that must remain available
/// after subtracting the file cache.
///
/// This value must be non-zero.
min_remaining_after_cache: NonZeroU64,
/// Controls the rate of increase in the file cache's size as it grows from zero
/// (when total resources equals min_remaining_after_cache) to the desired size based on
/// `resource_multiplier`.
///
/// A `spread_factor` of zero means that all additional resources will go to the cache until it
/// reaches the desired size. Setting `spread_factor` to N roughly means "for every 1 byte added to
/// the cache's size, N bytes are reserved for the rest of the system, until the cache gets to
/// its desired size".
///
/// This value must be >= 0, and must retain an increase that is more than what would be given by
/// `resource_multiplier`. For example, setting `resource_multiplier` = 0.75 but `spread_factor` = 1
/// would be invalid, because `spread_factor` would induce only 50% usage - never reaching the 75%
/// as desired by `resource_multiplier`.
///
/// `spread_factor` is too large if `(spread_factor + 1) * resource_multiplier >= 1`.
spread_factor: f64,
}
impl FileCacheConfig {
pub fn default_in_memory() -> Self {
Self {
in_memory: true,
// 75 %
resource_multiplier: 0.75,
// 640 MiB; (512 + 128)
min_remaining_after_cache: NonZeroU64::new(640 * MiB).unwrap(),
// ensure any increase in file cache size is split 90-10 with 10% to other memory
spread_factor: 0.1,
}
}
pub fn default_on_disk() -> Self {
Self {
in_memory: false,
resource_multiplier: 0.75,
// 256 MiB - lower than when in memory because overcommitting is safe; if we don't have
// memory, the kernel will just evict from its page cache, rather than e.g. killing
// everything.
min_remaining_after_cache: NonZeroU64::new(256 * MiB).unwrap(),
spread_factor: 0.1,
}
}
/// Make sure fields of the config are consistent.
pub fn validate(&self) -> anyhow::Result<()> {
// Single field validity
anyhow::ensure!(
0.0 < self.resource_multiplier && self.resource_multiplier < 1.0,
"resource_multiplier must be between 0.0 and 1.0 exclusive, got {}",
self.resource_multiplier
);
anyhow::ensure!(
self.spread_factor >= 0.0,
"spread_factor must be >= 0, got {}",
self.spread_factor
);
// Check that `resource_multiplier` and `spread_factor` are valid w.r.t. each other.
//
// As shown in `calculate_cache_size`, we have two lines resulting from `resource_multiplier` and
// `spread_factor`, respectively. They are:
//
// `total` `min_remaining_after_cache`
// size = ————————————————————— - —————————————————————————————
// `spread_factor` + 1 `spread_factor` + 1
//
// and
//
// size = `resource_multiplier` × total
//
// .. where `total` is the total resources. These are isomorphic to the typical 'y = mx + b'
// form, with y = "size" and x = "total".
//
// These lines intersect at:
//
// `min_remaining_after_cache`
// ———————————————————————————————————————————————————
// 1 - `resource_multiplier` × (`spread_factor` + 1)
//
// We want to ensure that this value (a) exists, and (b) is >= `min_remaining_after_cache`. This is
// guaranteed when '`resource_multiplier` × (`spread_factor` + 1)' is less than 1.
// (We also need it to be >= 0, but that's already guaranteed.)
let intersect_factor = self.resource_multiplier * (self.spread_factor + 1.0);
anyhow::ensure!(
intersect_factor < 1.0,
"incompatible resource_multipler and spread_factor"
);
Ok(())
}
/// Calculate the desired size of the cache, given the total memory
pub fn calculate_cache_size(&self, total: u64) -> u64 {
// *Note*: all units are in bytes, until the very last line.
let available = total.saturating_sub(self.min_remaining_after_cache.get());
if available == 0 {
return 0;
}
// Conversions to ensure we don't overflow from floating-point ops
let size_from_spread =
i64::max(0, (available as f64 / (1.0 + self.spread_factor)) as i64) as u64;
let size_from_normal = (total as f64 * self.resource_multiplier) as u64;
let byte_size = u64::min(size_from_spread, size_from_normal);
// The file cache operates in units of mebibytes, so the sizes we produce should
// be rounded to a mebibyte. We round down to be conservative.
byte_size / MiB * MiB
}
}
impl FileCacheState {
/// Connect to the file cache.
#[tracing::instrument(skip_all, fields(%conn_str, ?config))]
pub async fn new(
conn_str: &str,
config: FileCacheConfig,
token: CancellationToken,
) -> anyhow::Result<Self> {
config.validate().context("file cache config is invalid")?;
info!(conn_str, "connecting to Postgres file cache");
let client = FileCacheState::connect(conn_str, token.clone())
.await
.context("failed to connect to postgres file cache")?;
let conn_str = conn_str.to_string();
Ok(Self {
client,
config,
conn_str,
token,
})
}
/// Connect to Postgres.
///
/// Aborts the spawned thread if the kill signal is received. This is not
/// a method as it is called in [`FileCacheState::new`].
#[tracing::instrument(skip_all, fields(%conn_str))]
async fn connect(conn_str: &str, token: CancellationToken) -> anyhow::Result<Client> {
let (client, conn) = tokio_postgres::connect(conn_str, NoTls)
.await
.context("failed to connect to pg client")?;
// The connection object performs the actual communication with the database,
// so spawn it off to run on its own. See tokio-postgres docs.
crate::spawn_with_cancel(
token,
|res| {
if let Err(error) = res {
error!(%error, "postgres error")
}
},
conn,
);
Ok(client)
}
/// Execute a query with a retry if necessary.
///
/// If the initial query fails, we restart the database connection and attempt
/// if again.
#[tracing::instrument(skip_all, fields(%statement))]
pub async fn query_with_retry(
&mut self,
statement: &str,
params: &[&(dyn ToSql + Sync)],
) -> anyhow::Result<Vec<Row>> {
match self
.client
.query(statement, params)
.await
.context("failed to execute query")
{
Ok(rows) => Ok(rows),
Err(e) => {
error!(error = ?e, "postgres error: {e} -> retrying");
let client = FileCacheState::connect(&self.conn_str, self.token.clone())
.await
.context("failed to connect to postgres file cache")?;
info!("successfully reconnected to postgres client");
// Replace the old client and attempt the query with the new one
self.client = client;
self.client
.query(statement, params)
.await
.context("failed to execute query a second time")
}
}
}
/// Get the current size of the file cache.
#[tracing::instrument(skip_all)]
pub async fn get_file_cache_size(&mut self) -> anyhow::Result<u64> {
self.query_with_retry(
// The file cache GUC variable is in MiB, but the conversion with
// pg_size_bytes means that the end result we get is in bytes.
"SELECT pg_size_bytes(current_setting('neon.file_cache_size_limit'));",
&[],
)
.await
.context("failed to query pg for file cache size")?
.first()
.ok_or_else(|| anyhow!("file cache size query returned no rows"))?
// pg_size_bytes returns a bigint which is the same as an i64.
.try_get::<_, i64>(0)
// Since the size of the table is not negative, the cast is sound.
.map(|bytes| bytes as u64)
.context("failed to extract file cache size from query result")
}
/// Attempt to set the file cache size, returning the size it was actually
/// set to.
#[tracing::instrument(skip_all, fields(%num_bytes))]
pub async fn set_file_cache_size(&mut self, num_bytes: u64) -> anyhow::Result<u64> {
let max_bytes = self
// The file cache GUC variable is in MiB, but the conversion with pg_size_bytes
// means that the end result we get is in bytes.
.query_with_retry(
"SELECT pg_size_bytes(current_setting('neon.max_file_cache_size'));",
&[],
)
.await
.context("failed to query pg for max file cache size")?
.first()
.ok_or_else(|| anyhow!("max file cache size query returned no rows"))?
.try_get::<_, i64>(0)
.map(|bytes| bytes as u64)
.context("failed to extract max file cache size from query result")?;
let max_mb = max_bytes / MiB;
let num_mb = u64::min(num_bytes, max_bytes) / MiB;
let capped = if num_bytes > max_bytes {
" (capped by maximum size)"
} else {
""
};
info!(
size = num_mb,
max = max_mb,
"updating file cache size {capped}",
);
// note: even though the normal ways to get the cache size produce values with trailing "MB"
// (hence why we call pg_size_bytes in `get_file_cache_size`'s query), the format
// it expects to set the value is "integer number of MB" without trailing units.
// For some reason, this *really* wasn't working with normal arguments, so that's
// why we're constructing the query here.
self.client
.query(
&format!("ALTER SYSTEM SET neon.file_cache_size_limit = {};", num_mb),
&[],
)
.await
.context("failed to change file cache size limit")?;
// must use pg_reload_conf to have the settings change take effect
self.client
.execute("SELECT pg_reload_conf();", &[])
.await
.context("failed to reload config")?;
Ok(num_mb * MiB)
}
}

215
libs/vm_monitor/src/lib.rs Normal file
View File

@@ -0,0 +1,215 @@
#![cfg(target_os = "linux")]
use anyhow::Context;
use axum::{
extract::{ws::WebSocket, State, WebSocketUpgrade},
response::Response,
};
use axum::{routing::get, Router, Server};
use clap::Parser;
use futures::Future;
use std::{fmt::Debug, time::Duration};
use sysinfo::{RefreshKind, System, SystemExt};
use tokio::{sync::broadcast, task::JoinHandle};
use tokio_util::sync::CancellationToken;
use tracing::{error, info};
use runner::Runner;
// Code that interfaces with agent
pub mod dispatcher;
pub mod protocol;
pub mod cgroup;
pub mod filecache;
pub mod runner;
/// The vm-monitor is an autoscaling component started by compute_ctl.
///
/// It carries out autoscaling decisions (upscaling/downscaling) and responds to
/// memory pressure by making requests to the autoscaler-agent.
#[derive(Debug, Parser)]
pub struct Args {
/// The name of the cgroup we should monitor for memory.high events. This
/// is the cgroup that postgres should be running in.
#[arg(short, long)]
pub cgroup: Option<String>,
/// The connection string for the Postgres file cache we should manage.
#[arg(short, long)]
pub pgconnstr: Option<String>,
/// Flag to signal that the Postgres file cache is on disk (i.e. not in memory aside from the
/// kernel's page cache), and therefore should not count against available memory.
//
// NB: Ideally this flag would directly refer to whether the file cache is in memory (rather
// than a roundabout way, via whether it's on disk), but in order to be backwards compatible
// during the switch away from an in-memory file cache, we had to default to the previous
// behavior.
#[arg(long)]
pub file_cache_on_disk: bool,
/// The address we should listen on for connection requests. For the
/// agent, this is 0.0.0.0:10301. For the informant, this is 127.0.0.1:10369.
#[arg(short, long)]
pub addr: String,
}
impl Args {
pub fn addr(&self) -> &str {
&self.addr
}
}
/// The number of bytes in one mebibyte.
#[allow(non_upper_case_globals)]
const MiB: u64 = 1 << 20;
/// Convert a quantity in bytes to a quantity in mebibytes, generally for display
/// purposes. (Most calculations in this crate use bytes directly)
pub fn bytes_to_mebibytes(bytes: u64) -> f32 {
(bytes as f32) / (MiB as f32)
}
pub fn get_total_system_memory() -> u64 {
System::new_with_specifics(RefreshKind::new().with_memory()).total_memory()
}
/// Global app state for the Axum server
#[derive(Debug, Clone)]
pub struct ServerState {
/// Used to close old connections.
///
/// When a new connection is made, we send a message signalling to the old
/// connection to close.
pub sender: broadcast::Sender<()>,
/// Used to cancel all spawned threads in the monitor.
pub token: CancellationToken,
// The CLI args
pub args: &'static Args,
}
/// Spawn a thread that may get cancelled by the provided [`CancellationToken`].
///
/// This is mainly meant to be called with futures that will be pending for a very
/// long time, or are not mean to return. If it is not desirable for the future to
/// ever resolve, such as in the case of [`cgroup::CgroupWatcher::watch`], the error can
/// be logged with `f`.
pub fn spawn_with_cancel<T, F>(
token: CancellationToken,
f: F,
future: T,
) -> JoinHandle<Option<T::Output>>
where
T: Future + Send + 'static,
T::Output: Send + 'static,
F: FnOnce(&T::Output) + Send + 'static,
{
tokio::spawn(async move {
tokio::select! {
_ = token.cancelled() => {
info!("received global kill signal");
None
}
res = future => {
f(&res);
Some(res)
}
}
})
}
/// The entrypoint to the binary.
///
/// Set up tracing, parse arguments, and start an http server.
pub async fn start(args: &'static Args, token: CancellationToken) -> anyhow::Result<()> {
// This channel is used to close old connections. When a new connection is
// made, we send a message signalling to the old connection to close.
let (sender, _) = tokio::sync::broadcast::channel::<()>(1);
let app = Router::new()
// This route gets upgraded to a websocket connection. We only support
// one connection at a time, which we enforce by killing old connections
// when we receive a new one.
.route("/monitor", get(ws_handler))
.with_state(ServerState {
sender,
token,
args,
});
let addr = args.addr();
let bound = Server::try_bind(&addr.parse().expect("parsing address should not fail"))
.with_context(|| format!("failed to bind to {addr}"))?;
info!(addr, "server bound");
bound
.serve(app.into_make_service())
.await
.context("server exited")?;
Ok(())
}
/// Handles incoming websocket connections.
///
/// If we are already to connected to an agent, we kill that old connection
/// and accept the new one.
#[tracing::instrument(name = "/monitor", skip_all, fields(?args))]
pub async fn ws_handler(
ws: WebSocketUpgrade,
State(ServerState {
sender,
token,
args,
}): State<ServerState>,
) -> Response {
// Kill the old monitor
info!("closing old connection if there is one");
let _ = sender.send(());
// Start the new one. Wow, the cycle of death and rebirth
let closer = sender.subscribe();
ws.on_upgrade(|ws| start_monitor(ws, args, closer, token))
}
/// Starts the monitor. If startup fails or the monitor exits, an error will
/// be logged and our internal state will be reset to allow for new connections.
#[tracing::instrument(skip_all, fields(?args))]
async fn start_monitor(
ws: WebSocket,
args: &Args,
kill: broadcast::Receiver<()>,
token: CancellationToken,
) {
info!("accepted new websocket connection -> starting monitor");
let timeout = Duration::from_secs(4);
let monitor = tokio::time::timeout(
timeout,
Runner::new(Default::default(), args, ws, kill, token),
)
.await;
let mut monitor = match monitor {
Ok(Ok(monitor)) => monitor,
Ok(Err(error)) => {
error!(?error, "failed to create monitor");
return;
}
Err(_) => {
error!(
?timeout,
"creating monitor timed out (probably waiting to receive protocol range)"
);
return;
}
};
info!("connected to agent");
match monitor.run().await {
Ok(()) => info!("monitor was killed due to new connection"),
Err(e) => error!(error = ?e, "monitor terminated unexpectedly"),
}
}

View File

@@ -0,0 +1,241 @@
//! Types representing protocols and actual agent-monitor messages.
//!
//! The pervasive use of serde modifiers throughout this module is to ease
//! serialization on the go side. Because go does not have enums (which model
//! messages well), it is harder to model messages, and we accomodate that with
//! serde.
//!
//! *Note*: the agent sends and receives messages in different ways.
//!
//! The agent serializes messages in the form and then sends them. The use
//! of `#[serde(tag = "type", content = "content")]` allows us to use `Type`
//! to determine how to deserialize `Content`.
//! ```ignore
//! struct {
//! Content any
//! Type string
//! Id uint64
//! }
//! ```
//! and receives messages in the form:
//! ```ignore
//! struct {
//! {fields embedded}
//! Type string
//! Id uint64
//! }
//! ```
//! After reading the type field, the agent will decode the entire message
//! again, this time into the correct type using the embedded fields.
//! Because the agent cannot just extract the json contained in a certain field
//! (it initially deserializes to `map[string]interface{}`), we keep the fields
//! at the top level, so the entire piece of json can be deserialized into a struct,
//! such as a `DownscaleResult`, with the `Type` and `Id` fields ignored.
use core::fmt;
use std::cmp;
use serde::{de::Error, Deserialize, Serialize};
/// A Message we send to the agent.
#[derive(Serialize, Deserialize, Debug, Clone)]
pub struct OutboundMsg {
#[serde(flatten)]
pub(crate) inner: OutboundMsgKind,
pub(crate) id: usize,
}
impl OutboundMsg {
pub fn new(inner: OutboundMsgKind, id: usize) -> Self {
Self { inner, id }
}
}
/// The different underlying message types we can send to the agent.
#[derive(Serialize, Deserialize, Debug, Clone)]
#[serde(tag = "type")]
pub enum OutboundMsgKind {
/// Indicates that the agent sent an invalid message, i.e, we couldn't
/// properly deserialize it.
InvalidMessage { error: String },
/// Indicates that we experienced an internal error while processing a message.
/// For example, if a cgroup operation fails while trying to handle an upscale,
/// we return `InternalError`.
InternalError { error: String },
/// Returned to the agent once we have finished handling an upscale. If the
/// handling was unsuccessful, an `InternalError` will get returned instead.
/// *Note*: this is a struct variant because of the way go serializes struct{}
UpscaleConfirmation {},
/// Indicates to the monitor that we are urgently requesting resources.
/// *Note*: this is a struct variant because of the way go serializes struct{}
UpscaleRequest {},
/// Returned to the agent once we have finished attempting to downscale. If
/// an error occured trying to do so, an `InternalError` will get returned instead.
/// However, if we are simply unsuccessful (for example, do to needing the resources),
/// that gets included in the `DownscaleResult`.
DownscaleResult {
// FIXME for the future (once the informant is deprecated)
// As of the time of writing, the agent/informant version of this struct is
// called api.DownscaleResult. This struct has uppercase fields which are
// serialized as such. Thus, we serialize using uppercase names so we don't
// have to make a breaking change to the agent<->informant protocol. Once
// the informant has been superseded by the monitor, we can add the correct
// struct tags to api.DownscaleResult without causing a breaking change,
// since we don't need to support the agent<->informant protocol anymore.
#[serde(rename = "Ok")]
ok: bool,
#[serde(rename = "Status")]
status: String,
},
/// Part of the bidirectional heartbeat. The heartbeat is initiated by the
/// agent.
/// *Note*: this is a struct variant because of the way go serializes struct{}
HealthCheck {},
}
/// A message received form the agent.
#[derive(Serialize, Deserialize, Debug, Clone)]
pub struct InboundMsg {
#[serde(flatten)]
pub(crate) inner: InboundMsgKind,
pub(crate) id: usize,
}
/// The different underlying message types we can receive from the agent.
#[derive(Serialize, Deserialize, Debug, Clone)]
#[serde(tag = "type", content = "content")]
pub enum InboundMsgKind {
/// Indicates that the we sent an invalid message, i.e, we couldn't
/// properly deserialize it.
InvalidMessage { error: String },
/// Indicates that the informan experienced an internal error while processing
/// a message. For example, if it failed to request upsacle from the agent, it
/// would return an `InternalError`.
InternalError { error: String },
/// Indicates to us that we have been granted more resources. We should respond
/// with an `UpscaleConfirmation` when done handling the resources (increasins
/// file cache size, cgorup memory limits).
UpscaleNotification { granted: Resources },
/// A request to reduce resource usage. We should response with a `DownscaleResult`,
/// when done.
DownscaleRequest { target: Resources },
/// Part of the bidirectional heartbeat. The heartbeat is initiated by the
/// agent.
/// *Note*: this is a struct variant because of the way go serializes struct{}
HealthCheck {},
}
/// Represents the resources granted to a VM.
#[derive(Serialize, Deserialize, Debug, Clone, Copy)]
// Renamed because the agent has multiple resources types:
// `Resources` (milliCPU/memory slots)
// `Allocation` (vCPU/bytes) <- what we correspond to
#[serde(rename(serialize = "Allocation", deserialize = "Allocation"))]
pub struct Resources {
/// Number of vCPUs
pub(crate) cpu: f64,
/// Bytes of memory
pub(crate) mem: u64,
}
impl Resources {
pub fn new(cpu: f64, mem: u64) -> Self {
Self { cpu, mem }
}
}
pub const PROTOCOL_MIN_VERSION: ProtocolVersion = ProtocolVersion::V1_0;
pub const PROTOCOL_MAX_VERSION: ProtocolVersion = ProtocolVersion::V1_0;
#[derive(Debug, Clone, Copy, PartialEq, PartialOrd, Ord, Eq, Serialize, Deserialize)]
pub struct ProtocolVersion(u8);
impl ProtocolVersion {
/// Represents v1.0 of the agent<-> monitor protocol - the initial version
///
/// Currently the latest version.
const V1_0: ProtocolVersion = ProtocolVersion(1);
}
impl fmt::Display for ProtocolVersion {
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
match *self {
ProtocolVersion(0) => f.write_str("<invalid: zero>"),
ProtocolVersion::V1_0 => f.write_str("v1.0"),
other => write!(f, "<unknown: {other}>"),
}
}
}
/// A set of protocol bounds that determines what we are speaking.
///
/// These bounds are inclusive.
#[derive(Debug)]
pub struct ProtocolRange {
pub min: ProtocolVersion,
pub max: ProtocolVersion,
}
// Use a custom deserialize impl to ensure that `self.min <= self.max`
impl<'de> Deserialize<'de> for ProtocolRange {
fn deserialize<D>(deserializer: D) -> Result<Self, D::Error>
where
D: serde::Deserializer<'de>,
{
#[derive(Deserialize)]
struct InnerProtocolRange {
min: ProtocolVersion,
max: ProtocolVersion,
}
let InnerProtocolRange { min, max } = InnerProtocolRange::deserialize(deserializer)?;
if min > max {
Err(D::Error::custom(format!(
"min version = {min} is greater than max version = {max}",
)))
} else {
Ok(ProtocolRange { min, max })
}
}
}
impl fmt::Display for ProtocolRange {
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
if self.min == self.max {
f.write_fmt(format_args!("{}", self.max))
} else {
f.write_fmt(format_args!("{} to {}", self.min, self.max))
}
}
}
impl ProtocolRange {
/// Find the highest shared version between two `ProtocolRange`'s
pub fn highest_shared_version(&self, other: &Self) -> anyhow::Result<ProtocolVersion> {
// We first have to make sure the ranges are overlapping. Once we know
// this, we can merge the ranges by taking the max of the mins and the
// mins of the maxes.
if self.min > other.max {
anyhow::bail!(
"Non-overlapping bounds: other.max = {} was less than self.min = {}",
other.max,
self.min,
)
} else if self.max < other.min {
anyhow::bail!(
"Non-overlappinng bounds: self.max = {} was less than other.min = {}",
self.max,
other.min
)
} else {
Ok(cmp::min(self.max, other.max))
}
}
}
/// We send this to the monitor after negotiating which protocol to use
#[derive(Serialize, Debug)]
#[serde(rename_all = "camelCase")]
pub enum ProtocolResponse {
Error(String),
Version(ProtocolVersion),
}

View File

@@ -0,0 +1,460 @@
//! Exposes the `Runner`, which handles messages received from agent and
//! sends upscale requests.
//!
//! This is the "Monitor" part of the monitor binary and is the main entrypoint for
//! all functionality.
use std::sync::Arc;
use std::{fmt::Debug, mem};
use anyhow::{bail, Context};
use axum::extract::ws::{Message, WebSocket};
use futures::StreamExt;
use tokio::sync::broadcast;
use tokio::sync::mpsc;
use tokio_util::sync::CancellationToken;
use tracing::{error, info, warn};
use crate::cgroup::{CgroupWatcher, MemoryLimits, Sequenced};
use crate::dispatcher::Dispatcher;
use crate::filecache::{FileCacheConfig, FileCacheState};
use crate::protocol::{InboundMsg, InboundMsgKind, OutboundMsg, OutboundMsgKind, Resources};
use crate::{bytes_to_mebibytes, get_total_system_memory, spawn_with_cancel, Args, MiB};
/// Central struct that interacts with agent, dispatcher, and cgroup to handle
/// signals from the agent.
#[derive(Debug)]
pub struct Runner {
config: Config,
filecache: Option<FileCacheState>,
cgroup: Option<Arc<CgroupWatcher>>,
dispatcher: Dispatcher,
/// We "mint" new message ids by incrementing this counter and taking the value.
///
/// **Note**: This counter is always odd, so that we avoid collisions between the IDs generated
/// by us vs the autoscaler-agent.
counter: usize,
/// A signal to kill the main thread produced by `self.run()`. This is triggered
/// when the server receives a new connection. When the thread receives the
/// signal off this channel, it will gracefully shutdown.
kill: broadcast::Receiver<()>,
}
/// Configuration for a `Runner`
#[derive(Debug)]
pub struct Config {
/// `sys_buffer_bytes` gives the estimated amount of memory, in bytes, that the kernel uses before
/// handing out the rest to userspace. This value is the estimated difference between the
/// *actual* physical memory and the amount reported by `grep MemTotal /proc/meminfo`.
///
/// For more information, refer to `man 5 proc`, which defines MemTotal as "Total usable RAM
/// (i.e., physical RAM minus a few reserved bits and the kernel binary code)".
///
/// We only use `sys_buffer_bytes` when calculating the system memory from the *external* memory
/// size, rather than the self-reported memory size, according to the kernel.
///
/// TODO: this field is only necessary while we still have to trust the autoscaler-agent's
/// upscale resource amounts (because we might not *actually* have been upscaled yet). This field
/// should be removed once we have a better solution there.
sys_buffer_bytes: u64,
}
impl Default for Config {
fn default() -> Self {
Self {
sys_buffer_bytes: 100 * MiB,
}
}
}
impl Runner {
/// Create a new monitor.
#[tracing::instrument(skip_all, fields(?config, ?args))]
pub async fn new(
config: Config,
args: &Args,
ws: WebSocket,
kill: broadcast::Receiver<()>,
token: CancellationToken,
) -> anyhow::Result<Runner> {
anyhow::ensure!(
config.sys_buffer_bytes != 0,
"invalid monitor Config: sys_buffer_bytes cannot be 0"
);
// *NOTE*: the dispatcher and cgroup manager talk through these channels
// so make sure they each get the correct half, nothing is droppped, etc.
let (notified_send, notified_recv) = mpsc::channel(1);
let (requesting_send, requesting_recv) = mpsc::channel(1);
let dispatcher = Dispatcher::new(ws, notified_send, requesting_recv)
.await
.context("error creating new dispatcher")?;
let mut state = Runner {
config,
filecache: None,
cgroup: None,
dispatcher,
counter: 1, // NB: must be odd, see the comment about the field for more.
kill,
};
let mut file_cache_reserved_bytes = 0;
let mem = get_total_system_memory();
// We need to process file cache initialization before cgroup initialization, so that the memory
// allocated to the file cache is appropriately taken into account when we decide the cgroup's
// memory limits.
if let Some(connstr) = &args.pgconnstr {
info!("initializing file cache");
let config = match args.file_cache_on_disk {
true => FileCacheConfig::default_on_disk(),
false => FileCacheConfig::default_in_memory(),
};
let mut file_cache = FileCacheState::new(connstr, config, token.clone())
.await
.context("failed to create file cache")?;
let size = file_cache
.get_file_cache_size()
.await
.context("error getting file cache size")?;
let new_size = file_cache.config.calculate_cache_size(mem);
info!(
initial = bytes_to_mebibytes(size),
new = bytes_to_mebibytes(new_size),
"setting initial file cache size",
);
// note: even if size == new_size, we want to explicitly set it, just
// to make sure that we have the permissions to do so
let actual_size = file_cache
.set_file_cache_size(new_size)
.await
.context("failed to set file cache size, possibly due to inadequate permissions")?;
if actual_size != new_size {
info!("file cache size actually got set to {actual_size}")
}
// Mark the resources given to the file cache as reserved, but only if it's in memory.
if !args.file_cache_on_disk {
file_cache_reserved_bytes = actual_size;
}
state.filecache = Some(file_cache);
}
if let Some(name) = &args.cgroup {
let (mut cgroup, cgroup_event_stream) =
CgroupWatcher::new(name.clone(), requesting_send)
.context("failed to create cgroup manager")?;
let available = mem - file_cache_reserved_bytes;
cgroup
.set_memory_limits(available)
.context("failed to set cgroup memory limits")?;
let cgroup = Arc::new(cgroup);
// Some might call this . . . cgroup v2
let cgroup_clone = Arc::clone(&cgroup);
spawn_with_cancel(token, |_| error!("cgroup watcher terminated"), async move {
cgroup_clone.watch(notified_recv, cgroup_event_stream).await
});
state.cgroup = Some(cgroup);
} else {
// *NOTE*: We need to forget the sender so that its drop impl does not get ran.
// This allows us to poll it in `Monitor::run` regardless of whether we
// are managing a cgroup or not. If we don't forget it, all receives will
// immediately return an error because the sender is droped and it will
// claim all select! statements, effectively turning `Monitor::run` into
// `loop { fail to receive }`.
mem::forget(requesting_send);
}
Ok(state)
}
/// Attempt to downscale filecache + cgroup
#[tracing::instrument(skip_all, fields(?target))]
pub async fn try_downscale(&mut self, target: Resources) -> anyhow::Result<(bool, String)> {
// Nothing to adjust
if self.cgroup.is_none() && self.filecache.is_none() {
info!("no action needed for downscale (no cgroup or file cache enabled)");
return Ok((
true,
"monitor is not managing cgroup or file cache".to_string(),
));
}
let requested_mem = target.mem;
let usable_system_memory = requested_mem.saturating_sub(self.config.sys_buffer_bytes);
let expected_file_cache_mem_usage = self
.filecache
.as_ref()
.map(|file_cache| file_cache.config.calculate_cache_size(usable_system_memory))
.unwrap_or(0);
let mut new_cgroup_mem_high = 0;
if let Some(cgroup) = &self.cgroup {
new_cgroup_mem_high = cgroup
.config
.calculate_memory_high_value(usable_system_memory - expected_file_cache_mem_usage);
let current = cgroup
.current_memory_usage()
.context("failed to fetch cgroup memory")?;
if new_cgroup_mem_high < current + cgroup.config.memory_high_buffer_bytes {
let status = format!(
"{}: {} MiB (new high) < {} (current usage) + {} (buffer)",
"calculated memory.high too low",
bytes_to_mebibytes(new_cgroup_mem_high),
bytes_to_mebibytes(current),
bytes_to_mebibytes(cgroup.config.memory_high_buffer_bytes)
);
info!(status, "discontinuing downscale");
return Ok((false, status));
}
}
// The downscaling has been approved. Downscale the file cache, then the cgroup.
let mut status = vec![];
let mut file_cache_mem_usage = 0;
if let Some(file_cache) = &mut self.filecache {
let actual_usage = file_cache
.set_file_cache_size(expected_file_cache_mem_usage)
.await
.context("failed to set file cache size")?;
if file_cache.config.in_memory {
file_cache_mem_usage = actual_usage;
}
let message = format!(
"set file cache size to {} MiB (in memory = {})",
bytes_to_mebibytes(actual_usage),
file_cache.config.in_memory,
);
info!("downscale: {message}");
status.push(message);
}
if let Some(cgroup) = &self.cgroup {
let available_memory = usable_system_memory - file_cache_mem_usage;
if file_cache_mem_usage != expected_file_cache_mem_usage {
new_cgroup_mem_high = cgroup.config.calculate_memory_high_value(available_memory);
}
let limits = MemoryLimits::new(
// new_cgroup_mem_high is initialized to 0 but it is guarancontextd to not be here
// since it is properly initialized in the previous cgroup if let block
new_cgroup_mem_high,
available_memory,
);
cgroup
.set_limits(&limits)
.context("failed to set cgroup memory limits")?;
let message = format!(
"set cgroup memory.high to {} MiB, of new max {} MiB",
bytes_to_mebibytes(new_cgroup_mem_high),
bytes_to_mebibytes(available_memory)
);
info!("downscale: {message}");
status.push(message);
}
// TODO: make this status thing less jank
let status = status.join("; ");
Ok((true, status))
}
/// Handle new resources
#[tracing::instrument(skip_all, fields(?resources))]
pub async fn handle_upscale(&mut self, resources: Resources) -> anyhow::Result<()> {
if self.filecache.is_none() && self.cgroup.is_none() {
info!("no action needed for upscale (no cgroup or file cache enabled)");
return Ok(());
}
let new_mem = resources.mem;
let usable_system_memory = new_mem.saturating_sub(self.config.sys_buffer_bytes);
// Get the file cache's expected contribution to the memory usage
let mut file_cache_mem_usage = 0;
if let Some(file_cache) = &mut self.filecache {
let expected_usage = file_cache.config.calculate_cache_size(usable_system_memory);
info!(
target = bytes_to_mebibytes(expected_usage),
total = bytes_to_mebibytes(new_mem),
"updating file cache size",
);
let actual_usage = file_cache
.set_file_cache_size(expected_usage)
.await
.context("failed to set file cache size")?;
if file_cache.config.in_memory {
file_cache_mem_usage = actual_usage;
}
if actual_usage != expected_usage {
warn!(
"file cache was set to a different size that we wanted: target = {} Mib, actual= {} Mib",
bytes_to_mebibytes(expected_usage),
bytes_to_mebibytes(actual_usage)
)
}
}
if let Some(cgroup) = &self.cgroup {
let available_memory = usable_system_memory - file_cache_mem_usage;
let new_cgroup_mem_high = cgroup.config.calculate_memory_high_value(available_memory);
info!(
target = bytes_to_mebibytes(new_cgroup_mem_high),
total = bytes_to_mebibytes(new_mem),
name = cgroup.path(),
"updating cgroup memory.high",
);
let limits = MemoryLimits::new(new_cgroup_mem_high, available_memory);
cgroup
.set_limits(&limits)
.context("failed to set file cache size")?;
}
Ok(())
}
/// Take in a message and perform some action, such as downscaling or upscaling,
/// and return a message to be send back.
#[tracing::instrument(skip_all, fields(%id, message = ?inner))]
pub async fn process_message(
&mut self,
InboundMsg { inner, id }: InboundMsg,
) -> anyhow::Result<Option<OutboundMsg>> {
match inner {
InboundMsgKind::UpscaleNotification { granted } => {
self.handle_upscale(granted)
.await
.context("failed to handle upscale")?;
self.dispatcher
.notify_upscale(Sequenced::new(granted))
.await
.context("failed to notify notify cgroup of upscale")?;
Ok(Some(OutboundMsg::new(
OutboundMsgKind::UpscaleConfirmation {},
id,
)))
}
InboundMsgKind::DownscaleRequest { target } => self
.try_downscale(target)
.await
.context("failed to downscale")
.map(|(ok, status)| {
Some(OutboundMsg::new(
OutboundMsgKind::DownscaleResult { ok, status },
id,
))
}),
InboundMsgKind::InvalidMessage { error } => {
warn!(
%error, id, "received notification of an invalid message we sent"
);
Ok(None)
}
InboundMsgKind::InternalError { error } => {
warn!(error, id, "agent experienced an internal error");
Ok(None)
}
InboundMsgKind::HealthCheck {} => {
Ok(Some(OutboundMsg::new(OutboundMsgKind::HealthCheck {}, id)))
}
}
}
// TODO: don't propagate errors, probably just warn!?
#[tracing::instrument(skip_all)]
pub async fn run(&mut self) -> anyhow::Result<()> {
info!("starting dispatcher");
loop {
tokio::select! {
signal = self.kill.recv() => {
match signal {
Ok(()) => return Ok(()),
Err(e) => bail!("failed to receive kill signal: {e}")
}
}
// we need to propagate an upscale request
request = self.dispatcher.request_upscale_events.recv() => {
if request.is_none() {
bail!("failed to listen for upscale event from cgroup")
}
info!("cgroup asking for upscale; forwarding request");
self.counter += 2; // Increment, preserving parity (i.e. keep the
// counter odd). See the field comment for more.
self.dispatcher
.send(OutboundMsg::new(OutboundMsgKind::UpscaleRequest {}, self.counter))
.await
.context("failed to send message")?;
}
// there is a message from the agent
msg = self.dispatcher.source.next() => {
if let Some(msg) = msg {
// Don't use 'message' as a key as the string also uses
// that for its key
info!(?msg, "received message");
match msg {
Ok(msg) => {
let message: InboundMsg = match msg {
Message::Text(text) => {
serde_json::from_str(&text).context("failed to deserialize text message")?
}
other => {
warn!(
// Don't use 'message' as a key as the
// string also uses that for its key
msg = ?other,
"agent should only send text messages but received different type"
);
continue
},
};
let out = match self.process_message(message.clone()).await {
Ok(Some(out)) => out,
Ok(None) => continue,
Err(e) => {
let error = e.to_string();
warn!(?error, "error handling message");
OutboundMsg::new(
OutboundMsgKind::InternalError {
error
},
message.id
)
}
};
self.dispatcher
.send(out)
.await
.context("failed to send message")?;
}
Err(e) => warn!("{e}"),
}
} else {
anyhow::bail!("dispatcher connection closed")
}
}
}
}
}
}

View File

@@ -35,6 +35,8 @@ humantime-serde.workspace = true
hyper.workspace = true
itertools.workspace = true
nix.workspace = true
# hack to get the number of worker threads tokio uses
num_cpus = { version = "1.15" }
num-traits.workspace = true
once_cell.workspace = true
pin-project-lite.workspace = true
@@ -49,6 +51,7 @@ serde.workspace = true
serde_json = { workspace = true, features = ["raw_value"] }
serde_with.workspace = true
signal-hook.workspace = true
smallvec = { workspace = true, features = ["write"] }
svg_fmt.workspace = true
sync_wrapper.workspace = true
tokio-tar.workspace = true

View File

@@ -215,7 +215,6 @@ fn bench_sequential(c: &mut Criterion) {
TimelineId::generate(),
zero.add(10 * i32)..zero.add(10 * i32 + 1),
Lsn(i),
false,
0,
);
updates.insert_historic(layer);

View File

@@ -13,6 +13,7 @@ clap = { workspace = true, features = ["string"] }
git-version.workspace = true
pageserver = { path = ".." }
postgres_ffi.workspace = true
tokio.workspace = true
utils.workspace = true
svg_fmt.workspace = true
workspace_hack.workspace = true

View File

@@ -23,6 +23,7 @@
//! <https://grafana.com/tutorials/build-a-panel-plugin/>
use anyhow::Result;
use pageserver::repository::Key;
use pageserver::METADATA_FILE_NAME;
use std::cmp::Ordering;
use std::io::{self, BufRead};
use std::path::PathBuf;
@@ -71,6 +72,10 @@ pub fn main() -> Result<()> {
let line = PathBuf::from_str(&line).unwrap();
let filename = line.file_name().unwrap();
let filename = filename.to_str().unwrap();
if filename == METADATA_FILE_NAME {
// Don't try and parse "metadata" like a key-lsn range
continue;
}
let range = parse_filename(filename);
ranges.push(range);
}

View File

@@ -10,7 +10,7 @@ use std::{fs, path::Path, str};
use pageserver::page_cache::PAGE_SZ;
use pageserver::repository::{Key, KEY_SIZE};
use pageserver::tenant::block_io::{BlockReader, FileBlockReader};
use pageserver::tenant::block_io::FileBlockReader;
use pageserver::tenant::disk_btree::{DiskBtreeReader, VisitDirection};
use pageserver::tenant::storage_layer::delta_layer::{Summary, DELTA_KEY_SIZE};
use pageserver::tenant::storage_layer::range_overlaps;
@@ -95,9 +95,9 @@ pub(crate) fn parse_filename(name: &str) -> Option<LayerFile> {
}
// Finds the max_holes largest holes, ignoring any that are smaller than MIN_HOLE_LENGTH"
fn get_holes(path: &Path, max_holes: usize) -> Result<Vec<Hole>> {
async fn get_holes(path: &Path, max_holes: usize) -> Result<Vec<Hole>> {
let file = FileBlockReader::new(VirtualFile::open(path)?);
let summary_blk = file.read_blk(0)?;
let summary_blk = file.read_blk(0).await?;
let actual_summary = Summary::des_prefix(summary_blk.as_ref())?;
let tree_reader = DiskBtreeReader::<_, DELTA_KEY_SIZE>::new(
actual_summary.index_start_blk,
@@ -107,29 +107,31 @@ fn get_holes(path: &Path, max_holes: usize) -> Result<Vec<Hole>> {
// min-heap (reserve space for one more element added before eviction)
let mut heap: BinaryHeap<Hole> = BinaryHeap::with_capacity(max_holes + 1);
let mut prev_key: Option<Key> = None;
tree_reader.visit(
&[0u8; DELTA_KEY_SIZE],
VisitDirection::Forwards,
|key, _value| {
let curr = Key::from_slice(&key[..KEY_SIZE]);
if let Some(prev) = prev_key {
if curr.to_i128() - prev.to_i128() >= MIN_HOLE_LENGTH {
heap.push(Hole(prev..curr));
if heap.len() > max_holes {
heap.pop(); // remove smallest hole
tree_reader
.visit(
&[0u8; DELTA_KEY_SIZE],
VisitDirection::Forwards,
|key, _value| {
let curr = Key::from_slice(&key[..KEY_SIZE]);
if let Some(prev) = prev_key {
if curr.to_i128() - prev.to_i128() >= MIN_HOLE_LENGTH {
heap.push(Hole(prev..curr));
if heap.len() > max_holes {
heap.pop(); // remove smallest hole
}
}
}
}
prev_key = Some(curr.next());
true
},
)?;
prev_key = Some(curr.next());
true
},
)
.await?;
let mut holes = heap.into_vec();
holes.sort_by_key(|hole| hole.0.start);
Ok(holes)
}
pub(crate) fn main(cmd: &AnalyzeLayerMapCmd) -> Result<()> {
pub(crate) async fn main(cmd: &AnalyzeLayerMapCmd) -> Result<()> {
let storage_path = &cmd.path;
let max_holes = cmd.max_holes.unwrap_or(DEFAULT_MAX_HOLES);
@@ -160,7 +162,7 @@ pub(crate) fn main(cmd: &AnalyzeLayerMapCmd) -> Result<()> {
parse_filename(&layer.file_name().into_string().unwrap())
{
if layer_file.is_delta {
layer_file.holes = get_holes(&layer.path(), max_holes)?;
layer_file.holes = get_holes(&layer.path(), max_holes).await?;
n_deltas += 1;
}
layers.push(layer_file);

View File

@@ -43,15 +43,12 @@ pub(crate) enum LayerCmd {
},
}
fn read_delta_file(path: impl AsRef<Path>) -> Result<()> {
use pageserver::tenant::blob_io::BlobCursor;
use pageserver::tenant::block_io::BlockReader;
async fn read_delta_file(path: impl AsRef<Path>) -> Result<()> {
let path = path.as_ref();
virtual_file::init(10);
page_cache::init(100);
let file = FileBlockReader::new(VirtualFile::open(path)?);
let summary_blk = file.read_blk(0)?;
let summary_blk = file.read_blk(0).await?;
let actual_summary = Summary::des_prefix(summary_blk.as_ref())?;
let tree_reader = DiskBtreeReader::<_, DELTA_KEY_SIZE>::new(
actual_summary.index_start_blk,
@@ -60,25 +57,27 @@ fn read_delta_file(path: impl AsRef<Path>) -> Result<()> {
);
// TODO(chi): dedup w/ `delta_layer.rs` by exposing the API.
let mut all = vec![];
tree_reader.visit(
&[0u8; DELTA_KEY_SIZE],
VisitDirection::Forwards,
|key, value_offset| {
let curr = Key::from_slice(&key[..KEY_SIZE]);
all.push((curr, BlobRef(value_offset)));
true
},
)?;
let mut cursor = BlockCursor::new(&file);
tree_reader
.visit(
&[0u8; DELTA_KEY_SIZE],
VisitDirection::Forwards,
|key, value_offset| {
let curr = Key::from_slice(&key[..KEY_SIZE]);
all.push((curr, BlobRef(value_offset)));
true
},
)
.await?;
let cursor = BlockCursor::new_fileblockreader_virtual(&file);
for (k, v) in all {
let value = cursor.read_blob(v.pos())?;
let value = cursor.read_blob(v.pos()).await?;
println!("key:{} value_len:{}", k, value.len());
}
// TODO(chi): special handling for last key?
Ok(())
}
pub(crate) fn main(cmd: &LayerCmd) -> Result<()> {
pub(crate) async fn main(cmd: &LayerCmd) -> Result<()> {
match cmd {
LayerCmd::List { path } => {
for tenant in fs::read_dir(path.join("tenants"))? {
@@ -153,7 +152,7 @@ pub(crate) fn main(cmd: &LayerCmd) -> Result<()> {
);
if layer_file.is_delta {
read_delta_file(layer.path())?;
read_delta_file(layer.path()).await?;
} else {
anyhow::bail!("not supported yet :(");
}

View File

@@ -72,12 +72,13 @@ struct AnalyzeLayerMapCmd {
max_holes: Option<usize>,
}
fn main() -> anyhow::Result<()> {
#[tokio::main]
async fn main() -> anyhow::Result<()> {
let cli = CliOpts::parse();
match cli.command {
Commands::Layer(cmd) => {
layers::main(&cmd)?;
layers::main(&cmd).await?;
}
Commands::Metadata(cmd) => {
handle_metadata(&cmd)?;
@@ -86,7 +87,7 @@ fn main() -> anyhow::Result<()> {
draw_timeline_dir::main()?;
}
Commands::AnalyzeLayerMap(cmd) => {
layer_map_analyzer::main(&cmd)?;
layer_map_analyzer::main(&cmd).await?;
}
Commands::PrintLayerFile(cmd) => {
if let Err(e) = read_pg_control_file(&cmd.path) {
@@ -94,7 +95,7 @@ fn main() -> anyhow::Result<()> {
"Failed to read input file as a pg control one: {e:#}\n\
Attempting to read it as layer file"
);
print_layerfile(&cmd.path)?;
print_layerfile(&cmd.path).await?;
}
}
};
@@ -113,12 +114,12 @@ fn read_pg_control_file(control_file_path: &Path) -> anyhow::Result<()> {
Ok(())
}
fn print_layerfile(path: &Path) -> anyhow::Result<()> {
async fn print_layerfile(path: &Path) -> anyhow::Result<()> {
// Basic initialization of things that don't change after startup
virtual_file::init(10);
page_cache::init(100);
let ctx = RequestContext::new(TaskKind::DebugTool, DownloadBehavior::Error);
dump_layerfile_from_path(path, true, &ctx)
dump_layerfile_from_path(path, true, &ctx).await
}
fn handle_metadata(

View File

@@ -6,11 +6,14 @@ use std::{env, ops::ControlFlow, path::Path, str::FromStr};
use anyhow::{anyhow, Context};
use clap::{Arg, ArgAction, Command};
use fail::FailScenario;
use metrics::launch_timestamp::{set_launch_timestamp_metric, LaunchTimestamp};
use pageserver::disk_usage_eviction_task::{self, launch_disk_usage_global_eviction_task};
use pageserver::metrics::{STARTUP_DURATION, STARTUP_IS_LOADING};
use pageserver::task_mgr::WALRECEIVER_RUNTIME;
use pageserver::tenant::TenantSharedResources;
use remote_storage::GenericRemoteStorage;
use tokio::time::Instant;
use tracing::*;
use metrics::set_build_info_metric;
@@ -38,8 +41,6 @@ const PID_FILE_NAME: &str = "pageserver.pid";
const FEATURES: &[&str] = &[
#[cfg(feature = "testing")]
"testing",
#[cfg(feature = "fail/failpoints")]
"fail/failpoints",
];
fn version() -> String {
@@ -121,7 +122,7 @@ fn main() -> anyhow::Result<()> {
}
// Initialize up failpoints support
let scenario = FailScenario::setup();
let scenario = pageserver::failpoint_support::init();
// Basic initialization of things that don't change after startup
virtual_file::init(conf.max_file_descriptors);
@@ -226,6 +227,19 @@ fn start_pageserver(
launch_ts: &'static LaunchTimestamp,
conf: &'static PageServerConf,
) -> anyhow::Result<()> {
// Monotonic time for later calculating startup duration
let started_startup_at = Instant::now();
let startup_checkpoint = move |phase: &str, human_phase: &str| {
let elapsed = started_startup_at.elapsed();
let secs = elapsed.as_secs_f64();
STARTUP_DURATION.with_label_values(&[phase]).set(secs);
info!(
elapsed_ms = elapsed.as_millis(),
"{human_phase} ({secs:.3}s since start)"
)
};
// Print version and launch timestamp to the log,
// and expose them as prometheus metrics.
// A changed version string indicates changed software.
@@ -335,6 +349,11 @@ fn start_pageserver(
// Set up remote storage client
let remote_storage = create_remote_storage_client(conf)?;
// Up to this point no significant I/O has been done: this should have been fast. Record
// duration prior to starting I/O intensive phase of startup.
startup_checkpoint("initial", "Starting loading tenants");
STARTUP_IS_LOADING.set(1);
// Startup staging or optimizing:
//
// We want to minimize downtime for `page_service` connections, and trying not to overload
@@ -355,18 +374,19 @@ fn start_pageserver(
let order = pageserver::InitializationOrder {
initial_tenant_load: Some(init_done_tx),
initial_logical_size_can_start: init_done_rx.clone(),
initial_logical_size_attempt: init_logical_size_done_tx,
initial_logical_size_attempt: Some(init_logical_size_done_tx),
background_jobs_can_start: background_jobs_barrier.clone(),
};
// Scan the local 'tenants/' directory and start loading the tenants
let init_started_at = std::time::Instant::now();
let shutdown_pageserver = tokio_util::sync::CancellationToken::new();
BACKGROUND_RUNTIME.block_on(mgr::init_tenant_mgr(
conf,
broker_client.clone(),
remote_storage.clone(),
TenantSharedResources {
broker_client: broker_client.clone(),
remote_storage: remote_storage.clone(),
},
order,
))?;
@@ -378,18 +398,13 @@ fn start_pageserver(
let guard = scopeguard::guard_on_success((), |_| tracing::info!("Cancelled before initial load completed"));
init_done_rx.wait().await;
startup_checkpoint("initial_tenant_load", "Initial load completed");
STARTUP_IS_LOADING.set(0);
// initial logical sizes can now start, as they were waiting on init_done_rx.
scopeguard::ScopeGuard::into_inner(guard);
let init_done = std::time::Instant::now();
let elapsed = init_done - init_started_at;
tracing::info!(
elapsed_millis = elapsed.as_millis(),
"Initial load completed"
);
let mut init_sizes_done = std::pin::pin!(init_logical_size_done_rx.wait());
let timeout = conf.background_task_maximum_delay;
@@ -398,12 +413,7 @@ fn start_pageserver(
let init_sizes_done = match tokio::time::timeout(timeout, &mut init_sizes_done).await {
Ok(_) => {
let now = std::time::Instant::now();
tracing::info!(
from_init_done_millis = (now - init_done).as_millis(),
from_init_millis = (now - init_started_at).as_millis(),
"Initial logical sizes completed"
);
startup_checkpoint("initial_logical_sizes", "Initial logical sizes completed");
None
}
Err(_) => {
@@ -419,6 +429,7 @@ fn start_pageserver(
// allow background jobs to start
drop(background_jobs_can_start);
startup_checkpoint("background_jobs_can_start", "Starting background jobs");
if let Some(init_sizes_done) = init_sizes_done {
// ending up here is not a bug; at the latest logical sizes will be queried by
@@ -428,14 +439,11 @@ fn start_pageserver(
scopeguard::ScopeGuard::into_inner(guard);
let now = std::time::Instant::now();
tracing::info!(
from_init_done_millis = (now - init_done).as_millis(),
from_init_millis = (now - init_started_at).as_millis(),
"Initial logical sizes completed after timeout (background jobs already started)"
);
startup_checkpoint("initial_logical_sizes", "Initial logical sizes completed after timeout (background jobs already started)");
}
startup_checkpoint("complete", "Startup complete");
};
async move {

View File

@@ -31,9 +31,12 @@ use utils::{
use crate::disk_usage_eviction_task::DiskUsageEvictionTaskConfig;
use crate::tenant::config::TenantConf;
use crate::tenant::config::TenantConfOpt;
use crate::tenant::{TENANT_ATTACHING_MARKER_FILENAME, TIMELINES_SEGMENT_NAME};
use crate::tenant::{
TENANT_ATTACHING_MARKER_FILENAME, TENANT_DELETED_MARKER_FILE_NAME, TIMELINES_SEGMENT_NAME,
};
use crate::{
IGNORED_TENANT_FILE_NAME, METADATA_FILE_NAME, TENANT_CONFIG_NAME, TIMELINE_UNINIT_MARK_SUFFIX,
IGNORED_TENANT_FILE_NAME, METADATA_FILE_NAME, TENANT_CONFIG_NAME, TIMELINE_DELETE_MARK_SUFFIX,
TIMELINE_UNINIT_MARK_SUFFIX,
};
pub mod defaults {
@@ -601,6 +604,22 @@ impl PageServerConf {
)
}
pub fn timeline_delete_mark_file_path(
&self,
tenant_id: TenantId,
timeline_id: TimelineId,
) -> PathBuf {
path_with_suffix_extension(
self.timeline_path(&tenant_id, &timeline_id),
TIMELINE_DELETE_MARK_SUFFIX,
)
}
pub fn tenant_deleted_mark_file_path(&self, tenant_id: &TenantId) -> PathBuf {
self.tenant_path(tenant_id)
.join(TENANT_DELETED_MARKER_FILE_NAME)
}
pub fn traces_path(&self) -> PathBuf {
self.workdir.join("traces")
}
@@ -624,23 +643,6 @@ impl PageServerConf {
.join(METADATA_FILE_NAME)
}
/// Files on the remote storage are stored with paths, relative to the workdir.
/// That path includes in itself both tenant and timeline ids, allowing to have a unique remote storage path.
///
/// Errors if the path provided does not start from pageserver's workdir.
pub fn remote_path(&self, local_path: &Path) -> anyhow::Result<RemotePath> {
local_path
.strip_prefix(&self.workdir)
.context("Failed to strip workdir prefix")
.and_then(RemotePath::new)
.with_context(|| {
format!(
"Failed to resolve remote part of path {:?} for base {:?}",
local_path, self.workdir
)
})
}
/// Turns storage remote path of a file into its local path.
pub fn local_path(&self, remote_path: &RemotePath) -> PathBuf {
remote_path.with_base(&self.workdir)

View File

@@ -7,27 +7,23 @@ use crate::context::{DownloadBehavior, RequestContext};
use crate::task_mgr::{self, TaskKind, BACKGROUND_RUNTIME};
use crate::tenant::{mgr, LogicalSizeCalculationCause};
use anyhow;
use chrono::Utc;
use chrono::{DateTime, Utc};
use consumption_metrics::{idempotency_key, Event, EventChunk, EventType, CHUNK_SIZE};
use pageserver_api::models::TenantState;
use reqwest::Url;
use serde::Serialize;
use serde_with::{serde_as, DisplayFromStr};
use std::collections::HashMap;
use std::time::Duration;
use std::sync::Arc;
use std::time::{Duration, SystemTime};
use tracing::*;
use utils::id::{NodeId, TenantId, TimelineId};
const WRITTEN_SIZE: &str = "written_size";
const SYNTHETIC_STORAGE_SIZE: &str = "synthetic_storage_size";
const RESIDENT_SIZE: &str = "resident_size";
const REMOTE_STORAGE_SIZE: &str = "remote_storage_size";
const TIMELINE_LOGICAL_SIZE: &str = "timeline_logical_size";
use utils::lsn::Lsn;
const DEFAULT_HTTP_REPORTING_TIMEOUT: Duration = Duration::from_secs(60);
#[serde_as]
#[derive(Serialize, Debug)]
#[derive(Serialize, Debug, Clone, Copy)]
struct Ids {
#[serde_as(as = "DisplayFromStr")]
tenant_id: TenantId,
@@ -38,10 +34,142 @@ struct Ids {
/// Key that uniquely identifies the object, this metric describes.
#[derive(Debug, Clone, PartialEq, Eq, Hash)]
pub struct PageserverConsumptionMetricsKey {
pub tenant_id: TenantId,
pub timeline_id: Option<TimelineId>,
pub metric: &'static str,
struct MetricsKey {
tenant_id: TenantId,
timeline_id: Option<TimelineId>,
metric: &'static str,
}
impl MetricsKey {
const fn absolute_values(self) -> AbsoluteValueFactory {
AbsoluteValueFactory(self)
}
const fn incremental_values(self) -> IncrementalValueFactory {
IncrementalValueFactory(self)
}
}
/// Helper type which each individual metric kind can return to produce only absolute values.
struct AbsoluteValueFactory(MetricsKey);
impl AbsoluteValueFactory {
fn at(self, time: DateTime<Utc>, val: u64) -> (MetricsKey, (EventType, u64)) {
let key = self.0;
(key, (EventType::Absolute { time }, val))
}
}
/// Helper type which each individual metric kind can return to produce only incremental values.
struct IncrementalValueFactory(MetricsKey);
impl IncrementalValueFactory {
#[allow(clippy::wrong_self_convention)]
fn from_previous_up_to(
self,
prev_end: DateTime<Utc>,
up_to: DateTime<Utc>,
val: u64,
) -> (MetricsKey, (EventType, u64)) {
let key = self.0;
// cannot assert prev_end < up_to because these are realtime clock based
(
key,
(
EventType::Incremental {
start_time: prev_end,
stop_time: up_to,
},
val,
),
)
}
fn key(&self) -> &MetricsKey {
&self.0
}
}
// the static part of a MetricsKey
impl MetricsKey {
/// Absolute value of [`Timeline::get_last_record_lsn`].
///
/// [`Timeline::get_last_record_lsn`]: crate::tenant::Timeline::get_last_record_lsn
const fn written_size(tenant_id: TenantId, timeline_id: TimelineId) -> AbsoluteValueFactory {
MetricsKey {
tenant_id,
timeline_id: Some(timeline_id),
metric: "written_size",
}
.absolute_values()
}
/// Values will be the difference of the latest [`MetricsKey::written_size`] to what we
/// previously sent, starting from the previously sent incremental time range ending at the
/// latest absolute measurement.
const fn written_size_delta(
tenant_id: TenantId,
timeline_id: TimelineId,
) -> IncrementalValueFactory {
MetricsKey {
tenant_id,
timeline_id: Some(timeline_id),
// the name here is correctly about data not size, because that is what is wanted by
// downstream pipeline
metric: "written_data_bytes_delta",
}
.incremental_values()
}
/// Exact [`Timeline::get_current_logical_size`].
///
/// [`Timeline::get_current_logical_size`]: crate::tenant::Timeline::get_current_logical_size
const fn timeline_logical_size(
tenant_id: TenantId,
timeline_id: TimelineId,
) -> AbsoluteValueFactory {
MetricsKey {
tenant_id,
timeline_id: Some(timeline_id),
metric: "timeline_logical_size",
}
.absolute_values()
}
/// [`Tenant::remote_size`]
///
/// [`Tenant::remote_size`]: crate::tenant::Tenant::remote_size
const fn remote_storage_size(tenant_id: TenantId) -> AbsoluteValueFactory {
MetricsKey {
tenant_id,
timeline_id: None,
metric: "remote_storage_size",
}
.absolute_values()
}
/// Sum of [`Timeline::resident_physical_size`] for each `Tenant`.
///
/// [`Timeline::resident_physical_size`]: crate::tenant::Timeline::resident_physical_size
const fn resident_size(tenant_id: TenantId) -> AbsoluteValueFactory {
MetricsKey {
tenant_id,
timeline_id: None,
metric: "resident_size",
}
.absolute_values()
}
/// [`Tenant::cached_synthetic_size`] as refreshed by [`calculate_synthetic_size_worker`].
///
/// [`Tenant::cached_synthetic_size`]: crate::tenant::Tenant::cached_synthetic_size
const fn synthetic_size(tenant_id: TenantId) -> AbsoluteValueFactory {
MetricsKey {
tenant_id,
timeline_id: None,
metric: "synthetic_storage_size",
}
.absolute_values()
}
}
/// Main thread that serves metrics collection
@@ -79,7 +207,7 @@ pub async fn collect_metrics(
.timeout(DEFAULT_HTTP_REPORTING_TIMEOUT)
.build()
.expect("Failed to create http client with timeout");
let mut cached_metrics: HashMap<PageserverConsumptionMetricsKey, u64> = HashMap::new();
let mut cached_metrics = HashMap::new();
let mut prev_iteration_time: std::time::Instant = std::time::Instant::now();
loop {
@@ -119,15 +247,15 @@ pub async fn collect_metrics(
///
/// TODO
/// - refactor this function (chunking+sending part) to reuse it in proxy module;
pub async fn collect_metrics_iteration(
async fn collect_metrics_iteration(
client: &reqwest::Client,
cached_metrics: &mut HashMap<PageserverConsumptionMetricsKey, u64>,
cached_metrics: &mut HashMap<MetricsKey, (EventType, u64)>,
metric_collection_endpoint: &reqwest::Url,
node_id: NodeId,
ctx: &RequestContext,
send_cached: bool,
) {
let mut current_metrics: Vec<(PageserverConsumptionMetricsKey, u64)> = Vec::new();
let mut current_metrics: Vec<(MetricsKey, (EventType, u64))> = Vec::new();
trace!(
"starting collect_metrics_iteration. metric_collection_endpoint: {}",
metric_collection_endpoint
@@ -161,99 +289,65 @@ pub async fn collect_metrics_iteration(
let mut tenant_resident_size = 0;
// iterate through list of timelines in tenant
for timeline in tenant.list_timelines().iter() {
for timeline in tenant.list_timelines() {
// collect per-timeline metrics only for active timelines
if timeline.is_active() {
let timeline_written_size = u64::from(timeline.get_last_record_lsn());
current_metrics.push((
PageserverConsumptionMetricsKey {
let timeline_id = timeline.timeline_id;
match TimelineSnapshot::collect(&timeline, ctx) {
Ok(Some(snap)) => {
snap.to_metrics(
tenant_id,
timeline_id: Some(timeline.timeline_id),
metric: WRITTEN_SIZE,
},
timeline_written_size,
));
let span = info_span!("collect_metrics_iteration", tenant_id = %timeline.tenant_id, timeline_id = %timeline.timeline_id);
match span.in_scope(|| timeline.get_current_logical_size(ctx)) {
// Only send timeline logical size when it is fully calculated.
Ok((size, is_exact)) if is_exact => {
current_metrics.push((
PageserverConsumptionMetricsKey {
tenant_id,
timeline_id: Some(timeline.timeline_id),
metric: TIMELINE_LOGICAL_SIZE,
},
size,
));
}
Ok((_, _)) => {}
Err(err) => {
error!(
"failed to get current logical size for timeline {}: {err:?}",
timeline.timeline_id
);
continue;
}
};
timeline_id,
Utc::now(),
&mut current_metrics,
cached_metrics,
);
}
Ok(None) => {}
Err(e) => {
error!(
"failed to get metrics values for tenant {tenant_id} timeline {}: {e:#?}",
timeline.timeline_id
);
continue;
}
}
let timeline_resident_size = timeline.get_resident_physical_size();
tenant_resident_size += timeline_resident_size;
tenant_resident_size += timeline.resident_physical_size();
}
match tenant.get_remote_size().await {
Ok(tenant_remote_size) => {
current_metrics.push((
PageserverConsumptionMetricsKey {
tenant_id,
timeline_id: None,
metric: REMOTE_STORAGE_SIZE,
},
tenant_remote_size,
));
}
Err(err) => {
error!(
"failed to get remote size for tenant {}: {err:?}",
tenant_id
);
}
}
current_metrics
.push(MetricsKey::remote_storage_size(tenant_id).at(Utc::now(), tenant.remote_size()));
current_metrics.push((
PageserverConsumptionMetricsKey {
tenant_id,
timeline_id: None,
metric: RESIDENT_SIZE,
},
tenant_resident_size,
));
current_metrics
.push(MetricsKey::resident_size(tenant_id).at(Utc::now(), tenant_resident_size));
// Note that this metric is calculated in a separate bgworker
// Here we only use cached value, which may lag behind the real latest one
let tenant_synthetic_size = tenant.get_cached_synthetic_size();
let synthetic_size = tenant.cached_synthetic_size();
if tenant_synthetic_size != 0 {
if synthetic_size != 0 {
// only send non-zeroes because otherwise these show up as errors in logs
current_metrics.push((
PageserverConsumptionMetricsKey {
tenant_id,
timeline_id: None,
metric: SYNTHETIC_STORAGE_SIZE,
},
tenant_synthetic_size,
));
current_metrics
.push(MetricsKey::synthetic_size(tenant_id).at(Utc::now(), synthetic_size));
}
}
// Filter metrics, unless we want to send all metrics, including cached ones.
// See: https://github.com/neondatabase/neon/issues/3485
if !send_cached {
current_metrics.retain(|(curr_key, curr_val)| match cached_metrics.get(curr_key) {
Some(val) => val != curr_val,
None => true,
current_metrics.retain(|(curr_key, (kind, curr_val))| {
if kind.is_incremental() {
// incremental values (currently only written_size_delta) should not get any cache
// deduplication because they will be used by upstream for "is still alive."
true
} else {
match cached_metrics.get(curr_key) {
Some((_, val)) => val != curr_val,
None => true,
}
}
});
}
@@ -268,14 +362,16 @@ pub async fn collect_metrics_iteration(
let mut chunk_to_send: Vec<Event<Ids>> = Vec::with_capacity(CHUNK_SIZE);
let node_id = node_id.to_string();
for chunk in chunks {
chunk_to_send.clear();
// enrich metrics with type,timestamp and idempotency key before sending
chunk_to_send.extend(chunk.iter().map(|(curr_key, curr_val)| Event {
kind: EventType::Absolute { time: Utc::now() },
chunk_to_send.extend(chunk.iter().map(|(curr_key, (when, curr_val))| Event {
kind: *when,
metric: curr_key.metric,
idempotency_key: idempotency_key(node_id.to_string()),
idempotency_key: idempotency_key(&node_id),
value: *curr_val,
extra: Ids {
tenant_id: curr_key.tenant_id,
@@ -283,17 +379,14 @@ pub async fn collect_metrics_iteration(
},
}));
let chunk_json = serde_json::value::to_raw_value(&EventChunk {
events: &chunk_to_send,
})
.expect("PageserverConsumptionMetric should not fail serialization");
const MAX_RETRIES: u32 = 3;
for attempt in 0..MAX_RETRIES {
let res = client
.post(metric_collection_endpoint.clone())
.json(&chunk_json)
.json(&EventChunk {
events: (&chunk_to_send).into(),
})
.send()
.await;
@@ -329,6 +422,130 @@ pub async fn collect_metrics_iteration(
}
}
/// Internal type to make timeline metric production testable.
///
/// As this value type contains all of the information needed from a timeline to produce the
/// metrics, it can easily be created with different values in test.
struct TimelineSnapshot {
loaded_at: (Lsn, SystemTime),
last_record_lsn: Lsn,
current_exact_logical_size: Option<u64>,
}
impl TimelineSnapshot {
/// Collect the metrics from an actual timeline.
///
/// Fails currently only when [`Timeline::get_current_logical_size`] fails.
///
/// [`Timeline::get_current_logical_size`]: crate::tenant::Timeline::get_current_logical_size
fn collect(
t: &Arc<crate::tenant::Timeline>,
ctx: &RequestContext,
) -> anyhow::Result<Option<Self>> {
use anyhow::Context;
if !t.is_active() {
// no collection for broken or stopping needed, we will still keep the cached values
// though at the caller.
Ok(None)
} else {
let loaded_at = t.loaded_at;
let last_record_lsn = t.get_last_record_lsn();
let current_exact_logical_size = {
let span = info_span!("collect_metrics_iteration", tenant_id = %t.tenant_id, timeline_id = %t.timeline_id);
let res = span
.in_scope(|| t.get_current_logical_size(ctx))
.context("get_current_logical_size");
match res? {
// Only send timeline logical size when it is fully calculated.
(size, is_exact) if is_exact => Some(size),
(_, _) => None,
}
};
Ok(Some(TimelineSnapshot {
loaded_at,
last_record_lsn,
current_exact_logical_size,
}))
}
}
/// Produce the timeline consumption metrics into the `metrics` argument.
fn to_metrics(
&self,
tenant_id: TenantId,
timeline_id: TimelineId,
now: DateTime<Utc>,
metrics: &mut Vec<(MetricsKey, (EventType, u64))>,
cache: &HashMap<MetricsKey, (EventType, u64)>,
) {
let timeline_written_size = u64::from(self.last_record_lsn);
let (key, written_size_now) =
MetricsKey::written_size(tenant_id, timeline_id).at(now, timeline_written_size);
// last_record_lsn can only go up, right now at least, TODO: #2592 or related
// features might change this.
let written_size_delta_key = MetricsKey::written_size_delta(tenant_id, timeline_id);
// use this when available, because in a stream of incremental values, it will be
// accurate where as when last_record_lsn stops moving, we will only cache the last
// one of those.
let last_stop_time = cache
.get(written_size_delta_key.key())
.map(|(until, _val)| {
until
.incremental_timerange()
.expect("never create EventType::Absolute for written_size_delta")
.end
});
// by default, use the last sent written_size as the basis for
// calculating the delta. if we don't yet have one, use the load time value.
let prev = cache
.get(&key)
.map(|(prev_at, prev)| {
// use the prev time from our last incremental update, or default to latest
// absolute update on the first round.
let prev_at = prev_at
.absolute_time()
.expect("never create EventType::Incremental for written_size");
let prev_at = last_stop_time.unwrap_or(prev_at);
(*prev_at, *prev)
})
.unwrap_or_else(|| {
// if we don't have a previous point of comparison, compare to the load time
// lsn.
let (disk_consistent_lsn, loaded_at) = &self.loaded_at;
(DateTime::from(*loaded_at), disk_consistent_lsn.0)
});
// written_size_bytes_delta
metrics.extend(
if let Some(delta) = written_size_now.1.checked_sub(prev.1) {
let up_to = written_size_now
.0
.absolute_time()
.expect("never create EventType::Incremental for written_size");
let key_value = written_size_delta_key.from_previous_up_to(prev.0, *up_to, delta);
Some(key_value)
} else {
None
},
);
// written_size
metrics.push((key, written_size_now));
if let Some(size) = self.current_exact_logical_size {
metrics.push(MetricsKey::timeline_logical_size(tenant_id, timeline_id).at(now, size));
}
}
}
/// Caclculate synthetic size for each active tenant
pub async fn calculate_synthetic_size_worker(
synthetic_size_calculation_interval: Duration,
@@ -343,7 +560,7 @@ pub async fn calculate_synthetic_size_worker(
_ = task_mgr::shutdown_watcher() => {
return Ok(());
},
tick_at = ticker.tick() => {
tick_at = ticker.tick() => {
let tenants = match mgr::list_tenants().await {
Ok(tenants) => tenants,
@@ -379,3 +596,149 @@ pub async fn calculate_synthetic_size_worker(
}
}
}
#[cfg(test)]
mod tests {
use std::collections::HashMap;
use std::time::SystemTime;
use utils::{
id::{TenantId, TimelineId},
lsn::Lsn,
};
use crate::consumption_metrics::MetricsKey;
use super::TimelineSnapshot;
use chrono::{DateTime, Utc};
#[test]
fn startup_collected_timeline_metrics_before_advancing() {
let tenant_id = TenantId::generate();
let timeline_id = TimelineId::generate();
let mut metrics = Vec::new();
let cache = HashMap::new();
let initdb_lsn = Lsn(0x10000);
let disk_consistent_lsn = Lsn(initdb_lsn.0 * 2);
let snap = TimelineSnapshot {
loaded_at: (disk_consistent_lsn, SystemTime::now()),
last_record_lsn: disk_consistent_lsn,
current_exact_logical_size: Some(0x42000),
};
let now = DateTime::<Utc>::from(SystemTime::now());
snap.to_metrics(tenant_id, timeline_id, now, &mut metrics, &cache);
assert_eq!(
metrics,
&[
MetricsKey::written_size_delta(tenant_id, timeline_id).from_previous_up_to(
snap.loaded_at.1.into(),
now,
0
),
MetricsKey::written_size(tenant_id, timeline_id).at(now, disk_consistent_lsn.0),
MetricsKey::timeline_logical_size(tenant_id, timeline_id).at(now, 0x42000)
]
);
}
#[test]
fn startup_collected_timeline_metrics_second_round() {
let tenant_id = TenantId::generate();
let timeline_id = TimelineId::generate();
let [now, before, init] = time_backwards();
let now = DateTime::<Utc>::from(now);
let before = DateTime::<Utc>::from(before);
let initdb_lsn = Lsn(0x10000);
let disk_consistent_lsn = Lsn(initdb_lsn.0 * 2);
let mut metrics = Vec::new();
let cache = HashMap::from([
MetricsKey::written_size(tenant_id, timeline_id).at(before, disk_consistent_lsn.0)
]);
let snap = TimelineSnapshot {
loaded_at: (disk_consistent_lsn, init),
last_record_lsn: disk_consistent_lsn,
current_exact_logical_size: Some(0x42000),
};
snap.to_metrics(tenant_id, timeline_id, now, &mut metrics, &cache);
assert_eq!(
metrics,
&[
MetricsKey::written_size_delta(tenant_id, timeline_id)
.from_previous_up_to(before, now, 0),
MetricsKey::written_size(tenant_id, timeline_id).at(now, disk_consistent_lsn.0),
MetricsKey::timeline_logical_size(tenant_id, timeline_id).at(now, 0x42000)
]
);
}
#[test]
fn startup_collected_timeline_metrics_nth_round_at_same_lsn() {
let tenant_id = TenantId::generate();
let timeline_id = TimelineId::generate();
let [now, just_before, before, init] = time_backwards();
let now = DateTime::<Utc>::from(now);
let just_before = DateTime::<Utc>::from(just_before);
let before = DateTime::<Utc>::from(before);
let initdb_lsn = Lsn(0x10000);
let disk_consistent_lsn = Lsn(initdb_lsn.0 * 2);
let mut metrics = Vec::new();
let cache = HashMap::from([
// at t=before was the last time the last_record_lsn changed
MetricsKey::written_size(tenant_id, timeline_id).at(before, disk_consistent_lsn.0),
// end time of this event is used for the next ones
MetricsKey::written_size_delta(tenant_id, timeline_id).from_previous_up_to(
before,
just_before,
0,
),
]);
let snap = TimelineSnapshot {
loaded_at: (disk_consistent_lsn, init),
last_record_lsn: disk_consistent_lsn,
current_exact_logical_size: Some(0x42000),
};
snap.to_metrics(tenant_id, timeline_id, now, &mut metrics, &cache);
assert_eq!(
metrics,
&[
MetricsKey::written_size_delta(tenant_id, timeline_id).from_previous_up_to(
just_before,
now,
0
),
MetricsKey::written_size(tenant_id, timeline_id).at(now, disk_consistent_lsn.0),
MetricsKey::timeline_logical_size(tenant_id, timeline_id).at(now, 0x42000)
]
);
}
fn time_backwards<const N: usize>() -> [std::time::SystemTime; N] {
let mut times = [std::time::SystemTime::UNIX_EPOCH; N];
times[0] = std::time::SystemTime::now();
for behind in 1..N {
times[behind] = times[0] - std::time::Duration::from_secs(behind as u64);
}
times
}
}

View File

@@ -85,6 +85,7 @@
//! The solution is that all code paths are infected with precisely one
//! [`RequestContext`] argument. Functions in the middle of the call chain
//! only need to pass it on.
use crate::task_mgr::TaskKind;
// The main structure of this module, see module-level comment.
@@ -92,6 +93,7 @@ use crate::task_mgr::TaskKind;
pub struct RequestContext {
task_kind: TaskKind,
download_behavior: DownloadBehavior,
access_stats_behavior: AccessStatsBehavior,
}
/// Desired behavior if the operation requires an on-demand download
@@ -109,6 +111,67 @@ pub enum DownloadBehavior {
Error,
}
/// Whether this request should update access times used in LRU eviction
#[derive(Clone, Copy, PartialEq, Eq, Debug)]
pub(crate) enum AccessStatsBehavior {
/// Update access times: this request's access to data should be taken
/// as a hint that the accessed layer is likely to be accessed again
Update,
/// Do not update access times: this request is accessing the layer
/// but does not want to indicate that the layer should be retained in cache,
/// perhaps because the requestor is a compaction routine that will soon cover
/// this layer with another.
Skip,
}
pub struct RequestContextBuilder {
inner: RequestContext,
}
impl RequestContextBuilder {
/// A new builder with default settings
pub fn new(task_kind: TaskKind) -> Self {
Self {
inner: RequestContext {
task_kind,
download_behavior: DownloadBehavior::Download,
access_stats_behavior: AccessStatsBehavior::Update,
},
}
}
pub fn extend(original: &RequestContext) -> Self {
Self {
// This is like a Copy, but avoid implementing Copy because ordinary users of
// RequestContext should always move or ref it.
inner: RequestContext {
task_kind: original.task_kind,
download_behavior: original.download_behavior,
access_stats_behavior: original.access_stats_behavior,
},
}
}
/// Configure the DownloadBehavior of the context: whether to
/// download missing layers, and/or warn on the download.
pub fn download_behavior(mut self, b: DownloadBehavior) -> Self {
self.inner.download_behavior = b;
self
}
/// Configure the AccessStatsBehavior of the context: whether layer
/// accesses should update the access time of the layer.
pub(crate) fn access_stats_behavior(mut self, b: AccessStatsBehavior) -> Self {
self.inner.access_stats_behavior = b;
self
}
pub fn build(self) -> RequestContext {
self.inner
}
}
impl RequestContext {
/// Create a new RequestContext that has no parent.
///
@@ -123,10 +186,9 @@ impl RequestContext {
/// because someone explicitly canceled it.
/// It has no parent, so it cannot inherit cancellation from there.
pub fn new(task_kind: TaskKind, download_behavior: DownloadBehavior) -> Self {
RequestContext {
task_kind,
download_behavior,
}
RequestContextBuilder::new(task_kind)
.download_behavior(download_behavior)
.build()
}
/// Create a detached child context for a task that may outlive `self`.
@@ -187,10 +249,7 @@ impl RequestContext {
}
fn child_impl(&self, task_kind: TaskKind, download_behavior: DownloadBehavior) -> Self {
RequestContext {
task_kind,
download_behavior,
}
Self::new(task_kind, download_behavior)
}
pub fn task_kind(&self) -> TaskKind {
@@ -200,4 +259,8 @@ impl RequestContext {
pub fn download_behavior(&self) -> DownloadBehavior {
self.download_behavior
}
pub(crate) fn access_stats_behavior(&self) -> AccessStatsBehavior {
self.access_stats_behavior
}
}

View File

@@ -304,17 +304,18 @@ pub async fn disk_usage_eviction_task_iteration_impl<U: Usage>(
// Debug-log the list of candidates
let now = SystemTime::now();
for (i, (partition, candidate)) in candidates.iter().enumerate() {
let desc = candidate.layer.layer_desc();
debug!(
"cand {}/{}: size={}, no_access_for={}us, partition={:?}, {}/{}/{}",
i + 1,
candidates.len(),
candidate.layer.file_size(),
desc.file_size,
now.duration_since(candidate.last_activity_ts)
.unwrap()
.as_micros(),
partition,
candidate.layer.get_tenant_id(),
candidate.layer.get_timeline_id(),
desc.tenant_id,
desc.timeline_id,
candidate.layer,
);
}
@@ -346,7 +347,7 @@ pub async fn disk_usage_eviction_task_iteration_impl<U: Usage>(
warned = Some(usage_planned);
}
usage_planned.add_available_bytes(candidate.layer.file_size());
usage_planned.add_available_bytes(candidate.layer.layer_desc().file_size);
batched
.entry(TimelineKey(candidate.timeline))
@@ -389,15 +390,16 @@ pub async fn disk_usage_eviction_task_iteration_impl<U: Usage>(
Ok(results) => {
assert_eq!(results.len(), batch.len());
for (result, layer) in results.into_iter().zip(batch.iter()) {
let file_size = layer.layer_desc().file_size;
match result {
Some(Ok(())) => {
usage_assumed.add_available_bytes(layer.file_size());
usage_assumed.add_available_bytes(file_size);
}
Some(Err(EvictionError::CannotEvictRemoteLayer)) => {
unreachable!("get_local_layers_for_disk_usage_eviction finds only local layers")
}
Some(Err(EvictionError::FileNotFound)) => {
evictions_failed.file_sizes += layer.file_size();
evictions_failed.file_sizes += file_size;
evictions_failed.count += 1;
}
Some(Err(
@@ -406,7 +408,7 @@ pub async fn disk_usage_eviction_task_iteration_impl<U: Usage>(
)) => {
let e = utils::error::report_compact_sources(&e);
warn!(%layer, "failed to evict layer: {e}");
evictions_failed.file_sizes += layer.file_size();
evictions_failed.file_sizes += file_size;
evictions_failed.count += 1;
}
None => {
@@ -545,12 +547,12 @@ async fn collect_eviction_candidates(
// We could be better here, e.g., sum of all L0 layers + most recent L1 layer.
// That's what's typically used by the various background loops.
//
// The default can be overriden with a fixed value in the tenant conf.
// The default can be overridden with a fixed value in the tenant conf.
// A default override can be put in the default tenant conf in the pageserver.toml.
let min_resident_size = if let Some(s) = tenant.get_min_resident_size_override() {
debug!(
tenant_id=%tenant.tenant_id(),
overriden_size=s,
overridden_size=s,
"using overridden min resident size for tenant"
);
s

View File

@@ -0,0 +1,86 @@
/// use with fail::cfg("$name", "return(2000)")
///
/// The effect is similar to a "sleep(2000)" action, i.e. we sleep for the
/// specified time (in milliseconds). The main difference is that we use async
/// tokio sleep function. Another difference is that we print lines to the log,
/// which can be useful in tests to check that the failpoint was hit.
#[macro_export]
macro_rules! __failpoint_sleep_millis_async {
($name:literal) => {{
// If the failpoint is used with a "return" action, set should_sleep to the
// returned value (as string). Otherwise it's set to None.
let should_sleep = (|| {
::fail::fail_point!($name, |x| x);
::std::option::Option::None
})();
// Sleep if the action was a returned value
if let ::std::option::Option::Some(duration_str) = should_sleep {
$crate::failpoint_support::failpoint_sleep_helper($name, duration_str).await
}
}};
}
pub use __failpoint_sleep_millis_async as sleep_millis_async;
// Helper function used by the macro. (A function has nicer scoping so we
// don't need to decorate everything with "::")
#[doc(hidden)]
pub(crate) async fn failpoint_sleep_helper(name: &'static str, duration_str: String) {
let millis = duration_str.parse::<u64>().unwrap();
let d = std::time::Duration::from_millis(millis);
tracing::info!("failpoint {:?}: sleeping for {:?}", name, d);
tokio::time::sleep(d).await;
tracing::info!("failpoint {:?}: sleep done", name);
}
pub fn init() -> fail::FailScenario<'static> {
// The failpoints lib provides support for parsing the `FAILPOINTS` env var.
// We want non-default behavior for `exit`, though, so, we handle it separately.
//
// Format for FAILPOINTS is "name=actions" separated by ";".
let actions = std::env::var("FAILPOINTS");
if actions.is_ok() {
std::env::remove_var("FAILPOINTS");
} else {
// let the library handle non-utf8, or nothing for not present
}
let scenario = fail::FailScenario::setup();
if let Ok(val) = actions {
val.split(';')
.enumerate()
.map(|(i, s)| s.split_once('=').ok_or((i, s)))
.for_each(|res| {
let (name, actions) = match res {
Ok(t) => t,
Err((i, s)) => {
panic!(
"startup failpoints: missing action on the {}th failpoint; try `{s}=return`",
i + 1,
);
}
};
if let Err(e) = apply_failpoint(name, actions) {
panic!("startup failpoints: failed to apply failpoint {name}={actions}: {e}");
}
});
}
scenario
}
pub(crate) fn apply_failpoint(name: &str, actions: &str) -> Result<(), String> {
if actions == "exit" {
fail::cfg_callback(name, exit_failpoint)
} else {
fail::cfg(name, actions)
}
}
#[inline(never)]
fn exit_failpoint() {
tracing::info!("Exit requested by failpoint");
std::process::exit(1);
}

View File

@@ -93,6 +93,47 @@ paths:
application/json:
schema:
$ref: "#/components/schemas/Error"
delete:
description: |
Attempts to delete specified tenant. 500 and 409 errors should be retried until 404 is retrieved.
404 means that deletion successfully finished"
responses:
"400":
description: Error when no tenant id found in path
content:
application/json:
schema:
$ref: "#/components/schemas/Error"
"401":
description: Unauthorized Error
content:
application/json:
schema:
$ref: "#/components/schemas/UnauthorizedError"
"403":
description: Forbidden Error
content:
application/json:
schema:
$ref: "#/components/schemas/ForbiddenError"
"404":
description: Tenant not found
content:
application/json:
schema:
$ref: "#/components/schemas/NotFoundError"
"409":
description: Deletion is already in progress, continue polling
content:
application/json:
schema:
$ref: "#/components/schemas/ConflictError"
"500":
description: Generic operation error
content:
application/json:
schema:
$ref: "#/components/schemas/Error"
/v1/tenant/{tenant_id}/timeline:
parameters:
@@ -820,6 +861,7 @@ paths:
application/json:
schema:
$ref: "#/components/schemas/Error"
/v1/tenant/config:
put:
description: |

View File

@@ -187,7 +187,7 @@ impl From<crate::tenant::DeleteTimelineError> for ApiError {
format!("Cannot delete timeline which has child timelines: {children:?}")
.into_boxed_str(),
),
a @ AlreadyInProgress => ApiError::Conflict(a.to_string()),
a @ AlreadyInProgress(_) => ApiError::Conflict(a.to_string()),
Other(e) => ApiError::InternalServerError(e),
}
}
@@ -208,6 +208,19 @@ impl From<crate::tenant::mgr::DeleteTimelineError> for ApiError {
}
}
impl From<crate::tenant::delete::DeleteTenantError> for ApiError {
fn from(value: crate::tenant::delete::DeleteTenantError) -> Self {
use crate::tenant::delete::DeleteTenantError::*;
match value {
Get(g) => ApiError::from(g),
e @ AlreadyInProgress => ApiError::Conflict(e.to_string()),
Timeline(t) => ApiError::from(t),
Other(o) => ApiError::InternalServerError(o),
e @ InvalidState(_) => ApiError::PreconditionFailed(e.to_string().into_boxed_str()),
}
}
}
// Helper function to construct a TimelineInfo struct for a timeline
async fn build_timeline_info(
timeline: &Arc<Timeline>,
@@ -504,7 +517,6 @@ async fn timeline_delete_handler(
.instrument(info_span!("timeline_delete", %tenant_id, %timeline_id))
.await?;
// FIXME: needs to be an error for console to retry it. Ideally Accepted should be used and retried until 404.
json_response(StatusCode::ACCEPTED, ())
}
@@ -617,6 +629,23 @@ async fn tenant_status(
json_response(StatusCode::OK, tenant_info)
}
async fn tenant_delete_handler(
request: Request<Body>,
_cancel: CancellationToken,
) -> Result<Response<Body>, ApiError> {
// TODO openapi spec
let tenant_id: TenantId = parse_request_param(&request, "tenant_id")?;
check_permission(&request, Some(tenant_id))?;
let state = get_state(&request);
mgr::delete_tenant(state.conf, state.remote_storage.clone(), tenant_id)
.instrument(info_span!("tenant_delete_handler", %tenant_id))
.await?;
json_response(StatusCode::ACCEPTED, ())
}
/// HTTP endpoint to query the current tenant_size of a tenant.
///
/// This is not used by consumption metrics under [`crate::consumption_metrics`], but can be used
@@ -950,14 +979,7 @@ async fn failpoints_handler(
// We recognize one extra "action" that's not natively recognized
// by the failpoints crate: exit, to immediately kill the process
let cfg_result = if fp.actions == "exit" {
fail::cfg_callback(fp.name, || {
info!("Exit requested by failpoint");
std::process::exit(1);
})
} else {
fail::cfg(fp.name, &fp.actions)
};
let cfg_result = crate::failpoint_support::apply_failpoint(&fp.name, &fp.actions);
if let Err(err_msg) = cfg_result {
return Err(ApiError::BadRequest(anyhow!(
@@ -994,31 +1016,29 @@ async fn timeline_gc_handler(
// Run compaction immediately on given timeline.
async fn timeline_compact_handler(
request: Request<Body>,
_cancel: CancellationToken,
cancel: CancellationToken,
) -> Result<Response<Body>, ApiError> {
let tenant_id: TenantId = parse_request_param(&request, "tenant_id")?;
let timeline_id: TimelineId = parse_request_param(&request, "timeline_id")?;
check_permission(&request, Some(tenant_id))?;
let ctx = RequestContext::new(TaskKind::MgmtRequest, DownloadBehavior::Download);
let result_receiver = mgr::immediate_compact(tenant_id, timeline_id, &ctx)
.await
.context("spawn compaction task")
.map_err(ApiError::InternalServerError)?;
let result: anyhow::Result<()> = result_receiver
.await
.context("receive compaction result")
.map_err(ApiError::InternalServerError)?;
result.map_err(ApiError::InternalServerError)?;
json_response(StatusCode::OK, ())
async {
let ctx = RequestContext::new(TaskKind::MgmtRequest, DownloadBehavior::Download);
let timeline = active_timeline_of_active_tenant(tenant_id, timeline_id).await?;
timeline
.compact(&cancel, &ctx)
.await
.map_err(ApiError::InternalServerError)?;
json_response(StatusCode::OK, ())
}
.instrument(info_span!("manual_compaction", %tenant_id, %timeline_id))
.await
}
// Run checkpoint immediately on given timeline.
async fn timeline_checkpoint_handler(
request: Request<Body>,
_cancel: CancellationToken,
cancel: CancellationToken,
) -> Result<Response<Body>, ApiError> {
let tenant_id: TenantId = parse_request_param(&request, "tenant_id")?;
let timeline_id: TimelineId = parse_request_param(&request, "timeline_id")?;
@@ -1031,13 +1051,13 @@ async fn timeline_checkpoint_handler(
.await
.map_err(ApiError::InternalServerError)?;
timeline
.compact(&ctx)
.compact(&cancel, &ctx)
.await
.map_err(ApiError::InternalServerError)?;
json_response(StatusCode::OK, ())
}
.instrument(info_span!("manual_checkpoint", tenant_id = %tenant_id, timeline_id = %timeline_id))
.instrument(info_span!("manual_checkpoint", %tenant_id, %timeline_id))
.await
}
@@ -1347,6 +1367,9 @@ pub fn make_router(
.get("/v1/tenant", |r| api_handler(r, tenant_list_handler))
.post("/v1/tenant", |r| api_handler(r, tenant_create_handler))
.get("/v1/tenant/:tenant_id", |r| api_handler(r, tenant_status))
.delete("/v1/tenant/:tenant_id", |r| {
api_handler(r, tenant_delete_handler)
})
.get("/v1/tenant/:tenant_id/synthetic_size", |r| {
api_handler(r, tenant_size_handler)
})

View File

@@ -7,7 +7,7 @@ pub mod disk_usage_eviction_task;
pub mod http;
pub mod import_datadir;
pub mod keyspace;
pub(crate) mod metrics;
pub mod metrics;
pub mod page_cache;
pub mod page_service;
pub mod pgdatadir_mapping;
@@ -21,6 +21,8 @@ pub mod walingest;
pub mod walrecord;
pub mod walredo;
pub mod failpoint_support;
use std::path::Path;
use crate::task_mgr::TaskKind;
@@ -47,50 +49,54 @@ pub use crate::metrics::preinitialize_metrics;
#[tracing::instrument]
pub async fn shutdown_pageserver(exit_code: i32) {
use std::time::Duration;
// Shut down the libpq endpoint task. This prevents new connections from
// being accepted.
task_mgr::shutdown_tasks(Some(TaskKind::LibpqEndpointListener), None, None).await;
timed(
task_mgr::shutdown_tasks(Some(TaskKind::LibpqEndpointListener), None, None),
"shutdown LibpqEndpointListener",
Duration::from_secs(1),
)
.await;
// Shut down any page service tasks.
task_mgr::shutdown_tasks(Some(TaskKind::PageRequestHandler), None, None).await;
timed(
task_mgr::shutdown_tasks(Some(TaskKind::PageRequestHandler), None, None),
"shutdown PageRequestHandlers",
Duration::from_secs(1),
)
.await;
// Shut down all the tenants. This flushes everything to disk and kills
// the checkpoint and GC tasks.
tenant::mgr::shutdown_all_tenants().await;
timed(
tenant::mgr::shutdown_all_tenants(),
"shutdown all tenants",
Duration::from_secs(5),
)
.await;
// Shut down the HTTP endpoint last, so that you can still check the server's
// status while it's shutting down.
// FIXME: We should probably stop accepting commands like attach/detach earlier.
task_mgr::shutdown_tasks(Some(TaskKind::HttpEndpointListener), None, None).await;
timed(
task_mgr::shutdown_tasks(Some(TaskKind::HttpEndpointListener), None, None),
"shutdown http",
Duration::from_secs(1),
)
.await;
// There should be nothing left, but let's be sure
task_mgr::shutdown_tasks(None, None, None).await;
timed(
task_mgr::shutdown_tasks(None, None, None),
"shutdown leftovers",
Duration::from_secs(1),
)
.await;
info!("Shut down successfully completed");
std::process::exit(exit_code);
}
const DEFAULT_BASE_BACKOFF_SECONDS: f64 = 0.1;
const DEFAULT_MAX_BACKOFF_SECONDS: f64 = 3.0;
async fn exponential_backoff(n: u32, base_increment: f64, max_seconds: f64) {
let backoff_duration_seconds =
exponential_backoff_duration_seconds(n, base_increment, max_seconds);
if backoff_duration_seconds > 0.0 {
info!(
"Backoff: waiting {backoff_duration_seconds} seconds before processing with the task",
);
tokio::time::sleep(std::time::Duration::from_secs_f64(backoff_duration_seconds)).await;
}
}
pub fn exponential_backoff_duration_seconds(n: u32, base_increment: f64, max_seconds: f64) -> f64 {
if n == 0 {
0.0
} else {
(1.0 + base_increment).powf(f64::from(n)).min(max_seconds)
}
}
/// The name of the metadata file pageserver creates per timeline.
/// Full path: `tenants/<tenant_id>/timelines/<timeline_id>/metadata`.
pub const METADATA_FILE_NAME: &str = "metadata";
@@ -109,6 +115,8 @@ pub const TEMP_FILE_SUFFIX: &str = "___temp";
/// Full path: `tenants/<tenant_id>/timelines/<timeline_id>___uninit`.
pub const TIMELINE_UNINIT_MARK_SUFFIX: &str = "___uninit";
pub const TIMELINE_DELETE_MARK_SUFFIX: &str = "___delete";
/// A marker file to prevent pageserver from loading a certain tenant on restart.
/// Different from [`TIMELINE_UNINIT_MARK_SUFFIX`] due to semantics of the corresponding
/// `ignore` management API command, that expects the ignored tenant to be properly loaded
@@ -123,15 +131,30 @@ pub fn is_temporary(path: &Path) -> bool {
}
}
pub fn is_uninit_mark(path: &Path) -> bool {
fn ends_with_suffix(path: &Path, suffix: &str) -> bool {
match path.file_name() {
Some(name) => name
.to_string_lossy()
.ends_with(TIMELINE_UNINIT_MARK_SUFFIX),
Some(name) => name.to_string_lossy().ends_with(suffix),
None => false,
}
}
pub fn is_uninit_mark(path: &Path) -> bool {
ends_with_suffix(path, TIMELINE_UNINIT_MARK_SUFFIX)
}
pub fn is_delete_mark(path: &Path) -> bool {
ends_with_suffix(path, TIMELINE_DELETE_MARK_SUFFIX)
}
fn is_walkdir_io_not_found(e: &walkdir::Error) -> bool {
if let Some(e) = e.io_error() {
if e.kind() == std::io::ErrorKind::NotFound {
return true;
}
}
false
}
/// During pageserver startup, we need to order operations not to exhaust tokio worker threads by
/// blocking.
///
@@ -147,7 +170,7 @@ pub struct InitializationOrder {
/// Each timeline owns a clone of this to be consumed on the initial logical size calculation
/// attempt. It is important to drop this once the attempt has completed.
pub initial_logical_size_attempt: utils::completion::Completion,
pub initial_logical_size_attempt: Option<utils::completion::Completion>,
/// Barrier for when we can start any background jobs.
///
@@ -155,33 +178,75 @@ pub struct InitializationOrder {
pub background_jobs_can_start: utils::completion::Barrier,
}
#[cfg(test)]
mod backoff_defaults_tests {
use super::*;
/// Time the future with a warning when it exceeds a threshold.
async fn timed<Fut: std::future::Future>(
fut: Fut,
name: &str,
warn_at: std::time::Duration,
) -> <Fut as std::future::Future>::Output {
let started = std::time::Instant::now();
#[test]
fn backoff_defaults_produce_growing_backoff_sequence() {
let mut current_backoff_value = None;
let mut fut = std::pin::pin!(fut);
for i in 0..10_000 {
let new_backoff_value = exponential_backoff_duration_seconds(
i,
DEFAULT_BASE_BACKOFF_SECONDS,
DEFAULT_MAX_BACKOFF_SECONDS,
match tokio::time::timeout(warn_at, &mut fut).await {
Ok(ret) => {
tracing::info!(
task = name,
elapsed_ms = started.elapsed().as_millis(),
"completed"
);
ret
}
Err(_) => {
tracing::info!(
task = name,
elapsed_ms = started.elapsed().as_millis(),
"still waiting, taking longer than expected..."
);
if let Some(old_backoff_value) = current_backoff_value.replace(new_backoff_value) {
assert!(
old_backoff_value <= new_backoff_value,
"{i}th backoff value {new_backoff_value} is smaller than the previous one {old_backoff_value}"
)
}
}
let ret = fut.await;
assert_eq!(
current_backoff_value.expect("Should have produced backoff values to compare"),
DEFAULT_MAX_BACKOFF_SECONDS,
"Given big enough of retries, backoff should reach its allowed max value"
);
// this has a global allowed_errors
tracing::warn!(
task = name,
elapsed_ms = started.elapsed().as_millis(),
"completed, took longer than expected"
);
ret
}
}
}
#[cfg(test)]
mod timed_tests {
use super::timed;
use std::time::Duration;
#[tokio::test]
async fn timed_completes_when_inner_future_completes() {
// A future that completes on time should have its result returned
let r1 = timed(
async move {
tokio::time::sleep(Duration::from_millis(10)).await;
123
},
"test 1",
Duration::from_millis(50),
)
.await;
assert_eq!(r1, 123);
// A future that completes too slowly should also have its result returned
let r1 = timed(
async move {
tokio::time::sleep(Duration::from_millis(50)).await;
456
},
"test 1",
Duration::from_millis(10),
)
.await;
assert_eq!(r1, 456);
}
}

View File

@@ -1,13 +1,12 @@
use metrics::metric_vec_duration::DurationResultObserver;
use metrics::{
register_counter_vec, register_histogram, register_histogram_vec, register_int_counter,
register_int_counter_vec, register_int_gauge, register_int_gauge_vec, register_uint_gauge,
register_uint_gauge_vec, Counter, CounterVec, Histogram, HistogramVec, IntCounter,
IntCounterVec, IntGauge, IntGaugeVec, UIntGauge, UIntGaugeVec,
register_counter_vec, register_gauge_vec, register_histogram, register_histogram_vec,
register_int_counter, register_int_counter_vec, register_int_gauge, register_int_gauge_vec,
register_uint_gauge, register_uint_gauge_vec, Counter, CounterVec, GaugeVec, Histogram,
HistogramVec, IntCounter, IntCounterVec, IntGauge, IntGaugeVec, UIntGauge, UIntGaugeVec,
};
use once_cell::sync::Lazy;
use pageserver_api::models::TenantState;
use strum::VariantNames;
use strum::{EnumCount, IntoEnumIterator, VariantNames};
use strum_macros::{EnumVariantNames, IntoStaticStr};
use utils::id::{TenantId, TimelineId};
@@ -74,7 +73,7 @@ pub static STORAGE_TIME_COUNT_PER_TIMELINE: Lazy<IntCounterVec> = Lazy::new(|| {
// Buckets for background operations like compaction, GC, size calculation
const STORAGE_OP_BUCKETS: &[f64] = &[0.010, 0.100, 1.0, 10.0, 100.0, 1000.0];
pub static STORAGE_TIME_GLOBAL: Lazy<HistogramVec> = Lazy::new(|| {
pub(crate) static STORAGE_TIME_GLOBAL: Lazy<HistogramVec> = Lazy::new(|| {
register_histogram_vec!(
"pageserver_storage_operations_seconds_global",
"Time spent on storage operations",
@@ -84,18 +83,17 @@ pub static STORAGE_TIME_GLOBAL: Lazy<HistogramVec> = Lazy::new(|| {
.expect("failed to define a metric")
});
static READ_NUM_FS_LAYERS: Lazy<HistogramVec> = Lazy::new(|| {
register_histogram_vec!(
pub(crate) static READ_NUM_FS_LAYERS: Lazy<Histogram> = Lazy::new(|| {
register_histogram!(
"pageserver_read_num_fs_layers",
"Number of persistent layers accessed for processing a read request, including those in the cache",
&["tenant_id", "timeline_id"],
vec![1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 10.0, 20.0, 50.0, 100.0],
)
.expect("failed to define a metric")
});
// Metrics collected on operations on the storage repository.
pub static RECONSTRUCT_TIME: Lazy<Histogram> = Lazy::new(|| {
pub(crate) static RECONSTRUCT_TIME: Lazy<Histogram> = Lazy::new(|| {
register_histogram!(
"pageserver_getpage_reconstruct_seconds",
"Time spent in reconstruct_value (reconstruct a page from deltas)",
@@ -104,7 +102,7 @@ pub static RECONSTRUCT_TIME: Lazy<Histogram> = Lazy::new(|| {
.expect("failed to define a metric")
});
pub static MATERIALIZED_PAGE_CACHE_HIT_DIRECT: Lazy<IntCounter> = Lazy::new(|| {
pub(crate) static MATERIALIZED_PAGE_CACHE_HIT_DIRECT: Lazy<IntCounter> = Lazy::new(|| {
register_int_counter!(
"pageserver_materialized_cache_hits_direct_total",
"Number of cache hits from materialized page cache without redo",
@@ -112,17 +110,16 @@ pub static MATERIALIZED_PAGE_CACHE_HIT_DIRECT: Lazy<IntCounter> = Lazy::new(|| {
.expect("failed to define a metric")
});
static GET_RECONSTRUCT_DATA_TIME: Lazy<HistogramVec> = Lazy::new(|| {
register_histogram_vec!(
pub(crate) static GET_RECONSTRUCT_DATA_TIME: Lazy<Histogram> = Lazy::new(|| {
register_histogram!(
"pageserver_getpage_get_reconstruct_data_seconds",
"Time spent in get_reconstruct_value_data",
&["tenant_id", "timeline_id"],
CRITICAL_OP_BUCKETS.into(),
)
.expect("failed to define a metric")
});
pub static MATERIALIZED_PAGE_CACHE_HIT: Lazy<IntCounter> = Lazy::new(|| {
pub(crate) static MATERIALIZED_PAGE_CACHE_HIT: Lazy<IntCounter> = Lazy::new(|| {
register_int_counter!(
"pageserver_materialized_cache_hits_total",
"Number of cache hits from materialized page cache",
@@ -246,11 +243,10 @@ pub static PAGE_CACHE_SIZE: Lazy<PageCacheSizeMetrics> = Lazy::new(|| PageCacheS
},
});
static WAIT_LSN_TIME: Lazy<HistogramVec> = Lazy::new(|| {
register_histogram_vec!(
pub(crate) static WAIT_LSN_TIME: Lazy<Histogram> = Lazy::new(|| {
register_histogram!(
"pageserver_wait_lsn_seconds",
"Time spent waiting for WAL to arrive",
&["tenant_id", "timeline_id"],
CRITICAL_OP_BUCKETS.into(),
)
.expect("failed to define a metric")
@@ -284,7 +280,7 @@ static REMOTE_PHYSICAL_SIZE: Lazy<UIntGaugeVec> = Lazy::new(|| {
.expect("failed to define a metric")
});
pub static REMOTE_ONDEMAND_DOWNLOADED_LAYERS: Lazy<IntCounter> = Lazy::new(|| {
pub(crate) static REMOTE_ONDEMAND_DOWNLOADED_LAYERS: Lazy<IntCounter> = Lazy::new(|| {
register_int_counter!(
"pageserver_remote_ondemand_downloaded_layers_total",
"Total on-demand downloaded layers"
@@ -292,7 +288,7 @@ pub static REMOTE_ONDEMAND_DOWNLOADED_LAYERS: Lazy<IntCounter> = Lazy::new(|| {
.unwrap()
});
pub static REMOTE_ONDEMAND_DOWNLOADED_BYTES: Lazy<IntCounter> = Lazy::new(|| {
pub(crate) static REMOTE_ONDEMAND_DOWNLOADED_BYTES: Lazy<IntCounter> = Lazy::new(|| {
register_int_counter!(
"pageserver_remote_ondemand_downloaded_bytes_total",
"Total bytes of layers on-demand downloaded",
@@ -309,16 +305,29 @@ static CURRENT_LOGICAL_SIZE: Lazy<UIntGaugeVec> = Lazy::new(|| {
.expect("failed to define current logical size metric")
});
pub static TENANT_STATE_METRIC: Lazy<UIntGaugeVec> = Lazy::new(|| {
pub(crate) static TENANT_STATE_METRIC: Lazy<UIntGaugeVec> = Lazy::new(|| {
register_uint_gauge_vec!(
"pageserver_tenant_states_count",
"Count of tenants per state",
&["tenant_id", "state"]
&["state"]
)
.expect("Failed to register pageserver_tenant_states_count metric")
});
pub static TENANT_SYNTHETIC_SIZE_METRIC: Lazy<UIntGaugeVec> = Lazy::new(|| {
/// A set of broken tenants.
///
/// These are expected to be so rare that a set is fine. Set as in a new timeseries per each broken
/// tenant.
pub(crate) static BROKEN_TENANTS_SET: Lazy<UIntGaugeVec> = Lazy::new(|| {
register_uint_gauge_vec!(
"pageserver_broken_tenants_count",
"Set of broken tenants",
&["tenant_id"]
)
.expect("Failed to register pageserver_tenant_states_count metric")
});
pub(crate) static TENANT_SYNTHETIC_SIZE_METRIC: Lazy<UIntGaugeVec> = Lazy::new(|| {
register_uint_gauge_vec!(
"pageserver_tenant_synthetic_cached_size_bytes",
"Synthetic size of each tenant in bytes",
@@ -376,7 +385,7 @@ static EVICTIONS_WITH_LOW_RESIDENCE_DURATION: Lazy<IntCounterVec> = Lazy::new(||
.expect("failed to define a metric")
});
pub static UNEXPECTED_ONDEMAND_DOWNLOADS: Lazy<IntCounter> = Lazy::new(|| {
pub(crate) static UNEXPECTED_ONDEMAND_DOWNLOADS: Lazy<IntCounter> = Lazy::new(|| {
register_int_counter!(
"pageserver_unexpected_ondemand_downloads_count",
"Number of unexpected on-demand downloads. \
@@ -385,6 +394,35 @@ pub static UNEXPECTED_ONDEMAND_DOWNLOADS: Lazy<IntCounter> = Lazy::new(|| {
.expect("failed to define a metric")
});
/// How long did we take to start up? Broken down by labels to describe
/// different phases of startup.
pub static STARTUP_DURATION: Lazy<GaugeVec> = Lazy::new(|| {
register_gauge_vec!(
"pageserver_startup_duration_seconds",
"Time taken by phases of pageserver startup, in seconds",
&["phase"]
)
.expect("Failed to register pageserver_startup_duration_seconds metric")
});
pub static STARTUP_IS_LOADING: Lazy<UIntGauge> = Lazy::new(|| {
register_uint_gauge!(
"pageserver_startup_is_loading",
"1 while in initial startup load of tenants, 0 at other times"
)
.expect("Failed to register pageserver_startup_is_loading")
});
/// How long did tenants take to go from construction to active state?
pub(crate) static TENANT_ACTIVATION: Lazy<Histogram> = Lazy::new(|| {
register_histogram!(
"pageserver_tenant_activation_seconds",
"Time taken by tenants to activate, in seconds",
CRITICAL_OP_BUCKETS.into()
)
.expect("Failed to register pageserver_tenant_activation_seconds metric")
});
/// Each `Timeline`'s [`EVICTIONS_WITH_LOW_RESIDENCE_DURATION`] metric.
#[derive(Debug)]
pub struct EvictionsWithLowResidenceDuration {
@@ -499,23 +537,31 @@ const STORAGE_IO_TIME_BUCKETS: &[f64] = &[
30.000, // 30000 ms
];
const STORAGE_IO_TIME_OPERATIONS: &[&str] = &[
"open", "close", "read", "write", "seek", "fsync", "gc", "metadata",
];
const STORAGE_IO_SIZE_OPERATIONS: &[&str] = &["read", "write"];
pub static STORAGE_IO_TIME: Lazy<HistogramVec> = Lazy::new(|| {
/// Tracks time taken by fs operations near VirtualFile.
///
/// Operations:
/// - open ([`std::fs::OpenOptions::open`])
/// - close (dropping [`std::fs::File`])
/// - close-by-replace (close by replacement algorithm)
/// - read (`read_at`)
/// - write (`write_at`)
/// - seek (modify internal position or file length query)
/// - fsync ([`std::fs::File::sync_all`])
/// - metadata ([`std::fs::File::metadata`])
pub(crate) static STORAGE_IO_TIME: Lazy<HistogramVec> = Lazy::new(|| {
register_histogram_vec!(
"pageserver_io_operations_seconds",
"Time spent in IO operations",
&["operation", "tenant_id", "timeline_id"],
&["operation"],
STORAGE_IO_TIME_BUCKETS.into()
)
.expect("failed to define a metric")
});
pub static STORAGE_IO_SIZE: Lazy<IntGaugeVec> = Lazy::new(|| {
const STORAGE_IO_SIZE_OPERATIONS: &[&str] = &["read", "write"];
// Needed for the https://neonprod.grafana.net/d/5uK9tHL4k/picking-tenant-for-relocation?orgId=1
pub(crate) static STORAGE_IO_SIZE: Lazy<IntGaugeVec> = Lazy::new(|| {
register_int_gauge_vec!(
"pageserver_io_operations_bytes_total",
"Total amount of bytes read/written in IO operations",
@@ -524,23 +570,160 @@ pub static STORAGE_IO_SIZE: Lazy<IntGaugeVec> = Lazy::new(|| {
.expect("failed to define a metric")
});
const SMGR_QUERY_TIME_OPERATIONS: &[&str] = &[
"get_rel_exists",
"get_rel_size",
"get_page_at_lsn",
"get_db_size",
];
#[derive(Debug)]
struct GlobalAndPerTimelineHistogram {
global: Histogram,
per_tenant_timeline: Histogram,
}
pub static SMGR_QUERY_TIME: Lazy<HistogramVec> = Lazy::new(|| {
impl GlobalAndPerTimelineHistogram {
fn observe(&self, value: f64) {
self.global.observe(value);
self.per_tenant_timeline.observe(value);
}
}
struct GlobalAndPerTimelineHistogramTimer<'a> {
h: &'a GlobalAndPerTimelineHistogram,
start: std::time::Instant,
}
impl<'a> Drop for GlobalAndPerTimelineHistogramTimer<'a> {
fn drop(&mut self) {
let elapsed = self.start.elapsed();
self.h.observe(elapsed.as_secs_f64());
}
}
#[derive(
Debug,
Clone,
Copy,
IntoStaticStr,
strum_macros::EnumCount,
strum_macros::EnumIter,
strum_macros::FromRepr,
)]
#[strum(serialize_all = "snake_case")]
pub enum SmgrQueryType {
GetRelExists,
GetRelSize,
GetPageAtLsn,
GetDbSize,
}
#[derive(Debug)]
pub struct SmgrQueryTimePerTimeline {
metrics: [GlobalAndPerTimelineHistogram; SmgrQueryType::COUNT],
}
static SMGR_QUERY_TIME_PER_TENANT_TIMELINE: Lazy<HistogramVec> = Lazy::new(|| {
register_histogram_vec!(
"pageserver_smgr_query_seconds",
"Time spent on smgr query handling",
"Time spent on smgr query handling, aggegated by query type and tenant/timeline.",
&["smgr_query_type", "tenant_id", "timeline_id"],
CRITICAL_OP_BUCKETS.into(),
)
.expect("failed to define a metric")
});
static SMGR_QUERY_TIME_GLOBAL: Lazy<HistogramVec> = Lazy::new(|| {
register_histogram_vec!(
"pageserver_smgr_query_seconds_global",
"Time spent on smgr query handling, aggregated by query type.",
&["smgr_query_type"],
CRITICAL_OP_BUCKETS.into(),
)
.expect("failed to define a metric")
});
impl SmgrQueryTimePerTimeline {
pub(crate) fn new(tenant_id: &TenantId, timeline_id: &TimelineId) -> Self {
let tenant_id = tenant_id.to_string();
let timeline_id = timeline_id.to_string();
let metrics = std::array::from_fn(|i| {
let op = SmgrQueryType::from_repr(i).unwrap();
let global = SMGR_QUERY_TIME_GLOBAL
.get_metric_with_label_values(&[op.into()])
.unwrap();
let per_tenant_timeline = SMGR_QUERY_TIME_PER_TENANT_TIMELINE
.get_metric_with_label_values(&[op.into(), &tenant_id, &timeline_id])
.unwrap();
GlobalAndPerTimelineHistogram {
global,
per_tenant_timeline,
}
});
Self { metrics }
}
pub(crate) fn start_timer(&self, op: SmgrQueryType) -> impl Drop + '_ {
let metric = &self.metrics[op as usize];
GlobalAndPerTimelineHistogramTimer {
h: metric,
start: std::time::Instant::now(),
}
}
}
#[cfg(test)]
mod smgr_query_time_tests {
use strum::IntoEnumIterator;
use utils::id::{TenantId, TimelineId};
// Regression test, we used hard-coded string constants before using an enum.
#[test]
fn op_label_name() {
use super::SmgrQueryType::*;
let expect: [(super::SmgrQueryType, &'static str); 4] = [
(GetRelExists, "get_rel_exists"),
(GetRelSize, "get_rel_size"),
(GetPageAtLsn, "get_page_at_lsn"),
(GetDbSize, "get_db_size"),
];
for (op, expect) in expect {
let actual: &'static str = op.into();
assert_eq!(actual, expect);
}
}
#[test]
fn basic() {
let ops: Vec<_> = super::SmgrQueryType::iter().collect();
for op in &ops {
let tenant_id = TenantId::generate();
let timeline_id = TimelineId::generate();
let metrics = super::SmgrQueryTimePerTimeline::new(&tenant_id, &timeline_id);
let get_counts = || {
let global: u64 = ops
.iter()
.map(|op| metrics.metrics[*op as usize].global.get_sample_count())
.sum();
let per_tenant_timeline: u64 = ops
.iter()
.map(|op| {
metrics.metrics[*op as usize]
.per_tenant_timeline
.get_sample_count()
})
.sum();
(global, per_tenant_timeline)
};
let (pre_global, pre_per_tenant_timeline) = get_counts();
assert_eq!(pre_per_tenant_timeline, 0);
let timer = metrics.start_timer(*op);
drop(timer);
let (post_global, post_per_tenant_timeline) = get_counts();
assert_eq!(post_per_tenant_timeline, 1);
assert!(post_global > pre_global);
}
}
}
// keep in sync with control plane Go code so that we can validate
// compute's basebackup_ms metric with our perspective in the context of SLI/SLO.
static COMPUTE_STARTUP_BUCKETS: Lazy<[f64; 28]> = Lazy::new(|| {
@@ -605,7 +788,7 @@ static REMOTE_TIMELINE_CLIENT_CALLS_STARTED_HIST: Lazy<HistogramVec> = Lazy::new
at a given instant. It gives you a better idea of the queue depth \
than plotting the gauge directly, since operations may complete faster \
than the sampling interval.",
&["tenant_id", "timeline_id", "file_kind", "op_kind"],
&["file_kind", "op_kind"],
// The calls_unfinished gauge is an integer gauge, hence we have integer buckets.
vec![0.0, 1.0, 2.0, 4.0, 6.0, 8.0, 10.0, 15.0, 20.0, 40.0, 60.0, 80.0, 100.0, 500.0],
)
@@ -662,18 +845,18 @@ impl RemoteOpFileKind {
}
}
pub static REMOTE_OPERATION_TIME: Lazy<HistogramVec> = Lazy::new(|| {
pub(crate) static REMOTE_OPERATION_TIME: Lazy<HistogramVec> = Lazy::new(|| {
register_histogram_vec!(
"pageserver_remote_operation_seconds",
"Time spent on remote storage operations. \
Grouped by tenant, timeline, operation_kind and status. \
Does not account for time spent waiting in remote timeline client's queues.",
&["tenant_id", "timeline_id", "file_kind", "op_kind", "status"]
&["file_kind", "op_kind", "status"]
)
.expect("failed to define a metric")
});
pub static TENANT_TASK_EVENTS: Lazy<IntCounterVec> = Lazy::new(|| {
pub(crate) static TENANT_TASK_EVENTS: Lazy<IntCounterVec> = Lazy::new(|| {
register_int_counter_vec!(
"pageserver_tenant_task_events",
"Number of task start/stop/fail events.",
@@ -682,7 +865,7 @@ pub static TENANT_TASK_EVENTS: Lazy<IntCounterVec> = Lazy::new(|| {
.expect("Failed to register tenant_task_events metric")
});
pub static BACKGROUND_LOOP_PERIOD_OVERRUN_COUNT: Lazy<IntCounterVec> = Lazy::new(|| {
pub(crate) static BACKGROUND_LOOP_PERIOD_OVERRUN_COUNT: Lazy<IntCounterVec> = Lazy::new(|| {
register_int_counter_vec!(
"pageserver_background_loop_period_overrun_count",
"Incremented whenever warn_when_period_overrun() logs a warning.",
@@ -693,7 +876,7 @@ pub static BACKGROUND_LOOP_PERIOD_OVERRUN_COUNT: Lazy<IntCounterVec> = Lazy::new
// walreceiver metrics
pub static WALRECEIVER_STARTED_CONNECTIONS: Lazy<IntCounter> = Lazy::new(|| {
pub(crate) static WALRECEIVER_STARTED_CONNECTIONS: Lazy<IntCounter> = Lazy::new(|| {
register_int_counter!(
"pageserver_walreceiver_started_connections_total",
"Number of started walreceiver connections"
@@ -701,7 +884,7 @@ pub static WALRECEIVER_STARTED_CONNECTIONS: Lazy<IntCounter> = Lazy::new(|| {
.expect("failed to define a metric")
});
pub static WALRECEIVER_ACTIVE_MANAGERS: Lazy<IntGauge> = Lazy::new(|| {
pub(crate) static WALRECEIVER_ACTIVE_MANAGERS: Lazy<IntGauge> = Lazy::new(|| {
register_int_gauge!(
"pageserver_walreceiver_active_managers",
"Number of active walreceiver managers"
@@ -709,7 +892,7 @@ pub static WALRECEIVER_ACTIVE_MANAGERS: Lazy<IntGauge> = Lazy::new(|| {
.expect("failed to define a metric")
});
pub static WALRECEIVER_SWITCHES: Lazy<IntCounterVec> = Lazy::new(|| {
pub(crate) static WALRECEIVER_SWITCHES: Lazy<IntCounterVec> = Lazy::new(|| {
register_int_counter_vec!(
"pageserver_walreceiver_switches_total",
"Number of walreceiver manager change_connection calls",
@@ -718,7 +901,7 @@ pub static WALRECEIVER_SWITCHES: Lazy<IntCounterVec> = Lazy::new(|| {
.expect("failed to define a metric")
});
pub static WALRECEIVER_BROKER_UPDATES: Lazy<IntCounter> = Lazy::new(|| {
pub(crate) static WALRECEIVER_BROKER_UPDATES: Lazy<IntCounter> = Lazy::new(|| {
register_int_counter!(
"pageserver_walreceiver_broker_updates_total",
"Number of received broker updates in walreceiver"
@@ -726,7 +909,7 @@ pub static WALRECEIVER_BROKER_UPDATES: Lazy<IntCounter> = Lazy::new(|| {
.expect("failed to define a metric")
});
pub static WALRECEIVER_CANDIDATES_EVENTS: Lazy<IntCounterVec> = Lazy::new(|| {
pub(crate) static WALRECEIVER_CANDIDATES_EVENTS: Lazy<IntCounterVec> = Lazy::new(|| {
register_int_counter_vec!(
"pageserver_walreceiver_candidates_events_total",
"Number of walreceiver candidate events",
@@ -735,10 +918,10 @@ pub static WALRECEIVER_CANDIDATES_EVENTS: Lazy<IntCounterVec> = Lazy::new(|| {
.expect("failed to define a metric")
});
pub static WALRECEIVER_CANDIDATES_ADDED: Lazy<IntCounter> =
pub(crate) static WALRECEIVER_CANDIDATES_ADDED: Lazy<IntCounter> =
Lazy::new(|| WALRECEIVER_CANDIDATES_EVENTS.with_label_values(&["add"]));
pub static WALRECEIVER_CANDIDATES_REMOVED: Lazy<IntCounter> =
pub(crate) static WALRECEIVER_CANDIDATES_REMOVED: Lazy<IntCounter> =
Lazy::new(|| WALRECEIVER_CANDIDATES_EVENTS.with_label_values(&["remove"]));
// Metrics collected on WAL redo operations
@@ -785,7 +968,7 @@ macro_rules! redo_bytes_histogram_count_buckets {
};
}
pub static WAL_REDO_TIME: Lazy<Histogram> = Lazy::new(|| {
pub(crate) static WAL_REDO_TIME: Lazy<Histogram> = Lazy::new(|| {
register_histogram!(
"pageserver_wal_redo_seconds",
"Time spent on WAL redo",
@@ -794,7 +977,7 @@ pub static WAL_REDO_TIME: Lazy<Histogram> = Lazy::new(|| {
.expect("failed to define a metric")
});
pub static WAL_REDO_WAIT_TIME: Lazy<Histogram> = Lazy::new(|| {
pub(crate) static WAL_REDO_WAIT_TIME: Lazy<Histogram> = Lazy::new(|| {
register_histogram!(
"pageserver_wal_redo_wait_seconds",
"Time spent waiting for access to the Postgres WAL redo process",
@@ -803,7 +986,7 @@ pub static WAL_REDO_WAIT_TIME: Lazy<Histogram> = Lazy::new(|| {
.expect("failed to define a metric")
});
pub static WAL_REDO_RECORDS_HISTOGRAM: Lazy<Histogram> = Lazy::new(|| {
pub(crate) static WAL_REDO_RECORDS_HISTOGRAM: Lazy<Histogram> = Lazy::new(|| {
register_histogram!(
"pageserver_wal_redo_records_histogram",
"Histogram of number of records replayed per redo in the Postgres WAL redo process",
@@ -812,7 +995,7 @@ pub static WAL_REDO_RECORDS_HISTOGRAM: Lazy<Histogram> = Lazy::new(|| {
.expect("failed to define a metric")
});
pub static WAL_REDO_BYTES_HISTOGRAM: Lazy<Histogram> = Lazy::new(|| {
pub(crate) static WAL_REDO_BYTES_HISTOGRAM: Lazy<Histogram> = Lazy::new(|| {
register_histogram!(
"pageserver_wal_redo_bytes_histogram",
"Histogram of number of records replayed per redo sent to Postgres",
@@ -821,7 +1004,8 @@ pub static WAL_REDO_BYTES_HISTOGRAM: Lazy<Histogram> = Lazy::new(|| {
.expect("failed to define a metric")
});
pub static WAL_REDO_RECORD_COUNTER: Lazy<IntCounter> = Lazy::new(|| {
// FIXME: isn't this already included by WAL_REDO_RECORDS_HISTOGRAM which has _count?
pub(crate) static WAL_REDO_RECORD_COUNTER: Lazy<IntCounter> = Lazy::new(|| {
register_int_counter!(
"pageserver_replayed_wal_records_total",
"Number of WAL records replayed in WAL redo process"
@@ -897,7 +1081,6 @@ impl StorageTimeMetrics {
pub struct TimelineMetrics {
tenant_id: String,
timeline_id: String,
pub get_reconstruct_data_time_histo: Histogram,
pub flush_time_histo: StorageTimeMetrics,
pub compact_time_histo: StorageTimeMetrics,
pub create_images_time_histo: StorageTimeMetrics,
@@ -906,9 +1089,7 @@ pub struct TimelineMetrics {
pub load_layer_map_histo: StorageTimeMetrics,
pub garbage_collect_histo: StorageTimeMetrics,
pub last_record_gauge: IntGauge,
pub wait_lsn_time_histo: Histogram,
pub resident_physical_size_gauge: UIntGauge,
pub read_num_fs_layers: Histogram,
/// copy of LayeredTimeline.current_logical_size
pub current_logical_size_gauge: UIntGauge,
pub num_persistent_files_created: IntCounter,
@@ -925,9 +1106,6 @@ impl TimelineMetrics {
) -> Self {
let tenant_id = tenant_id.to_string();
let timeline_id = timeline_id.to_string();
let get_reconstruct_data_time_histo = GET_RECONSTRUCT_DATA_TIME
.get_metric_with_label_values(&[&tenant_id, &timeline_id])
.unwrap();
let flush_time_histo =
StorageTimeMetrics::new(StorageTimeOperation::LayerFlush, &tenant_id, &timeline_id);
let compact_time_histo =
@@ -948,9 +1126,6 @@ impl TimelineMetrics {
let last_record_gauge = LAST_RECORD_LSN
.get_metric_with_label_values(&[&tenant_id, &timeline_id])
.unwrap();
let wait_lsn_time_histo = WAIT_LSN_TIME
.get_metric_with_label_values(&[&tenant_id, &timeline_id])
.unwrap();
let resident_physical_size_gauge = RESIDENT_PHYSICAL_SIZE
.get_metric_with_label_values(&[&tenant_id, &timeline_id])
.unwrap();
@@ -966,16 +1141,12 @@ impl TimelineMetrics {
let evictions = EVICTIONS
.get_metric_with_label_values(&[&tenant_id, &timeline_id])
.unwrap();
let read_num_fs_layers = READ_NUM_FS_LAYERS
.get_metric_with_label_values(&[&tenant_id, &timeline_id])
.unwrap();
let evictions_with_low_residence_duration =
evictions_with_low_residence_duration_builder.build(&tenant_id, &timeline_id);
TimelineMetrics {
tenant_id,
timeline_id,
get_reconstruct_data_time_histo,
flush_time_histo,
compact_time_histo,
create_images_time_histo,
@@ -984,7 +1155,6 @@ impl TimelineMetrics {
garbage_collect_histo,
load_layer_map_histo,
last_record_gauge,
wait_lsn_time_histo,
resident_physical_size_gauge,
current_logical_size_gauge,
num_persistent_files_created,
@@ -993,7 +1163,6 @@ impl TimelineMetrics {
evictions_with_low_residence_duration: std::sync::RwLock::new(
evictions_with_low_residence_duration,
),
read_num_fs_layers,
}
}
}
@@ -1002,36 +1171,40 @@ impl Drop for TimelineMetrics {
fn drop(&mut self) {
let tenant_id = &self.tenant_id;
let timeline_id = &self.timeline_id;
let _ = GET_RECONSTRUCT_DATA_TIME.remove_label_values(&[tenant_id, timeline_id]);
let _ = LAST_RECORD_LSN.remove_label_values(&[tenant_id, timeline_id]);
let _ = WAIT_LSN_TIME.remove_label_values(&[tenant_id, timeline_id]);
let _ = RESIDENT_PHYSICAL_SIZE.remove_label_values(&[tenant_id, timeline_id]);
let _ = CURRENT_LOGICAL_SIZE.remove_label_values(&[tenant_id, timeline_id]);
let _ = NUM_PERSISTENT_FILES_CREATED.remove_label_values(&[tenant_id, timeline_id]);
let _ = PERSISTENT_BYTES_WRITTEN.remove_label_values(&[tenant_id, timeline_id]);
let _ = EVICTIONS.remove_label_values(&[tenant_id, timeline_id]);
let _ = READ_NUM_FS_LAYERS.remove_label_values(&[tenant_id, timeline_id]);
self.evictions_with_low_residence_duration
.write()
.unwrap()
.remove(tenant_id, timeline_id);
// The following metrics are born outside of the TimelineMetrics lifecycle but still
// removed at the end of it. The idea is to have the metrics outlive the
// entity during which they're observed, e.g., the smgr metrics shall
// outlive an individual smgr connection, but not the timeline.
for op in StorageTimeOperation::VARIANTS {
let _ =
STORAGE_TIME_SUM_PER_TIMELINE.remove_label_values(&[op, tenant_id, timeline_id]);
let _ =
STORAGE_TIME_COUNT_PER_TIMELINE.remove_label_values(&[op, tenant_id, timeline_id]);
}
for op in STORAGE_IO_TIME_OPERATIONS {
let _ = STORAGE_IO_TIME.remove_label_values(&[op, tenant_id, timeline_id]);
}
for op in STORAGE_IO_SIZE_OPERATIONS {
let _ = STORAGE_IO_SIZE.remove_label_values(&[op, tenant_id, timeline_id]);
}
for op in SMGR_QUERY_TIME_OPERATIONS {
let _ = SMGR_QUERY_TIME.remove_label_values(&[op, tenant_id, timeline_id]);
for op in SmgrQueryType::iter() {
let _ = SMGR_QUERY_TIME_PER_TENANT_TIMELINE.remove_label_values(&[
op.into(),
tenant_id,
timeline_id,
]);
}
}
}
@@ -1039,9 +1212,7 @@ impl Drop for TimelineMetrics {
pub fn remove_tenant_metrics(tenant_id: &TenantId) {
let tid = tenant_id.to_string();
let _ = TENANT_SYNTHETIC_SIZE_METRIC.remove_label_values(&[&tid]);
for state in TenantState::VARIANTS {
let _ = TENANT_STATE_METRIC.remove_label_values(&[&tid, state]);
}
// we leave the BROKEN_TENANTS_SET entry if any
}
use futures::Future;
@@ -1056,9 +1227,7 @@ pub struct RemoteTimelineClientMetrics {
tenant_id: String,
timeline_id: String,
remote_physical_size_gauge: Mutex<Option<UIntGauge>>,
remote_operation_time: Mutex<HashMap<(&'static str, &'static str, &'static str), Histogram>>,
calls_unfinished_gauge: Mutex<HashMap<(&'static str, &'static str), IntGauge>>,
calls_started_hist: Mutex<HashMap<(&'static str, &'static str), Histogram>>,
bytes_started_counter: Mutex<HashMap<(&'static str, &'static str), IntCounter>>,
bytes_finished_counter: Mutex<HashMap<(&'static str, &'static str), IntCounter>>,
}
@@ -1068,14 +1237,13 @@ impl RemoteTimelineClientMetrics {
RemoteTimelineClientMetrics {
tenant_id: tenant_id.to_string(),
timeline_id: timeline_id.to_string(),
remote_operation_time: Mutex::new(HashMap::default()),
calls_unfinished_gauge: Mutex::new(HashMap::default()),
calls_started_hist: Mutex::new(HashMap::default()),
bytes_started_counter: Mutex::new(HashMap::default()),
bytes_finished_counter: Mutex::new(HashMap::default()),
remote_physical_size_gauge: Mutex::new(None),
}
}
pub fn remote_physical_size_gauge(&self) -> UIntGauge {
let mut guard = self.remote_physical_size_gauge.lock().unwrap();
guard
@@ -1089,26 +1257,17 @@ impl RemoteTimelineClientMetrics {
})
.clone()
}
pub fn remote_operation_time(
&self,
file_kind: &RemoteOpFileKind,
op_kind: &RemoteOpKind,
status: &'static str,
) -> Histogram {
let mut guard = self.remote_operation_time.lock().unwrap();
let key = (file_kind.as_str(), op_kind.as_str(), status);
let metric = guard.entry(key).or_insert_with(move || {
REMOTE_OPERATION_TIME
.get_metric_with_label_values(&[
&self.tenant_id.to_string(),
&self.timeline_id.to_string(),
key.0,
key.1,
key.2,
])
.unwrap()
});
metric.clone()
REMOTE_OPERATION_TIME
.get_metric_with_label_values(&[key.0, key.1, key.2])
.unwrap()
}
fn calls_unfinished_gauge(
@@ -1136,19 +1295,10 @@ impl RemoteTimelineClientMetrics {
file_kind: &RemoteOpFileKind,
op_kind: &RemoteOpKind,
) -> Histogram {
let mut guard = self.calls_started_hist.lock().unwrap();
let key = (file_kind.as_str(), op_kind.as_str());
let metric = guard.entry(key).or_insert_with(move || {
REMOTE_TIMELINE_CLIENT_CALLS_STARTED_HIST
.get_metric_with_label_values(&[
&self.tenant_id.to_string(),
&self.timeline_id.to_string(),
key.0,
key.1,
])
.unwrap()
});
metric.clone()
REMOTE_TIMELINE_CLIENT_CALLS_STARTED_HIST
.get_metric_with_label_values(&[key.0, key.1])
.unwrap()
}
fn bytes_started_counter(
@@ -1328,15 +1478,10 @@ impl Drop for RemoteTimelineClientMetrics {
tenant_id,
timeline_id,
remote_physical_size_gauge,
remote_operation_time,
calls_unfinished_gauge,
calls_started_hist,
bytes_started_counter,
bytes_finished_counter,
} = self;
for ((a, b, c), _) in remote_operation_time.get_mut().unwrap().drain() {
let _ = REMOTE_OPERATION_TIME.remove_label_values(&[tenant_id, timeline_id, a, b, c]);
}
for ((a, b), _) in calls_unfinished_gauge.get_mut().unwrap().drain() {
let _ = REMOTE_TIMELINE_CLIENT_CALLS_UNFINISHED_GAUGE.remove_label_values(&[
tenant_id,
@@ -1345,14 +1490,6 @@ impl Drop for RemoteTimelineClientMetrics {
b,
]);
}
for ((a, b), _) in calls_started_hist.get_mut().unwrap().drain() {
let _ = REMOTE_TIMELINE_CLIENT_CALLS_STARTED_HIST.remove_label_values(&[
tenant_id,
timeline_id,
a,
b,
]);
}
for ((a, b), _) in bytes_started_counter.get_mut().unwrap().drain() {
let _ = REMOTE_TIMELINE_CLIENT_BYTES_STARTED_COUNTER.remove_label_values(&[
tenant_id,
@@ -1434,15 +1571,51 @@ impl<F: Future<Output = Result<O, E>>, O, E> Future for MeasuredRemoteOp<F> {
}
pub fn preinitialize_metrics() {
// We want to alert on this metric increasing.
// Initialize it eagerly, so that our alert rule can distinguish absence of the metric from metric value 0.
assert_eq!(UNEXPECTED_ONDEMAND_DOWNLOADS.get(), 0);
UNEXPECTED_ONDEMAND_DOWNLOADS.reset();
// Python tests need these and on some we do alerting.
//
// FIXME(4813): make it so that we have no top level metrics as this fn will easily fall out of
// order:
// - global metrics reside in a Lazy<PageserverMetrics>
// - access via crate::metrics::PS_METRICS.materialized_page_cache_hit.inc()
// - could move the statics into TimelineMetrics::new()?
// Same as above for this metric, but, it's a Vec-type metric for which we don't know all the labels.
BACKGROUND_LOOP_PERIOD_OVERRUN_COUNT.reset();
// counters
[
&MATERIALIZED_PAGE_CACHE_HIT,
&MATERIALIZED_PAGE_CACHE_HIT_DIRECT,
&UNEXPECTED_ONDEMAND_DOWNLOADS,
&WALRECEIVER_STARTED_CONNECTIONS,
&WALRECEIVER_BROKER_UPDATES,
&WALRECEIVER_CANDIDATES_ADDED,
&WALRECEIVER_CANDIDATES_REMOVED,
]
.into_iter()
.for_each(|c| {
Lazy::force(c);
});
// Python tests need these.
MATERIALIZED_PAGE_CACHE_HIT_DIRECT.get();
MATERIALIZED_PAGE_CACHE_HIT.get();
// countervecs
[&BACKGROUND_LOOP_PERIOD_OVERRUN_COUNT]
.into_iter()
.for_each(|c| {
Lazy::force(c);
});
// gauges
WALRECEIVER_ACTIVE_MANAGERS.get();
// histograms
[
&READ_NUM_FS_LAYERS,
&RECONSTRUCT_TIME,
&WAIT_LSN_TIME,
&WAL_REDO_TIME,
&WAL_REDO_WAIT_TIME,
&WAL_REDO_RECORDS_HISTOGRAM,
&WAL_REDO_BYTES_HISTOGRAM,
]
.into_iter()
.for_each(|h| {
Lazy::force(h);
});
}

View File

@@ -10,6 +10,42 @@
//! PostgreSQL buffer size, and a Slot struct for each buffer to contain
//! information about what's stored in the buffer.
//!
//! # Types Of Pages
//!
//! [`PageCache`] only supports immutable pages.
//! Hence there is no need to worry about coherency.
//!
//! Two types of pages are supported:
//!
//! * **Materialized pages**, filled & used by page reconstruction
//! * **Immutable File pages**, filled & used by [`crate::tenant::block_io`] and [`crate::tenant::ephemeral_file`].
//!
//! Note that [`crate::tenant::ephemeral_file::EphemeralFile`] is generally mutable, but, it's append-only.
//! It uses the page cache only for the blocks that are already fully written and immutable.
//!
//! # Filling The Page Cache
//!
//! Page cache maps from a cache key to a buffer slot.
//! The cache key uniquely identifies the piece of data that is being cached.
//!
//! The cache key for **materialized pages** is [`TenantId`], [`TimelineId`], [`Key`], and [`Lsn`].
//! Use [`PageCache::memorize_materialized_page`] and [`PageCache::lookup_materialized_page`] for fill & access.
//!
//! The cache key for **immutable file** pages is [`FileId`] and a block number.
//! Users of page cache that wish to page-cache an arbitrary (immutable!) on-disk file do the following:
//! * Have a mechanism to deterministically associate the on-disk file with a [`FileId`].
//! * Get a [`FileId`] using [`next_file_id`].
//! * Use the mechanism to associate the on-disk file with the returned [`FileId`].
//! * Use [`PageCache::read_immutable_buf`] to get a [`ReadBufResult`].
//! * If the page was already cached, it'll be the [`ReadBufResult::Found`] variant that contains
//! a read guard for the page. Just use it.
//! * If the page was not cached, it'll be the [`ReadBufResult::NotFound`] variant that contains
//! a write guard for the page. Fill the page with the contents of the on-disk file.
//! Then call [`PageWriteGuard::mark_valid`] to mark the page as valid.
//! Then try again to [`PageCache::read_immutable_buf`].
//! Unless there's high cache pressure, the page should now be cached.
//! (TODO: allow downgrading the write guard to a read guard to ensure forward progress.)
//!
//! # Locking
//!
//! There are two levels of locking involved: There's one lock for the "mapping"
@@ -39,21 +75,16 @@
use std::{
collections::{hash_map::Entry, HashMap},
convert::TryInto,
sync::{
atomic::{AtomicU8, AtomicUsize, Ordering},
RwLock, RwLockReadGuard, RwLockWriteGuard, TryLockError,
},
sync::atomic::{AtomicU64, AtomicU8, AtomicUsize, Ordering},
};
use anyhow::Context;
use once_cell::sync::OnceCell;
use tracing::error;
use utils::{
id::{TenantId, TimelineId},
lsn::Lsn,
};
use crate::tenant::writeback_ephemeral_file;
use crate::{metrics::PageCacheSizeMetrics, repository::Key};
static PAGE_CACHE: OnceCell<PageCache> = OnceCell::new();
@@ -87,6 +118,17 @@ pub fn get() -> &'static PageCache {
pub const PAGE_SZ: usize = postgres_ffi::BLCKSZ as usize;
const MAX_USAGE_COUNT: u8 = 5;
/// See module-level comment.
#[derive(Debug, Copy, Clone, PartialEq, Eq, Hash)]
pub struct FileId(u64);
static NEXT_ID: AtomicU64 = AtomicU64::new(1);
/// See module-level comment.
pub fn next_file_id() -> FileId {
FileId(NEXT_ID.fetch_add(1, Ordering::Relaxed))
}
///
/// CacheKey uniquely identifies a "thing" to cache in the page cache.
///
@@ -97,12 +139,8 @@ enum CacheKey {
hash_key: MaterializedPageHashKey,
lsn: Lsn,
},
EphemeralPage {
file_id: u64,
blkno: u32,
},
ImmutableFilePage {
file_id: u64,
file_id: FileId,
blkno: u32,
},
}
@@ -121,14 +159,13 @@ struct Version {
}
struct Slot {
inner: RwLock<SlotInner>,
inner: tokio::sync::RwLock<SlotInner>,
usage_count: AtomicU8,
}
struct SlotInner {
key: Option<CacheKey>,
buf: &'static mut [u8; PAGE_SZ],
dirty: bool,
}
impl Slot {
@@ -163,6 +200,11 @@ impl Slot {
Err(usage_count) => usage_count,
}
}
/// Sets the usage count to a specific value.
fn set_usage_count(&self, count: u8) {
self.usage_count.store(count, Ordering::Relaxed);
}
}
pub struct PageCache {
@@ -175,11 +217,9 @@ pub struct PageCache {
///
/// If you add support for caching different kinds of objects, each object kind
/// can have a separate mapping map, next to this field.
materialized_page_map: RwLock<HashMap<MaterializedPageHashKey, Vec<Version>>>,
materialized_page_map: std::sync::RwLock<HashMap<MaterializedPageHashKey, Vec<Version>>>,
ephemeral_page_map: RwLock<HashMap<(u64, u32), usize>>,
immutable_page_map: RwLock<HashMap<(u64, u32), usize>>,
immutable_page_map: std::sync::RwLock<HashMap<(FileId, u32), usize>>,
/// The actual buffers with their metadata.
slots: Box<[Slot]>,
@@ -195,7 +235,7 @@ pub struct PageCache {
/// PageReadGuard is a "lease" on a buffer, for reading. The page is kept locked
/// until the guard is dropped.
///
pub struct PageReadGuard<'i>(RwLockReadGuard<'i, SlotInner>);
pub struct PageReadGuard<'i>(tokio::sync::RwLockReadGuard<'i, SlotInner>);
impl std::ops::Deref for PageReadGuard<'_> {
type Target = [u8; PAGE_SZ];
@@ -222,9 +262,10 @@ impl AsRef<[u8; PAGE_SZ]> for PageReadGuard<'_> {
/// to initialize.
///
pub struct PageWriteGuard<'i> {
inner: RwLockWriteGuard<'i, SlotInner>,
inner: tokio::sync::RwLockWriteGuard<'i, SlotInner>,
// Are the page contents currently valid?
// Used to mark pages as invalid that are assigned but not yet filled with data.
valid: bool,
}
@@ -258,14 +299,6 @@ impl PageWriteGuard<'_> {
);
self.valid = true;
}
pub fn mark_dirty(&mut self) {
// only ephemeral pages can be dirty ATM.
assert!(matches!(
self.inner.key,
Some(CacheKey::EphemeralPage { .. })
));
self.inner.dirty = true;
}
}
impl Drop for PageWriteGuard<'_> {
@@ -280,7 +313,6 @@ impl Drop for PageWriteGuard<'_> {
let self_key = self.inner.key.as_ref().unwrap();
PAGE_CACHE.get().unwrap().remove_mapping(self_key);
self.inner.key = None;
self.inner.dirty = false;
}
}
}
@@ -308,7 +340,7 @@ impl PageCache {
/// The 'lsn' is an upper bound, this will return the latest version of
/// the given block, but not newer than 'lsn'. Returns the actual LSN of the
/// returned page.
pub fn lookup_materialized_page(
pub async fn lookup_materialized_page(
&self,
tenant_id: TenantId,
timeline_id: TimelineId,
@@ -328,7 +360,7 @@ impl PageCache {
lsn,
};
if let Some(guard) = self.try_lock_for_read(&mut cache_key) {
if let Some(guard) = self.try_lock_for_read(&mut cache_key).await {
if let CacheKey::MaterializedPage {
hash_key: _,
lsn: available_lsn,
@@ -355,7 +387,7 @@ impl PageCache {
///
/// Store an image of the given page in the cache.
///
pub fn memorize_materialized_page(
pub async fn memorize_materialized_page(
&self,
tenant_id: TenantId,
timeline_id: TimelineId,
@@ -372,7 +404,7 @@ impl PageCache {
lsn,
};
match self.lock_for_write(&cache_key)? {
match self.lock_for_write(&cache_key).await? {
WriteBufResult::Found(write_guard) => {
// We already had it in cache. Another thread must've put it there
// concurrently. Check that it had the same contents that we
@@ -388,68 +420,16 @@ impl PageCache {
Ok(())
}
// Section 1.2: Public interface functions for working with Ephemeral pages.
// Section 1.2: Public interface functions for working with immutable file pages.
pub fn read_ephemeral_buf(&self, file_id: u64, blkno: u32) -> anyhow::Result<ReadBufResult> {
let mut cache_key = CacheKey::EphemeralPage { file_id, blkno };
self.lock_for_read(&mut cache_key)
}
pub fn write_ephemeral_buf(&self, file_id: u64, blkno: u32) -> anyhow::Result<WriteBufResult> {
let cache_key = CacheKey::EphemeralPage { file_id, blkno };
self.lock_for_write(&cache_key)
}
/// Immediately drop all buffers belonging to given file, without writeback
pub fn drop_buffers_for_ephemeral(&self, drop_file_id: u64) {
for slot_idx in 0..self.slots.len() {
let slot = &self.slots[slot_idx];
let mut inner = slot.inner.write().unwrap();
if let Some(key) = &inner.key {
match key {
CacheKey::EphemeralPage { file_id, blkno: _ } if *file_id == drop_file_id => {
// remove mapping for old buffer
self.remove_mapping(key);
inner.key = None;
inner.dirty = false;
}
_ => {}
}
}
}
}
// Section 1.3: Public interface functions for working with immutable file pages.
pub fn read_immutable_buf(&self, file_id: u64, blkno: u32) -> anyhow::Result<ReadBufResult> {
pub async fn read_immutable_buf(
&self,
file_id: FileId,
blkno: u32,
) -> anyhow::Result<ReadBufResult> {
let mut cache_key = CacheKey::ImmutableFilePage { file_id, blkno };
self.lock_for_read(&mut cache_key)
}
/// Immediately drop all buffers belonging to given file, without writeback
pub fn drop_buffers_for_immutable(&self, drop_file_id: u64) {
for slot_idx in 0..self.slots.len() {
let slot = &self.slots[slot_idx];
let mut inner = slot.inner.write().unwrap();
if let Some(key) = &inner.key {
match key {
CacheKey::ImmutableFilePage { file_id, blkno: _ }
if *file_id == drop_file_id =>
{
// remove mapping for old buffer
self.remove_mapping(key);
inner.key = None;
inner.dirty = false;
}
_ => {}
}
}
}
self.lock_for_read(&mut cache_key).await
}
//
@@ -469,14 +449,14 @@ impl PageCache {
///
/// If no page is found, returns None and *cache_key is left unmodified.
///
fn try_lock_for_read(&self, cache_key: &mut CacheKey) -> Option<PageReadGuard> {
async fn try_lock_for_read(&self, cache_key: &mut CacheKey) -> Option<PageReadGuard> {
let cache_key_orig = cache_key.clone();
if let Some(slot_idx) = self.search_mapping(cache_key) {
// The page was found in the mapping. Lock the slot, and re-check
// that it's still what we expected (because we released the mapping
// lock already, another thread could have evicted the page)
let slot = &self.slots[slot_idx];
let inner = slot.inner.read().unwrap();
let inner = slot.inner.read().await;
if inner.key.as_ref() == Some(cache_key) {
slot.inc_usage_count();
return Some(PageReadGuard(inner));
@@ -517,15 +497,11 @@ impl PageCache {
/// }
/// ```
///
fn lock_for_read(&self, cache_key: &mut CacheKey) -> anyhow::Result<ReadBufResult> {
async fn lock_for_read(&self, cache_key: &mut CacheKey) -> anyhow::Result<ReadBufResult> {
let (read_access, hit) = match cache_key {
CacheKey::MaterializedPage { .. } => {
unreachable!("Materialized pages use lookup_materialized_page")
}
CacheKey::EphemeralPage { .. } => (
&crate::metrics::PAGE_CACHE.read_accesses_ephemeral,
&crate::metrics::PAGE_CACHE.read_hits_ephemeral,
),
CacheKey::ImmutableFilePage { .. } => (
&crate::metrics::PAGE_CACHE.read_accesses_immutable,
&crate::metrics::PAGE_CACHE.read_hits_immutable,
@@ -536,7 +512,7 @@ impl PageCache {
let mut is_first_iteration = true;
loop {
// First check if the key already exists in the cache.
if let Some(read_guard) = self.try_lock_for_read(cache_key) {
if let Some(read_guard) = self.try_lock_for_read(cache_key).await {
if is_first_iteration {
hit.inc();
}
@@ -566,8 +542,7 @@ impl PageCache {
// Make the slot ready
let slot = &self.slots[slot_idx];
inner.key = Some(cache_key.clone());
inner.dirty = false;
slot.usage_count.store(1, Ordering::Relaxed);
slot.set_usage_count(1);
return Ok(ReadBufResult::NotFound(PageWriteGuard {
inner,
@@ -580,13 +555,13 @@ impl PageCache {
/// found, returns None.
///
/// When locking a page for writing, the search criteria is always "exact".
fn try_lock_for_write(&self, cache_key: &CacheKey) -> Option<PageWriteGuard> {
async fn try_lock_for_write(&self, cache_key: &CacheKey) -> Option<PageWriteGuard> {
if let Some(slot_idx) = self.search_mapping_for_write(cache_key) {
// The page was found in the mapping. Lock the slot, and re-check
// that it's still what we expected (because we don't released the mapping
// lock already, another thread could have evicted the page)
let slot = &self.slots[slot_idx];
let inner = slot.inner.write().unwrap();
let inner = slot.inner.write().await;
if inner.key.as_ref() == Some(cache_key) {
slot.inc_usage_count();
return Some(PageWriteGuard { inner, valid: true });
@@ -599,10 +574,10 @@ impl PageCache {
///
/// Similar to lock_for_read(), but the returned buffer is write-locked and
/// may be modified by the caller even if it's already found in the cache.
fn lock_for_write(&self, cache_key: &CacheKey) -> anyhow::Result<WriteBufResult> {
async fn lock_for_write(&self, cache_key: &CacheKey) -> anyhow::Result<WriteBufResult> {
loop {
// First check if the key already exists in the cache.
if let Some(write_guard) = self.try_lock_for_write(cache_key) {
if let Some(write_guard) = self.try_lock_for_write(cache_key).await {
return Ok(WriteBufResult::Found(write_guard));
}
@@ -628,8 +603,7 @@ impl PageCache {
// Make the slot ready
let slot = &self.slots[slot_idx];
inner.key = Some(cache_key.clone());
inner.dirty = false;
slot.usage_count.store(1, Ordering::Relaxed);
slot.set_usage_count(1);
return Ok(WriteBufResult::NotFound(PageWriteGuard {
inner,
@@ -667,10 +641,6 @@ impl PageCache {
*lsn = version.lsn;
Some(version.slot_idx)
}
CacheKey::EphemeralPage { file_id, blkno } => {
let map = self.ephemeral_page_map.read().unwrap();
Some(*map.get(&(*file_id, *blkno))?)
}
CacheKey::ImmutableFilePage { file_id, blkno } => {
let map = self.immutable_page_map.read().unwrap();
Some(*map.get(&(*file_id, *blkno))?)
@@ -694,10 +664,6 @@ impl PageCache {
None
}
}
CacheKey::EphemeralPage { file_id, blkno } => {
let map = self.ephemeral_page_map.read().unwrap();
Some(*map.get(&(*file_id, *blkno))?)
}
CacheKey::ImmutableFilePage { file_id, blkno } => {
let map = self.immutable_page_map.read().unwrap();
Some(*map.get(&(*file_id, *blkno))?)
@@ -731,12 +697,6 @@ impl PageCache {
panic!("could not find old key in mapping")
}
}
CacheKey::EphemeralPage { file_id, blkno } => {
let mut map = self.ephemeral_page_map.write().unwrap();
map.remove(&(*file_id, *blkno))
.expect("could not find old key in mapping");
self.size_metrics.current_bytes_ephemeral.sub_page_sz(1);
}
CacheKey::ImmutableFilePage { file_id, blkno } => {
let mut map = self.immutable_page_map.write().unwrap();
map.remove(&(*file_id, *blkno))
@@ -776,17 +736,7 @@ impl PageCache {
}
}
}
CacheKey::EphemeralPage { file_id, blkno } => {
let mut map = self.ephemeral_page_map.write().unwrap();
match map.entry((*file_id, *blkno)) {
Entry::Occupied(entry) => Some(*entry.get()),
Entry::Vacant(entry) => {
entry.insert(slot_idx);
self.size_metrics.current_bytes_ephemeral.add_page_sz(1);
None
}
}
}
CacheKey::ImmutableFilePage { file_id, blkno } => {
let mut map = self.immutable_page_map.write().unwrap();
match map.entry((*file_id, *blkno)) {
@@ -808,7 +758,7 @@ impl PageCache {
/// Find a slot to evict.
///
/// On return, the slot is empty and write-locked.
fn find_victim(&self) -> anyhow::Result<(usize, RwLockWriteGuard<SlotInner>)> {
fn find_victim(&self) -> anyhow::Result<(usize, tokio::sync::RwLockWriteGuard<SlotInner>)> {
let iter_limit = self.slots.len() * 10;
let mut iters = 0;
loop {
@@ -820,10 +770,7 @@ impl PageCache {
if slot.dec_usage_count() == 0 {
let mut inner = match slot.inner.try_write() {
Ok(inner) => inner,
Err(TryLockError::Poisoned(err)) => {
anyhow::bail!("buffer lock was poisoned: {err:?}")
}
Err(TryLockError::WouldBlock) => {
Err(_err) => {
// If we have looped through the whole buffer pool 10 times
// and still haven't found a victim buffer, something's wrong.
// Maybe all the buffers were in locked. That could happen in
@@ -837,25 +784,8 @@ impl PageCache {
}
};
if let Some(old_key) = &inner.key {
if inner.dirty {
if let Err(err) = Self::writeback(old_key, inner.buf) {
// Writing the page to disk failed.
//
// FIXME: What to do here, when? We could propagate the error to the
// caller, but victim buffer is generally unrelated to the original
// call. It can even belong to a different tenant. Currently, we
// report the error to the log and continue the clock sweep to find
// a different victim. But if the problem persists, the page cache
// could fill up with dirty pages that we cannot evict, and we will
// loop retrying the writebacks indefinitely.
error!("writeback of buffer {:?} failed: {}", old_key, err);
continue;
}
}
// remove mapping for old buffer
self.remove_mapping(old_key);
inner.dirty = false;
inner.key = None;
}
return Ok((slot_idx, inner));
@@ -863,39 +793,18 @@ impl PageCache {
}
}
fn writeback(cache_key: &CacheKey, buf: &[u8]) -> Result<(), std::io::Error> {
match cache_key {
CacheKey::MaterializedPage {
hash_key: _,
lsn: _,
} => Err(std::io::Error::new(
std::io::ErrorKind::Other,
"unexpected dirty materialized page",
)),
CacheKey::EphemeralPage { file_id, blkno } => {
writeback_ephemeral_file(*file_id, *blkno, buf)
}
CacheKey::ImmutableFilePage {
file_id: _,
blkno: _,
} => Err(std::io::Error::new(
std::io::ErrorKind::Other,
"unexpected dirty immutable page",
)),
}
}
/// Initialize a new page cache
///
/// This should be called only once at page server startup.
fn new(num_pages: usize) -> Self {
assert!(num_pages > 0, "page cache size must be > 0");
// We use Box::leak here and into_boxed_slice to avoid leaking uninitialized
// memory that Vec's might contain.
let page_buffer = Box::leak(vec![0u8; num_pages * PAGE_SZ].into_boxed_slice());
let size_metrics = &crate::metrics::PAGE_CACHE_SIZE;
size_metrics.max_bytes.set_page_sz(num_pages);
size_metrics.current_bytes_ephemeral.set_page_sz(0);
size_metrics.current_bytes_immutable.set_page_sz(0);
size_metrics.current_bytes_materialized_page.set_page_sz(0);
@@ -905,11 +814,7 @@ impl PageCache {
let buf: &mut [u8; PAGE_SZ] = chunk.try_into().unwrap();
Slot {
inner: RwLock::new(SlotInner {
key: None,
buf,
dirty: false,
}),
inner: tokio::sync::RwLock::new(SlotInner { key: None, buf }),
usage_count: AtomicU8::new(0),
}
})
@@ -917,7 +822,6 @@ impl PageCache {
Self {
materialized_page_map: Default::default(),
ephemeral_page_map: Default::default(),
immutable_page_map: Default::default(),
slots,
next_evict_slot: AtomicUsize::new(0),

View File

@@ -50,7 +50,8 @@ use crate::basebackup;
use crate::config::PageServerConf;
use crate::context::{DownloadBehavior, RequestContext};
use crate::import_datadir::import_wal_from_tar;
use crate::metrics::{LIVE_CONNECTIONS_COUNT, SMGR_QUERY_TIME};
use crate::metrics;
use crate::metrics::LIVE_CONNECTIONS_COUNT;
use crate::task_mgr;
use crate::task_mgr::TaskKind;
use crate::tenant;
@@ -306,39 +307,6 @@ async fn page_service_conn_main(
}
}
struct PageRequestMetrics {
get_rel_exists: metrics::Histogram,
get_rel_size: metrics::Histogram,
get_page_at_lsn: metrics::Histogram,
get_db_size: metrics::Histogram,
}
impl PageRequestMetrics {
fn new(tenant_id: &TenantId, timeline_id: &TimelineId) -> Self {
let tenant_id = tenant_id.to_string();
let timeline_id = timeline_id.to_string();
let get_rel_exists =
SMGR_QUERY_TIME.with_label_values(&["get_rel_exists", &tenant_id, &timeline_id]);
let get_rel_size =
SMGR_QUERY_TIME.with_label_values(&["get_rel_size", &tenant_id, &timeline_id]);
let get_page_at_lsn =
SMGR_QUERY_TIME.with_label_values(&["get_page_at_lsn", &tenant_id, &timeline_id]);
let get_db_size =
SMGR_QUERY_TIME.with_label_values(&["get_db_size", &tenant_id, &timeline_id]);
Self {
get_rel_exists,
get_rel_size,
get_page_at_lsn,
get_db_size,
}
}
}
struct PageServerHandler {
_conf: &'static PageServerConf,
broker_client: storage_broker::BrokerClientChannel,
@@ -406,7 +374,7 @@ impl PageServerHandler {
pgb.write_message_noflush(&BeMessage::CopyBothResponse)?;
pgb.flush().await?;
let metrics = PageRequestMetrics::new(&tenant_id, &timeline_id);
let metrics = metrics::SmgrQueryTimePerTimeline::new(&tenant_id, &timeline_id);
loop {
let msg = tokio::select! {
@@ -446,21 +414,21 @@ impl PageServerHandler {
let response = match neon_fe_msg {
PagestreamFeMessage::Exists(req) => {
let _timer = metrics.get_rel_exists.start_timer();
let _timer = metrics.start_timer(metrics::SmgrQueryType::GetRelExists);
self.handle_get_rel_exists_request(&timeline, &req, &ctx)
.await
}
PagestreamFeMessage::Nblocks(req) => {
let _timer = metrics.get_rel_size.start_timer();
let _timer = metrics.start_timer(metrics::SmgrQueryType::GetRelSize);
self.handle_get_nblocks_request(&timeline, &req, &ctx).await
}
PagestreamFeMessage::GetPage(req) => {
let _timer = metrics.get_page_at_lsn.start_timer();
let _timer = metrics.start_timer(metrics::SmgrQueryType::GetPageAtLsn);
self.handle_get_page_at_lsn_request(&timeline, &req, &ctx)
.await
}
PagestreamFeMessage::DbSize(req) => {
let _timer = metrics.get_db_size.start_timer();
let _timer = metrics.start_timer(metrics::SmgrQueryType::GetDbSize);
self.handle_db_size_request(&timeline, &req, &ctx).await
}
};
@@ -501,7 +469,9 @@ impl PageServerHandler {
// Create empty timeline
info!("creating new timeline");
let tenant = get_active_tenant_with_timeout(tenant_id, &ctx).await?;
let timeline = tenant.create_empty_timeline(timeline_id, base_lsn, pg_version, &ctx)?;
let timeline = tenant
.create_empty_timeline(timeline_id, base_lsn, pg_version, &ctx)
.await?;
// TODO mark timeline as not ready until it reaches end_lsn.
// We might have some wal to import as well, and we should prevent compute
@@ -984,8 +954,8 @@ where
false
};
metrics::metric_vec_duration::observe_async_block_duration_by_result(
&*crate::metrics::BASEBACKUP_QUERY_TIME,
::metrics::metric_vec_duration::observe_async_block_duration_by_result(
&*metrics::BASEBACKUP_QUERY_TIME,
async move {
self.handle_basebackup_request(
pgb,

View File

@@ -130,11 +130,25 @@ pub static WALRECEIVER_RUNTIME: Lazy<Runtime> = Lazy::new(|| {
pub static BACKGROUND_RUNTIME: Lazy<Runtime> = Lazy::new(|| {
tokio::runtime::Builder::new_multi_thread()
.thread_name("background op worker")
// if you change the number of worker threads please change the constant below
.enable_all()
.build()
.expect("Failed to create background op runtime")
});
pub(crate) static BACKGROUND_RUNTIME_WORKER_THREADS: Lazy<usize> = Lazy::new(|| {
// force init and thus panics
let _ = BACKGROUND_RUNTIME.handle();
// replicates tokio-1.28.1::loom::sys::num_cpus which is not available publicly
// tokio would had already panicked for parsing errors or NotUnicode
//
// this will be wrong if any of the runtimes gets their worker threads configured to something
// else, but that has not been needed in a long time.
std::env::var("TOKIO_WORKER_THREADS")
.map(|s| s.parse::<usize>().unwrap())
.unwrap_or_else(|_e| usize::max(1, num_cpus::get()))
});
#[derive(Debug, Clone, Copy)]
pub struct PageserverTaskId(u64);
@@ -545,7 +559,7 @@ pub fn current_task_id() -> Option<PageserverTaskId> {
pub async fn shutdown_watcher() {
let token = SHUTDOWN_TOKEN
.try_with(|t| t.clone())
.expect("shutdown_requested() called in an unexpected task or thread");
.expect("shutdown_watcher() called in an unexpected task or thread");
token.cancelled().await;
}

File diff suppressed because it is too large Load Diff

View File

@@ -12,41 +12,28 @@
//! len >= 128: 1XXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX
//!
use crate::page_cache::PAGE_SZ;
use crate::tenant::block_io::{BlockCursor, BlockReader};
use crate::tenant::block_io::BlockCursor;
use std::cmp::min;
use std::io::{Error, ErrorKind};
/// For reading
pub trait BlobCursor {
impl<'a> BlockCursor<'a> {
/// Read a blob into a new buffer.
fn read_blob(&mut self, offset: u64) -> Result<Vec<u8>, std::io::Error> {
pub async fn read_blob(&self, offset: u64) -> Result<Vec<u8>, std::io::Error> {
let mut buf = Vec::new();
self.read_blob_into_buf(offset, &mut buf)?;
self.read_blob_into_buf(offset, &mut buf).await?;
Ok(buf)
}
/// Read blob into the given buffer. Any previous contents in the buffer
/// are overwritten.
fn read_blob_into_buf(
&mut self,
offset: u64,
dstbuf: &mut Vec<u8>,
) -> Result<(), std::io::Error>;
}
impl<R> BlobCursor for BlockCursor<R>
where
R: BlockReader,
{
fn read_blob_into_buf(
&mut self,
pub async fn read_blob_into_buf(
&self,
offset: u64,
dstbuf: &mut Vec<u8>,
) -> Result<(), std::io::Error> {
let mut blknum = (offset / PAGE_SZ as u64) as u32;
let mut off = (offset % PAGE_SZ as u64) as usize;
let mut buf = self.read_blk(blknum)?;
let mut buf = self.read_blk(blknum).await?;
// peek at the first byte, to determine if it's a 1- or 4-byte length
let first_len_byte = buf[off];
@@ -62,7 +49,7 @@ where
// it is split across two pages
len_buf[..thislen].copy_from_slice(&buf[off..PAGE_SZ]);
blknum += 1;
buf = self.read_blk(blknum)?;
buf = self.read_blk(blknum).await?;
len_buf[thislen..].copy_from_slice(&buf[0..4 - thislen]);
off = 4 - thislen;
} else {
@@ -83,7 +70,7 @@ where
if page_remain == 0 {
// continue on next page
blknum += 1;
buf = self.read_blk(blknum)?;
buf = self.read_blk(blknum).await?;
off = 0;
page_remain = PAGE_SZ;
}
@@ -109,18 +96,12 @@ pub trait BlobWriter {
/// An implementation of BlobWriter to write blobs to anything that
/// implements std::io::Write.
///
pub struct WriteBlobWriter<W>
where
W: std::io::Write,
{
pub struct WriteBlobWriter<W> {
inner: W,
offset: u64,
}
impl<W> WriteBlobWriter<W>
where
W: std::io::Write,
{
impl<W> WriteBlobWriter<W> {
pub fn new(inner: W, start_offset: u64) -> Self {
WriteBlobWriter {
inner,
@@ -142,33 +123,38 @@ where
}
}
impl<W> BlobWriter for WriteBlobWriter<W>
where
W: std::io::Write,
{
fn write_blob(&mut self, srcbuf: &[u8]) -> Result<u64, Error> {
let offset = self.offset;
macro_rules! write_blob_impl {
(WriteBlobWriter<$ty:ty>) => {
impl WriteBlobWriter<$ty> {
pub async fn write_blob(&mut self, srcbuf: &[u8]) -> Result<u64, Error> {
use std::io::Write;
let offset = self.offset;
if srcbuf.len() < 128 {
// Short blob. Write a 1-byte length header
let len_buf = srcbuf.len() as u8;
self.inner.write_all(&[len_buf])?;
self.offset += 1;
} else {
// Write a 4-byte length header
if srcbuf.len() > 0x7fff_ffff {
return Err(Error::new(
ErrorKind::Other,
format!("blob too large ({} bytes)", srcbuf.len()),
));
if srcbuf.len() < 128 {
// Short blob. Write a 1-byte length header
let len_buf = srcbuf.len() as u8;
self.inner.write_all(&[len_buf])?;
self.offset += 1;
} else {
// Write a 4-byte length header
if srcbuf.len() > 0x7fff_ffff {
return Err(Error::new(
ErrorKind::Other,
format!("blob too large ({} bytes)", srcbuf.len()),
));
}
let mut len_buf = ((srcbuf.len()) as u32).to_be_bytes();
len_buf[0] |= 0x80;
self.inner.write_all(&len_buf)?;
self.offset += 4;
}
self.inner.write_all(srcbuf)?;
self.offset += srcbuf.len() as u64;
Ok(offset)
}
let mut len_buf = ((srcbuf.len()) as u32).to_be_bytes();
len_buf[0] |= 0x80;
self.inner.write_all(&len_buf)?;
self.offset += 4;
}
self.inner.write_all(srcbuf)?;
self.offset += srcbuf.len() as u64;
Ok(offset)
}
};
}
write_blob_impl!(WriteBlobWriter<crate::tenant::io::BufWriter<crate::virtual_file::VirtualFile>>);
write_blob_impl!(WriteBlobWriter<crate::virtual_file::VirtualFile>);

View File

@@ -2,12 +2,12 @@
//! Low-level Block-oriented I/O functions
//!
use crate::page_cache;
use crate::page_cache::{ReadBufResult, PAGE_SZ};
use super::ephemeral_file::EphemeralFile;
use super::storage_layer::delta_layer::{Adapter, DeltaLayerInner};
use crate::page_cache::{self, PageReadGuard, ReadBufResult, PAGE_SZ};
use crate::virtual_file::VirtualFile;
use bytes::Bytes;
use std::ops::{Deref, DerefMut};
use std::os::unix::fs::FileExt;
use std::sync::atomic::AtomicU64;
/// This is implemented by anything that can read 8 kB (PAGE_SZ)
/// blocks, using the page cache
@@ -15,36 +15,80 @@ use std::sync::atomic::AtomicU64;
/// There are currently two implementations: EphemeralFile, and FileBlockReader
/// below.
pub trait BlockReader {
type BlockLease: Deref<Target = [u8; PAGE_SZ]> + 'static;
///
/// Read a block. Returns a "lease" object that can be used to
/// access to the contents of the page. (For the page cache, the
/// lease object represents a lock on the buffer.)
///
fn read_blk(&self, blknum: u32) -> Result<Self::BlockLease, std::io::Error>;
///
/// Create a new "cursor" for reading from this reader.
///
/// A cursor caches the last accessed page, allowing for faster
/// access if the same block is accessed repeatedly.
fn block_cursor(&self) -> BlockCursor<&Self>
where
Self: Sized,
{
BlockCursor::new(self)
}
fn block_cursor(&self) -> BlockCursor<'_>;
}
impl<B> BlockReader for &B
where
B: BlockReader,
{
type BlockLease = B::BlockLease;
fn block_cursor(&self) -> BlockCursor<'_> {
(*self).block_cursor()
}
}
fn read_blk(&self, blknum: u32) -> Result<Self::BlockLease, std::io::Error> {
(*self).read_blk(blknum)
/// Reference to an in-memory copy of an immutable on-disk block.
pub enum BlockLease<'a> {
PageReadGuard(PageReadGuard<'static>),
EphemeralFileMutableTail(&'a [u8; PAGE_SZ]),
#[cfg(test)]
Arc(std::sync::Arc<[u8; PAGE_SZ]>),
}
impl From<PageReadGuard<'static>> for BlockLease<'static> {
fn from(value: PageReadGuard<'static>) -> BlockLease<'static> {
BlockLease::PageReadGuard(value)
}
}
#[cfg(test)]
impl<'a> From<std::sync::Arc<[u8; PAGE_SZ]>> for BlockLease<'a> {
fn from(value: std::sync::Arc<[u8; PAGE_SZ]>) -> Self {
BlockLease::Arc(value)
}
}
impl<'a> Deref for BlockLease<'a> {
type Target = [u8; PAGE_SZ];
fn deref(&self) -> &Self::Target {
match self {
BlockLease::PageReadGuard(v) => v.deref(),
BlockLease::EphemeralFileMutableTail(v) => v,
#[cfg(test)]
BlockLease::Arc(v) => v.deref(),
}
}
}
/// Provides the ability to read blocks from different sources,
/// similar to using traits for this purpose.
///
/// Unlike traits, we also support the read function to be async though.
pub(crate) enum BlockReaderRef<'a> {
FileBlockReaderVirtual(&'a FileBlockReader<VirtualFile>),
EphemeralFile(&'a EphemeralFile),
Adapter(Adapter<&'a DeltaLayerInner>),
#[cfg(test)]
TestDisk(&'a super::disk_btree::tests::TestDisk),
}
impl<'a> BlockReaderRef<'a> {
#[inline(always)]
async fn read_blk(&self, blknum: u32) -> Result<BlockLease, std::io::Error> {
use BlockReaderRef::*;
match self {
FileBlockReaderVirtual(r) => r.read_blk(blknum).await,
EphemeralFile(r) => r.read_blk(blknum).await,
Adapter(r) => r.read_blk(blknum).await,
#[cfg(test)]
TestDisk(r) => r.read_blk(blknum),
}
}
}
@@ -65,26 +109,31 @@ where
/// // do stuff with 'buf'
/// ```
///
pub struct BlockCursor<R>
where
R: BlockReader,
{
reader: R,
pub struct BlockCursor<'a> {
reader: BlockReaderRef<'a>,
}
impl<R> BlockCursor<R>
where
R: BlockReader,
{
pub fn new(reader: R) -> Self {
impl<'a> BlockCursor<'a> {
pub(crate) fn new(reader: BlockReaderRef<'a>) -> Self {
BlockCursor { reader }
}
// Needed by cli
pub fn new_fileblockreader_virtual(reader: &'a FileBlockReader<VirtualFile>) -> Self {
BlockCursor {
reader: BlockReaderRef::FileBlockReaderVirtual(reader),
}
}
pub fn read_blk(&mut self, blknum: u32) -> Result<R::BlockLease, std::io::Error> {
self.reader.read_blk(blknum)
/// Read a block.
///
/// Returns a "lease" object that can be used to
/// access to the contents of the page. (For the page cache, the
/// lease object represents a lock on the buffer.)
#[inline(always)]
pub async fn read_blk(&self, blknum: u32) -> Result<BlockLease, std::io::Error> {
self.reader.read_blk(blknum).await
}
}
static NEXT_ID: AtomicU64 = AtomicU64::new(1);
/// An adapter for reading a (virtual) file using the page cache.
///
@@ -94,48 +143,42 @@ pub struct FileBlockReader<F> {
pub file: F,
/// Unique ID of this file, used as key in the page cache.
file_id: u64,
file_id: page_cache::FileId,
}
impl<F> FileBlockReader<F>
where
F: FileExt,
{
pub fn new(file: F) -> Self {
let file_id = NEXT_ID.fetch_add(1, std::sync::atomic::Ordering::Relaxed);
impl FileBlockReader<VirtualFile> {
pub fn new(file: VirtualFile) -> Self {
let file_id = page_cache::next_file_id();
FileBlockReader { file_id, file }
}
/// Read a page from the underlying file into given buffer.
fn fill_buffer(&self, buf: &mut [u8], blkno: u32) -> Result<(), std::io::Error> {
async fn fill_buffer(&self, buf: &mut [u8], blkno: u32) -> Result<(), std::io::Error> {
assert!(buf.len() == PAGE_SZ);
self.file.read_exact_at(buf, blkno as u64 * PAGE_SZ as u64)
}
}
impl<F> BlockReader for FileBlockReader<F>
where
F: FileExt,
{
type BlockLease = page_cache::PageReadGuard<'static>;
fn read_blk(&self, blknum: u32) -> Result<Self::BlockLease, std::io::Error> {
// Look up the right page
/// Read a block.
///
/// Returns a "lease" object that can be used to
/// access to the contents of the page. (For the page cache, the
/// lease object represents a lock on the buffer.)
pub async fn read_blk(&self, blknum: u32) -> Result<BlockLease, std::io::Error> {
let cache = page_cache::get();
loop {
match cache
.read_immutable_buf(self.file_id, blknum)
.await
.map_err(|e| {
std::io::Error::new(
std::io::ErrorKind::Other,
format!("Failed to read immutable buf: {e:#}"),
)
})? {
ReadBufResult::Found(guard) => break Ok(guard),
ReadBufResult::Found(guard) => break Ok(guard.into()),
ReadBufResult::NotFound(mut write_guard) => {
// Read the page from disk into the buffer
self.fill_buffer(write_guard.deref_mut(), blknum)?;
self.fill_buffer(write_guard.deref_mut(), blknum).await?;
write_guard.mark_valid();
// Swap for read lock
@@ -146,6 +189,12 @@ where
}
}
impl BlockReader for FileBlockReader<VirtualFile> {
fn block_cursor(&self) -> BlockCursor<'_> {
BlockCursor::new(BlockReaderRef::FileBlockReaderVirtual(self))
}
}
///
/// Trait for block-oriented output
///

View File

@@ -0,0 +1,617 @@
use std::{
path::{Path, PathBuf},
sync::Arc,
};
use anyhow::Context;
use pageserver_api::models::TenantState;
use remote_storage::{DownloadError, GenericRemoteStorage, RemotePath};
use tokio::sync::OwnedMutexGuard;
use tokio_util::sync::CancellationToken;
use tracing::{error, info, instrument, warn, Instrument, Span};
use utils::{
backoff, completion, crashsafe, fs_ext,
id::{TenantId, TimelineId},
};
use crate::{
config::PageServerConf,
context::RequestContext,
task_mgr::{self, TaskKind},
InitializationOrder,
};
use super::{
mgr::{GetTenantError, TenantsMap},
remote_timeline_client::{FAILED_REMOTE_OP_RETRIES, FAILED_UPLOAD_WARN_THRESHOLD},
span,
timeline::delete::DeleteTimelineFlow,
tree_sort_timelines, DeleteTimelineError, Tenant,
};
const SHOULD_RESUME_DELETION_FETCH_MARK_ATTEMPTS: u32 = 3;
#[derive(Debug, thiserror::Error)]
pub enum DeleteTenantError {
#[error("GetTenant {0}")]
Get(#[from] GetTenantError),
#[error("Invalid state {0}. Expected Active or Broken")]
InvalidState(TenantState),
#[error("Tenant deletion is already in progress")]
AlreadyInProgress,
#[error("Timeline {0}")]
Timeline(#[from] DeleteTimelineError),
#[error(transparent)]
Other(#[from] anyhow::Error),
}
type DeletionGuard = tokio::sync::OwnedMutexGuard<DeleteTenantFlow>;
fn remote_tenant_delete_mark_path(
conf: &PageServerConf,
tenant_id: &TenantId,
) -> anyhow::Result<RemotePath> {
let tenant_remote_path = conf
.tenant_path(tenant_id)
.strip_prefix(&conf.workdir)
.context("Failed to strip workdir prefix")
.and_then(RemotePath::new)
.context("tenant path")?;
Ok(tenant_remote_path.join(Path::new("deleted")))
}
async fn create_remote_delete_mark(
conf: &PageServerConf,
remote_storage: &GenericRemoteStorage,
tenant_id: &TenantId,
) -> Result<(), DeleteTenantError> {
let remote_mark_path = remote_tenant_delete_mark_path(conf, tenant_id)?;
let data: &[u8] = &[];
backoff::retry(
|| async {
remote_storage
.upload(data, 0, &remote_mark_path, None)
.await
},
|_e| false,
FAILED_UPLOAD_WARN_THRESHOLD,
FAILED_REMOTE_OP_RETRIES,
"mark_upload",
// TODO: use a cancellation token (https://github.com/neondatabase/neon/issues/5066)
backoff::Cancel::new(CancellationToken::new(), || unreachable!()),
)
.await
.context("mark_upload")?;
Ok(())
}
async fn create_local_delete_mark(
conf: &PageServerConf,
tenant_id: &TenantId,
) -> Result<(), DeleteTenantError> {
let marker_path = conf.tenant_deleted_mark_file_path(tenant_id);
// Note: we're ok to replace existing file.
let _ = std::fs::OpenOptions::new()
.write(true)
.create(true)
.open(&marker_path)
.with_context(|| format!("could not create delete marker file {marker_path:?}"))?;
crashsafe::fsync_file_and_parent(&marker_path).context("sync_mark")?;
Ok(())
}
async fn schedule_ordered_timeline_deletions(
tenant: &Arc<Tenant>,
) -> Result<Vec<(Arc<tokio::sync::Mutex<DeleteTimelineFlow>>, TimelineId)>, DeleteTenantError> {
// Tenant is stopping at this point. We know it will be deleted.
// No new timelines should be created.
// Tree sort timelines to delete from leafs to the root.
// NOTE: by calling clone we release the mutex which creates a possibility for a race: pending deletion
// can complete and remove timeline from the map in between our call to clone
// and `DeleteTimelineFlow::run`, so `run` wont find timeline in `timelines` map.
// timelines.lock is currently synchronous so we cant hold it across await point.
// So just ignore NotFound error if we get it from `run`.
// Beware: in case it becomes async and we try to hold it here, `run` also locks it, which can create a deadlock.
let timelines = tenant.timelines.lock().unwrap().clone();
let sorted =
tree_sort_timelines(timelines, |t| t.get_ancestor_timeline_id()).context("tree sort")?;
let mut already_running_deletions = vec![];
for (timeline_id, _) in sorted.into_iter().rev() {
if let Err(e) = DeleteTimelineFlow::run(tenant, timeline_id, true).await {
match e {
DeleteTimelineError::NotFound => {
// Timeline deletion finished after call to clone above but before call
// to `DeleteTimelineFlow::run` and removed timeline from the map.
continue;
}
DeleteTimelineError::AlreadyInProgress(guard) => {
already_running_deletions.push((guard, timeline_id));
continue;
}
e => return Err(DeleteTenantError::Timeline(e)),
}
}
}
Ok(already_running_deletions)
}
async fn ensure_timelines_dir_empty(timelines_path: &Path) -> Result<(), DeleteTenantError> {
// Assert timelines dir is empty.
if !fs_ext::is_directory_empty(timelines_path).await? {
// Display first 10 items in directory
let list = &fs_ext::list_dir(timelines_path).await.context("list_dir")?[..10];
return Err(DeleteTenantError::Other(anyhow::anyhow!(
"Timelines directory is not empty after all timelines deletion: {list:?}"
)));
}
Ok(())
}
async fn remove_tenant_remote_delete_mark(
conf: &PageServerConf,
remote_storage: Option<&GenericRemoteStorage>,
tenant_id: &TenantId,
) -> Result<(), DeleteTenantError> {
if let Some(remote_storage) = remote_storage {
let path = remote_tenant_delete_mark_path(conf, tenant_id)?;
backoff::retry(
|| async { remote_storage.delete(&path).await },
|_e| false,
FAILED_UPLOAD_WARN_THRESHOLD,
FAILED_REMOTE_OP_RETRIES,
"remove_tenant_remote_delete_mark",
// TODO: use a cancellation token (https://github.com/neondatabase/neon/issues/5066)
backoff::Cancel::new(CancellationToken::new(), || unreachable!()),
)
.await
.context("remove_tenant_remote_delete_mark")?;
}
Ok(())
}
// Cleanup fs traces: tenant config, timelines dir local delete mark, tenant dir
async fn cleanup_remaining_fs_traces(
conf: &PageServerConf,
tenant_id: &TenantId,
) -> Result<(), DeleteTenantError> {
let rm = |p: PathBuf, is_dir: bool| async move {
if is_dir {
tokio::fs::remove_dir(&p).await
} else {
tokio::fs::remove_file(&p).await
}
.or_else(fs_ext::ignore_not_found)
.with_context(|| {
let to_display = p.display();
format!("failed to delete {to_display}")
})
};
rm(conf.tenant_config_path(tenant_id), false).await?;
fail::fail_point!("tenant-delete-before-remove-timelines-dir", |_| {
Err(anyhow::anyhow!(
"failpoint: tenant-delete-before-remove-timelines-dir"
))?
});
rm(conf.timelines_path(tenant_id), true).await?;
fail::fail_point!("tenant-delete-before-remove-deleted-mark", |_| {
Err(anyhow::anyhow!(
"failpoint: tenant-delete-before-remove-deleted-mark"
))?
});
// Make sure previous deletions are ordered before mark removal.
// Otherwise there is no guarantee that they reach the disk before mark deletion.
// So its possible for mark to reach disk first and for other deletions
// to be reordered later and thus missed if a crash occurs.
// Note that we dont need to sync after mark file is removed
// because we can tolerate the case when mark file reappears on startup.
let tenant_path = &conf.tenant_path(tenant_id);
if tenant_path.exists() {
crashsafe::fsync_async(&conf.tenant_path(tenant_id))
.await
.context("fsync_pre_mark_remove")?;
}
rm(conf.tenant_deleted_mark_file_path(tenant_id), false).await?;
fail::fail_point!("tenant-delete-before-remove-tenant-dir", |_| {
Err(anyhow::anyhow!(
"failpoint: tenant-delete-before-remove-tenant-dir"
))?
});
rm(conf.tenant_path(tenant_id), true).await?;
Ok(())
}
pub(crate) async fn remote_delete_mark_exists(
conf: &PageServerConf,
tenant_id: &TenantId,
remote_storage: &GenericRemoteStorage,
) -> anyhow::Result<bool> {
// If remote storage is there we rely on it
let remote_mark_path = remote_tenant_delete_mark_path(conf, tenant_id).context("path")?;
let result = backoff::retry(
|| async { remote_storage.download(&remote_mark_path).await },
|e| matches!(e, DownloadError::NotFound),
SHOULD_RESUME_DELETION_FETCH_MARK_ATTEMPTS,
SHOULD_RESUME_DELETION_FETCH_MARK_ATTEMPTS,
"fetch_tenant_deletion_mark",
// TODO: use a cancellation token (https://github.com/neondatabase/neon/issues/5066)
backoff::Cancel::new(CancellationToken::new(), || unreachable!()),
)
.await;
match result {
Ok(_) => Ok(true),
Err(DownloadError::NotFound) => Ok(false),
Err(e) => Err(anyhow::anyhow!(e)).context("remote_delete_mark_exists")?,
}
}
/// Orchestrates tenant shut down of all tasks, removes its in-memory structures,
/// and deletes its data from both disk and s3.
/// The sequence of steps:
/// 1. Upload remote deletion mark.
/// 2. Create local mark file.
/// 3. Shutdown tasks
/// 4. Run ordered timeline deletions
/// 5. Wait for timeline deletion operations that were scheduled before tenant deletion was requested
/// 6. Remove remote mark
/// 7. Cleanup remaining fs traces, tenant dir, config, timelines dir, local delete mark
/// It is resumable from any step in case a crash/restart occurs.
/// There are three entrypoints to the process:
/// 1. [`DeleteTenantFlow::run`] this is the main one called by a management api handler.
/// 2. [`DeleteTenantFlow::resume_from_load`] is called during restarts when local or remote deletion marks are still there.
/// 3. [`DeleteTenantFlow::resume_from_attach`] is called when deletion is resumed tenant is found to be deleted during attach process.
/// Note the only other place that messes around timeline delete mark is the `Tenant::spawn_load` function.
#[derive(Default)]
pub enum DeleteTenantFlow {
#[default]
NotStarted,
InProgress,
Finished,
}
impl DeleteTenantFlow {
// These steps are run in the context of management api request handler.
// Long running steps are continued to run in the background.
// NB: If this fails half-way through, and is retried, the retry will go through
// all the same steps again. Make sure the code here is idempotent, and don't
// error out if some of the shutdown tasks have already been completed!
// NOTE: static needed for background part.
// We assume that calling code sets up the span with tenant_id.
#[instrument(skip_all)]
pub(crate) async fn run(
conf: &'static PageServerConf,
remote_storage: Option<GenericRemoteStorage>,
tenants: &'static tokio::sync::RwLock<TenantsMap>,
tenant_id: TenantId,
) -> Result<(), DeleteTenantError> {
span::debug_assert_current_span_has_tenant_id();
let (tenant, mut guard) = Self::prepare(tenants, tenant_id).await?;
if let Err(e) = Self::run_inner(&mut guard, conf, remote_storage.as_ref(), &tenant).await {
tenant.set_broken(format!("{e:#}")).await;
return Err(e);
}
Self::schedule_background(guard, conf, remote_storage, tenants, tenant);
Ok(())
}
// Helper function needed to be able to match once on returned error and transition tenant into broken state.
// This is needed because tenant.shutwodn is not idempotent. If tenant state is set to stopping another call to tenant.shutdown
// will result in an error, but here we need to be able to retry shutdown when tenant deletion is retried.
// So the solution is to set tenant state to broken.
async fn run_inner(
guard: &mut OwnedMutexGuard<Self>,
conf: &'static PageServerConf,
remote_storage: Option<&GenericRemoteStorage>,
tenant: &Tenant,
) -> Result<(), DeleteTenantError> {
guard.mark_in_progress()?;
fail::fail_point!("tenant-delete-before-create-remote-mark", |_| {
Err(anyhow::anyhow!(
"failpoint: tenant-delete-before-create-remote-mark"
))?
});
// IDEA: implement detach as delete without remote storage. Then they would use the same lock (deletion_progress) so wont contend.
// Though sounds scary, different mark name?
// Detach currently uses remove_dir_all so in case of a crash we can end up in a weird state.
if let Some(remote_storage) = &remote_storage {
create_remote_delete_mark(conf, remote_storage, &tenant.tenant_id)
.await
.context("remote_mark")?
}
fail::fail_point!("tenant-delete-before-create-local-mark", |_| {
Err(anyhow::anyhow!(
"failpoint: tenant-delete-before-create-local-mark"
))?
});
create_local_delete_mark(conf, &tenant.tenant_id)
.await
.context("local delete mark")?;
fail::fail_point!("tenant-delete-before-background", |_| {
Err(anyhow::anyhow!(
"failpoint: tenant-delete-before-background"
))?
});
Ok(())
}
fn mark_in_progress(&mut self) -> anyhow::Result<()> {
match self {
Self::Finished => anyhow::bail!("Bug. Is in finished state"),
Self::InProgress { .. } => { /* We're in a retry */ }
Self::NotStarted => { /* Fresh start */ }
}
*self = Self::InProgress;
Ok(())
}
pub async fn should_resume_deletion(
conf: &'static PageServerConf,
remote_storage: Option<&GenericRemoteStorage>,
tenant: &Tenant,
) -> Result<Option<DeletionGuard>, DeleteTenantError> {
let acquire = |t: &Tenant| {
Some(
Arc::clone(&t.delete_progress)
.try_lock_owned()
.expect("we're the only owner during init"),
)
};
let tenant_id = tenant.tenant_id;
// Check local mark first, if its there there is no need to go to s3 to check whether remote one exists.
if conf.tenant_deleted_mark_file_path(&tenant_id).exists() {
return Ok(acquire(tenant));
}
let remote_storage = match remote_storage {
Some(remote_storage) => remote_storage,
None => return Ok(None),
};
if remote_delete_mark_exists(conf, &tenant_id, remote_storage).await? {
Ok(acquire(tenant))
} else {
Ok(None)
}
}
pub(crate) async fn resume_from_load(
guard: DeletionGuard,
tenant: &Arc<Tenant>,
init_order: Option<&InitializationOrder>,
tenants: &'static tokio::sync::RwLock<TenantsMap>,
ctx: &RequestContext,
) -> Result<(), DeleteTenantError> {
let (_, progress) = completion::channel();
tenant
.set_stopping(progress, true, false)
.await
.expect("cant be stopping or broken");
// Do not consume valuable resources during the load phase, continue deletion once init phase is complete.
let background_jobs_can_start = init_order.as_ref().map(|x| &x.background_jobs_can_start);
if let Some(background) = background_jobs_can_start {
info!("waiting for backgound jobs barrier");
background.clone().wait().await;
info!("ready for backgound jobs barrier");
}
// Tenant may not be loadable if we fail late in cleanup_remaining_fs_traces (e g remove timelines dir)
let timelines_path = tenant.conf.timelines_path(&tenant.tenant_id);
if timelines_path.exists() {
tenant.load(init_order, ctx).await.context("load")?;
}
Self::background(
guard,
tenant.conf,
tenant.remote_storage.clone(),
tenants,
tenant,
)
.await
}
pub(crate) async fn resume_from_attach(
guard: DeletionGuard,
tenant: &Arc<Tenant>,
tenants: &'static tokio::sync::RwLock<TenantsMap>,
ctx: &RequestContext,
) -> Result<(), DeleteTenantError> {
let (_, progress) = completion::channel();
tenant
.set_stopping(progress, false, true)
.await
.expect("cant be stopping or broken");
tenant.attach(ctx).await.context("attach")?;
Self::background(
guard,
tenant.conf,
tenant.remote_storage.clone(),
tenants,
tenant,
)
.await
}
async fn prepare(
tenants: &tokio::sync::RwLock<TenantsMap>,
tenant_id: TenantId,
) -> Result<(Arc<Tenant>, tokio::sync::OwnedMutexGuard<Self>), DeleteTenantError> {
let m = tenants.read().await;
let tenant = m
.get(&tenant_id)
.ok_or(GetTenantError::NotFound(tenant_id))?;
// FIXME: unsure about active only. Our init jobs may not be cancellable properly,
// so at least for now allow deletions only for active tenants. TODO recheck
// Broken and Stopping is needed for retries.
if !matches!(
tenant.current_state(),
TenantState::Active | TenantState::Broken { .. }
) {
return Err(DeleteTenantError::InvalidState(tenant.current_state()));
}
let guard = Arc::clone(&tenant.delete_progress)
.try_lock_owned()
.map_err(|_| DeleteTenantError::AlreadyInProgress)?;
fail::fail_point!("tenant-delete-before-shutdown", |_| {
Err(anyhow::anyhow!("failpoint: tenant-delete-before-shutdown"))?
});
// make pageserver shutdown not to wait for our completion
let (_, progress) = completion::channel();
// It would be good to only set stopping here and continue shutdown in the background, but shutdown is not idempotent.
// i e it is an error to do:
// tenant.set_stopping
// tenant.shutdown
// Its also bad that we're holding tenants.read here.
// TODO relax set_stopping to be idempotent?
if tenant.shutdown(progress, false).await.is_err() {
return Err(DeleteTenantError::Other(anyhow::anyhow!(
"tenant shutdown is already in progress"
)));
}
Ok((Arc::clone(tenant), guard))
}
fn schedule_background(
guard: OwnedMutexGuard<Self>,
conf: &'static PageServerConf,
remote_storage: Option<GenericRemoteStorage>,
tenants: &'static tokio::sync::RwLock<TenantsMap>,
tenant: Arc<Tenant>,
) {
let tenant_id = tenant.tenant_id;
task_mgr::spawn(
task_mgr::BACKGROUND_RUNTIME.handle(),
TaskKind::TimelineDeletionWorker,
Some(tenant_id),
None,
"tenant_delete",
false,
async move {
if let Err(err) =
Self::background(guard, conf, remote_storage, tenants, &tenant).await
{
error!("Error: {err:#}");
tenant.set_broken(format!("{err:#}")).await;
};
Ok(())
}
.instrument({
let span = tracing::info_span!(parent: None, "delete_tenant", tenant_id=%tenant_id);
span.follows_from(Span::current());
span
}),
);
}
async fn background(
mut guard: OwnedMutexGuard<Self>,
conf: &PageServerConf,
remote_storage: Option<GenericRemoteStorage>,
tenants: &'static tokio::sync::RwLock<TenantsMap>,
tenant: &Arc<Tenant>,
) -> Result<(), DeleteTenantError> {
// Tree sort timelines, schedule delete for them. Mention retries from the console side.
// Note that if deletion fails we dont mark timelines as broken,
// the whole tenant will become broken as by `Self::schedule_background` logic
let already_running_timeline_deletions = schedule_ordered_timeline_deletions(tenant)
.await
.context("schedule_ordered_timeline_deletions")?;
fail::fail_point!("tenant-delete-before-polling-ongoing-deletions", |_| {
Err(anyhow::anyhow!(
"failpoint: tenant-delete-before-polling-ongoing-deletions"
))?
});
// Wait for deletions that were already running at the moment when tenant deletion was requested.
// When we can lock deletion guard it means that corresponding timeline deletion finished.
for (guard, timeline_id) in already_running_timeline_deletions {
let flow = guard.lock().await;
if !flow.is_finished() {
return Err(DeleteTenantError::Other(anyhow::anyhow!(
"already running timeline deletion failed: {timeline_id}"
)));
}
}
let timelines_path = conf.timelines_path(&tenant.tenant_id);
// May not exist if we fail in cleanup_remaining_fs_traces after removing it
if timelines_path.exists() {
// sanity check to guard against layout changes
ensure_timelines_dir_empty(&timelines_path)
.await
.context("timelines dir not empty")?;
}
remove_tenant_remote_delete_mark(conf, remote_storage.as_ref(), &tenant.tenant_id).await?;
fail::fail_point!("tenant-delete-before-cleanup-remaining-fs-traces", |_| {
Err(anyhow::anyhow!(
"failpoint: tenant-delete-before-cleanup-remaining-fs-traces"
))?
});
cleanup_remaining_fs_traces(conf, &tenant.tenant_id)
.await
.context("cleanup_remaining_fs_traces")?;
let mut locked = tenants.write().await;
if locked.remove(&tenant.tenant_id).is_none() {
warn!("Tenant got removed from tenants map during deletion");
};
*guard = Self::Finished;
Ok(())
}
}

View File

@@ -20,6 +20,7 @@
//!
use byteorder::{ReadBytesExt, BE};
use bytes::{BufMut, Bytes, BytesMut};
use either::Either;
use hex;
use std::{cmp::Ordering, io, result};
use thiserror::Error;
@@ -230,14 +231,15 @@ where
///
/// Read the value for given key. Returns the value, or None if it doesn't exist.
///
pub fn get(&self, search_key: &[u8; L]) -> Result<Option<u64>> {
pub async fn get(&self, search_key: &[u8; L]) -> Result<Option<u64>> {
let mut result: Option<u64> = None;
self.visit(search_key, VisitDirection::Forwards, |key, value| {
if key == search_key {
result = Some(value);
}
false
})?;
})
.await?;
Ok(result)
}
@@ -246,7 +248,7 @@ where
/// will be called for every key >= 'search_key' (or <= 'search_key', if scanning
/// backwards)
///
pub fn visit<V>(
pub async fn visit<V>(
&self,
search_key: &[u8; L],
dir: VisitDirection,
@@ -255,117 +257,78 @@ where
where
V: FnMut(&[u8], u64) -> bool,
{
self.search_recurse(self.root_blk, search_key, dir, &mut visitor)
}
let mut stack = Vec::new();
stack.push((self.root_blk, None));
let block_cursor = self.reader.block_cursor();
while let Some((node_blknum, opt_iter)) = stack.pop() {
// Locate the node.
let node_buf = block_cursor.read_blk(self.start_blk + node_blknum).await?;
fn search_recurse<V>(
&self,
node_blknum: u32,
search_key: &[u8; L],
dir: VisitDirection,
visitor: &mut V,
) -> Result<bool>
where
V: FnMut(&[u8], u64) -> bool,
{
// Locate the node.
let blk = self.reader.read_blk(self.start_blk + node_blknum)?;
let node = OnDiskNode::deparse(node_buf.as_ref())?;
let prefix_len = node.prefix_len as usize;
let suffix_len = node.suffix_len as usize;
// Search all entries on this node
self.search_node(blk.as_ref(), search_key, dir, visitor)
}
assert!(node.num_children > 0);
fn search_node<V>(
&self,
node_buf: &[u8],
search_key: &[u8; L],
dir: VisitDirection,
visitor: &mut V,
) -> Result<bool>
where
V: FnMut(&[u8], u64) -> bool,
{
let node = OnDiskNode::deparse(node_buf)?;
let prefix_len = node.prefix_len as usize;
let suffix_len = node.suffix_len as usize;
let mut keybuf = Vec::new();
keybuf.extend(node.prefix);
keybuf.resize(prefix_len + suffix_len, 0);
assert!(node.num_children > 0);
let mut keybuf = Vec::new();
keybuf.extend(node.prefix);
keybuf.resize(prefix_len + suffix_len, 0);
if dir == VisitDirection::Forwards {
// Locate the first match
let mut idx = match node.binary_search(search_key, keybuf.as_mut_slice()) {
Ok(idx) => idx,
Err(idx) => {
if node.level == 0 {
// Imagine that the node contains the following keys:
//
// 1
// 3 <-- idx
// 5
//
// If the search key is '2' and there is exact match,
// the binary search would return the index of key
// '3'. That's cool, '3' is the first key to return.
let mut iter = if let Some(iter) = opt_iter {
iter
} else if dir == VisitDirection::Forwards {
// Locate the first match
let idx = match node.binary_search(search_key, keybuf.as_mut_slice()) {
Ok(idx) => idx,
Err(idx) => {
if node.level == 0 {
// Imagine that the node contains the following keys:
//
// 1
// 3 <-- idx
// 5
//
// If the search key is '2' and there is exact match,
// the binary search would return the index of key
// '3'. That's cool, '3' is the first key to return.
idx
} else {
// This is an internal page, so each key represents a lower
// bound for what's in the child page. If there is no exact
// match, we have to return the *previous* entry.
//
// 1 <-- return this
// 3 <-- idx
// 5
idx.saturating_sub(1)
}
}
};
Either::Left(idx..node.num_children.into())
} else {
let idx = match node.binary_search(search_key, keybuf.as_mut_slice()) {
Ok(idx) => {
// Exact match. That's the first entry to return, and walk
// backwards from there.
idx
} else {
// This is an internal page, so each key represents a lower
// bound for what's in the child page. If there is no exact
// match, we have to return the *previous* entry.
//
// 1 <-- return this
// 3 <-- idx
// 5
idx.saturating_sub(1)
}
}
};
// idx points to the first match now. Keep going from there
let mut key_off = idx * suffix_len;
while idx < node.num_children as usize {
let suffix = &node.keys[key_off..key_off + suffix_len];
keybuf[prefix_len..].copy_from_slice(suffix);
let value = node.value(idx);
#[allow(clippy::collapsible_if)]
if node.level == 0 {
// leaf
if !visitor(&keybuf, value.to_u64()) {
return Ok(false);
Err(idx) => {
// No exact match. The binary search returned the index of the
// first key that's > search_key. Back off by one, and walk
// backwards from there.
if let Some(idx) = idx.checked_sub(1) {
idx
} else {
return Ok(false);
}
}
} else {
#[allow(clippy::collapsible_if)]
if !self.search_recurse(value.to_blknum(), search_key, dir, visitor)? {
return Ok(false);
}
}
idx += 1;
key_off += suffix_len;
}
} else {
let mut idx = match node.binary_search(search_key, keybuf.as_mut_slice()) {
Ok(idx) => {
// Exact match. That's the first entry to return, and walk
// backwards from there. (The loop below starts from 'idx -
// 1', so add one here to compensate.)
idx + 1
}
Err(idx) => {
// No exact match. The binary search returned the index of the
// first key that's > search_key. Back off by one, and walk
// backwards from there. (The loop below starts from idx - 1,
// so we don't need to subtract one here)
idx
}
};
Either::Right((0..=idx).rev())
};
// idx points to the first match + 1 now. Keep going from there.
let mut key_off = idx * suffix_len;
while idx > 0 {
idx -= 1;
key_off -= suffix_len;
// idx points to the first match now. Keep going from there
while let Some(idx) = iter.next() {
let key_off = idx * suffix_len;
let suffix = &node.keys[key_off..key_off + suffix_len];
keybuf[prefix_len..].copy_from_slice(suffix);
let value = node.value(idx);
@@ -376,12 +339,8 @@ where
return Ok(false);
}
} else {
#[allow(clippy::collapsible_if)]
if !self.search_recurse(value.to_blknum(), search_key, dir, visitor)? {
return Ok(false);
}
}
if idx == 0 {
stack.push((node_blknum, Some(iter)));
stack.push((value.to_blknum(), None));
break;
}
}
@@ -390,39 +349,44 @@ where
}
#[allow(dead_code)]
pub fn dump(&self) -> Result<()> {
self.dump_recurse(self.root_blk, &[], 0)
}
pub async fn dump(&self) -> Result<()> {
let mut stack = Vec::new();
fn dump_recurse(&self, blknum: u32, path: &[u8], depth: usize) -> Result<()> {
let blk = self.reader.read_blk(self.start_blk + blknum)?;
let buf: &[u8] = blk.as_ref();
stack.push((self.root_blk, String::new(), 0, 0, 0));
let node = OnDiskNode::<L>::deparse(buf)?;
let block_cursor = self.reader.block_cursor();
print!("{:indent$}", "", indent = depth * 2);
println!(
"blk #{}: path {}: prefix {}, suffix_len {}",
blknum,
hex::encode(path),
hex::encode(node.prefix),
node.suffix_len
);
while let Some((blknum, path, depth, child_idx, key_off)) = stack.pop() {
let blk = block_cursor.read_blk(self.start_blk + blknum).await?;
let buf: &[u8] = blk.as_ref();
let node = OnDiskNode::<L>::deparse(buf)?;
let mut idx = 0;
let mut key_off = 0;
while idx < node.num_children {
if child_idx == 0 {
print!("{:indent$}", "", indent = depth * 2);
let path_prefix = stack
.iter()
.map(|(_blknum, path, ..)| path.as_str())
.collect::<String>();
println!(
"blk #{blknum}: path {path_prefix}{path}: prefix {}, suffix_len {}",
hex::encode(node.prefix),
node.suffix_len
);
}
if child_idx + 1 < node.num_children {
let key_off = key_off + node.suffix_len as usize;
stack.push((blknum, path.clone(), depth, child_idx + 1, key_off));
}
let key = &node.keys[key_off..key_off + node.suffix_len as usize];
let val = node.value(idx as usize);
let val = node.value(child_idx as usize);
print!("{:indent$}", "", indent = depth * 2 + 2);
println!("{}: {}", hex::encode(key), hex::encode(val.0));
if node.level > 0 {
let child_path = [path, node.prefix].concat();
self.dump_recurse(val.to_blknum(), &child_path, depth + 1)?;
stack.push((val.to_blknum(), hex::encode(node.prefix), depth + 1, 0, 0));
}
idx += 1;
key_off += node.suffix_len as usize;
}
Ok(())
}
@@ -722,28 +686,30 @@ impl<const L: usize> BuildNode<L> {
}
#[cfg(test)]
mod tests {
pub(crate) mod tests {
use super::*;
use crate::tenant::block_io::{BlockCursor, BlockLease, BlockReaderRef};
use rand::Rng;
use std::collections::BTreeMap;
use std::sync::atomic::{AtomicUsize, Ordering};
#[derive(Clone, Default)]
struct TestDisk {
pub(crate) struct TestDisk {
blocks: Vec<Bytes>,
}
impl TestDisk {
fn new() -> Self {
Self::default()
}
}
impl BlockReader for TestDisk {
type BlockLease = std::rc::Rc<[u8; PAGE_SZ]>;
fn read_blk(&self, blknum: u32) -> io::Result<Self::BlockLease> {
pub(crate) fn read_blk(&self, blknum: u32) -> io::Result<BlockLease> {
let mut buf = [0u8; PAGE_SZ];
buf.copy_from_slice(&self.blocks[blknum as usize]);
Ok(std::rc::Rc::new(buf))
Ok(std::sync::Arc::new(buf).into())
}
}
impl BlockReader for TestDisk {
fn block_cursor(&self) -> BlockCursor<'_> {
BlockCursor::new(BlockReaderRef::TestDisk(self))
}
}
impl BlockWriter for &mut TestDisk {
@@ -754,8 +720,8 @@ mod tests {
}
}
#[test]
fn basic() -> Result<()> {
#[tokio::test]
async fn basic() -> Result<()> {
let mut disk = TestDisk::new();
let mut writer = DiskBtreeBuilder::<_, 6>::new(&mut disk);
@@ -775,16 +741,16 @@ mod tests {
let reader = DiskBtreeReader::new(0, root_offset, disk);
reader.dump()?;
reader.dump().await?;
// Test the `get` function on all the keys.
for (key, val) in all_data.iter() {
assert_eq!(reader.get(key)?, Some(*val));
assert_eq!(reader.get(key).await?, Some(*val));
}
// And on some keys that don't exist
assert_eq!(reader.get(b"aaaaaa")?, None);
assert_eq!(reader.get(b"zzzzzz")?, None);
assert_eq!(reader.get(b"xaaabx")?, None);
assert_eq!(reader.get(b"aaaaaa").await?, None);
assert_eq!(reader.get(b"zzzzzz").await?, None);
assert_eq!(reader.get(b"xaaabx").await?, None);
// Test search with `visit` function
let search_key = b"xabaaa";
@@ -795,10 +761,12 @@ mod tests {
.collect();
let mut data = Vec::new();
reader.visit(search_key, VisitDirection::Forwards, |key, value| {
data.push((key.to_vec(), value));
true
})?;
reader
.visit(search_key, VisitDirection::Forwards, |key, value| {
data.push((key.to_vec(), value));
true
})
.await?;
assert_eq!(data, expected);
// Test a backwards scan
@@ -809,16 +777,20 @@ mod tests {
.collect();
expected.reverse();
let mut data = Vec::new();
reader.visit(search_key, VisitDirection::Backwards, |key, value| {
data.push((key.to_vec(), value));
true
})?;
reader
.visit(search_key, VisitDirection::Backwards, |key, value| {
data.push((key.to_vec(), value));
true
})
.await?;
assert_eq!(data, expected);
// Backward scan where nothing matches
reader.visit(b"aaaaaa", VisitDirection::Backwards, |key, value| {
panic!("found unexpected key {}: {}", hex::encode(key), value);
})?;
reader
.visit(b"aaaaaa", VisitDirection::Backwards, |key, value| {
panic!("found unexpected key {}: {}", hex::encode(key), value);
})
.await?;
// Full scan
let expected: Vec<(Vec<u8>, u64)> = all_data
@@ -826,17 +798,19 @@ mod tests {
.map(|(key, value)| (key.to_vec(), *value))
.collect();
let mut data = Vec::new();
reader.visit(&[0u8; 6], VisitDirection::Forwards, |key, value| {
data.push((key.to_vec(), value));
true
})?;
reader
.visit(&[0u8; 6], VisitDirection::Forwards, |key, value| {
data.push((key.to_vec(), value));
true
})
.await?;
assert_eq!(data, expected);
Ok(())
}
#[test]
fn lots_of_keys() -> Result<()> {
#[tokio::test]
async fn lots_of_keys() -> Result<()> {
let mut disk = TestDisk::new();
let mut writer = DiskBtreeBuilder::<_, 8>::new(&mut disk);
@@ -856,7 +830,7 @@ mod tests {
let reader = DiskBtreeReader::new(0, root_offset, disk);
reader.dump()?;
reader.dump().await?;
use std::sync::Mutex;
@@ -877,13 +851,15 @@ mod tests {
for search_key_int in 0..(NUM_KEYS * 2 + 10) {
let search_key = u64::to_be_bytes(search_key_int);
assert_eq!(
reader.get(&search_key)?,
reader.get(&search_key).await?,
all_data.get(&search_key_int).cloned()
);
// Test a forward scan starting with this key
result.lock().unwrap().clear();
reader.visit(&search_key, VisitDirection::Forwards, take_ten)?;
reader
.visit(&search_key, VisitDirection::Forwards, take_ten)
.await?;
let expected = all_data
.range(search_key_int..)
.take(10)
@@ -893,7 +869,9 @@ mod tests {
// And a backwards scan
result.lock().unwrap().clear();
reader.visit(&search_key, VisitDirection::Backwards, take_ten)?;
reader
.visit(&search_key, VisitDirection::Backwards, take_ten)
.await?;
let expected = all_data
.range(..=search_key_int)
.rev()
@@ -907,7 +885,9 @@ mod tests {
let search_key = u64::to_be_bytes(0);
limit.store(usize::MAX, Ordering::Relaxed);
result.lock().unwrap().clear();
reader.visit(&search_key, VisitDirection::Forwards, take_ten)?;
reader
.visit(&search_key, VisitDirection::Forwards, take_ten)
.await?;
let expected = all_data
.iter()
.map(|(&key, &val)| (key, val))
@@ -918,7 +898,9 @@ mod tests {
let search_key = u64::to_be_bytes(u64::MAX);
limit.store(usize::MAX, Ordering::Relaxed);
result.lock().unwrap().clear();
reader.visit(&search_key, VisitDirection::Backwards, take_ten)?;
reader
.visit(&search_key, VisitDirection::Backwards, take_ten)
.await?;
let expected = all_data
.iter()
.rev()
@@ -929,8 +911,8 @@ mod tests {
Ok(())
}
#[test]
fn random_data() -> Result<()> {
#[tokio::test]
async fn random_data() -> Result<()> {
// Generate random keys with exponential distribution, to
// exercise the prefix compression
const NUM_KEYS: usize = 100000;
@@ -957,19 +939,23 @@ mod tests {
// Test get() operation on all the keys
for (&key, &val) in all_data.iter() {
let search_key = u128::to_be_bytes(key);
assert_eq!(reader.get(&search_key)?, Some(val));
assert_eq!(reader.get(&search_key).await?, Some(val));
}
// Test get() operations on random keys, most of which will not exist
for _ in 0..100000 {
let key_int = rand::thread_rng().gen::<u128>();
let search_key = u128::to_be_bytes(key_int);
assert!(reader.get(&search_key)? == all_data.get(&key_int).cloned());
assert!(reader.get(&search_key).await? == all_data.get(&key_int).cloned());
}
// Test boundary cases
assert!(reader.get(&u128::to_be_bytes(u128::MIN))? == all_data.get(&u128::MIN).cloned());
assert!(reader.get(&u128::to_be_bytes(u128::MAX))? == all_data.get(&u128::MAX).cloned());
assert!(
reader.get(&u128::to_be_bytes(u128::MIN)).await? == all_data.get(&u128::MIN).cloned()
);
assert!(
reader.get(&u128::to_be_bytes(u128::MAX)).await? == all_data.get(&u128::MAX).cloned()
);
Ok(())
}
@@ -994,8 +980,8 @@ mod tests {
///
/// This test contains a particular data set, see disk_btree_test_data.rs
///
#[test]
fn particular_data() -> Result<()> {
#[tokio::test]
async fn particular_data() -> Result<()> {
// Build a tree from it
let mut disk = TestDisk::new();
let mut writer = DiskBtreeBuilder::<_, 26>::new(&mut disk);
@@ -1011,18 +997,20 @@ mod tests {
// Test get() operation on all the keys
for (key, val) in disk_btree_test_data::TEST_DATA {
assert_eq!(reader.get(&key)?, Some(val));
assert_eq!(reader.get(&key).await?, Some(val));
}
// Test full scan
let mut count = 0;
reader.visit(&[0u8; 26], VisitDirection::Forwards, |_key, _value| {
count += 1;
true
})?;
reader
.visit(&[0u8; 26], VisitDirection::Forwards, |_key, _value| {
count += 1;
true
})
.await?;
assert_eq!(count, disk_btree_test_data::TEST_DATA.len());
reader.dump()?;
reader.dump().await?;
Ok(())
}

View File

@@ -2,46 +2,29 @@
//! used to keep in-memory layers spilled on disk.
use crate::config::PageServerConf;
use crate::page_cache::{self, ReadBufResult, WriteBufResult, PAGE_SZ};
use crate::tenant::blob_io::BlobWriter;
use crate::tenant::block_io::BlockReader;
use crate::page_cache::{self, PAGE_SZ};
use crate::tenant::block_io::{BlockCursor, BlockLease, BlockReader};
use crate::virtual_file::VirtualFile;
use once_cell::sync::Lazy;
use std::cmp::min;
use std::collections::HashMap;
use std::fs::OpenOptions;
use std::io::{self, ErrorKind};
use std::ops::DerefMut;
use std::path::PathBuf;
use std::sync::{Arc, RwLock};
use std::sync::atomic::AtomicU64;
use tracing::*;
use utils::id::{TenantId, TimelineId};
use std::os::unix::fs::FileExt;
///
/// This is the global cache of file descriptors (File objects).
///
static EPHEMERAL_FILES: Lazy<RwLock<EphemeralFiles>> = Lazy::new(|| {
RwLock::new(EphemeralFiles {
next_file_id: 1,
files: HashMap::new(),
})
});
pub struct EphemeralFiles {
next_file_id: u64,
files: HashMap<u64, Arc<VirtualFile>>,
}
pub struct EphemeralFile {
file_id: u64,
page_cache_file_id: page_cache::FileId,
_tenant_id: TenantId,
_timeline_id: TimelineId,
file: Arc<VirtualFile>,
pub size: u64,
file: VirtualFile,
len: u64,
/// An ephemeral file is append-only.
/// We keep the last page, which can still be modified, in [`Self::mutable_tail`].
/// The other pages, which can no longer be modified, are accessed through the page cache.
mutable_tail: [u8; PAGE_SZ],
}
impl EphemeralFile {
@@ -50,71 +33,188 @@ impl EphemeralFile {
tenant_id: TenantId,
timeline_id: TimelineId,
) -> Result<EphemeralFile, io::Error> {
let mut l = EPHEMERAL_FILES.write().unwrap();
let file_id = l.next_file_id;
l.next_file_id += 1;
static NEXT_FILENAME: AtomicU64 = AtomicU64::new(1);
let filename_disambiguator =
NEXT_FILENAME.fetch_add(1, std::sync::atomic::Ordering::Relaxed);
let filename = conf
.timeline_path(&tenant_id, &timeline_id)
.join(PathBuf::from(format!("ephemeral-{}", file_id)));
.join(PathBuf::from(format!("ephemeral-{filename_disambiguator}")));
let file = VirtualFile::open_with_options(
&filename,
OpenOptions::new().read(true).write(true).create(true),
)?;
let file_rc = Arc::new(file);
l.files.insert(file_id, file_rc.clone());
Ok(EphemeralFile {
file_id,
page_cache_file_id: page_cache::next_file_id(),
_tenant_id: tenant_id,
_timeline_id: timeline_id,
file: file_rc,
size: 0,
file,
len: 0,
mutable_tail: [0u8; PAGE_SZ],
})
}
fn fill_buffer(&self, buf: &mut [u8], blkno: u32) -> Result<(), io::Error> {
let mut off = 0;
while off < PAGE_SZ {
let n = self
.file
.read_at(&mut buf[off..], blkno as u64 * PAGE_SZ as u64 + off as u64)?;
if n == 0 {
// Reached EOF. Fill the rest of the buffer with zeros.
const ZERO_BUF: [u8; PAGE_SZ] = [0u8; PAGE_SZ];
buf[off..].copy_from_slice(&ZERO_BUF[off..]);
break;
}
off += n;
}
Ok(())
pub(crate) fn len(&self) -> u64 {
self.len
}
fn get_buf_for_write(&self, blkno: u32) -> Result<page_cache::PageWriteGuard, io::Error> {
// Look up the right page
let cache = page_cache::get();
let mut write_guard = match cache
.write_ephemeral_buf(self.file_id, blkno)
.map_err(|e| to_io_error(e, "Failed to write ephemeral buf"))?
{
WriteBufResult::Found(guard) => guard,
WriteBufResult::NotFound(mut guard) => {
// Read the page from disk into the buffer
// TODO: if we're overwriting the whole page, no need to read it in first
self.fill_buffer(guard.deref_mut(), blkno)?;
guard.mark_valid();
pub(crate) async fn read_blk(&self, blknum: u32) -> Result<BlockLease, io::Error> {
let flushed_blknums = 0..self.len / PAGE_SZ as u64;
if flushed_blknums.contains(&(blknum as u64)) {
let cache = page_cache::get();
loop {
match cache
.read_immutable_buf(self.page_cache_file_id, blknum)
.await
.map_err(|e| {
std::io::Error::new(
std::io::ErrorKind::Other,
// order path before error because error is anyhow::Error => might have many contexts
format!(
"ephemeral file: read immutable page #{}: {}: {:#}",
blknum,
self.file.path.display(),
e,
),
)
})? {
page_cache::ReadBufResult::Found(guard) => {
return Ok(BlockLease::PageReadGuard(guard))
}
page_cache::ReadBufResult::NotFound(mut write_guard) => {
let buf: &mut [u8] = write_guard.deref_mut();
debug_assert_eq!(buf.len(), PAGE_SZ);
self.file
.read_exact_at(&mut buf[..], blknum as u64 * PAGE_SZ as u64)?;
write_guard.mark_valid();
// And then fall through to modify it.
guard
// Swap for read lock
continue;
}
};
}
};
write_guard.mark_dirty();
} else {
debug_assert_eq!(blknum as u64, self.len / PAGE_SZ as u64);
Ok(BlockLease::EphemeralFileMutableTail(&self.mutable_tail))
}
}
Ok(write_guard)
pub(crate) async fn write_blob(&mut self, srcbuf: &[u8]) -> Result<u64, io::Error> {
struct Writer<'a> {
ephemeral_file: &'a mut EphemeralFile,
/// The block to which the next [`push_bytes`] will write.
blknum: u32,
/// The offset inside the block identified by [`blknum`] to which [`push_bytes`] will write.
off: usize,
}
impl<'a> Writer<'a> {
fn new(ephemeral_file: &'a mut EphemeralFile) -> io::Result<Writer<'a>> {
Ok(Writer {
blknum: (ephemeral_file.len / PAGE_SZ as u64) as u32,
off: (ephemeral_file.len % PAGE_SZ as u64) as usize,
ephemeral_file,
})
}
#[inline(always)]
async fn push_bytes(&mut self, src: &[u8]) -> Result<(), io::Error> {
let mut src_remaining = src;
while !src_remaining.is_empty() {
let dst_remaining = &mut self.ephemeral_file.mutable_tail[self.off..];
let n = min(dst_remaining.len(), src_remaining.len());
dst_remaining[..n].copy_from_slice(&src_remaining[..n]);
self.off += n;
src_remaining = &src_remaining[n..];
if self.off == PAGE_SZ {
match self
.ephemeral_file
.file
.write_all_at(
&self.ephemeral_file.mutable_tail,
self.blknum as u64 * PAGE_SZ as u64,
)
.await
{
Ok(_) => {
// Pre-warm the page cache with what we just wrote.
// This isn't necessary for coherency/correctness, but it's how we've always done it.
let cache = page_cache::get();
match cache
.read_immutable_buf(
self.ephemeral_file.page_cache_file_id,
self.blknum,
)
.await
{
Ok(page_cache::ReadBufResult::Found(_guard)) => {
// This function takes &mut self, so, it shouldn't be possible to reach this point.
unreachable!("we just wrote blknum {} and this function takes &mut self, so, no concurrent read_blk is possible", self.blknum);
}
Ok(page_cache::ReadBufResult::NotFound(mut write_guard)) => {
let buf: &mut [u8] = write_guard.deref_mut();
debug_assert_eq!(buf.len(), PAGE_SZ);
buf.copy_from_slice(&self.ephemeral_file.mutable_tail);
write_guard.mark_valid();
// pre-warm successful
}
Err(e) => {
error!("ephemeral_file write_blob failed to get immutable buf to pre-warm page cache: {e:?}");
// fail gracefully, it's not the end of the world if we can't pre-warm the cache here
}
}
// Zero the buffer for re-use.
// Zeroing is critical for correcntess because the write_blob code below
// and similarly read_blk expect zeroed pages.
self.ephemeral_file.mutable_tail.fill(0);
// This block is done, move to next one.
self.blknum += 1;
self.off = 0;
}
Err(e) => {
return Err(std::io::Error::new(
ErrorKind::Other,
// order error before path because path is long and error is short
format!(
"ephemeral_file: write_blob: write-back full tail blk #{}: {:#}: {}",
self.blknum,
e,
self.ephemeral_file.file.path.display(),
),
));
}
}
}
}
Ok(())
}
}
let pos = self.len;
let mut writer = Writer::new(self)?;
// Write the length field
if srcbuf.len() < 0x80 {
// short one-byte length header
let len_buf = [srcbuf.len() as u8];
writer.push_bytes(&len_buf).await?;
} else {
let mut len_buf = u32::to_be_bytes(srcbuf.len() as u32);
len_buf[0] |= 0x80;
writer.push_bytes(&len_buf).await?;
}
// Write the payload
writer.push_bytes(srcbuf).await?;
if srcbuf.len() < 0x80 {
self.len += 1;
} else {
self.len += 4;
}
self.len += srcbuf.len() as u64;
Ok(pos)
}
}
@@ -127,210 +227,40 @@ pub fn is_ephemeral_file(filename: &str) -> bool {
}
}
impl FileExt for EphemeralFile {
fn read_at(&self, dstbuf: &mut [u8], offset: u64) -> Result<usize, io::Error> {
// Look up the right page
let blkno = (offset / PAGE_SZ as u64) as u32;
let off = offset as usize % PAGE_SZ;
let len = min(PAGE_SZ - off, dstbuf.len());
let read_guard;
let mut write_guard;
let cache = page_cache::get();
let buf = match cache
.read_ephemeral_buf(self.file_id, blkno)
.map_err(|e| to_io_error(e, "Failed to read ephemeral buf"))?
{
ReadBufResult::Found(guard) => {
read_guard = guard;
read_guard.as_ref()
}
ReadBufResult::NotFound(guard) => {
// Read the page from disk into the buffer
write_guard = guard;
self.fill_buffer(write_guard.deref_mut(), blkno)?;
write_guard.mark_valid();
// And then fall through to read the requested slice from the
// buffer.
write_guard.as_ref()
}
};
dstbuf[0..len].copy_from_slice(&buf[off..(off + len)]);
Ok(len)
}
fn write_at(&self, srcbuf: &[u8], offset: u64) -> Result<usize, io::Error> {
// Look up the right page
let blkno = (offset / PAGE_SZ as u64) as u32;
let off = offset as usize % PAGE_SZ;
let len = min(PAGE_SZ - off, srcbuf.len());
let mut write_guard;
let cache = page_cache::get();
let buf = match cache
.write_ephemeral_buf(self.file_id, blkno)
.map_err(|e| to_io_error(e, "Failed to write ephemeral buf"))?
{
WriteBufResult::Found(guard) => {
write_guard = guard;
write_guard.deref_mut()
}
WriteBufResult::NotFound(guard) => {
// Read the page from disk into the buffer
// TODO: if we're overwriting the whole page, no need to read it in first
write_guard = guard;
self.fill_buffer(write_guard.deref_mut(), blkno)?;
write_guard.mark_valid();
// And then fall through to modify it.
write_guard.deref_mut()
}
};
buf[off..(off + len)].copy_from_slice(&srcbuf[0..len]);
write_guard.mark_dirty();
Ok(len)
}
}
impl BlobWriter for EphemeralFile {
fn write_blob(&mut self, srcbuf: &[u8]) -> Result<u64, io::Error> {
let pos = self.size;
let mut blknum = (self.size / PAGE_SZ as u64) as u32;
let mut off = (pos % PAGE_SZ as u64) as usize;
let mut buf = self.get_buf_for_write(blknum)?;
// Write the length field
if srcbuf.len() < 0x80 {
buf[off] = srcbuf.len() as u8;
off += 1;
} else {
let mut len_buf = u32::to_be_bytes(srcbuf.len() as u32);
len_buf[0] |= 0x80;
let thislen = PAGE_SZ - off;
if thislen < 4 {
// it needs to be split across pages
buf[off..(off + thislen)].copy_from_slice(&len_buf[..thislen]);
blknum += 1;
buf = self.get_buf_for_write(blknum)?;
buf[0..4 - thislen].copy_from_slice(&len_buf[thislen..]);
off = 4 - thislen;
} else {
buf[off..off + 4].copy_from_slice(&len_buf);
off += 4;
}
}
// Write the payload
let mut buf_remain = srcbuf;
while !buf_remain.is_empty() {
let mut page_remain = PAGE_SZ - off;
if page_remain == 0 {
blknum += 1;
buf = self.get_buf_for_write(blknum)?;
off = 0;
page_remain = PAGE_SZ;
}
let this_blk_len = min(page_remain, buf_remain.len());
buf[off..(off + this_blk_len)].copy_from_slice(&buf_remain[..this_blk_len]);
off += this_blk_len;
buf_remain = &buf_remain[this_blk_len..];
}
drop(buf);
if srcbuf.len() < 0x80 {
self.size += 1;
} else {
self.size += 4;
}
self.size += srcbuf.len() as u64;
Ok(pos)
}
}
impl Drop for EphemeralFile {
fn drop(&mut self) {
// drop all pages from page cache
let cache = page_cache::get();
cache.drop_buffers_for_ephemeral(self.file_id);
// remove entry from the hash map
EPHEMERAL_FILES.write().unwrap().files.remove(&self.file_id);
// There might still be pages in the [`crate::page_cache`] for this file.
// We leave them there, [`crate::page_cache::PageCache::find_victim`] will evict them when needed.
// unlink the file
let res = std::fs::remove_file(&self.file.path);
if let Err(e) = res {
warn!(
"could not remove ephemeral file '{}': {}",
self.file.path.display(),
e
);
}
}
}
pub fn writeback(file_id: u64, blkno: u32, buf: &[u8]) -> Result<(), io::Error> {
if let Some(file) = EPHEMERAL_FILES.read().unwrap().files.get(&file_id) {
match file.write_all_at(buf, blkno as u64 * PAGE_SZ as u64) {
Ok(_) => Ok(()),
Err(e) => Err(io::Error::new(
ErrorKind::Other,
format!(
"failed to write back to ephemeral file at {} error: {}",
file.path.display(),
if e.kind() != std::io::ErrorKind::NotFound {
// just never log the not found errors, we cannot do anything for them; on detach
// the tenant directory is already gone.
//
// not found files might also be related to https://github.com/neondatabase/neon/issues/2442
error!(
"could not remove ephemeral file '{}': {}",
self.file.path.display(),
e
),
)),
);
}
}
} else {
Err(io::Error::new(
ErrorKind::Other,
"could not write back page, not found in ephemeral files hash",
))
}
}
impl BlockReader for EphemeralFile {
type BlockLease = page_cache::PageReadGuard<'static>;
fn read_blk(&self, blknum: u32) -> Result<Self::BlockLease, io::Error> {
// Look up the right page
let cache = page_cache::get();
loop {
match cache
.read_ephemeral_buf(self.file_id, blknum)
.map_err(|e| to_io_error(e, "Failed to read ephemeral buf"))?
{
ReadBufResult::Found(guard) => return Ok(guard),
ReadBufResult::NotFound(mut write_guard) => {
// Read the page from disk into the buffer
self.fill_buffer(write_guard.deref_mut(), blknum)?;
write_guard.mark_valid();
// Swap for read lock
continue;
}
};
}
fn block_cursor(&self) -> super::block_io::BlockCursor<'_> {
BlockCursor::new(super::block_io::BlockReaderRef::EphemeralFile(self))
}
}
fn to_io_error(e: anyhow::Error, context: &str) -> io::Error {
io::Error::new(ErrorKind::Other, format!("{context}: {e:#}"))
}
#[cfg(test)]
mod tests {
use super::*;
use crate::tenant::blob_io::{BlobCursor, BlobWriter};
use crate::tenant::block_io::BlockCursor;
use rand::{seq::SliceRandom, thread_rng, RngCore};
use crate::tenant::block_io::{BlockCursor, BlockReaderRef};
use rand::{thread_rng, RngCore};
use std::fs;
use std::str::FromStr;
@@ -351,78 +281,43 @@ mod tests {
Ok((conf, tenant_id, timeline_id))
}
// Helper function to slurp contents of a file, starting at the current position,
// into a string
fn read_string(efile: &EphemeralFile, offset: u64, len: usize) -> Result<String, io::Error> {
let mut buf = Vec::new();
buf.resize(len, 0u8);
efile.read_exact_at(&mut buf, offset)?;
Ok(String::from_utf8_lossy(&buf)
.trim_end_matches('\0')
.to_string())
}
#[test]
fn test_ephemeral_files() -> Result<(), io::Error> {
let (conf, tenant_id, timeline_id) = harness("ephemeral_files")?;
let file_a = EphemeralFile::create(conf, tenant_id, timeline_id)?;
file_a.write_all_at(b"foo", 0)?;
assert_eq!("foo", read_string(&file_a, 0, 20)?);
file_a.write_all_at(b"bar", 3)?;
assert_eq!("foobar", read_string(&file_a, 0, 20)?);
// Open a lot of files, enough to cause some page evictions.
let mut efiles = Vec::new();
for fileno in 0..100 {
let efile = EphemeralFile::create(conf, tenant_id, timeline_id)?;
efile.write_all_at(format!("file {}", fileno).as_bytes(), 0)?;
assert_eq!(format!("file {}", fileno), read_string(&efile, 0, 10)?);
efiles.push((fileno, efile));
}
// Check that all the files can still be read from. Use them in random order for
// good measure.
efiles.as_mut_slice().shuffle(&mut thread_rng());
for (fileno, efile) in efiles.iter_mut() {
assert_eq!(format!("file {}", fileno), read_string(efile, 0, 10)?);
}
Ok(())
}
#[test]
fn test_ephemeral_blobs() -> Result<(), io::Error> {
#[tokio::test]
async fn test_ephemeral_blobs() -> Result<(), io::Error> {
let (conf, tenant_id, timeline_id) = harness("ephemeral_blobs")?;
let mut file = EphemeralFile::create(conf, tenant_id, timeline_id)?;
let pos_foo = file.write_blob(b"foo")?;
assert_eq!(b"foo", file.block_cursor().read_blob(pos_foo)?.as_slice());
let pos_bar = file.write_blob(b"bar")?;
assert_eq!(b"foo", file.block_cursor().read_blob(pos_foo)?.as_slice());
assert_eq!(b"bar", file.block_cursor().read_blob(pos_bar)?.as_slice());
let pos_foo = file.write_blob(b"foo").await?;
assert_eq!(
b"foo",
file.block_cursor().read_blob(pos_foo).await?.as_slice()
);
let pos_bar = file.write_blob(b"bar").await?;
assert_eq!(
b"foo",
file.block_cursor().read_blob(pos_foo).await?.as_slice()
);
assert_eq!(
b"bar",
file.block_cursor().read_blob(pos_bar).await?.as_slice()
);
let mut blobs = Vec::new();
for i in 0..10000 {
let data = Vec::from(format!("blob{}", i).as_bytes());
let pos = file.write_blob(&data)?;
let pos = file.write_blob(&data).await?;
blobs.push((pos, data));
}
// also test with a large blobs
for i in 0..100 {
let data = format!("blob{}", i).as_bytes().repeat(100);
let pos = file.write_blob(&data)?;
let pos = file.write_blob(&data).await?;
blobs.push((pos, data));
}
let mut cursor = BlockCursor::new(&file);
let cursor = BlockCursor::new(BlockReaderRef::EphemeralFile(&file));
for (pos, expected) in blobs {
let actual = cursor.read_blob(pos)?;
let actual = cursor.read_blob(pos).await?;
assert_eq!(actual, expected);
}
@@ -430,8 +325,8 @@ mod tests {
let mut large_data = Vec::new();
large_data.resize(20000, 0);
thread_rng().fill_bytes(&mut large_data);
let pos_large = file.write_blob(&large_data)?;
let result = file.block_cursor().read_blob(pos_large)?;
let pos_large = file.write_blob(&large_data).await?;
let result = file.block_cursor().read_blob(pos_large).await?;
assert_eq!(result, large_data);
Ok(())

View File

@@ -50,7 +50,6 @@ use crate::context::RequestContext;
use crate::keyspace::KeyPartitioning;
use crate::repository::Key;
use crate::tenant::storage_layer::InMemoryLayer;
use crate::tenant::storage_layer::Layer;
use anyhow::Result;
use std::collections::VecDeque;
use std::ops::Range;
@@ -121,7 +120,7 @@ impl BatchedUpdates<'_> {
///
/// This should be called when the corresponding file on disk has been deleted.
///
pub fn remove_historic(&mut self, layer_desc: PersistentLayerDesc) {
pub fn remove_historic(&mut self, layer_desc: &PersistentLayerDesc) {
self.layer_map.remove_historic_noflush(layer_desc)
}
@@ -253,11 +252,11 @@ impl LayerMap {
///
/// Helper function for BatchedUpdates::remove_historic
///
pub fn remove_historic_noflush(&mut self, layer_desc: PersistentLayerDesc) {
pub fn remove_historic_noflush(&mut self, layer_desc: &PersistentLayerDesc) {
self.historic
.remove(historic_layer_coverage::LayerKey::from(&layer_desc));
.remove(historic_layer_coverage::LayerKey::from(layer_desc));
let layer_key = layer_desc.key();
if Self::is_l0(&layer_desc) {
if Self::is_l0(layer_desc) {
let len_before = self.l0_delta_layers.len();
let mut l0_delta_layers = std::mem::take(&mut self.l0_delta_layers);
l0_delta_layers.retain(|other| other.key() != layer_key);
@@ -626,17 +625,17 @@ impl LayerMap {
/// debugging function to print out the contents of the layer map
#[allow(unused)]
pub fn dump(&self, verbose: bool, ctx: &RequestContext) -> Result<()> {
pub async fn dump(&self, verbose: bool, ctx: &RequestContext) -> Result<()> {
println!("Begin dump LayerMap");
println!("open_layer:");
if let Some(open_layer) = &self.open_layer {
open_layer.dump(verbose, ctx)?;
open_layer.dump(verbose, ctx).await?;
}
println!("frozen_layers:");
for frozen_layer in self.frozen_layers.iter() {
frozen_layer.dump(verbose, ctx)?;
frozen_layer.dump(verbose, ctx).await?;
}
println!("historic_layers:");
@@ -766,8 +765,7 @@ mod tests {
expected_in_counts
);
map.batch_update()
.remove_historic(downloaded.layer_desc().clone());
map.batch_update().remove_historic(downloaded.layer_desc());
assert_eq!(count_layer_in(&map, downloaded.layer_desc()), (0, 0));
}

View File

@@ -26,7 +26,7 @@
//! recovered from this file. This is tracked in
//! <https://github.com/neondatabase/neon/issues/4418>
use std::io::{self, Read, Write};
use std::io::{self, Write};
use crate::virtual_file::VirtualFile;
use anyhow::Result;
@@ -151,11 +151,12 @@ impl Manifest {
/// Load a manifest. Returns the manifest and a list of operations. If the manifest is corrupted,
/// the bool flag will be set to true and the user is responsible to reconstruct a new manifest and
/// backup the current one.
pub fn load(
mut file: VirtualFile,
pub async fn load(
file: VirtualFile,
) -> Result<(Self, Vec<Operation>, ManifestPartiallyCorrupted), ManifestLoadError> {
let mut buf = vec![];
file.read_to_end(&mut buf).map_err(ManifestLoadError::Io)?;
file.read_exact_at(&mut buf, 0)
.map_err(ManifestLoadError::Io)?;
// Read manifest header
let mut buf = Bytes::from(buf);
@@ -241,8 +242,8 @@ mod tests {
use super::*;
#[test]
fn test_read_manifest() {
#[tokio::test]
async fn test_read_manifest() {
let testdir = crate::config::PageServerConf::test_repo_dir("test_read_manifest");
std::fs::create_dir_all(&testdir).unwrap();
let file = VirtualFile::create(&testdir.join("MANIFEST")).unwrap();
@@ -274,7 +275,7 @@ mod tests {
.truncate(false),
)
.unwrap();
let (mut manifest, operations, corrupted) = Manifest::load(file).unwrap();
let (mut manifest, operations, corrupted) = Manifest::load(file).await.unwrap();
assert!(!corrupted.0);
assert_eq!(operations.len(), 2);
assert_eq!(
@@ -306,7 +307,7 @@ mod tests {
.truncate(false),
)
.unwrap();
let (_manifest, operations, corrupted) = Manifest::load(file).unwrap();
let (_manifest, operations, corrupted) = Manifest::load(file).await.unwrap();
assert!(!corrupted.0);
assert_eq!(operations.len(), 3);
assert_eq!(&operations[0], &Operation::Snapshot(snapshot, Lsn::from(0)));

View File

@@ -9,10 +9,11 @@
//! [`remote_timeline_client`]: super::remote_timeline_client
use std::fs::{File, OpenOptions};
use std::io::Write;
use std::io;
use anyhow::{bail, ensure, Context};
use serde::{Deserialize, Serialize};
use anyhow::{ensure, Context};
use serde::{de::Error, Deserialize, Serialize, Serializer};
use thiserror::Error;
use tracing::info_span;
use utils::bin_ser::SerializeError;
use utils::{
@@ -231,8 +232,30 @@ impl TimelineMetadata {
}
}
impl<'de> Deserialize<'de> for TimelineMetadata {
fn deserialize<D>(deserializer: D) -> Result<Self, D::Error>
where
D: serde::Deserializer<'de>,
{
let bytes = Vec::<u8>::deserialize(deserializer)?;
Self::from_bytes(bytes.as_slice()).map_err(|e| D::Error::custom(format!("{e}")))
}
}
impl Serialize for TimelineMetadata {
fn serialize<S>(&self, serializer: S) -> Result<S::Ok, S::Error>
where
S: Serializer,
{
let bytes = self
.to_bytes()
.map_err(|e| serde::ser::Error::custom(format!("{e}")))?;
bytes.serialize(serializer)
}
}
/// Save timeline metadata to file
pub fn save_metadata(
pub async fn save_metadata(
conf: &'static PageServerConf,
tenant_id: &TenantId,
timeline_id: &TimelineId,
@@ -250,10 +273,7 @@ pub fn save_metadata(
let metadata_bytes = data.to_bytes().context("Failed to get metadata bytes")?;
if file.write(&metadata_bytes)? != metadata_bytes.len() {
bail!("Could not write all the metadata bytes in a single call");
}
file.sync_all()?;
file.write_and_fsync(&metadata_bytes)?;
// fsync the parent directory to ensure the directory entry is durable
if first_save {
@@ -267,24 +287,24 @@ pub fn save_metadata(
Ok(())
}
#[derive(Error, Debug)]
pub enum LoadMetadataError {
#[error(transparent)]
Read(#[from] io::Error),
#[error(transparent)]
Decode(#[from] anyhow::Error),
}
pub fn load_metadata(
conf: &'static PageServerConf,
tenant_id: &TenantId,
timeline_id: &TimelineId,
) -> anyhow::Result<TimelineMetadata> {
) -> Result<TimelineMetadata, LoadMetadataError> {
let metadata_path = conf.metadata_path(tenant_id, timeline_id);
let metadata_bytes = std::fs::read(&metadata_path).with_context(|| {
format!(
"Failed to read metadata bytes from path {}",
metadata_path.display()
)
})?;
TimelineMetadata::from_bytes(&metadata_bytes).with_context(|| {
format!(
"Failed to parse metadata bytes from path {}",
metadata_path.display()
)
})
let metadata_bytes = std::fs::read(metadata_path)?;
Ok(TimelineMetadata::from_bytes(&metadata_bytes)?)
}
#[cfg(test)]

View File

@@ -20,15 +20,21 @@ use crate::config::PageServerConf;
use crate::context::{DownloadBehavior, RequestContext};
use crate::task_mgr::{self, TaskKind};
use crate::tenant::config::TenantConfOpt;
use crate::tenant::delete::DeleteTenantFlow;
use crate::tenant::{create_tenant_files, CreateTenantFilesMode, Tenant, TenantState};
use crate::{InitializationOrder, IGNORED_TENANT_FILE_NAME};
use utils::fs_ext::PathExt;
use utils::generation::Generation;
use utils::id::{TenantId, TimelineId};
use super::delete::DeleteTenantError;
use super::timeline::delete::DeleteTimelineFlow;
use super::TenantSharedResources;
/// The tenants known to the pageserver.
/// The enum variants are used to distinguish the different states that the pageserver can be in.
enum TenantsMap {
pub(crate) enum TenantsMap {
/// [`init_tenant_mgr`] is not done yet.
Initializing,
/// [`init_tenant_mgr`] is done, all on-disk tenants have been loaded.
@@ -40,13 +46,13 @@ enum TenantsMap {
}
impl TenantsMap {
fn get(&self, tenant_id: &TenantId) -> Option<&Arc<Tenant>> {
pub(crate) fn get(&self, tenant_id: &TenantId) -> Option<&Arc<Tenant>> {
match self {
TenantsMap::Initializing => None,
TenantsMap::Open(m) | TenantsMap::ShuttingDown(m) => m.get(tenant_id),
}
}
fn remove(&mut self, tenant_id: &TenantId) -> Option<Arc<Tenant>> {
pub(crate) fn remove(&mut self, tenant_id: &TenantId) -> Option<Arc<Tenant>> {
match self {
TenantsMap::Initializing => None,
TenantsMap::Open(m) | TenantsMap::ShuttingDown(m) => m.remove(tenant_id),
@@ -62,8 +68,7 @@ static TENANTS: Lazy<RwLock<TenantsMap>> = Lazy::new(|| RwLock::new(TenantsMap::
#[instrument(skip_all)]
pub async fn init_tenant_mgr(
conf: &'static PageServerConf,
broker_client: storage_broker::BrokerClientChannel,
remote_storage: Option<GenericRemoteStorage>,
resources: TenantSharedResources,
init_order: InitializationOrder,
) -> anyhow::Result<()> {
// Scan local filesystem for attached tenants
@@ -95,7 +100,9 @@ pub async fn init_tenant_mgr(
);
}
} else {
// This case happens if we crash during attach before creating the attach marker file
// This case happens if we:
// * crash during attach before creating the attach marker file
// * crash during tenant delete before removing tenant directory
let is_empty = tenant_dir_path.is_empty_dir().with_context(|| {
format!("Failed to check whether {tenant_dir_path:?} is an empty dir")
})?;
@@ -119,9 +126,9 @@ pub async fn init_tenant_mgr(
match schedule_local_tenant_processing(
conf,
&tenant_dir_path,
broker_client.clone(),
remote_storage.clone(),
resources.clone(),
Some(init_order.clone()),
&TENANTS,
&ctx,
) {
Ok(tenant) => {
@@ -152,12 +159,12 @@ pub async fn init_tenant_mgr(
Ok(())
}
pub fn schedule_local_tenant_processing(
pub(crate) fn schedule_local_tenant_processing(
conf: &'static PageServerConf,
tenant_path: &Path,
broker_client: storage_broker::BrokerClientChannel,
remote_storage: Option<GenericRemoteStorage>,
resources: TenantSharedResources,
init_order: Option<InitializationOrder>,
tenants: &'static tokio::sync::RwLock<TenantsMap>,
ctx: &RequestContext,
) -> anyhow::Result<Arc<Tenant>> {
anyhow::ensure!(
@@ -192,8 +199,16 @@ pub fn schedule_local_tenant_processing(
let tenant = if conf.tenant_attaching_mark_file_path(&tenant_id).exists() {
info!("tenant {tenant_id} has attaching mark file, resuming its attach operation");
if let Some(remote_storage) = remote_storage {
match Tenant::spawn_attach(conf, tenant_id, broker_client, remote_storage, ctx) {
if let Some(remote_storage) = resources.remote_storage {
match Tenant::spawn_attach(
conf,
tenant_id,
Generation::none(),
resources.broker_client,
tenants,
remote_storage,
ctx,
) {
Ok(tenant) => tenant,
Err(e) => {
error!("Failed to spawn_attach tenant {tenant_id}, reason: {e:#}");
@@ -214,9 +229,10 @@ pub fn schedule_local_tenant_processing(
Tenant::spawn_load(
conf,
tenant_id,
broker_client,
remote_storage,
Generation::none(),
resources,
init_order,
tenants,
ctx,
)
};
@@ -264,71 +280,77 @@ async fn shutdown_all_tenants0(tenants: &tokio::sync::RwLock<TenantsMap>) {
}
};
let started_at = std::time::Instant::now();
let mut join_set = JoinSet::new();
for (tenant_id, tenant) in tenants_to_shut_down {
join_set.spawn(
async move {
// ordering shouldn't matter for this, either we store true right away or never
let ordering = std::sync::atomic::Ordering::Relaxed;
let joined_other = std::sync::atomic::AtomicBool::new(false);
let freeze_and_flush = true;
let mut shutdown = std::pin::pin!(async {
let freeze_and_flush = true;
let res = {
let (_guard, shutdown_progress) = completion::channel();
tenant.shutdown(shutdown_progress, freeze_and_flush).await
};
if let Err(other_progress) = res {
// join the another shutdown in progress
joined_other.store(true, ordering);
other_progress.wait().await;
}
});
// in practice we might not have a lot time to go, since systemd is going to
// SIGKILL us at 10s, but we can try. delete tenant might take a while, so put out
// a warning.
let warning = std::time::Duration::from_secs(5);
let mut warning = std::pin::pin!(tokio::time::sleep(warning));
tokio::select! {
_ = &mut shutdown => {},
_ = &mut warning => {
let joined_other = joined_other.load(ordering);
warn!(%joined_other, "waiting for the shutdown to complete");
shutdown.await;
}
let res = {
let (_guard, shutdown_progress) = completion::channel();
tenant.shutdown(shutdown_progress, freeze_and_flush).await
};
if let Err(other_progress) = res {
// join the another shutdown in progress
other_progress.wait().await;
}
// we cannot afford per tenant logging here, because if s3 is degraded, we are
// going to log too many lines
debug!("tenant successfully stopped");
}
.instrument(info_span!("shutdown", %tenant_id)),
);
}
let total = join_set.len();
let mut panicked = 0;
let mut buffering = true;
const BUFFER_FOR: std::time::Duration = std::time::Duration::from_millis(500);
let mut buffered = std::pin::pin!(tokio::time::sleep(BUFFER_FOR));
while let Some(res) = join_set.join_next().await {
match res {
Ok(()) => {}
Err(join_error) if join_error.is_cancelled() => {
unreachable!("we are not cancelling any of the futures");
}
Err(join_error) if join_error.is_panic() => {
// cannot really do anything, as this panic is likely a bug
panicked += 1;
}
Err(join_error) => {
warn!("unknown kind of JoinError: {join_error}");
while !join_set.is_empty() {
tokio::select! {
Some(joined) = join_set.join_next() => {
match joined {
Ok(()) => {}
Err(join_error) if join_error.is_cancelled() => {
unreachable!("we are not cancelling any of the futures");
}
Err(join_error) if join_error.is_panic() => {
// cannot really do anything, as this panic is likely a bug
panicked += 1;
}
Err(join_error) => {
warn!("unknown kind of JoinError: {join_error}");
}
}
if !buffering {
// buffer so that every 500ms since the first update (or starting) we'll log
// how far away we are; this is because we will get SIGKILL'd at 10s, and we
// are not able to log *then*.
buffering = true;
buffered.as_mut().reset(tokio::time::Instant::now() + BUFFER_FOR);
}
},
_ = &mut buffered, if buffering => {
buffering = false;
info!(remaining = join_set.len(), total, elapsed_ms = started_at.elapsed().as_millis(), "waiting for tenants to shutdown");
}
}
}
if panicked > 0 {
warn!(panicked, "observed panicks while shutting down tenants");
warn!(
panicked,
total, "observed panicks while shutting down tenants"
);
}
// caller will log how long we took
}
pub async fn create_tenant(
@@ -347,8 +369,12 @@ pub async fn create_tenant(
// TODO: tenant directory remains on disk if we bail out from here on.
// See https://github.com/neondatabase/neon/issues/4233
let tenant_resources = TenantSharedResources {
broker_client,
remote_storage,
};
let created_tenant =
schedule_local_tenant_processing(conf, &tenant_directory, broker_client, remote_storage, None, ctx)?;
schedule_local_tenant_processing(conf, &tenant_directory, tenant_resources, None, &TENANTS, ctx)?;
// TODO: tenant object & its background loops remain, untracked in tenant map, if we fail here.
// See https://github.com/neondatabase/neon/issues/4233
@@ -409,6 +435,14 @@ pub async fn get_tenant(
}
}
pub async fn delete_tenant(
conf: &'static PageServerConf,
remote_storage: Option<GenericRemoteStorage>,
tenant_id: TenantId,
) -> Result<(), DeleteTenantError> {
DeleteTenantFlow::run(conf, remote_storage, &TENANTS, tenant_id).await
}
#[derive(Debug, thiserror::Error)]
pub enum DeleteTimelineError {
#[error("Tenant {0}")]
@@ -421,12 +455,10 @@ pub enum DeleteTimelineError {
pub async fn delete_timeline(
tenant_id: TenantId,
timeline_id: TimelineId,
ctx: &RequestContext,
_ctx: &RequestContext,
) -> Result<(), DeleteTimelineError> {
let tenant = get_tenant(tenant_id, true).await?;
tenant
.prepare_and_schedule_delete_timeline(timeline_id, ctx)
.await?;
DeleteTimelineFlow::run(&tenant, timeline_id, false).await?;
Ok(())
}
@@ -501,7 +533,11 @@ pub async fn load_tenant(
.with_context(|| format!("Failed to remove tenant ignore mark {tenant_ignore_mark:?} during tenant loading"))?;
}
let new_tenant = schedule_local_tenant_processing(conf, &tenant_path, broker_client, remote_storage, None, ctx)
let resources = TenantSharedResources {
broker_client,
remote_storage,
};
let new_tenant = schedule_local_tenant_processing(conf, &tenant_path, resources, None, &TENANTS, ctx)
.with_context(|| {
format!("Failed to schedule tenant processing in path {tenant_path:?}")
})?;
@@ -582,7 +618,11 @@ pub async fn attach_tenant(
.context("check for attach marker file existence")?;
anyhow::ensure!(marker_file_exists, "create_tenant_files should have created the attach marker file");
let attached_tenant = schedule_local_tenant_processing(conf, &tenant_dir, broker_client, Some(remote_storage), None, ctx)?;
let resources = TenantSharedResources {
broker_client,
remote_storage: Some(remote_storage),
};
let attached_tenant = schedule_local_tenant_processing(conf, &tenant_dir, resources, None, &TENANTS, ctx)?;
// TODO: tenant object & its background loops remain, untracked in tenant map, if we fail here.
// See https://github.com/neondatabase/neon/issues/4233
@@ -768,55 +808,6 @@ pub async fn immediate_gc(
Ok(wait_task_done)
}
pub async fn immediate_compact(
tenant_id: TenantId,
timeline_id: TimelineId,
ctx: &RequestContext,
) -> Result<tokio::sync::oneshot::Receiver<anyhow::Result<()>>, ApiError> {
let guard = TENANTS.read().await;
let tenant = guard
.get(&tenant_id)
.map(Arc::clone)
.with_context(|| format!("tenant {tenant_id}"))
.map_err(|e| ApiError::NotFound(e.into()))?;
let timeline = tenant
.get_timeline(timeline_id, true)
.map_err(|e| ApiError::NotFound(e.into()))?;
// Run in task_mgr to avoid race with tenant_detach operation
let ctx = ctx.detached_child(TaskKind::Compaction, DownloadBehavior::Download);
let (task_done, wait_task_done) = tokio::sync::oneshot::channel();
task_mgr::spawn(
&tokio::runtime::Handle::current(),
TaskKind::Compaction,
Some(tenant_id),
Some(timeline_id),
&format!(
"timeline_compact_handler compaction run for tenant {tenant_id} timeline {timeline_id}"
),
false,
async move {
let result = timeline
.compact(&ctx)
.instrument(info_span!("manual_compact", %tenant_id, %timeline_id))
.await;
match task_done.send(result) {
Ok(_) => (),
Err(result) => error!("failed to send compaction result: {result:?}"),
}
Ok(())
},
);
// drop the guard until after we've spawned the task so that timeline shutdown will wait for the task
drop(guard);
Ok(wait_task_done)
}
#[cfg(test)]
mod tests {
use std::collections::HashMap;

View File

@@ -135,7 +135,7 @@
//! - Initiate upload queue with that [`IndexPart`].
//! - Reschedule all lost operations by comparing the local filesystem state
//! and remote state as per [`IndexPart`]. This is done in
//! [`Tenant::timeline_init_and_sync`] and [`Timeline::reconcile_with_remote`].
//! [`Tenant::timeline_init_and_sync`].
//!
//! Note that if we crash during file deletion between the index update
//! that removes the file from the list of files, and deleting the remote file,
@@ -172,7 +172,6 @@
//! transitioning it from `TenantState::Attaching` to `TenantState::Active` state.
//! This starts the timelines' WAL-receivers and the tenant's GC & Compaction loops.
//!
//! Most of the above steps happen in [`Timeline::reconcile_with_remote`] or its callers.
//! We keep track of the fact that a client is in `Attaching` state in a marker
//! file on the local disk. This is critical because, when we restart the pageserver,
//! we do not want to do the `List timelines` step for each tenant that has already
@@ -192,14 +191,14 @@
//! not created and the uploads are skipped.
//! Theoretically, it should be ok to remove and re-add remote storage configuration to
//! the pageserver config at any time, since it doesn't make a difference to
//! `reconcile_with_remote`.
//! [`Timeline::load_layer_map`].
//! Of course, the remote timeline dir must not change while we have de-configured
//! remote storage, i.e., the pageserver must remain the owner of the given prefix
//! in remote storage.
//! But note that we don't test any of this right now.
//!
//! [`Tenant::timeline_init_and_sync`]: super::Tenant::timeline_init_and_sync
//! [`Timeline::reconcile_with_remote`]: super::Timeline::reconcile_with_remote
//! [`Timeline::load_layer_map`]: super::Timeline::load_layer_map
mod delete;
mod download;
@@ -211,15 +210,18 @@ use chrono::{NaiveDateTime, Utc};
// re-export these
pub use download::{is_temp_download_file, list_remote_timelines};
use scopeguard::ScopeGuard;
use tokio_util::sync::CancellationToken;
use utils::backoff::{
self, exponential_backoff, DEFAULT_BASE_BACKOFF_SECONDS, DEFAULT_MAX_BACKOFF_SECONDS,
};
use std::collections::{HashMap, VecDeque};
use std::path::Path;
use std::path::{Path, PathBuf};
use std::sync::atomic::{AtomicU32, Ordering};
use std::sync::{Arc, Mutex};
use remote_storage::{DownloadError, GenericRemoteStorage, RemotePath};
use std::ops::DerefMut;
use tokio::runtime::Runtime;
use tracing::{debug, error, info, instrument, warn};
use tracing::{info_span, Instrument};
use utils::lsn::Lsn;
@@ -229,9 +231,11 @@ use crate::metrics::{
RemoteTimelineClientMetricsCallTrackSize, REMOTE_ONDEMAND_DOWNLOADED_BYTES,
REMOTE_ONDEMAND_DOWNLOADED_LAYERS,
};
use crate::task_mgr::shutdown_token;
use crate::tenant::debug_assert_current_span_has_tenant_and_timeline_id;
use crate::tenant::remote_timeline_client::index::LayerFileMetadata;
use crate::tenant::upload_queue::Delete;
use crate::tenant::TIMELINES_SEGMENT_NAME;
use crate::{
config::PageServerConf,
task_mgr,
@@ -241,7 +245,6 @@ use crate::{
tenant::upload_queue::{
UploadOp, UploadQueue, UploadQueueInitialized, UploadQueueStopped, UploadTask,
},
{exponential_backoff, DEFAULT_BASE_BACKOFF_SECONDS, DEFAULT_MAX_BACKOFF_SECONDS},
};
use utils::id::{TenantId, TimelineId};
@@ -250,18 +253,19 @@ use self::index::IndexPart;
use super::storage_layer::LayerFileName;
use super::upload_queue::SetDeletedFlagProgress;
use super::Generation;
// Occasional network issues and such can cause remote operations to fail, and
// that's expected. If a download fails, we log it at info-level, and retry.
// But after FAILED_DOWNLOAD_WARN_THRESHOLD retries, we start to log it at WARN
// level instead, as repeated failures can mean a more serious problem. If it
// fails more than FAILED_DOWNLOAD_RETRIES times, we give up
const FAILED_DOWNLOAD_WARN_THRESHOLD: u32 = 3;
const FAILED_DOWNLOAD_RETRIES: u32 = 10;
pub(crate) const FAILED_DOWNLOAD_WARN_THRESHOLD: u32 = 3;
pub(crate) const FAILED_REMOTE_OP_RETRIES: u32 = 10;
// Similarly log failed uploads and deletions at WARN level, after this many
// retries. Uploads and deletions are retried forever, though.
const FAILED_UPLOAD_WARN_THRESHOLD: u32 = 3;
pub(crate) const FAILED_UPLOAD_WARN_THRESHOLD: u32 = 3;
pub enum MaybeDeletedIndexPart {
IndexPart(IndexPart),
@@ -309,10 +313,11 @@ pub enum PersistIndexPartWithDeletedFlagError {
pub struct RemoteTimelineClient {
conf: &'static PageServerConf,
runtime: &'static Runtime,
runtime: tokio::runtime::Handle,
tenant_id: TenantId,
timeline_id: TimelineId,
generation: Generation,
upload_queue: Mutex<UploadQueue>,
@@ -333,12 +338,14 @@ impl RemoteTimelineClient {
conf: &'static PageServerConf,
tenant_id: TenantId,
timeline_id: TimelineId,
generation: Generation,
) -> RemoteTimelineClient {
RemoteTimelineClient {
conf,
runtime: &BACKGROUND_RUNTIME,
runtime: BACKGROUND_RUNTIME.handle().to_owned(),
tenant_id,
timeline_id,
generation,
storage_impl: remote_storage,
upload_queue: Mutex::new(UploadQueue::Uninitialized),
metrics: Arc::new(RemoteTimelineClientMetrics::new(&tenant_id, &timeline_id)),
@@ -352,6 +359,10 @@ impl RemoteTimelineClient {
let mut upload_queue = self.upload_queue.lock().unwrap();
upload_queue.initialize_with_current_remote_index_part(index_part)?;
self.update_remote_physical_size_gauge(Some(index_part));
info!(
"initialized upload queue from remote index with {} layer files",
index_part.layer_metadata.len()
);
Ok(())
}
@@ -364,6 +375,7 @@ impl RemoteTimelineClient {
let mut upload_queue = self.upload_queue.lock().unwrap();
upload_queue.initialize_empty_remote(local_metadata)?;
self.update_remote_physical_size_gauge(None);
info!("initialized upload queue as empty");
Ok(())
}
@@ -442,10 +454,10 @@ impl RemoteTimelineClient {
);
let index_part = download::download_index_part(
self.conf,
&self.storage_impl,
&self.tenant_id,
&self.timeline_id,
self.generation,
)
.measure_remote_op(
self.tenant_id,
@@ -514,7 +526,7 @@ impl RemoteTimelineClient {
/// updated metadata.
///
/// The upload will be added to the queue immediately, but it
/// won't be performed until all previosuly scheduled layer file
/// won't be performed until all previously scheduled layer file
/// upload operations have completed successfully. This is to
/// ensure that when the index file claims that layers X, Y and Z
/// exist in remote storage, they really do. To wait for the upload
@@ -534,8 +546,7 @@ impl RemoteTimelineClient {
// ahead of what's _actually_ on the remote during index upload.
upload_queue.latest_metadata = metadata.clone();
let metadata_bytes = upload_queue.latest_metadata.to_bytes()?;
self.schedule_index_upload(upload_queue, metadata_bytes);
self.schedule_index_upload(upload_queue, upload_queue.latest_metadata.clone());
Ok(())
}
@@ -555,8 +566,7 @@ impl RemoteTimelineClient {
let upload_queue = guard.initialized_mut()?;
if upload_queue.latest_files_changes_since_metadata_upload_scheduled > 0 {
let metadata_bytes = upload_queue.latest_metadata.to_bytes()?;
self.schedule_index_upload(upload_queue, metadata_bytes);
self.schedule_index_upload(upload_queue, upload_queue.latest_metadata.clone());
}
Ok(())
@@ -566,7 +576,7 @@ impl RemoteTimelineClient {
fn schedule_index_upload(
self: &Arc<Self>,
upload_queue: &mut UploadQueueInitialized,
metadata_bytes: Vec<u8>,
metadata: TimelineMetadata,
) {
info!(
"scheduling metadata upload with {} files ({} changed)",
@@ -579,7 +589,7 @@ impl RemoteTimelineClient {
let index_part = IndexPart::new(
upload_queue.latest_files.clone(),
disk_consistent_lsn,
metadata_bytes,
metadata,
);
let op = UploadOp::UploadMetadata(index_part, disk_consistent_lsn);
self.calls_unfinished_metric_begin(&op);
@@ -625,7 +635,7 @@ impl RemoteTimelineClient {
/// Note: This schedules an index file upload before the deletions. The
/// deletion won't actually be performed, until any previously scheduled
/// upload operations, and the index file upload, have completed
/// succesfully.
/// successfully.
pub fn schedule_layer_file_deletion(
self: &Arc<Self>,
names: &[LayerFileName],
@@ -635,7 +645,7 @@ impl RemoteTimelineClient {
// Deleting layers doesn't affect the values stored in TimelineMetadata,
// so we don't need update it. Just serialize it.
let metadata_bytes = upload_queue.latest_metadata.to_bytes()?;
let metadata = upload_queue.latest_metadata.clone();
// Update the remote index file, removing the to-be-deleted files from the index,
// before deleting the actual files.
@@ -645,21 +655,41 @@ impl RemoteTimelineClient {
// from latest_files, but not yet scheduled for deletion. Use a closure
// to syntactically forbid ? or bail! calls here.
let no_bail_here = || {
for name in names {
upload_queue.latest_files.remove(name);
upload_queue.latest_files_changes_since_metadata_upload_scheduled += 1;
}
// Decorate our list of names with each name's generation, dropping
// makes that are unexpectedly missing from our metadata.
let with_generations: Vec<_> = names
.iter()
.filter_map(|name| {
// Remove from latest_files, learning the file's remote generation in the process
let meta = upload_queue.latest_files.remove(name);
if let Some(meta) = meta {
upload_queue.latest_files_changes_since_metadata_upload_scheduled += 1;
Some((name, meta.generation))
} else {
// This can only happen if we forgot to to schedule the file upload
// before scheduling the delete. Log it because it is a rare/strange
// situation, and in case something is misbehaving, we'd like to know which
// layers experienced this.
info!(
"Deleting layer {name} not found in latest_files list, never uploaded?"
);
None
}
})
.collect();
if upload_queue.latest_files_changes_since_metadata_upload_scheduled > 0 {
self.schedule_index_upload(upload_queue, metadata_bytes);
self.schedule_index_upload(upload_queue, metadata);
}
// schedule the actual deletions
for name in names {
for (name, generation) in with_generations {
let op = UploadOp::Delete(Delete {
file_kind: RemoteOpFileKind::Layer,
layer_file_name: name.clone(),
scheduled_from_timeline_delete: false,
generation,
});
self.calls_unfinished_metric_begin(&op);
upload_queue.queued_operations.push_back(op);
@@ -752,12 +782,25 @@ impl RemoteTimelineClient {
pausable_failpoint!("persist_deleted_index_part");
upload::upload_index_part(
self.conf,
&self.storage_impl,
&self.tenant_id,
&self.timeline_id,
&index_part_with_deleted_at,
backoff::retry(
|| {
upload::upload_index_part(
&self.storage_impl,
&self.tenant_id,
&self.timeline_id,
self.generation,
&index_part_with_deleted_at,
)
},
|_e| false,
1,
// have just a couple of attempts
// when executed as part of timeline deletion this happens in context of api call
// when executed as part of tenant deletion this happens in the background
2,
"persist_index_part_with_deleted_flag",
// TODO: use a cancellation token (https://github.com/neondatabase/neon/issues/5066)
backoff::Cancel::new(CancellationToken::new(), || unreachable!()),
)
.await?;
@@ -803,12 +846,14 @@ impl RemoteTimelineClient {
.reserve(stopped.upload_queue_for_deletion.latest_files.len());
// schedule the actual deletions
for name in stopped.upload_queue_for_deletion.latest_files.keys() {
for (name, meta) in &stopped.upload_queue_for_deletion.latest_files {
let op = UploadOp::Delete(Delete {
file_kind: RemoteOpFileKind::Layer,
layer_file_name: name.clone(),
scheduled_from_timeline_delete: true,
generation: meta.generation,
});
self.calls_unfinished_metric_begin(&op);
stopped
.upload_queue_for_deletion
@@ -827,17 +872,26 @@ impl RemoteTimelineClient {
)
};
receiver.changed().await?;
receiver.changed().await.context("upload queue shut down")?;
// Do not delete index part yet, it is needed for possible retry. If we remove it first
// and retry will arrive to different pageserver there wont be any traces of it on remote storage
let timeline_path = self.conf.timeline_path(&self.tenant_id, &self.timeline_id);
let timeline_storage_path = self.conf.remote_path(&timeline_path)?;
let timeline_storage_path = remote_timeline_path(&self.tenant_id, &self.timeline_id);
let remaining = self
.storage_impl
.list_prefixes(Some(&timeline_storage_path))
.await?;
let remaining = backoff::retry(
|| async {
self.storage_impl
.list_files(Some(&timeline_storage_path))
.await
},
|_e| false,
FAILED_DOWNLOAD_WARN_THRESHOLD,
FAILED_REMOTE_OP_RETRIES,
"list_prefixes",
backoff::Cancel::new(shutdown_token(), || anyhow::anyhow!("Cancelled!")),
)
.await
.context("list prefixes")?;
let remaining: Vec<RemotePath> = remaining
.into_iter()
@@ -852,13 +906,44 @@ impl RemoteTimelineClient {
.collect();
if !remaining.is_empty() {
self.storage_impl.delete_objects(&remaining).await?;
backoff::retry(
|| async { self.storage_impl.delete_objects(&remaining).await },
|_e| false,
FAILED_UPLOAD_WARN_THRESHOLD,
FAILED_REMOTE_OP_RETRIES,
"delete_objects",
backoff::Cancel::new(shutdown_token(), || anyhow::anyhow!("Cancelled!")),
)
.await
.context("delete_objects")?;
}
fail::fail_point!("timeline-delete-before-index-delete", |_| {
Err(anyhow::anyhow!(
"failpoint: timeline-delete-before-index-delete"
))?
});
let index_file_path = timeline_storage_path.join(Path::new(IndexPart::FILE_NAME));
debug!("deleting index part");
self.storage_impl.delete(&index_file_path).await?;
backoff::retry(
|| async { self.storage_impl.delete(&index_file_path).await },
|_e| false,
FAILED_UPLOAD_WARN_THRESHOLD,
FAILED_REMOTE_OP_RETRIES,
"delete_index",
backoff::Cancel::new(shutdown_token(), || anyhow::anyhow!("Cancelled")),
)
.await
.context("delete_index")?;
fail::fail_point!("timeline-delete-after-index-delete", |_| {
Err(anyhow::anyhow!(
"failpoint: timeline-delete-after-index-delete"
))?
});
info!(prefix=%timeline_storage_path, referenced=deletions_queued, not_referenced=%remaining.len(), "done deleting in timeline prefix, including index_part.json");
@@ -942,7 +1027,7 @@ impl RemoteTimelineClient {
let tenant_id = self.tenant_id;
let timeline_id = self.timeline_id;
task_mgr::spawn(
self.runtime.handle(),
&self.runtime,
TaskKind::RemoteUploadTask,
Some(self.tenant_id),
Some(self.timeline_id),
@@ -995,15 +1080,17 @@ impl RemoteTimelineClient {
let upload_result: anyhow::Result<()> = match &task.op {
UploadOp::UploadLayer(ref layer_file_name, ref layer_metadata) => {
let path = &self
let path = self
.conf
.timeline_path(&self.tenant_id, &self.timeline_id)
.join(layer_file_name.file_name());
upload::upload_timeline_layer(
self.conf,
&self.storage_impl,
path,
&path,
layer_metadata,
self.generation,
)
.measure_remote_op(
self.tenant_id,
@@ -1015,11 +1102,20 @@ impl RemoteTimelineClient {
.await
}
UploadOp::UploadMetadata(ref index_part, _lsn) => {
let mention_having_future_layers = if cfg!(feature = "testing") {
index_part
.layer_metadata
.keys()
.any(|x| x.is_in_future(*_lsn))
} else {
false
};
let res = upload::upload_index_part(
self.conf,
&self.storage_impl,
&self.tenant_id,
&self.timeline_id,
self.generation,
index_part,
)
.measure_remote_op(
@@ -1032,6 +1128,10 @@ impl RemoteTimelineClient {
.await;
if res.is_ok() {
self.update_remote_physical_size_gauge(Some(index_part));
if mention_having_future_layers {
// find rationale near crate::tenant::timeline::init::cleanup_future_layer
tracing::info!(disk_consistent_lsn=%_lsn, "uploaded an index_part.json with future layers -- this is ok! if shutdown now, expect future layer cleanup");
}
}
res
}
@@ -1040,7 +1140,7 @@ impl RemoteTimelineClient {
.conf
.timeline_path(&self.tenant_id, &self.timeline_id)
.join(delete.layer_file_name.file_name());
delete::delete_layer(self.conf, &self.storage_impl, path)
delete::delete_layer(self.conf, &self.storage_impl, path, delete.generation)
.measure_remote_op(
self.tenant_id,
self.timeline_id,
@@ -1083,14 +1183,13 @@ impl RemoteTimelineClient {
}
// sleep until it's time to retry, or we're cancelled
tokio::select! {
_ = task_mgr::shutdown_watcher() => { },
_ = exponential_backoff(
retries,
DEFAULT_BASE_BACKOFF_SECONDS,
DEFAULT_MAX_BACKOFF_SECONDS,
) => { },
};
exponential_backoff(
retries,
DEFAULT_BASE_BACKOFF_SECONDS,
DEFAULT_MAX_BACKOFF_SECONDS,
&shutdown_token(),
)
.await;
}
}
}
@@ -1105,7 +1204,7 @@ impl RemoteTimelineClient {
debug!("remote task {} completed successfully", task.op);
}
// The task has completed succesfully. Remove it from the in-progress list.
// The task has completed successfully. Remove it from the in-progress list.
{
let mut upload_queue_guard = self.upload_queue.lock().unwrap();
let upload_queue = match upload_queue_guard.deref_mut() {
@@ -1288,6 +1387,71 @@ impl RemoteTimelineClient {
}
}
pub fn remote_timelines_path(tenant_id: &TenantId) -> RemotePath {
let path = format!("tenants/{tenant_id}/{TIMELINES_SEGMENT_NAME}");
RemotePath::from_string(&path).expect("Failed to construct path")
}
pub fn remote_timeline_path(tenant_id: &TenantId, timeline_id: &TimelineId) -> RemotePath {
remote_timelines_path(tenant_id).join(&PathBuf::from(timeline_id.to_string()))
}
pub fn remote_layer_path(
tenant_id: &TenantId,
timeline_id: &TimelineId,
layer_file_name: &LayerFileName,
layer_meta: &LayerFileMetadata,
) -> RemotePath {
// Generation-aware key format
let path = format!(
"tenants/{tenant_id}/{TIMELINES_SEGMENT_NAME}/{timeline_id}/{0}{1}",
layer_file_name.file_name(),
layer_meta.generation.get_suffix()
);
RemotePath::from_string(&path).expect("Failed to construct path")
}
pub fn remote_index_path(
tenant_id: &TenantId,
timeline_id: &TimelineId,
generation: Generation,
) -> RemotePath {
RemotePath::from_string(&format!(
"tenants/{tenant_id}/{TIMELINES_SEGMENT_NAME}/{timeline_id}/{0}{1}",
IndexPart::FILE_NAME,
generation.get_suffix()
))
.expect("Failed to construct path")
}
/// Files on the remote storage are stored with paths, relative to the workdir.
/// That path includes in itself both tenant and timeline ids, allowing to have a unique remote storage path.
///
/// Errors if the path provided does not start from pageserver's workdir.
pub fn remote_path(
conf: &PageServerConf,
local_path: &Path,
generation: Generation,
) -> anyhow::Result<RemotePath> {
let stripped = local_path
.strip_prefix(&conf.workdir)
.context("Failed to strip workdir prefix")?;
let suffixed = format!(
"{0}{1}",
stripped.to_string_lossy(),
generation.get_suffix()
);
RemotePath::new(&PathBuf::from(suffixed)).with_context(|| {
format!(
"to resolve remote part of path {:?} for base {:?}",
local_path, conf.workdir
)
})
}
#[cfg(test)]
mod tests {
use super::*;
@@ -1295,7 +1459,7 @@ mod tests {
context::RequestContext,
tenant::{
harness::{TenantHarness, TIMELINE_ID},
Tenant,
Generation, Tenant, Timeline,
},
DEFAULT_PG_VERSION,
};
@@ -1304,7 +1468,6 @@ mod tests {
collections::HashSet,
path::{Path, PathBuf},
};
use tokio::runtime::EnterGuard;
use utils::lsn::Lsn;
pub(super) fn dummy_contents(name: &str) -> Vec<u8> {
@@ -1338,8 +1501,11 @@ mod tests {
assert_eq!(avec, bvec);
}
fn assert_remote_files(expected: &[&str], remote_path: &Path) {
let mut expected: Vec<String> = expected.iter().map(|x| String::from(*x)).collect();
fn assert_remote_files(expected: &[&str], remote_path: &Path, generation: Generation) {
let mut expected: Vec<String> = expected
.iter()
.map(|x| format!("{}{}", x, generation.get_suffix()))
.collect();
expected.sort();
let mut found: Vec<String> = Vec::new();
@@ -1354,35 +1520,25 @@ mod tests {
}
struct TestSetup {
runtime: &'static tokio::runtime::Runtime,
entered_runtime: EnterGuard<'static>,
harness: TenantHarness,
tenant: Arc<Tenant>,
timeline: Arc<Timeline>,
tenant_ctx: RequestContext,
remote_fs_dir: PathBuf,
client: Arc<RemoteTimelineClient>,
}
impl TestSetup {
fn new(test_name: &str) -> anyhow::Result<Self> {
async fn new(test_name: &str) -> anyhow::Result<Self> {
// Use a current-thread runtime in the test
let runtime = Box::leak(Box::new(
tokio::runtime::Builder::new_current_thread()
.enable_all()
.build()?,
));
let entered_runtime = runtime.enter();
let test_name = Box::leak(Box::new(format!("remote_timeline_client__{test_name}")));
let harness = TenantHarness::create(test_name)?;
let (tenant, ctx) = runtime.block_on(harness.load());
let (tenant, ctx) = harness.load().await;
// create an empty timeline directory
let _ = runtime.block_on(tenant.create_test_timeline(
TIMELINE_ID,
Lsn(8),
DEFAULT_PG_VERSION,
&ctx,
))?;
let timeline = tenant
.create_test_timeline(TIMELINE_ID, Lsn(8), DEFAULT_PG_VERSION, &ctx)
.await?;
let remote_fs_dir = harness.conf.workdir.join("remote_fs");
std::fs::create_dir_all(remote_fs_dir)?;
@@ -1400,13 +1556,16 @@ mod tests {
storage: RemoteStorageKind::LocalFs(remote_fs_dir.clone()),
};
let generation = Generation::new(0xdeadbeef);
let storage = GenericRemoteStorage::from_config(&storage_config).unwrap();
let client = Arc::new(RemoteTimelineClient {
conf: harness.conf,
runtime,
runtime: tokio::runtime::Handle::current(),
tenant_id: harness.tenant_id,
timeline_id: TIMELINE_ID,
generation,
storage_impl: storage,
upload_queue: Mutex::new(UploadQueue::Uninitialized),
metrics: Arc::new(RemoteTimelineClientMetrics::new(
@@ -1416,10 +1575,9 @@ mod tests {
});
Ok(Self {
runtime,
entered_runtime,
harness,
tenant,
timeline,
tenant_ctx: ctx,
remote_fs_dir,
client,
@@ -1428,8 +1586,8 @@ mod tests {
}
// Test scheduling
#[test]
fn upload_scheduling() -> anyhow::Result<()> {
#[tokio::test]
async fn upload_scheduling() {
// Test outline:
//
// Schedule upload of a bunch of layers. Check that they are started immediately, not queued
@@ -1445,25 +1603,28 @@ mod tests {
// Schedule index upload. Check that it's queued
let TestSetup {
runtime,
entered_runtime: _entered_runtime,
harness,
tenant: _tenant,
timeline: _timeline,
tenant_ctx: _tenant_ctx,
remote_fs_dir,
client,
} = TestSetup::new("upload_scheduling").unwrap();
} = TestSetup::new("upload_scheduling").await.unwrap();
let timeline_path = harness.timeline_path(&TIMELINE_ID);
println!("workdir: {}", harness.conf.workdir.display());
let remote_timeline_dir =
remote_fs_dir.join(timeline_path.strip_prefix(&harness.conf.workdir)?);
remote_fs_dir.join(timeline_path.strip_prefix(&harness.conf.workdir).unwrap());
println!("remote_timeline_dir: {}", remote_timeline_dir.display());
let metadata = dummy_metadata(Lsn(0x10));
client.init_upload_queue_for_empty_remote(&metadata)?;
client
.init_upload_queue_for_empty_remote(&metadata)
.unwrap();
let generation = Generation::new(0xdeadbeef);
// Create a couple of dummy files, schedule upload for them
let layer_file_name_1: LayerFileName = "000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000016B59D8-00000000016B5A51".parse().unwrap();
@@ -1472,26 +1633,32 @@ mod tests {
let content_1 = dummy_contents("foo");
let content_2 = dummy_contents("bar");
let content_3 = dummy_contents("baz");
std::fs::write(
timeline_path.join(layer_file_name_1.file_name()),
&content_1,
)?;
std::fs::write(
timeline_path.join(layer_file_name_2.file_name()),
&content_2,
)?;
std::fs::write(timeline_path.join(layer_file_name_3.file_name()), content_3)?;
client.schedule_layer_file_upload(
&layer_file_name_1,
&LayerFileMetadata::new(content_1.len() as u64),
)?;
client.schedule_layer_file_upload(
&layer_file_name_2,
&LayerFileMetadata::new(content_2.len() as u64),
)?;
for (filename, content) in [
(&layer_file_name_1, &content_1),
(&layer_file_name_2, &content_2),
(&layer_file_name_3, &content_3),
] {
std::fs::write(timeline_path.join(filename.file_name()), content).unwrap();
}
client
.schedule_layer_file_upload(
&layer_file_name_1,
&LayerFileMetadata::new(content_1.len() as u64, generation),
)
.unwrap();
client
.schedule_layer_file_upload(
&layer_file_name_2,
&LayerFileMetadata::new(content_2.len() as u64, generation),
)
.unwrap();
// Check that they are started immediately, not queued
//
// this works because we running within block_on, so any futures are now queued up until
// our next await point.
{
let mut guard = client.upload_queue.lock().unwrap();
let upload_queue = guard.initialized_mut().unwrap();
@@ -1505,7 +1672,9 @@ mod tests {
// Schedule upload of index. Check that it is queued
let metadata = dummy_metadata(Lsn(0x20));
client.schedule_index_upload_for_metadata_update(&metadata)?;
client
.schedule_index_upload_for_metadata_update(&metadata)
.unwrap();
{
let mut guard = client.upload_queue.lock().unwrap();
let upload_queue = guard.initialized_mut().unwrap();
@@ -1514,7 +1683,7 @@ mod tests {
}
// Wait for the uploads to finish
runtime.block_on(client.wait_completion())?;
client.wait_completion().await.unwrap();
{
let mut guard = client.upload_queue.lock().unwrap();
let upload_queue = guard.initialized_mut().unwrap();
@@ -1524,29 +1693,34 @@ mod tests {
}
// Download back the index.json, and check that the list of files is correct
let index_part = match runtime.block_on(client.download_index_file())? {
let index_part = match client.download_index_file().await.unwrap() {
MaybeDeletedIndexPart::IndexPart(index_part) => index_part,
MaybeDeletedIndexPart::Deleted(_) => panic!("unexpectedly got deleted index part"),
};
assert_file_list(
&index_part.timeline_layers,
&index_part
.layer_metadata
.keys()
.map(|f| f.to_owned())
.collect(),
&[
&layer_file_name_1.file_name(),
&layer_file_name_2.file_name(),
],
);
let downloaded_metadata = index_part.parse_metadata()?;
assert_eq!(downloaded_metadata, metadata);
assert_eq!(index_part.metadata, metadata);
// Schedule upload and then a deletion. Check that the deletion is queued
let content_baz = dummy_contents("baz");
std::fs::write(timeline_path.join("baz"), &content_baz)?;
client.schedule_layer_file_upload(
&layer_file_name_3,
&LayerFileMetadata::new(content_baz.len() as u64),
)?;
client.schedule_layer_file_deletion(&[layer_file_name_1.clone()])?;
client
.schedule_layer_file_upload(
&layer_file_name_3,
&LayerFileMetadata::new(content_3.len() as u64, generation),
)
.unwrap();
client
.schedule_layer_file_deletion(&[layer_file_name_1.clone()])
.unwrap();
{
let mut guard = client.upload_queue.lock().unwrap();
let upload_queue = guard.initialized_mut().unwrap();
@@ -1565,10 +1739,11 @@ mod tests {
"index_part.json",
],
&remote_timeline_dir,
generation,
);
// Finish them
runtime.block_on(client.wait_completion())?;
client.wait_completion().await.unwrap();
assert_remote_files(
&[
@@ -1577,24 +1752,26 @@ mod tests {
"index_part.json",
],
&remote_timeline_dir,
generation,
);
Ok(())
}
#[test]
fn bytes_unfinished_gauge_for_layer_file_uploads() -> anyhow::Result<()> {
#[tokio::test]
async fn bytes_unfinished_gauge_for_layer_file_uploads() {
// Setup
let TestSetup {
runtime,
harness,
tenant: _tenant,
timeline: _timeline,
client,
..
} = TestSetup::new("metrics")?;
} = TestSetup::new("metrics").await.unwrap();
let metadata = dummy_metadata(Lsn(0x10));
client.init_upload_queue_for_empty_remote(&metadata)?;
client
.init_upload_queue_for_empty_remote(&metadata)
.unwrap();
let timeline_path = harness.timeline_path(&TIMELINE_ID);
@@ -1603,7 +1780,8 @@ mod tests {
std::fs::write(
timeline_path.join(layer_file_name_1.file_name()),
&content_1,
)?;
)
.unwrap();
#[derive(Debug, PartialEq)]
struct BytesStartedFinished {
@@ -1627,16 +1805,20 @@ mod tests {
// Test
let generation = Generation::new(0xdeadbeef);
let init = get_bytes_started_stopped();
client.schedule_layer_file_upload(
&layer_file_name_1,
&LayerFileMetadata::new(content_1.len() as u64),
)?;
client
.schedule_layer_file_upload(
&layer_file_name_1,
&LayerFileMetadata::new(content_1.len() as u64, generation),
)
.unwrap();
let pre = get_bytes_started_stopped();
runtime.block_on(client.wait_completion())?;
client.wait_completion().await.unwrap();
let post = get_bytes_started_stopped();
@@ -1664,7 +1846,5 @@ mod tests {
finished: Some(content_1.len())
}
);
Ok(())
}
}

View File

@@ -5,25 +5,30 @@ use tracing::debug;
use remote_storage::GenericRemoteStorage;
use crate::config::PageServerConf;
use crate::{
config::PageServerConf,
tenant::{remote_timeline_client::remote_path, Generation},
};
pub(super) async fn delete_layer<'a>(
conf: &'static PageServerConf,
storage: &'a GenericRemoteStorage,
local_layer_path: &'a Path,
generation: Generation,
) -> anyhow::Result<()> {
fail::fail_point!("before-delete-layer", |_| {
anyhow::bail!("failpoint before-delete-layer")
});
debug!("Deleting layer from remote storage: {local_layer_path:?}",);
let path_to_delete = conf.remote_path(local_layer_path)?;
let path_to_delete = remote_path(conf, local_layer_path, generation)?;
// We don't want to print an error if the delete failed if the file has
// already been deleted. Thankfully, in this situation S3 already
// does not yield an error. While OS-provided local file system APIs do yield
// errors, we avoid them in the `LocalFs` wrapper.
storage.delete(&path_to_delete).await.with_context(|| {
format!("Failed to delete remote layer from storage at {path_to_delete:?}")
})
storage
.delete(&path_to_delete)
.await
.with_context(|| format!("delete remote layer from storage at {path_to_delete:?}"))
}

View File

@@ -11,23 +11,20 @@ use std::time::Duration;
use anyhow::{anyhow, Context};
use tokio::fs;
use tokio::io::AsyncWriteExt;
use tracing::{info, warn};
use tokio_util::sync::CancellationToken;
use utils::{backoff, crashsafe};
use crate::config::PageServerConf;
use crate::tenant::remote_timeline_client::{remote_layer_path, remote_timelines_path};
use crate::tenant::storage_layer::LayerFileName;
use crate::tenant::timeline::span::debug_assert_current_span_has_tenant_and_timeline_id;
use crate::{exponential_backoff, DEFAULT_BASE_BACKOFF_SECONDS, DEFAULT_MAX_BACKOFF_SECONDS};
use crate::tenant::Generation;
use remote_storage::{DownloadError, GenericRemoteStorage};
use utils::crashsafe::path_with_suffix_extension;
use utils::id::{TenantId, TimelineId};
use super::index::{IndexPart, LayerFileMetadata};
use super::{FAILED_DOWNLOAD_RETRIES, FAILED_DOWNLOAD_WARN_THRESHOLD};
async fn fsync_path(path: impl AsRef<std::path::Path>) -> Result<(), std::io::Error> {
fs::File::open(path).await?.sync_all().await
}
use super::{remote_index_path, FAILED_DOWNLOAD_WARN_THRESHOLD, FAILED_REMOTE_OP_RETRIES};
static MAX_DOWNLOAD_DURATION: Duration = Duration::from_secs(120);
@@ -46,13 +43,11 @@ pub async fn download_layer_file<'a>(
) -> Result<u64, DownloadError> {
debug_assert_current_span_has_tenant_and_timeline_id();
let timeline_path = conf.timeline_path(&tenant_id, &timeline_id);
let local_path = conf
.timeline_path(&tenant_id, &timeline_id)
.join(layer_file_name.file_name());
let local_path = timeline_path.join(layer_file_name.file_name());
let remote_path = conf
.remote_path(&local_path)
.map_err(DownloadError::Other)?;
let remote_path = remote_layer_path(&tenant_id, &timeline_id, layer_file_name, layer_metadata);
// Perform a rename inspired by durable_rename from file_utils.c.
// The sequence:
@@ -69,33 +64,43 @@ pub async fn download_layer_file<'a>(
let (mut destination_file, bytes_amount) = download_retry(
|| async {
// TODO: this doesn't use the cached fd for some reason?
let mut destination_file = fs::File::create(&temp_file_path).await.with_context(|| {
format!(
"create a destination file for layer '{}'",
temp_file_path.display()
)
})
.map_err(DownloadError::Other)?;
let mut download = storage.download(&remote_path).await.with_context(|| {
format!(
let mut destination_file = fs::File::create(&temp_file_path)
.await
.with_context(|| {
format!(
"create a destination file for layer '{}'",
temp_file_path.display()
)
})
.map_err(DownloadError::Other)?;
let mut download = storage
.download(&remote_path)
.await
.with_context(|| {
format!(
"open a download stream for layer with remote storage path '{remote_path:?}'"
)
})
.map_err(DownloadError::Other)?;
let bytes_amount = tokio::time::timeout(MAX_DOWNLOAD_DURATION, tokio::io::copy(&mut download.download_stream, &mut destination_file))
.await
.map_err(|e| DownloadError::Other(anyhow::anyhow!("Timed out {:?}", e)))?
.with_context(|| {
format!("Failed to download layer with remote storage path '{remote_path:?}' into file {temp_file_path:?}")
})
.map_err(DownloadError::Other)?;
Ok((destination_file, bytes_amount))
let bytes_amount = tokio::time::timeout(
MAX_DOWNLOAD_DURATION,
tokio::io::copy(&mut download.download_stream, &mut destination_file),
)
.await
.map_err(|e| DownloadError::Other(anyhow::anyhow!("Timed out {:?}", e)))?
.with_context(|| {
format!(
"download layer at remote path '{remote_path:?}' into file {temp_file_path:?}"
)
})
.map_err(DownloadError::Other)?;
Ok((destination_file, bytes_amount))
},
&format!("download {remote_path:?}"),
).await?;
)
.await?;
// Tokio doc here: https://docs.rs/tokio/1.17.0/tokio/fs/struct.File.html states that:
// A file will not be closed immediately when it goes out of scope if there are any IO operations
@@ -108,12 +113,7 @@ pub async fn download_layer_file<'a>(
destination_file
.flush()
.await
.with_context(|| {
format!(
"failed to flush source file at {}",
temp_file_path.display()
)
})
.with_context(|| format!("flush source file at {}", temp_file_path.display()))
.map_err(DownloadError::Other)?;
let expected = layer_metadata.file_size();
@@ -144,17 +144,12 @@ pub async fn download_layer_file<'a>(
fs::rename(&temp_file_path, &local_path)
.await
.with_context(|| {
format!(
"Could not rename download layer file to {}",
local_path.display(),
)
})
.with_context(|| format!("rename download layer file to {}", local_path.display(),))
.map_err(DownloadError::Other)?;
fsync_path(&local_path)
crashsafe::fsync_async(&local_path)
.await
.with_context(|| format!("Could not fsync layer file {}", local_path.display(),))
.with_context(|| format!("fsync layer file {}", local_path.display(),))
.map_err(DownloadError::Other)?;
tracing::debug!("download complete: {}", local_path.display());
@@ -178,21 +173,19 @@ pub fn is_temp_download_file(path: &Path) -> bool {
}
/// List timelines of given tenant in remote storage
pub async fn list_remote_timelines<'a>(
storage: &'a GenericRemoteStorage,
conf: &'static PageServerConf,
pub async fn list_remote_timelines(
storage: &GenericRemoteStorage,
tenant_id: TenantId,
) -> anyhow::Result<HashSet<TimelineId>> {
let tenant_path = conf.timelines_path(&tenant_id);
let tenant_storage_path = conf.remote_path(&tenant_path)?;
let remote_path = remote_timelines_path(&tenant_id);
fail::fail_point!("storage-sync-list-remote-timelines", |_| {
anyhow::bail!("storage-sync-list-remote-timelines");
});
let timelines = download_retry(
|| storage.list_prefixes(Some(&tenant_storage_path)),
&format!("list prefixes for {tenant_path:?}"),
|| storage.list_prefixes(Some(&remote_path)),
&format!("list prefixes for {tenant_id}"),
)
.await?;
@@ -207,9 +200,9 @@ pub async fn list_remote_timelines<'a>(
anyhow::anyhow!("failed to get timeline id for remote tenant {tenant_id}")
})?;
let timeline_id: TimelineId = object_name.parse().with_context(|| {
format!("failed to parse object name into timeline id '{object_name}'")
})?;
let timeline_id: TimelineId = object_name
.parse()
.with_context(|| format!("parse object name into timeline id '{object_name}'"))?;
// list_prefixes is assumed to return unique names. Ensure this here.
// NB: it's safer to bail out than warn-log this because the pageserver
@@ -227,21 +220,16 @@ pub async fn list_remote_timelines<'a>(
}
pub(super) async fn download_index_part(
conf: &'static PageServerConf,
storage: &GenericRemoteStorage,
tenant_id: &TenantId,
timeline_id: &TimelineId,
generation: Generation,
) -> Result<IndexPart, DownloadError> {
let index_part_path = conf
.metadata_path(tenant_id, timeline_id)
.with_file_name(IndexPart::FILE_NAME);
let part_storage_path = conf
.remote_path(&index_part_path)
.map_err(DownloadError::BadInput)?;
let remote_path = remote_index_path(tenant_id, timeline_id, generation);
let index_part_bytes = download_retry(
|| async {
let mut index_part_download = storage.download(&part_storage_path).await?;
let mut index_part_download = storage.download(&remote_path).await?;
let mut index_part_bytes = Vec::new();
tokio::io::copy(
@@ -249,26 +237,21 @@ pub(super) async fn download_index_part(
&mut index_part_bytes,
)
.await
.with_context(|| {
format!("Failed to download an index part into file {index_part_path:?}")
})
.with_context(|| format!("download index part at {remote_path:?}"))
.map_err(DownloadError::Other)?;
Ok(index_part_bytes)
},
&format!("download {part_storage_path:?}"),
&format!("download {remote_path:?}"),
)
.await?;
let index_part: IndexPart = serde_json::from_slice(&index_part_bytes)
.with_context(|| {
format!("Failed to deserialize index part file into file {index_part_path:?}")
})
.with_context(|| format!("download index part file at {remote_path:?}"))
.map_err(DownloadError::Other)?;
Ok(index_part)
}
///
/// Helper function to handle retries for a download operation.
///
/// Remote operations can fail due to rate limits (IAM, S3), spurious network
@@ -276,47 +259,21 @@ pub(super) async fn download_index_part(
/// with backoff.
///
/// (See similar logic for uploads in `perform_upload_task`)
async fn download_retry<T, O, F>(mut op: O, description: &str) -> Result<T, DownloadError>
async fn download_retry<T, O, F>(op: O, description: &str) -> Result<T, DownloadError>
where
O: FnMut() -> F,
F: Future<Output = Result<T, DownloadError>>,
{
let mut attempts = 0;
loop {
let result = op().await;
match result {
Ok(_) => {
if attempts > 0 {
info!("{description} succeeded after {attempts} retries");
}
return result;
}
// These are "permanent" errors that should not be retried.
Err(DownloadError::BadInput(_)) | Err(DownloadError::NotFound) => {
return result;
}
// Assume that any other failure might be transient, and the operation might
// succeed if we just keep trying.
Err(DownloadError::Other(err)) if attempts < FAILED_DOWNLOAD_WARN_THRESHOLD => {
info!("{description} failed, will retry (attempt {attempts}): {err:#}");
}
Err(DownloadError::Other(err)) if attempts < FAILED_DOWNLOAD_RETRIES => {
warn!("{description} failed, will retry (attempt {attempts}): {err:#}");
}
Err(DownloadError::Other(ref err)) => {
// Operation failed FAILED_DOWNLOAD_RETRIES times. Time to give up.
warn!("{description} still failed after {attempts} retries, giving up: {err:?}");
return result;
}
}
// sleep and retry
exponential_backoff(
attempts,
DEFAULT_BASE_BACKOFF_SECONDS,
DEFAULT_MAX_BACKOFF_SECONDS,
)
.await;
attempts += 1;
}
backoff::retry(
op,
|e| matches!(e, DownloadError::BadInput(_) | DownloadError::NotFound),
FAILED_DOWNLOAD_WARN_THRESHOLD,
FAILED_REMOTE_OP_RETRIES,
description,
// TODO: use a cancellation token (https://github.com/neondatabase/neon/issues/5066)
backoff::Cancel::new(CancellationToken::new(), || -> DownloadError {
unreachable!()
}),
)
.await
}

Some files were not shown because too many files have changed in this diff Show More