Compare commits

..

48 Commits

Author SHA1 Message Date
Arpad Müller
d2314afaf6 Make persist_tenant_config async fn 2023-09-01 14:29:06 +02:00
Arpad Müller
0aed765c6d Make tenant_map_insert and its closure async fn 2023-09-01 14:27:27 +02:00
Arpad Müller
aa4763bed1 Use write_and_fsync in persist_tenant_config 2023-09-01 14:18:34 +02:00
Arpad Müller
a13b0cc86b Make read_at async fn 2023-09-01 13:52:21 +02:00
Arpad Müller
167342db13 Make read_exact_at async fn 2023-09-01 13:52:21 +02:00
Arpad Müller
dc6668d4cd Add MaybeVirtualFile and use it in tests 2023-09-01 13:52:21 +02:00
Arpad Müller
df28e7afa0 Use write_and_fsync in save_metadata 2023-09-01 13:51:33 +02:00
Arpad Müller
9cffba255c Make save_metadata async fn 2023-09-01 13:51:33 +02:00
Arpad Müller
8ef27a3bf8 Add write_and_fsync function 2023-09-01 13:51:33 +02:00
Arpad Müller
afbebce051 Remove bounds 2023-09-01 13:51:33 +02:00
Arpad Müller
486a7f6d63 Make write_all_at async 2023-09-01 13:51:33 +02:00
Christian Schwarz
3413937eb2 FileBlockReader::fill_buffer make it obvious that we need to switch to async API 2023-09-01 13:51:33 +02:00
Arpad Müller
67f7e6f192 Remove Read impl that was only used in one place 2023-09-01 13:51:33 +02:00
Arpad Müller
69d046fd7e Move used FileExt functions to inherent impls 2023-09-01 13:51:33 +02:00
Christian Schwarz
353ebefefc FileBlockReader<File> is never used 2023-09-01 13:51:33 +02:00
Arpad Müller
17b3f4bc7d Move VirtualFile::seek to inherent function 2023-09-01 13:51:33 +02:00
John Spray
715077ab5b tests: broaden a log allow regex in test_ignored_tenant_stays_broken_without_metadata (#5168)
## Problem

- https://github.com/neondatabase/neon/issues/5167

## Summary of changes

Accept "will not become active" log line with _either_ Broken or
Stopping state, because we may hit it while in the process of doing the
`/ignore` (earlier in the test than the test expects to see the same
line with Broken)
2023-09-01 08:36:38 +01:00
John Spray
616e7046c7 s3_scrubber: import into the main neon repository (#5141)
## Problem

The S3 scrubber currently lives at
https://github.com/neondatabase/s3-scrubber

We don't have tests that use it, and it has copies of some data
structures that can get stale.

## Summary of changes

- Import the s3-scrubber as `s3_scrubber/
- Replace copied_definitions/ in the scrubber with direct access to the
`utils` and `pageserver` crates
- Modify visibility of a few definitions in `pageserver` to allow the
scrubber to use them
- Update scrubber code for recent changes to `IndexPart`
- Update `KNOWN_VERSIONS` for IndexPart and move the definition into
index.rs so that it is easier to keep up to date

As a future refinement, it would be good to pull the remote persistence
types (like IndexPart) out of `pageserver` into a separate library so
that the scrubber doesn't have to link against the whole pageserver, and
so that it's clearer which types need to be public.

Co-authored-by: Kirill Bulatov <kirill@neon.tech>
Co-authored-by: Dmitry Rodionov <dmitry@neon.tech>
Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
2023-08-31 19:01:39 +01:00
Conrad Ludgate
1b916a105a proxy: locked is not retriable (#5162)
## Problem

Management service returns Locked when quotas are exhausted. We cannot
retry on those

## Summary of changes

Makes Locked status unretriable
2023-08-31 15:50:15 +03:00
Conrad Ludgate
d11621d904 Proxy: proxy protocol v2 (#5028)
## Problem

We need to log the client IP, not the IP of the NLB.

## Summary of changes

Parse the proxy [protocol version
2](https://www.haproxy.org/download/1.8/doc/proxy-protocol.txt) if
possible
2023-08-31 14:30:25 +03:00
John Spray
43bb8bfdbb pageserver: fix flake in test_timeline_deletion_with_files_stuck_in_upload_queue (#5149)
## Problem

Test failing on a different ERROR log than it anticipated.

Closes: https://github.com/neondatabase/neon/issues/5148

## Summary of changes

Add the "could not flush frozen layer" error log to the permitted
errors.
2023-08-31 10:42:32 +01:00
John Spray
300a5aa05e pageserver: fix test v4_indexpart_is_parsed (#5157)
## Problem

Two recent PRs raced:
- https://github.com/neondatabase/neon/pull/5153
- https://github.com/neondatabase/neon/pull/5140

## Summary of changes

Add missing `generation` argument to IndexLayerMetadata construction
2023-08-31 10:40:46 +01:00
Nikita Kalyanov
b9c111962f pass JWT to management API (#5151)
support authentication with JWT from env for proxy calls to mgmt API
2023-08-31 12:23:51 +03:00
John Spray
83ae2bd82c pageserver: generation number support in keys and indices (#5140)
## Problem

To implement split brain protection, we need tenants and timelines to be
aware of their current generation, and use it when composing S3 keys.


## Summary of changes

- A `Generation` type is introduced in the `utils` crate -- it is in
this broadly-visible location because it will later be used from
`control_plane/` as well as `pageserver/`. Generations can be a number,
None, or Broken, to support legacy content (None), and Tenants in the
broken state (Broken).
- Tenant, Timeline, and RemoteTimelineClient all get a generation
attribute
- IndexPart's IndexLayerMetadata has a new `generation` attribute.
Legacy layers' metadata will deserialize to Generation::none().
- Remote paths are composed with a trailing generation suffix. If a
generation is equal to Generation::none() (as it currently always is),
then this suffix is an empty string.
- Functions for composing remote storage paths added in
remote_timeline_client: these avoid the way that we currently always
compose a local path and then strip the prefix, and avoid requiring a
PageserverConf reference on functions that want to create remote paths
(the conf is only needed for local paths). These are less DRY than the
old functions, but remote storage paths are a very rarely changing
thing, so it's better to write out our paths clearly in the functions
than to compose timeline paths from tenant paths, etc.
- Code paths that construct a Tenant take a `generation` argument in
anticipation that we will soon load generations on startup before
constructing Tenant.

Until the whole feature is done, we don't want any generation-ful keys
though: so initially we will carry this everywhere with the special
Generation::none() value.

Closes: https://github.com/neondatabase/neon/issues/5135

Co-authored-by: Christian Schwarz <christian@neon.tech>
2023-08-31 09:19:34 +01:00
Alexey Kondratov
f2c21447ce [compute_ctl] Create check availability data during full configuration (#5084)
I've moved it to the API handler in the 589cf1ed2 to do not delay
compute start. Yet, we now skip full configuration and catalog updates
in the most hot path -- waking up suspended compute, and only do it at:

- first start
- start with applying new configuration
- start for availability check

so it doesn't really matter anymore.

The problem with creating the table and test record in the API handler
is that someone can fill up timeline till the logical limit. Then it's
suspended and availability check is scheduled, so it fails.

If table + test row are created at the very beginning, we reserve a 8 KB
page for future checks, which theoretically will last almost forever.
For example, my ~1y old branch still has 8 KB sized test table:
```sql
cloud_admin@postgres=# select pg_relation_size('health_check');
 pg_relation_size
------------------
             8192
(1 row)
```

---------

Co-authored-by: Anastasia Lubennikova <anastasia@neon.tech>
2023-08-30 17:44:28 +02:00
Conrad Ludgate
93dcdb293a proxy: password hack hack (#5126)
## Problem

fixes #4881 

## Summary of changes
2023-08-30 16:20:27 +01:00
John Spray
a93274b389 pageserver: remove vestigial timeline_layers attribute (#5153)
## Problem

`timeline_layers` was write-only since
b95addddd5

We deployed the version that no longer requires it for deserializing, so
now we can stop including it when serializing.

## Summary of changes

Fully remove `timeline_layers`.
2023-08-30 16:14:04 +01:00
Anastasia Lubennikova
a7c0e4dcd0 Check if custiom extension is enabled.
This check was lost in the latest refactoring.

If extension is not present in 'public_extensions' or 'custom_extensions' don't download it
2023-08-30 17:47:06 +03:00
Conrad Ludgate
3b81e0c86d chore: remove webpki (#5069)
## Problem

webpki is unmaintained

Closes https://github.com/neondatabase/neon/security/dependabot/33

## Summary of changes

Update all dependents of webpki.
2023-08-30 15:14:03 +01:00
Anastasia Lubennikova
e5a397cf96 Form archive_path for remote extensions on the fly 2023-08-30 13:56:51 +03:00
Joonas Koivunen
05773708d3 fix: add context for ancestor lsn wait (#5143)
In logs it is confusing to see seqwait timeouts which seemingly arise
from the branched lsn but actually are about the ancestor, leading to
questions like "has the last_record_lsn went back".

Noticed by @problame.
2023-08-30 12:21:41 +03:00
John Spray
382473d9a5 docs: add RFC for remote storage generation numbers (#4919)
## Summary

A scheme of logical "generation numbers" for pageservers and their
attachments is proposed, along with
changes to the remote storage format to include these generation numbers
in S3 keys.

Using the control plane as the issuer of these generation numbers
enables strong anti-split-brain
properties in the pageserver cluster without implementing a consensus
mechanism directly
in the pageservers.

## Motivation

Currently, the pageserver's remote storage format does not provide a
mechanism for addressing
split brain conditions that may happen when replacing a node during
failover or when migrating
a tenant from one pageserver to another. From a remote storage
perspective, a split brain condition
occurs whenever two nodes both think they have the same tenant attached,
and both can write to S3. This
can happen in the case of a network partition, pathologically long
delays (e.g. suspended VM), or software
bugs.

This blocks robust implementation of failover from unresponsive
pageservers, due to the risk that
the unresponsive pageserver is still writing to S3.

---------

Co-authored-by: Christian Schwarz <christian@neon.tech>
Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
2023-08-30 09:49:55 +01:00
Arpad Müller
eb0a698adc Make page cache and read_blk async (#5023)
## Problem

`read_blk` does I/O and thus we would like to make it async. We can't
make the function async as long as the `PageReadGuard` returned by
`read_blk` isn't `Send`. The page cache is called by `read_blk`, and
thus it can't be async without `read_blk` being async. Thus, we have a
circular dependency.

## Summary of changes

Due to the circular dependency, we convert both the page cache and
`read_blk` to async at the same time:

We make the page cache use `tokio::sync` synchronization primitives as
those are `Send`. This makes all the places that acquire a lock require
async though, which we then also do. This includes also asyncification
of the `read_blk` function.

Builds upon #4994, #5015, #5056, and #5129.

Part of #4743.
2023-08-30 09:04:31 +02:00
Arseny Sher
81b6578c44 Allow walsender in recovery mode give WAL till dynamic flush_lsn.
Instead of fixed during the start of replication. To this end, create
term_flush_lsn watch channel similar to commit_lsn one. This allows to continue
recovery streaming if new data appears.
2023-08-29 23:19:40 +03:00
Arseny Sher
bc49c73fee Move wal_stream_connection_config to utils.
It will be used by safekeeper as well.
2023-08-29 23:19:40 +03:00
Arseny Sher
e98580b092 Add term and http endpoint to broker messaged SkTimelineInfo.
We need them for safekeeper peer recovery
https://github.com/neondatabase/neon/pull/4875
2023-08-29 23:19:40 +03:00
Arseny Sher
804ef23043 Rename TermSwitchEntry to TermLsn.
Add derive Ord for easy comparison of <term, lsn> pairs.

part of https://github.com/neondatabase/neon/pull/4875
2023-08-29 23:19:40 +03:00
Arseny Sher
87f7d6bce3 Start and stop per timeline recovery task.
Slightly refactors init: now load_tenant_timelines is also async to properly
init the timeline, but to keep global map lock sync we just acquire it anew for
each timeline.

Recovery task itself is just a stub here.

part of
https://github.com/neondatabase/neon/pull/4875
2023-08-29 23:19:40 +03:00
Arseny Sher
39e3fbbeb0 Add safekeeper peers to TimelineInfo.
Now available under GET /tenant/xxx/timeline/yyy for inspection.
2023-08-29 23:19:40 +03:00
Em Sharnoff
8d2a4aa5f8 vm-monitor: Add flag for when file cache on disk (#5130)
Part 1 of 2, for moving the file cache onto disk.

Because VMs are created by the control plane (and that's where the
filesystem for the file cache is defined), we can't rely on any kind of
synchronization between releases, so the change needs to be
feature-gated (kind of), with the default remaining the same for now.

See also: neondatabase/cloud#6593
2023-08-29 12:44:48 -07:00
Joonas Koivunen
d1fcdf75b3 test: enhanced logging for curious mock_s3 (#5134)
Possible flakyness with mock_s3. Add logging in hopes this will happen
again.

Co-authored-by: Alexander Bayandin <alexander@neon.tech>
2023-08-29 14:48:50 +03:00
Alexander Bayandin
7e39a96441 scripts/flaky_tests.py: Improve flaky tests detection (#5094)
## Problem

We still need to rerun some builds manually because flaky tests weren't
detected automatically.
I found two reasons for it:
- If a test is flaky on a particular build type, on a particular
Postgres version, there's a high chance that this test is flaky on all
configurations, but we don't automatically detect such cases.
- We detect flaky tests only on the main branch, which requires manual
retrigger runs for freshly made flaky tests.
Both of them are fixed in the PR.

## Summary of changes
- Spread flakiness of a single test to all configurations
- Detect flaky tests in all branches (not only in the main)
- Look back only at  7 days of test history (instead of 10)
2023-08-29 11:53:24 +01:00
Vadim Kharitonov
babefdd3f9 Upgrade pgvector to 0.5.0 (#5132) 2023-08-29 12:53:50 +03:00
Arpad Müller
805fee1483 page cache: small code cleanups (#5125)
## Problem

I saw these things while working on #5111.

## Summary of changes

* Add a comment explaining why we use `Vec::leak` instead of
`Vec::into_boxed_slice` plus `Box::leak`.
* Add another comment explaining what `valid` is doing, it wasn't very
clear before.
* Add a function `set_usage_count` to not set it directly.
2023-08-29 11:49:04 +03:00
Felix Prasanna
85d6d9dc85 monitor/compute_ctl: remove references to the informant (#5115)
Also added some docs to the monitor :)

Co-authored-by: Em Sharnoff <sharnoff@neon.tech>
2023-08-29 02:59:27 +03:00
Em Sharnoff
e40ee7c3d1 remove unused file 'vm-cgconfig.conf' (#5127)
Honestly no clue why it's still here, should have been removed ages ago.
This is handled by vm-builder now.
2023-08-28 13:04:57 -07:00
Christian Schwarz
0fe3b3646a page cache: don't proactively evict EphemeralFile pages (#5129)
Before this patch, when dropping an EphemeralFile, we'd scan the entire
`slots` to proactively evict its pages (`drop_buffers_for_immutable`).

This was _necessary_ before #4994 because the page cache was a
write-back cache: we'd be deleting the EphemeralFile from disk after,
so, if we hadn't evicted its pages before that, write-back in
`find_victim` wouldhave failed.

But, since #4994, the page cache is a read-only cache, so, it's safe
to keep read-only data cached. It's never going to get accessed again
and eventually, `find_victim` will evict it.

The only remaining advantage of `drop_buffers_for_immutable` over
relying on `find_victim` is that `find_victim` has to do the clock
page replacement iterations until the count reaches 0,
whereas `drop_buffers_for_immutable` can kick the page out right away.

However, weigh that against the cost of `drop_buffers_for_immutable`,
which currently scans the entire `slots` array to find the
EphemeralFile's pages.

Alternatives have been proposed in #5122 and #5128, but, they come
with their own overheads & trade-offs.

Also, the real reason why we're looking into this piece of code is
that we want to make the slots rwlock async in #5023.
Since `drop_buffers_for_immutable` is called from drop, and there
is no async drop, it would be nice to not have to deal with this.

So, let's just stop doing `drop_buffers_for_immutable` and observe
the performance impact in benchmarks.
2023-08-28 20:42:18 +02:00
Em Sharnoff
529f8b5016 compute_ctl: Fix switched vm-monitor args (#5117)
Small switcheroo from #4946.
2023-08-28 14:55:41 +02:00
109 changed files with 8192 additions and 3617 deletions

View File

@@ -14,6 +14,7 @@
!pgxn/
!proxy/
!safekeeper/
!s3_scrubber/
!storage_broker/
!trace/
!vendor/postgres-v14/

View File

@@ -145,7 +145,11 @@ runs:
if [ "${RERUN_FLAKY}" == "true" ]; then
mkdir -p $TEST_OUTPUT
poetry run ./scripts/flaky_tests.py "${TEST_RESULT_CONNSTR}" --days 10 --output "$TEST_OUTPUT/flaky.json"
poetry run ./scripts/flaky_tests.py "${TEST_RESULT_CONNSTR}" \
--days 7 \
--output "$TEST_OUTPUT/flaky.json" \
--pg-version "${DEFAULT_PG_VERSION}" \
--build-type "${BUILD_TYPE}"
EXTRA_PARAMS="--flaky-tests-json $TEST_OUTPUT/flaky.json $EXTRA_PARAMS"
fi

508
Cargo.lock generated

File diff suppressed because it is too large Load Diff

View File

@@ -7,6 +7,7 @@ members = [
"proxy",
"safekeeper",
"storage_broker",
"s3_scrubber",
"workspace_hack",
"trace",
"libs/compute_api",
@@ -37,11 +38,11 @@ async-compression = { version = "0.4.0", features = ["tokio", "gzip"] }
flate2 = "1.0.26"
async-stream = "0.3"
async-trait = "0.1"
aws-config = { version = "0.55", default-features = false, features=["rustls"] }
aws-sdk-s3 = "0.27"
aws-smithy-http = "0.55"
aws-credential-types = "0.55"
aws-types = "0.55"
aws-config = { version = "0.56", default-features = false, features=["rustls"] }
aws-sdk-s3 = "0.29"
aws-smithy-http = "0.56"
aws-credential-types = "0.56"
aws-types = "0.56"
axum = { version = "0.6.20", features = ["ws"] }
base64 = "0.13.0"
bincode = "1.3"
@@ -105,12 +106,12 @@ reqwest-middleware = "0.2.0"
reqwest-retry = "0.2.2"
routerify = "3"
rpds = "0.13"
rustls = "0.20"
rustls = "0.21"
rustls-pemfile = "1"
rustls-split = "0.3"
scopeguard = "1.1"
sysinfo = "0.29.2"
sentry = { version = "0.30", default-features = false, features = ["backtrace", "contexts", "panic", "rustls", "reqwest" ] }
sentry = { version = "0.31", default-features = false, features = ["backtrace", "contexts", "panic", "rustls", "reqwest" ] }
serde = { version = "1.0", features = ["derive"] }
serde_json = "1"
serde_with = "2.0"
@@ -125,11 +126,11 @@ sync_wrapper = "0.1.2"
tar = "0.4"
test-context = "0.1"
thiserror = "1.0"
tls-listener = { version = "0.6", features = ["rustls", "hyper-h1"] }
tls-listener = { version = "0.7", features = ["rustls", "hyper-h1"] }
tokio = { version = "1.17", features = ["macros"] }
tokio-io-timeout = "1.2.0"
tokio-postgres-rustls = "0.9.0"
tokio-rustls = "0.23"
tokio-postgres-rustls = "0.10.0"
tokio-rustls = "0.24"
tokio-stream = "0.1"
tokio-tar = "0.3"
tokio-util = { version = "0.7", features = ["io"] }
@@ -143,7 +144,7 @@ tracing-subscriber = { version = "0.3", default_features = false, features = ["s
url = "2.2"
uuid = { version = "1.2", features = ["v4", "serde"] }
walkdir = "2.3.2"
webpki-roots = "0.23"
webpki-roots = "0.25"
x509-parser = "0.15"
## TODO replace this with tracing
@@ -182,8 +183,8 @@ workspace_hack = { version = "0.1", path = "./workspace_hack/" }
## Build dependencies
criterion = "0.5.1"
rcgen = "0.10"
rstest = "0.17"
rcgen = "0.11"
rstest = "0.18"
tempfile = "3.4"
tonic-build = "0.9"

View File

@@ -211,8 +211,8 @@ RUN wget https://github.com/df7cb/postgresql-unit/archive/refs/tags/7.7.tar.gz -
FROM build-deps AS vector-pg-build
COPY --from=pg-build /usr/local/pgsql/ /usr/local/pgsql/
RUN wget https://github.com/pgvector/pgvector/archive/refs/tags/v0.4.4.tar.gz -O pgvector.tar.gz && \
echo "1cb70a63f8928e396474796c22a20be9f7285a8a013009deb8152445b61b72e6 pgvector.tar.gz" | sha256sum --check && \
RUN wget https://github.com/pgvector/pgvector/archive/refs/tags/v0.5.0.tar.gz -O pgvector.tar.gz && \
echo "d8aa3504b215467ca528525a6de12c3f85f9891b091ce0e5864dd8a9b757f77b pgvector.tar.gz" | sha256sum --check && \
mkdir pgvector-src && cd pgvector-src && tar xvzf ../pgvector.tar.gz --strip-components=1 -C . && \
make -j $(getconf _NPROCESSORS_ONLN) PG_CONFIG=/usr/local/pgsql/bin/pg_config && \
make -j $(getconf _NPROCESSORS_ONLN) install PG_CONFIG=/usr/local/pgsql/bin/pg_config && \

View File

@@ -19,9 +19,10 @@ Also `compute_ctl` spawns two separate service threads:
- `http-endpoint` runs a Hyper HTTP API server, which serves readiness and the
last activity requests.
If the `vm-informant` binary is present at `/bin/vm-informant`, it will also be started. For VM
compute nodes, `vm-informant` communicates with the VM autoscaling system. It coordinates
downscaling and (eventually) will request immediate upscaling under resource pressure.
If `AUTOSCALING` environment variable is set, `compute_ctl` will start the
`vm-monitor` located in [`neon/libs/vm_monitor`]. For VM compute nodes,
`vm-monitor` communicates with the VM autoscaling system. It coordinates
downscaling and requests immediate upscaling under resource pressure.
Usage example:
```sh

View File

@@ -20,9 +20,10 @@
//! - `http-endpoint` runs a Hyper HTTP API server, which serves readiness and the
//! last activity requests.
//!
//! If the `vm-informant` binary is present at `/bin/vm-informant`, it will also be started. For VM
//! compute nodes, `vm-informant` communicates with the VM autoscaling system. It coordinates
//! downscaling and (eventually) will request immediate upscaling under resource pressure.
//! If `AUTOSCALING` environment variable is set, `compute_ctl` will start the
//! `vm-monitor` located in [`neon/libs/vm_monitor`]. For VM compute nodes,
//! `vm-monitor` communicates with the VM autoscaling system. It coordinates
//! downscaling and requests immediate upscaling under resource pressure.
//!
//! Usage example:
//! ```sh
@@ -278,8 +279,9 @@ fn main() -> Result<()> {
use tokio_util::sync::CancellationToken;
use tracing::warn;
let vm_monitor_addr = matches.get_one::<String>("vm-monitor-addr");
let cgroup = matches.get_one::<String>("filecache-connstr");
let file_cache_connstr = matches.get_one::<String>("cgroup");
let file_cache_connstr = matches.get_one::<String>("filecache-connstr");
let cgroup = matches.get_one::<String>("cgroup");
let file_cache_on_disk = matches.get_flag("file-cache-on-disk");
// Only make a runtime if we need to.
// Note: it seems like you can make a runtime in an inner scope and
@@ -312,6 +314,7 @@ fn main() -> Result<()> {
cgroup: cgroup.cloned(),
pgconnstr: file_cache_connstr.cloned(),
addr: vm_monitor_addr.cloned().unwrap(),
file_cache_on_disk,
})),
token.clone(),
))
@@ -482,6 +485,11 @@ fn cli() -> clap::Command {
)
.value_name("FILECACHE_CONNSTR"),
)
.arg(
Arg::new("file-cache-on-disk")
.long("file-cache-on-disk")
.action(clap::ArgAction::SetTrue),
)
}
#[test]

View File

@@ -1,12 +1,39 @@
use anyhow::{anyhow, Result};
use anyhow::{anyhow, Ok, Result};
use postgres::Client;
use tokio_postgres::NoTls;
use tracing::{error, instrument};
use crate::compute::ComputeNode;
/// Create a special service table for availability checks
/// only if it does not exist already.
pub fn create_availability_check_data(client: &mut Client) -> Result<()> {
let query = "
DO $$
BEGIN
IF NOT EXISTS(
SELECT 1
FROM pg_catalog.pg_tables
WHERE tablename = 'health_check'
)
THEN
CREATE TABLE health_check (
id serial primary key,
updated_at timestamptz default now()
);
INSERT INTO health_check VALUES (1, now())
ON CONFLICT (id) DO UPDATE
SET updated_at = now();
END IF;
END
$$;";
client.execute(query, &[])?;
Ok(())
}
/// Update timestamp in a row in a special service table to check
/// that we can actually write some data in this particular timeline.
/// Create table if it's missing.
#[instrument(skip_all)]
pub async fn check_writability(compute: &ComputeNode) -> Result<()> {
// Connect to the database.
@@ -24,19 +51,15 @@ pub async fn check_writability(compute: &ComputeNode) -> Result<()> {
});
let query = "
CREATE TABLE IF NOT EXISTS health_check (
id serial primary key,
updated_at timestamptz default now()
);
INSERT INTO health_check VALUES (1, now())
ON CONFLICT (id) DO UPDATE
SET updated_at = now();";
let result = client.simple_query(query).await?;
if result.len() != 2 {
if result.len() != 1 {
return Err(anyhow::format_err!(
"expected 2 query results, but got {}",
"expected 1 query result, but got {}",
result.len()
));
}

View File

@@ -27,6 +27,7 @@ use utils::measured_stream::MeasuredReader;
use remote_storage::{DownloadError, GenericRemoteStorage, RemotePath};
use crate::checker::create_availability_check_data;
use crate::pg_helpers::*;
use crate::spec::*;
use crate::sync_sk::{check_if_synced, ping_safekeeper};
@@ -696,6 +697,7 @@ impl ComputeNode {
handle_role_deletions(spec, self.connstr.as_str(), &mut client)?;
handle_grants(spec, self.connstr.as_str())?;
handle_extensions(spec, &mut client)?;
create_availability_check_data(&mut client)?;
// 'Close' connection
drop(client);
@@ -1078,7 +1080,8 @@ LIMIT 100",
let mut download_tasks = Vec::new();
for library in &libs_vec {
let (ext_name, ext_path) = remote_extensions.get_ext(library, true)?;
let (ext_name, ext_path) =
remote_extensions.get_ext(library, true, &self.build_tag, &self.pgversion)?;
download_tasks.push(self.download_extension(ext_name, ext_path));
}
let results = join_all(download_tasks).await;

View File

@@ -180,7 +180,19 @@ pub async fn download_extension(
// Create extension control files from spec
pub fn create_control_files(remote_extensions: &RemoteExtSpec, pgbin: &str) {
let local_sharedir = Path::new(&get_pg_config("--sharedir", pgbin)).join("extension");
for ext_data in remote_extensions.extension_data.values() {
for (ext_name, ext_data) in remote_extensions.extension_data.iter() {
// Check if extension is present in public or custom.
// If not, then it is not allowed to be used by this compute.
if let Some(public_extensions) = &remote_extensions.public_extensions {
if !public_extensions.contains(ext_name) {
if let Some(custom_extensions) = &remote_extensions.custom_extensions {
if !custom_extensions.contains(ext_name) {
continue; // skip this extension, it is not allowed
}
}
}
}
for (control_name, control_content) in &ext_data.control_data {
let control_path = local_sharedir.join(control_name);
if !control_path.exists() {

View File

@@ -169,7 +169,12 @@ async fn routes(req: Request<Body>, compute: &Arc<ComputeNode>) -> Response<Body
}
};
remote_extensions.get_ext(&filename, is_library)
remote_extensions.get_ext(
&filename,
is_library,
&compute.build_tag,
&compute.pgversion,
)
};
match ext {

View File

@@ -138,7 +138,13 @@ impl ComputeControlPlane {
mode,
tenant_id,
pg_version,
skip_pg_catalog_updates: false,
// We don't setup roles and databases in the spec locally, so we don't need to
// do catalog updates. Catalog updates also include check availability
// data creation. Yet, we have tests that check that size and db dump
// before and after start are the same. So, skip catalog updates,
// with this we basically test a case of waking up an idle compute, where
// we also skip catalog updates in the cloud.
skip_pg_catalog_updates: true,
});
ep.create_endpoint_dir()?;
@@ -152,7 +158,7 @@ impl ComputeControlPlane {
http_port,
pg_port,
pg_version,
skip_pg_catalog_updates: false,
skip_pg_catalog_updates: true,
})?,
)?;
std::fs::write(

View File

@@ -0,0 +1,957 @@
# Pageserver: split-brain safety for remote storage through generation numbers
## Summary
A scheme of logical "generation numbers" for tenant attachment to pageservers is proposed, along with
changes to the remote storage format to include these generation numbers in S3 keys.
Using the control plane as the issuer of these generation numbers enables strong anti-split-brain
properties in the pageserver cluster without implementing a consensus mechanism directly
in the pageservers.
## Motivation
Currently, the pageserver's remote storage format does not provide a mechanism for addressing
split brain conditions that may happen when replacing a node or when migrating
a tenant from one pageserver to another.
From a remote storage perspective, a split brain condition occurs whenever two nodes both think
they have the same tenant attached, and both can write to S3. This can happen in the case of a
network partition, pathologically long delays (e.g. suspended VM), or software bugs.
In the current deployment model, control plane guarantees that a tenant is attached to one
pageserver at a time, thereby ruling out split-brain conditions resulting from dual
attachment (however, there is always the risk of a control plane bug). This control
plane guarantee prevents robust response to failures, as if a pageserver is unresponsive
we may not detach from it. The mechanism in this RFC fixes this, by making it safe to
attach to a new, different pageserver even if an unresponsive pageserver may be running.
Futher, lack of safety during split-brain conditions blocks two important features where occasional
split-brain conditions are part of the design assumptions:
- seamless tenant migration ([RFC PR](https://github.com/neondatabase/neon/pull/5029))
- automatic pageserver instance failure handling (aka "failover") (RFC TBD)
### Prior art
- 020-pageserver-s3-coordination.md
- 023-the-state-of-pageserver-tenant-relocation.md
- 026-pageserver-s3-mvcc.md
This RFC has broad similarities to the proposal to implement a MVCC scheme in
S3 object names, but this RFC avoids a general purpose transaction scheme in
favour of more specialized "generations" that work like a transaction ID that
always has the same lifetime as a pageserver process or tenant attachment, whichever
is shorter.
## Requirements
- Accommodate storage backends with no atomic or fencing capability (i.e. work within
S3's limitation that there are no atomics and clients can't be fenced)
- Don't depend on any STONITH or node fencing in the compute layer (i.e. we will not
assume that we can reliably kill and EC2 instance and have it die)
- Scoped per-tenant, not per-pageserver; for _seamless tenant migration_, we need
per-tenant granularity, and for _failover_, we likely want to spread the workload
of the failed pageserver instance to a number of peers, rather than monolithically
moving the entire workload to another machine.
We do not rule out the latter case, but should not constrain ourselves to it.
## Design Tenets
These are not requirements, but are ideas that guide the following design:
- Avoid implementing another consensus system: we already have a strongly consistent
database in the control plane that can do atomic operations where needed, and we also
have a Paxos implementation in the safekeeper.
- Avoiding locking in to specific models of how failover will work (e.g. do not assume that
all the tenants on a pageserver will fail over as a unit).
- Be strictly correct when it comes to data integrity. Occasional failures of availability
are tolerable, occasional data loss is not.
## Non Goals
The changes in this RFC intentionally isolate the design decision of how to define
logical generations numbers and object storage format in a way that is somewhat flexible with
respect to how actual orchestration of failover works.
This RFC intentionally does not cover:
- Failure detection
- Orchestration of failover
- Standby modes to keep data ready for fast migration
- Intentional multi-writer operation on tenants (multi-writer scenarios are assumed to be transient split-brain situations).
- Sharding.
The interaction between this RFC and those features is discussed in [Appendix B](#appendix-b-interoperability-with-other-features)
## Impacted Components
pageserver, control plane, safekeeper (optional)
## Implementation Part 1: Correctness
### Summary
- A per-tenant **generation number** is introduced to uniquely identifying tenant attachments to pageserver processes.
- This generation number increments each time the control plane modifies a tenant (`Project`)'s assigned pageserver, or when the assigned pageserver restarts.
- the control plane is the authority for generation numbers: only it may
increment a generation number.
- **Object keys are suffixed** with the generation number
- **Safety for multiply-attached tenants** is provided by the
generation number in the object key: the competing pageservers will not
try to write to the same keys.
- **Safety in split brain for multiple nodes running with
the same node ID** is provided by the pageserver calling out to the control plane
on startup, to re-attach and thereby increment the generations of any attached tenants
- **Safety for deletions** is achieved by deferring the DELETE from S3 to a point in time where the deleting node has validated with control plane that no attachment with a higher generation has a reference to the to-be-DELETEd key.
- **The control plane is used to issue generation numbers** to avoid the need for
a built-in consensus system in the pageserver, although this could in principle
be changed without changing the storage format.
### Generation numbers
A generation number is associated with each tenant in the control plane,
and each time the attachment status of the tenant changes, this is incremented.
Changes in attachment status include:
- Attaching the tenant to a different pageserver
- A pageserver restarting, and "re-attaching" its tenants on startup
These increments of attachment generation provide invariants we need to avoid
split-brain issues in storage:
- If two pageservers have the same tenant attached, the attachments are guaranteed to have different generation numbers, because the generation would increment
while attaching the second one.
- If there are multiple pageservers running with the same node ID, all the attachments on all pageservers are guaranteed to have different generation numbers, because the generation would increment
when the second node started and re-attached its tenants.
As long as the infrastructure does not transparently replace an underlying
physical machine, we are totally safe. See the later [unsafe case](#unsafe-case-on-badly-behaved-infrastructure) section for details.
### Object Key Changes
#### Generation suffix
All object keys (layer objects and index objects) will contain the attachment
generation as a [suffix](#why-a-generation-suffix-rather-than-prefix).
This suffix is the primary mechanism for protecting against split-brain situations, and
enabling safe multi-attachment of tenants:
- Two pageservers running with the same node ID (e.g. after a failure, where there is
some rogue pageserver still running) will not try to write to the same objects, because at startup they will have re-attached tenants and thereby incremented
generation numbers.
- Multiple attachments (to different pageservers) of the same tenant will not try to write to the same objects, as each attachment would have a distinct generation.
The generation is appended in hex format (8 byte string representing
u32), to all our existing key names. A u32's range limit would permit
27 restarts _per second_ over a 5 year system lifetime: orders of magnitude more than
is realistic.
The exact meaning of the generation suffix can evolve over time if necessary, for
example if we chose to implement a failover mechanism internally to the pageservers
rather than going via the control plane. The storage format just sees it as a number,
with the only semantic property being that the highest numbered index is the latest.
#### Index changes
Since object keys now include a generation suffix, the index of these keys must also be updated. IndexPart currently stores keys and LSNs sufficient to reconstruct key names: this would be extended to store the generation as well.
This will increase the size of the file, but only modestly: layers are already encoded as
their string-ized form, so the overhead is about 10 bytes per layer. This will be less if/when
the index storage format is migrated to a binary format from JSON.
#### Visibility
_This section doesn't describe code changes, but extends on the consequences of the
object key changes given above_
##### Visibility of objects to pageservers
Pageservers can of course list objects in S3 at any time, but in practice their
visible set is based on the contents of their LayerMap, which is initialized
from the `index_part.json.???` that they load.
Starting with the `index_part` from the most recent previous generation
(see [loading index_part](#finding-the-remote-indices-for-timelines)), a pageserver
initially has visibility of all the objects that were referenced in the loaded index.
These objects are guaranteed to remain visible until the current generation is
superseded, via pageservers in older generations avoiding deletions (see [deletion](#deletion)).
The "most recent previous generation" is _not_ necessarily the most recent
in terms of walltime, it is the one that is readable at the time a new generation
starts. Consider the following sequence of a tenant being re-attached to different
pageserver nodes:
- Create + attach on PS1 in generation 1
- PS1 Do some work, write out index_part.json-0001
- Attach to PS2 in generation 2
- Read index_part.json-0001
- PS2 starts doing some work...
- Attach to PS3 in generation 3
- Read index_part.json-0001
- **...PS2 finishes its work: now it writes index_part.json-0002**
- PS3 writes out index_part.json-0003
In the above sequence, the ancestry of indices is:
```
0001 -> 0002
|
-> 0003
```
This is not an issue for safety: if the 0002 references some object that is
not in 0001, then 0003 simply does not see it, and will re-do whatever
work was required (e.g. ingesting WAL or doing compaction). Objects referenced
by only the 0002 index will never be read by future attachment generations, and
will eventually be cleaned up by a scrub (see [scrubbing](#cleaning-up-orphan-objects-scrubbing)).
##### Visibility of LSNs to clients
Because index_part.json is now written with a generation suffix, which data
is visible depends on which generation the reader is operating in:
- If one was passively reading from S3 from outside of a pageserver, the
visibility of data would depend on which index_part.json-<generation> file
one had chosen to read from.
- If two pageservers have the same tenant attached, they may have different
data visible as they're independently replaying the WAL, and maintaining
independent LayerMaps that are written to independent index_part.json files.
Data does not have to be remotely committed to be visible.
- For a pageserver writing with a stale generation, historic LSNs
remain readable until another pageserver (with a higher generation suffix)
decides to execute GC deletions. At this point, we may think of the stale
attachment's generation as having logically ended: during its existence
the generation had a consistent view of the world.
- For a newly attached pageserver, its highest visible LSN may appears to
go backwards with respect to an earlier attachment, if that earlier
attachment had not uploaded all data to S3 before the new attachment.
### Deletion
#### Generation number validation
While writes are de-conflicted by writers always using their own generation number in the key,
deletions are slightly more challenging: if a pageserver A is isolated, and the true active node is
pageserver B, then it is dangerous for A to do any object deletions, even of objects that it wrote
itself, because pageserver's B metadata might reference those objects.
We solve this by inserting a "generation validation" step between the write of a remote index
that un-links a particular object from the index, and the actual deletion of the object, such
that deletions strictly obey the following ordering:
1. Write out index_part.json: this guarantees that any subsequent reader of the metadata will
not try and read the object we unlinked.
2. Call out to control plane to validate that the generation which we use for our attachment is still the latest.
3. If step 2 passes, it is safe to delete the object. Why? The check-in with control plane
together with our visibility rules guarantees that any later generation
will use either the exact `index_part.json` that we uploaded in step 1, or a successor
of it; not an earlier one. In both cases, the `index_part.json` doesn't reference the
key we are deleting anymore, so, the key is invisible to any later attachment generation.
Hence it's safe to delete it.
Note that at step 2 we are only confirming that deletions of objects _no longer referenced
by the specific `index_part.json` written in step 1_ are safe. If we were attempting other deletions concurrently,
these would need their own generation validation step.
If step 2 fails, we may leak the object. This is safe, but has a cost: see [scrubbing](#cleaning-up-orphan-objects-scrubbing). We may avoid this entirely outside of node
failures, if we do proper flushing of deletions on clean shutdown and clean migration.
To avoid doing a huge number of control plane requests to perform generation validation,
validation of many tenants will be done in a single request, and deletions will be queued up
prior to validation: see [Persistent deletion queue](#persistent-deletion-queue) for more.
#### `remote_consistent_lsn` updates
Remote objects are not the only kind of deletion the pageserver does: it also indirectly deletes
WAL data, by feeding back remote_consistent_lsn to safekeepers, as a signal to the safekeepers that
they may drop data below this LSN.
For the same reasons that deletion of objects must be guarded by an attachment generation number
validation step, updates to `remote_consistent_lsn` are subject to the same rules, using
an ordering as follows:
1. upload the index_part that covers data up to LSN `L0` to S3
2. Call out to control plane to validate that the generation which we use for our attachment is still the latest.
3. advance the `remote_consistent_lsn` that we advertise to the safekeepers to `L0`
If step 2 fails, then the `remote_consistent_lsn` advertised
to safekeepers will not advance again until a pageserver
with the latest generation is ready to do so.
**Note:** at step 3 we are not advertising the _latest_ remote_consistent_lsn, we are
advertising the value in the index_part that we uploaded in step 1. This provides
a strong ordering guarantee.
Internally to the pageserver, each timeline will have two remote_consistent_lsn values: the one that
reflects its latest write to remote storage, and the one that reflects the most
recent validation of generation number. It is only the latter value that may
be advertised to the outside world (i.e. to the safekeeper).
The control plane remains unaware of `remote_consistent_lsn`: it only has to validate
the freshness of generation numbers, thereby granting the pageserver permission to
share the information with the safekeeper.
For convenience, in subsequent sections and RFCs we will use "deletion" to mean both deletion
of objects in S3, and updates to the `remote_consistent_lsn`, as updates to the remote consistent
LSN are de-facto deletions done via the safekeeper, and both kinds of deletion are subject to
the same generation validation requirement.
### Pageserver attach/startup changes
#### Attachment
Calls to `/v1/tenant/{tenant_id}/attach` are augmented with an additional
`generation` field in the body.
The pageserver does not persist this: a generation is only good for the lifetime
of a process.
#### Finding the remote indices for timelines
Because index files are now suffixed with generation numbers, the pageserver
cannot always GET the remote index in one request, because it can't always
know a-priori what the latest remote index is.
Typically, the most recent generation to write an index would be our own
generation minus 1. However, this might not be the case: the previous
node might have started and acquired a generation number, and then crashed
before writing out a remote index.
In the general case and as a fallback, the pageserver may list all the `index_part.json`
files for a timeline, sort them by generation, and pick the highest that is `<=`
its current generation for this attachment. The tenant should never load an index
with an attachment generation _newer_ than its own.
These two rules combined ensure that objects written by later generations are never visible to earlier generations.
Note that if a given attachment picks an index part from an earlier generation (say n-2), but crashes & restarts before it writes its own generation's index part, next time it tries to pick an index part there may be an index part from generation n-1.
It would pick the n-1 index part in that case, because it's sorted higher than the previous one from generation n-2.
So, above rules guarantee no determinism in selecting the index part.
are allowed to be attached with stale attachment generations during a multiply-attached
phase in a migration, and in this instance if the old location's pageserver restarts,
it should not try and load the newer generation's index.
To summarize, on starting a timeline, the pageserver will:
1. Issue a GET for index_part.json-<my generation - 1>
2. If 1 failed, issue a ListObjectsv2 request for index_part.json\* and
pick the newest.
One could optimize this further by using the control plane to record specifically
which generation most recently wrote an index_part.json, if necessary, to increase
the probability of finding the index_part.json in one GET. One could also improve
the chances by having pageservers proactively write out index_part.json after they
get a new generation ID.
#### Re-attachment on startup
On startup, the pageserver will call out to an new control plane `/re-attach`
API (see [Generation API](#generation-api)). This returns a list of
tenants that should be attached to the pageserver, and their generation numbers, which
the control plane will increment before returning.
The pageserver should still scan its local disk on startup, but should _delete_
any local content for tenants not indicated in the `/re-attach` response: their
absence is an implicit detach operation.
**Note** if a tenant is omitted from the re-attach response, its local disk content
will be deleted. This will change in subsequent work, when the control plane gains
the concept of a secondary/standby location: a node with local content may revert
to this status and retain some local content.
#### Cleaning up previous generations' remote indices
Deletion of old indices is not necessary for correctness, although it is necessary
to avoid the ListObjects fallback in the previous section becoming ever more expensive.
Once the new attachment has written out its index_part.json, it may asynchronously clean up historic index_part.json
objects that were found.
We may choose to implement this deletion either as an explicit step after we
write out index_part for the first time in a pageserver's lifetime, or for
simplicity just do it periodically as part of the background scrub (see [scrubbing](#cleaning-up-orphan-objects-scrubbing));
### Control Plane Changes
#### Store generations for attaching tenants
- The `Project` table must store the generation number for use when
attaching the tenant to a new pageserver.
- The `/v1/tenant/:tenant_id/attach` pageserver API will require the generation number,
which the control plane can supply by simply incrementing the `Project`'s
generation number each time the tenant is attached to a different server: the same database
transaction that changes the assigned pageserver should also change the generation number.
#### Generation API
This section describes an API that could be provided directly by the control plane,
or built as a separate microservice. In earlier parts of the RFC, when we
discuss the control plane providing generation numbers, we are referring to this API.
The API endpoints used by the pageserver to acquire and validate generation
numbers are quite simple, and only require access to some persistent and
linerizable storage (such as a database).
Building this into the control plane is proposed as a least-effort option to exploit existing infrastructure and implement generation number issuance in the same transaction that mandates it (i.e., the transaction that updates the `Project` assignment to another pageserver).
However, this is not mandatory: this "Generation Number Issuer" could
be built as a microservice. In practice, we will write such a miniature service
anyway, to enable E2E pageserver/compute testing without control plane.
The endpoints required by pageservers are:
##### `/re-attach`
- Request: `{node_id: <u32>}`
- Response:
- 200 `{tenants: [{id: <TenantId>, gen: <u32>}]}`
- 404: unknown node_id
- (Future: 429: flapping detected, perhaps nodes are fighting for the same node ID,
or perhaps this node was in a retry loop)
- (On unknown tenants, omit tenant from `tenants` array)
- Server behavior: query database for which tenants should be attached to this pageserver.
- for each tenant that should be attached, increment the attachment generation and
include the new generation in the response
- Client behavior:
- for all tenants in the response, activate with the new generation number
- for any local disk content _not_ referenced in the response, act as if we
had been asked to detach it (i.e. delete local files)
**Note** the `node_id` in this request will change in future if we move to ephemeral
node IDs, to be replaced with some correlation ID that helps the control plane realize
if a process is running with the same storage as a previous pageserver process (e.g.
we might use EC instance ID, or we might just write some UUID to the disk the first
time we use it)
##### `/validate`
- Request: `{'tenants': [{tenant: <tenant id>, attach_gen: <gen>}, ...]}'`
- Response:
- 200 `{'tenants': [{tenant: <tenant id>, status: <bool>}...]}`
- (On unknown tenants, omit tenant from `tenants` array)
- Purpose: enable the pageserver to discover for the given attachments whether they are still the latest.
- Server behavior: this is a read-only operation: simply compare the generations in the request with
the generations known to the server, and set status to `true` if they match.
- Client behavior: clients must not do deletions within a tenant's remote data until they have
received a response indicating the generation they hold for the attachment is current.
#### Use of `/load` and `/ignore` APIs
Because the pageserver will be changed to only attach tenants on startup
based on the control plane's response to a `/re-attach` request, the load/ignore
APIs no longer make sense in their current form.
The `/load` API becomes functionally equivalent to attach, and will be removed:
any location that used `/load` before should just attach instead.
The `/ignore` API is equivalent to detaching, but without deleting local files.
### Timeline/Branch creation & deletion
All of the previous arguments for safety have described operations within
a timeline, where we may describe a sequence that includes updates to
index_part.json, and where reads and writes are coming from a postgres
endpoint (writes via the safekeeper).
Creating or destroying timeline is a bit different, because writes
are coming from the control plane.
We must be safe against scenarios such as:
- A tenant is attached to pageserver B while pageserver A is
in the middle of servicing an RPC from the control plane to
create or delete a tenant.
- A pageserver A has been sent a timeline creation request
but becomes unresponsive. The tenant is attached to a
different pageserver B, and the timeline creation request
is sent there too.
#### Timeline Creation
If some very slow node tries to do a timeline creation _after_
a more recent generation node has already created the timeline
and written some data into it, that must not cause harm. This
is provided in timeline creations by the way all the objects
within the timeline's remote path include a generation suffix:
a slow node in an old generation that attempts to "create" a timeline
that already exists will just emit an index_part.json with
an old generation suffix.
Timeline IDs are never reused, so we don't have
to worry about the case of create/delete/create cycles. If they
were re-used during a disaster recovery "un-delete" of a timeline,
that special case can be handled by calling out to all available pageservers
to check that they return 404 for the timeline, and to flush their
deletion queues in case they had any deletions pending from the
timeline.
The above makes it safe for control plane to change the assignment of
tenant to pageserver in control plane while a timeline creation is ongoing.
The reason is that the creation request against the new assigned pageserver
uses a new generation number. However, care must be taken by control plane
to ensure that a "timeline creation successul" response from some pageserver
is checked for the pageserver's generation for that timeline's tenant still being the latest.
If it is not the latest, the response does not constitute a successful timeline creation.
It is acceptable to discard such responses, the scrubber will clean up the S3 state.
It is better to issue a timelien deletion request to the stale attachment.
#### Timeline Deletion
Tenant/timeline deletion operations are exempt from generation validation
on deletes, and therefore don't have to go through the same deletion
queue as GC/compaction layer deletions. This is because once a
delete is issued by the control plane, it is a promise that the
control plane will keep trying until the deletion is done, so even stale
pageservers are permitted to go ahead and delete the objects.
The implications of this for control plane are:
- During timeline/tenant deletion, the control plane must wait for the deletion to
be truly complete (status 404) and also handle the case where the pageserver
becomes unavailable, either by waiting for a replacement with the same node_id,
or by *re-attaching the tenant elsewhere.
- The control plane must persist its intent to delete
a timeline/tenant before issuing any RPCs, and then once it starts, it must
keep retrying until the tenant/timeline is gone. This is already handled
by using a persistent `Operation` record that is retried indefinitely.
Timeline deletion may result in a special kind of object leak, where
the latest generation attachment completes a deletion (including erasing
all objects in the timeline path), but some slow/partitioned node is
writing into the timeline path with a stale generation number. This would
not be caught by any per-timeline scrubbing (see [scrubbing](#cleaning-up-orphan-objects-scrubbing)), since scrubbing happens on the
attached pageserver, and once the timeline is deleted it isn't attached anywhere.
This scenario should be pretty rare, and the control plane can make it even
rarer by ensuring that if a tenant is in a multi-attached state (e.g. during
migration), we wait for that to complete before processing the deletion. Beyond
that, we may implement some other top-level scrub of timelines in
an external tool, to identify any tenant/timeline paths that are not found
in the control plane database.
#### Examples
- Deletion, node restarts partway through:
- By the time we returned 202, we have written a remote delete marker
- Any subsequent incarnation of the same node_id will see the remote
delete marker and continue to process the deletion
- If the original pageserver is lost permanently and no replacement
with the same node_id is available, then the control plane must recover
by re-attaching the tenant to a different node.
- Creation, node becomes unresponsive partway through.
- Control plane will see HTTP request timeout, keep re-issuing
request to whoever is the latest attachment point for the tenant
until it succeeds.
- Stale nodes may be trying to execute timeline creation: they will
write out index_part.json files with
stale attachment generation: these will be eventually cleaned up
by the same mechanism as other old indices.
### Unsafe case on badly behaved infrastructure
This section is only relevant if running on a different environment
than EC2 machines with ephemeral disks.
If we ever run pageservers on infrastructure that might transparently restart
a pageserver while leaving an old process running (e.g. a VM gets rescheduled
without the old one being fenced), then there is a risk of corruption, when
the control plane attaches the tenant, as follows:
- If the control plane sends an `/attach` request to node A, then node A dies
and is replaced, and the control plane's retries the request without
incrementing that attachment ID, then it could end up with two physical nodes
both using the same generation number.
- This is not an issue when using EC2 instances with ephemeral storage, as long
as the control plane never re-uses a node ID, but it would need re-examining
if running on different infrastructure.
- To robustly protect against this class of issue, we would either:
- add a "node generation" to distinguish between different processes holding the
same node_id.
- or, dispense with static node_id entirely and issue an ephemeral ID to each
pageserver process when it starts.
## Implementation Part 2: Optimizations
### Persistent deletion queue
Between writing our a new index_part.json that doesn't reference an object,
and executing the deletion, an object passes through a window where it is
only referenced in memory, and could be leaked if the pageserver is stopped
uncleanly. That introduces conflicting incentives: on the one hand, we would
like to delay and batch deletions to
1. minimize the cost of the mandatory validations calls to control plane, and
2. minimize cost for DeleteObjects requests.
On the other hand we would also like to minimize leakage by executing
deletions promptly.
To resolve this, we may make the deletion queue persistent
and then executing these in the background at a later time.
_Note: The deletion queue's reason for existence is optimization rather than correctness,
so there is a lot of flexibility in exactly how the it should work,
as long as it obeys the rule to validate generations before executing deletions,
so the following details are not essential to the overall RFC._
#### Scope
The deletion queue will be global per pageserver, not per-tenant. There
are several reasons for this choice:
- Use the queue as a central point to coalesce validation requests to the
control plane: this avoids individual `Timeline` objects ever touching
the control plane API, and avoids them having to know the rules about
validating deletions. This separation of concerns will avoid burdening
the already many-LoC `Timeline` type with even more responsibility.
- Decouple the deletion queue from Tenant attachment lifetime: we may
"hibernate" an inactive tenant by tearing down its `Tenant`/`Timeline`
objects in the pageserver, without having to wait for deletions to be done.
- Amortize the cost of I/O for the persistent queue, instead of having many
tiny queues.
- Coalesce deletions into a smaller number of larger DeleteObjects calls
Because of the cost of doing I/O for persistence, and the desire to coalesce
generation validation requests across tenants, and coalesce deletions into
larger DeleteObjects requests, there will be one deletion queue per pageserver
rather than one per tenant. This has the added benefit that when deactivating
a tenant, we do not have to drain their deletion queue: deletions can proceed
for a tenant whose main `Tenant` object has been torn down.
#### Flow of deletion
The flow of a deletion is becomes:
1. Need for deletion of an object (=> layer file) is identified.
2. Unlink the object from all the places that reference it (=> `index_part.json`).
3. Enqueue the deletion to a persistent queue.
Each entry is `tenant_id, attachment_generation, S3 key`.
4. Validate & execute in batches:
4.1 For a batch of entries, call into control plane.
4.2 For the subset of entries that passed validation, execute a `DeleteObjects` S3 DELETE request for their S3 keys.
As outlined in the Part 1 on correctness, it is critical that deletions are only
executed once the key is not referenced anywhere in S3.
This property is obviously upheld by the scheme above.
#### We Accept Object Leakage In Acceptable Circumcstances
If we crash in the flow above between (2) and (3), we lose track of unreferenced object.
Further, enqueuing a single to the persistent queue may not be durable immediately to amortize cost of flush to disk.
This is acceptable for now, it can be caught by [the scrubber](#cleaning-up-orphan-objects-scrubbing).
There are various measures we can take to improve this in the future.
1. Cap amount of time until enqueued entry becomes durable (timeout for flush-to-tisk)
2. Proactively flush:
- On graceful shutdown, as we anticipate that some or
all of our attachments may be re-assigned while we are offline.
- On tenant detach.
3. For each entry, keep track of whether it has passed (2).
Only admit entries to (4) one they have passed (2).
This requires re-writing / two queue entries (intent, commit) per deletion.
The important take-away with any of the above is that it's not
disastrous to leak objects in exceptional circumstances.
#### Operations that may skip the queue
Deletions of an entire timeline are [exempt](#Timeline-Deletion) from generation number validation. Once the
control plane sends the deletion request, there is no requirement to retain the readability
of any data within the timeline, and all objects within the timeline path may be deleted
at any time from the control plane's deletion request onwards.
Since deletions of smaller timelines won't have enough objects to compose a full sized
DeleteObjects request, it is still useful to send these through the last part of the
deletion pipeline to coalesce with other executing deletions: to enable this, the
deletion queue should expose two input channels: one for deletions that must be
processed in a generation-aware way, and a fast path for timeline deletions, where
that fast path may skip validation and the persistent queue.
### Cleaning up orphan objects (scrubbing)
An orphan object is any object which is no longer referenced by a running node or by metadata.
Examples of how orphan objects arise:
- A node PUTs a layer object, then crashes before it writes the
index_part.json that references that layer.
- A stale node carries on running for some time, and writes out an unbounded number of
objects while it believes itself to be the rightful writer for a tenant.
- A pageserver crashes between un-linking an object from the index, and persisting
the object to its deletion queue.
Orphan objects are functionally harmless, but have a small cost due to S3 capacity consumed. We
may clean them up at some time in the future, but doing a ListObjectsv2 operation and cross
referencing with the latest metadata to identify objects which are not referenced.
Scrubbing will be done only by an attached pageserver (not some third party process), and deletions requested during scrub will go through the same
validation as all other deletions: the attachment generation must be
fresh. This avoids the possibility of a stale pageserver incorrectly
thinking than an object written by a newer generation is stale, and deleting
it.
It is not strictly necessary that scrubbing be done by an attached
pageserver: it could also be done externally. However, an external
scrubber would still require the same validation procedure that
a pageserver's deletion queue performs, before actually erasing
objects.
## Operational impact
### Availability
Coordination of generation numbers via the control plane introduce a dependency for certain
operations:
1. Starting new pageservers (or activating pageservers after a restart)
2. Executing enqueued deletions
3. Advertising updated `remote_consistent_lsn` to enable WAL trimming
Item 1. would mean that some in-place restarts that previously would have resumed service even if the control plane were
unavailable, will now not resume service to users until the control plane is available. We could
avoid this by having a timeout on communication with the control plane, and after some timeout,
resume service with the previous generation numbers (assuming this was persisted to disk). However,
this is unlikely to be needed as the control plane is already an essential & highly available component. Also, having a node re-use an old generation number would complicate
reasoning about the system, as it would break the invariant that a generation number uniquely identifies
a tenant's attachment to a given pageserver _process_: it would merely identify the tenant's attachment
to the pageserver _machine_ or its _on-disk-state_.
Item 2. is a non-issue operationally: it's harmless to delay deletions, the only impact of objects pending deletion is
the S3 capacity cost.
Item 3. could be an issue if safekeepers are low on disk space and the control plane is unavailable for a long time. If this became an issue,
we could adjust the safekeeper to delete segments from local disk sooner, as soon as they're uploaded to S3, rather than waiting for
remote_consistent_lsn to advance.
For a managed service, the general approach should be to make sure we are monitoring & respond fast enough
that control plane outages are bounded in time.
There is also the fact that control plane runs in a single region.
The latency for distant regions is not a big concern for us because all request types added by this RFC are either infrequent or not in the way of the data path.
However, we lose region isolation for the operations listed above.
The ongoing work to split console and control will give us per-region control plane, and all operations in this RFC can be handled by these per-region control planes.
With that in mind, we accept the trade-offs outlined in this paragraph.
We will also implement an "escape hatch" config generation numbers, where in a major disaster outage,
we may manually run pageservers with a hand-selected generation number, so that we can bring them online
independently of a control plane.
### Rollout
Although there is coupling between components, we may deploy most of the new data plane components
independently of the control plane: initially they can just use a static generation number.
#### Phase 1
The pageserver is deployed with some special config to:
- Always act like everything is generation 1 and do not wait for a control plane issued generation on attach
- Skip the places in deletion and remote_consistent_lsn updates where we would call into control plane
#### Phase 2
The control plane changes are deployed: control plane will now track and increment generation numbers.
#### Phase 3
The pageserver is deployed with its control-plane-dependent changes enabled: it will now require
the control plane to service re-attach requests on startup, and handle generation
validation requests.
### On-disk backward compatibility
Backward compatibility with existing data is straightforward:
- When reading the index, we may assume that any layer whose metadata doesn't include
generations will have a path without generation suffix.
- When locating the index file on attachment, we may use the "fallback" listing path
and if there is only an index without generation suffix, that is the one we load.
It is not necessary to re-write existing layers: even new index files will be able
to represent generation-less layers.
### On-disk forward compatibility
We will do a two phase rollout, probably over multiple releases because we will naturally
have some of the read-side code ready before the overall functionality is ready:
1. Deploy pageservers which understand the new index format and generation suffixes
in keys, but do not write objects with generation numbers in the keys.
2. Deploy pageservers that write objects with generation numbers in the keys.
Old pageservers will be oblivious to generation numbers. That means that they can't
read objects with generation numbers in the name. This is why we must
first step must deploy the ability to read, before the second step
starts writing them.
# Frequently Asked Questions
## Why a generation _suffix_ rather than _prefix_?
The choice is motivated by object listing, since one can list by prefix but not
suffix.
In [finding remote indices](#finding-the-remote-indices-for-timelines), we rely
on being able to do a prefix listing for `<tenant>/<timeline>/index_part.json*`.
That relies on the prefix listing.
The converse case of using a generation prefix and listing by generation is
not needed: one could imagine listing by generation while scrubbing (so that
a particular generation's layers could be scrubbed), but this is not part
of normal operations, and the [scrubber](#cleaning-up-orphan-objects-scrubbing) probably won't work that way anyway.
## Wouldn't it be simpler to have a separate deletion queue per timeline?
Functionally speaking, we could. That's how RemoteTimelineClient currently works,
but this approach does not map well to a long-lived persistent queue with
generation validation.
Anything we do per-timeline generates tiny random I/O, on a pageserver with
tens of thousands of timelines operating: to be ready for high scale, we should:
- A) Amortize costs where we can (e.g. a shared deletion queue)
- B) Expect to put tenants into a quiescent state while they're not
busy: i.e. we shouldn't keep a tenant alive to service its deletion queue.
This was discussed in the [scope](#scope) part of the deletion queue section.
# Appendix A: Examples of use in high availability/failover
The generation numbers proposed in this RFC are adaptable to a variety of different
failover scenarios and models. The sections below sketch how they would work in practice.
### In-place restart of a pageserver
"In-place" here means that the restart is done before any other element in the system
has taken action in response to the node being down.
- After restart, the node issues a re-attach request to the control plane, and
receives new generation numbers for all its attached tenants.
- Tenants may be activated with the generation number in the re-attach response.
- If any of its attachments were in fact stale (i.e. had be reassigned to another
node while this node was offline), then
- the re-attach response will inform the tenant about this by not including
the tenant of this by _not_ incrementing the generation for that attachment.
- This will implicitly block deletions in the tenant, but as an optimization
the pageserver should also proactively stop doing S3 uploads when it notices this stale-generation state.
- The control plane is expected to eventually detach this tenant from the
pageserver.
If the control plane does not include a tenant in the re-attach response,
but there is still local state for the tenant in the filesystem, the pageserver
deletes the local state in response and does not load/active the tenant.
See the [earlier section on pageserver startup](#pageserver-attachstartup-changes) for details.
Control plane can use this mechanism to clean up a pageserver that has been
down for so long that all its tenants were migrated away before it came back
up again and asked for re-attach.
### Failure of a pageserver
In this context, read "failure" as the most ambiguous possible case, where
a pageserver is unavailable to clients and control plane, but may still be executing and talking
to S3.
#### Case A: re-attachment to other nodes
1. Let's say node 0 becomes unresponsive in a cluster of three nodes 0, 1, 2.
2. Some external mechanism notices that the node is unavailable and initiates
movement of all tenants attached to that node to a different node according
to some distribution rule.
In this example, it would mean incrementing the generation
of all tenants that were attached to node 0, as each tenant's assigned pageserver changes.
3. A tenant which is now attached to node 1 will _also_ still be attached to node
0, from the perspective of node 0. Node 0 will still be using its old generation,
node 1 will be using a newer generation.
4. S3 writes will continue from nodes 0 and 1: there will be an index_part.json-00000001
\_and\* an index_part.json-00000002. Objects written under the old suffix
after the new attachment was created do not matter from the rest of the system's
perspective: the endpoints are reading from the new attachment location. Objects
written by node 0 are just garbage that can be cleaned up at leisure. Node 0 will
not do any deletions because it can't synchronize with control plane, or if it could,
its deletion queue processing would get errors for the validation requests.
#### Case B: direct node replacement with same node_id and drive
This is the scenario we would experience if running pageservers in some dynamic
VM/container environment that would auto-replace a given node_id when it became
unresponsive, with the node's storage supplied by some network block device
that is attached to the replacement VM/container.
1. Let's say node 0 fails, and there may be some other peers but they aren't relevant.
2. Some external mechanism notices that the node is unavailable, and creates
a "new node 0" (Node 0b) which is a physically separate server. The original node 0
(Node 0a) may still be running, because we do not assume the environment fences nodes.
3. On startup, node 0b re-attaches and gets higher generation numbers for
all tenants.
4. S3 writes continue from nodes 0a and 0b, but the writes do not collide due to different
generation in the suffix, and the writes from node 0a are not visible to the rest
of the system because endpoints are reading only from node 0b.
# Appendix B: interoperability with other features
## Sharded Keyspace
The design in this RFC maps neatly to a sharded keyspace design where subsets of the key space
for a tenant are assigned to different pageservers:
- the "unit of work" for attachments becomes something like a TenantShard rather than a Tenant
- TenantShards get generation numbers just as Tenants do.
- Write workload (ingest, compaction) for a tenant is spread out across pageservers via
TenantShards, but each TenantShard still has exactly one valid writer at a time.
## Read replicas
_This section is about a passive reader of S3 pageserver state, not a postgres
read replica_
For historical reads to LSNs below the remote persistent LSN, any node may act as a reader at any
time: remote data is logically immutable data, and the use of deferred deletion in this RFC helps
mitigate the fact that remote data is not _physically_ immutable (i.e. the actual data for a given
page moves around as compaction happens).
A read replica needs to be aware of generations in remote data in order to read the latest
metadata (find the index_part.json with the latest suffix). It may either query this
from the control plane, or find it with ListObjectsv2 request
## Seamless migration
To make tenant migration totally seamless, we will probably want to intentionally double-attach
a tenant briefly, serving reads from the old node while waiting for the new node to be ready.
This RFC enables that double-attachment: two nodes may be attached at the same time, with the migration destination
having a higher generation number. The old node will be able to ingest and serve reads, but not
do any deletes. The new node's attachment must also avoid deleting layers that the old node may
still use. A new piece of state
will be needed for this in the control plane's definition of an attachment.
## Warm secondary locations
To enable faster tenant movement after a pageserver is lost, we will probably want to spend some
disk capacity on keeping standby locations populated with local disk data.
There's no conflict between this RFC and that: implementing warm secondary locations on a per-tenant basis
would be a separate change to the control plane to store standby location(s) for a tenant. Because
the standbys do not write to S3, they do not need to be assigned generation numbers. When a tenant is
re-attached to a standby location, that would increment the tenant attachment generation and this
would work the same as any other attachment change, but with a warm cache.
## Ephemeral node IDs
This RFC intentionally avoids changing anything fundamental about how pageservers are identified
and registered with the control plane, to avoid coupling the implementation of pageserver split
brain protection with more fundamental changes in the management of the pageservers.
Moving to ephemeral node IDs would provide an extra layer of
resilience in the system, as it would prevent the control plane
accidentally attaching to two physical nodes with the same
generation, if somehow there were two physical nodes with
the same node IDs (currently we rely on EC2 guarantees to
eliminate this scenario). With ephemeral node IDs, there would be
no possibility of that happening, no matter the behavior of
underlying infrastructure.
Nothing fundamental in the pageserver's handling of generations needs to change to handle ephemeral node IDs, since we hardly use the
`node_id` anywhere. The `/re-attach` API would be extended
to enable the pageserver to obtain its ephemeral ID, and provide
some correlation identifier (e.g. EC instance ID), to help the
control plane re-attach tenants to the same physical server that
previously had them attached.

View File

@@ -89,6 +89,8 @@ impl RemoteExtSpec {
&self,
ext_name: &str,
is_library: bool,
build_tag: &str,
pg_major_version: &str,
) -> anyhow::Result<(String, RemotePath)> {
let mut real_ext_name = ext_name;
if is_library {
@@ -104,11 +106,32 @@ impl RemoteExtSpec {
.ok_or(anyhow::anyhow!("library {} is not found", lib_raw_name))?;
}
// Check if extension is present in public or custom.
// If not, then it is not allowed to be used by this compute.
if let Some(public_extensions) = &self.public_extensions {
if !public_extensions.contains(&real_ext_name.to_string()) {
if let Some(custom_extensions) = &self.custom_extensions {
if !custom_extensions.contains(&real_ext_name.to_string()) {
return Err(anyhow::anyhow!("extension {} is not found", real_ext_name));
}
}
}
}
match self.extension_data.get(real_ext_name) {
Some(ext_data) => Ok((
real_ext_name.to_string(),
RemotePath::from_string(&ext_data.archive_path)?,
)),
Some(_ext_data) => {
// Construct the path to the extension archive
// BUILD_TAG/PG_MAJOR_VERSION/extensions/EXTENSION_NAME.tar.zst
//
// Keep it in sync with path generation in
// https://github.com/neondatabase/build-custom-extensions/tree/main
let archive_path_str =
format!("{build_tag}/{pg_major_version}/extensions/{real_ext_name}.tar.zst");
Ok((
real_ext_name.to_string(),
RemotePath::from_string(&archive_path_str)?,
))
}
None => Err(anyhow::anyhow!(
"real_ext_name {} is not found",
real_ext_name

View File

@@ -31,6 +31,8 @@ fn lsn_invalid() -> Lsn {
#[serde_as]
#[derive(Debug, Clone, Deserialize, Serialize)]
pub struct SkTimelineInfo {
/// Term.
pub term: Option<u64>,
/// Term of the last entry.
pub last_log_term: Option<u64>,
/// LSN of the last record.
@@ -58,4 +60,6 @@ pub struct SkTimelineInfo {
/// A connection string to use for WAL receiving.
#[serde(default)]
pub safekeeper_connstr: Option<String>,
#[serde(default)]
pub http_connstr: Option<String>,
}

View File

@@ -38,6 +38,7 @@ url.workspace = true
uuid.workspace = true
pq_proto.workspace = true
postgres_connection.workspace = true
metrics.workspace = true
workspace_hack.workspace = true

View File

@@ -0,0 +1,113 @@
use std::fmt::Debug;
use serde::{Deserialize, Serialize};
/// Tenant generations are used to provide split-brain safety and allow
/// multiple pageservers to attach the same tenant concurrently.
///
/// See docs/rfcs/025-generation-numbers.md for detail on how generation
/// numbers are used.
#[derive(Copy, Clone, Eq, PartialEq, PartialOrd, Ord)]
pub enum Generation {
// Generations with this magic value will not add a suffix to S3 keys, and will not
// be included in persisted index_part.json. This value is only to be used
// during migration from pre-generation metadata to generation-aware metadata,
// and should eventually go away.
//
// A special Generation is used rather than always wrapping Generation in an Option,
// so that code handling generations doesn't have to be aware of the legacy
// case everywhere it touches a generation.
None,
// Generations with this magic value may never be used to construct S3 keys:
// we will panic if someone tries to. This is for Tenants in the "Broken" state,
// so that we can satisfy their constructor with a Generation without risking
// a code bug using it in an S3 write (broken tenants should never write)
Broken,
Valid(u32),
}
/// The Generation type represents a number associated with a Tenant, which
/// increments every time the tenant is attached to a new pageserver, or
/// an attached pageserver restarts.
///
/// It is included as a suffix in S3 keys, as a protection against split-brain
/// scenarios where pageservers might otherwise issue conflicting writes to
/// remote storage
impl Generation {
/// Create a new Generation that represents a legacy key format with
/// no generation suffix
pub fn none() -> Self {
Self::None
}
// Create a new generation that will panic if you try to use get_suffix
pub fn broken() -> Self {
Self::Broken
}
pub fn new(v: u32) -> Self {
Self::Valid(v)
}
pub fn is_none(&self) -> bool {
matches!(self, Self::None)
}
pub fn get_suffix(&self) -> String {
match self {
Self::Valid(v) => {
format!("-{:08x}", v)
}
Self::None => "".into(),
Self::Broken => {
panic!("Tried to use a broken generation");
}
}
}
}
impl Serialize for Generation {
fn serialize<S>(&self, serializer: S) -> Result<S::Ok, S::Error>
where
S: serde::Serializer,
{
if let Self::Valid(v) = self {
v.serialize(serializer)
} else {
// We should never be asked to serialize a None or Broken. Structures
// that include an optional generation should convert None to an
// Option<Generation>::None
Err(serde::ser::Error::custom(
"Tried to serialize invalid generation ({self})",
))
}
}
}
impl<'de> Deserialize<'de> for Generation {
fn deserialize<D>(deserializer: D) -> Result<Self, D::Error>
where
D: serde::Deserializer<'de>,
{
Ok(Self::Valid(u32::deserialize(deserializer)?))
}
}
// We intentionally do not implement Display for Generation, to reduce the
// risk of a bug where the generation is used in a format!() string directly
// instead of using get_suffix().
impl Debug for Generation {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
match self {
Self::Valid(v) => {
write!(f, "{:08x}", v)
}
Self::None => {
write!(f, "<none>")
}
Self::Broken => {
write!(f, "<broken>")
}
}
}
}

View File

@@ -27,6 +27,9 @@ pub mod id;
// http endpoint utils
pub mod http;
// definition of the Generation type for pageserver attachment APIs
pub mod generation;
// common log initialisation routine
pub mod logging;
@@ -58,6 +61,8 @@ pub mod serde_regex;
pub mod pageserver_feedback;
pub mod postgres_client;
pub mod tracing_span_assert;
pub mod rate_limit;
@@ -68,8 +73,6 @@ pub mod completion;
/// Reporting utilities
pub mod error;
pub mod sync;
/// This is a shortcut to embed git sha into binaries and avoid copying the same build script to all packages
///
/// we have several cases:

View File

@@ -0,0 +1,37 @@
//! Postgres client connection code common to other crates (safekeeper and
//! pageserver) which depends on tenant/timeline ids and thus not fitting into
//! postgres_connection crate.
use anyhow::Context;
use postgres_connection::{parse_host_port, PgConnectionConfig};
use crate::id::TenantTimelineId;
/// Create client config for fetching WAL from safekeeper on particular timeline.
/// listen_pg_addr_str is in form host:\[port\].
pub fn wal_stream_connection_config(
TenantTimelineId {
tenant_id,
timeline_id,
}: TenantTimelineId,
listen_pg_addr_str: &str,
auth_token: Option<&str>,
availability_zone: Option<&str>,
) -> anyhow::Result<PgConnectionConfig> {
let (host, port) =
parse_host_port(listen_pg_addr_str).context("Unable to parse listen_pg_addr_str")?;
let port = port.unwrap_or(5432);
let mut connstr = PgConnectionConfig::new_host_port(host, port)
.extend_options([
"-c".to_owned(),
format!("timeline_id={}", timeline_id),
format!("tenant_id={}", tenant_id),
])
.set_password(auth_token.map(|s| s.to_owned()));
if let Some(availability_zone) = availability_zone {
connstr = connstr.extend_options([format!("availability_zone={}", availability_zone)]);
}
Ok(connstr)
}

View File

@@ -1 +0,0 @@
pub mod heavier_once_cell;

View File

@@ -1,306 +0,0 @@
use std::sync::{Arc, Mutex, MutexGuard};
use tokio::sync::Semaphore;
/// Custom design like [`tokio::sync::OnceCell`] but using [`OwnedSemaphorePermit`] instead of
/// `SemaphorePermit`, allowing use of `take` which does not require holding an outer mutex guard
/// for the duration of initialization.
///
/// Has no unsafe, builds upon [`tokio::sync::Semaphore`] and [`std::sync::Mutex`].
///
/// [`OwnedSemaphorePermit`]: tokio::sync::OwnedSemaphorePermit
pub struct OnceCell<T> {
inner: Mutex<Inner<T>>,
}
impl<T> Default for OnceCell<T> {
/// Create new uninitialized [`OnceCell`].
fn default() -> Self {
Self {
inner: Default::default(),
}
}
}
/// Semaphore is the current state:
/// - open semaphore means the value is `None`, not yet initialized
/// - closed semaphore means the value has been initialized
#[derive(Debug)]
struct Inner<T> {
init_semaphore: Arc<Semaphore>,
value: Option<T>,
}
impl<T> Default for Inner<T> {
fn default() -> Self {
Self {
init_semaphore: Arc::new(Semaphore::new(1)),
value: None,
}
}
}
impl<T> OnceCell<T> {
/// Creates an already initialized `OnceCell` with the given value.
pub fn new(value: T) -> Self {
let sem = Semaphore::new(1);
sem.close();
Self {
inner: Mutex::new(Inner {
init_semaphore: Arc::new(sem),
value: Some(value),
}),
}
}
/// Returns a guard to an existing initialized value, or uniquely initializes the value before
/// returning the guard.
///
/// Initializing might wait on any existing [`Guard::take_and_deinit`] deinitialization.
///
/// Initialization is panic-safe and cancellation-safe.
pub async fn get_or_init<F, Fut, E>(&self, factory: F) -> Result<Guard<'_, T>, E>
where
F: FnOnce() -> Fut,
Fut: std::future::Future<Output = Result<T, E>>,
{
let sem = {
let guard = self.inner.lock().unwrap();
if guard.value.is_some() {
return Ok(Guard(guard));
}
guard.init_semaphore.clone()
};
let permit = sem.acquire_owned().await;
if permit.is_err() {
let guard = self.inner.lock().unwrap();
assert!(
guard.value.is_some(),
"semaphore got closed, must be initialized"
);
return Ok(Guard(guard));
} else {
// now we try
let value = factory().await?;
let mut guard = self.inner.lock().unwrap();
assert!(
guard.value.is_none(),
"we won permit, must not be initialized"
);
guard.value = Some(value);
guard.init_semaphore.close();
Ok(Guard(guard))
}
}
/// Returns a guard to an existing initialized value, if any.
pub fn get(&self) -> Option<Guard<'_, T>> {
let guard = self.inner.lock().unwrap();
if guard.value.is_some() {
Some(Guard(guard))
} else {
None
}
}
}
/// Uninteresting guard object to allow short-lived access to inspect or clone the held,
/// initialized value.
#[derive(Debug)]
pub struct Guard<'a, T>(MutexGuard<'a, Inner<T>>);
impl<T> std::ops::Deref for Guard<'_, T> {
type Target = T;
fn deref(&self) -> &Self::Target {
self.0
.value
.as_ref()
.expect("guard is not created unless value has been initialized")
}
}
impl<T> std::ops::DerefMut for Guard<'_, T> {
fn deref_mut(&mut self) -> &mut Self::Target {
self.0
.value
.as_mut()
.expect("guard is not created unless value has been initialized")
}
}
impl<'a, T> Guard<'a, T> {
/// Take the current value, and a new permit for it's deinitialization.
///
/// The permit will be on a semaphore part of the new internal value, and any following
/// [`OnceCell::get_or_init`] will wait on it to complete.
pub fn take_and_deinit(&mut self) -> (T, tokio::sync::OwnedSemaphorePermit) {
let mut swapped = Inner::default();
let permit = swapped
.init_semaphore
.clone()
.try_acquire_owned()
.expect("we just created this");
std::mem::swap(&mut *self.0, &mut swapped);
swapped
.value
.map(|v| (v, permit))
.expect("guard is not created unless value has been initialized")
}
}
#[cfg(test)]
mod tests {
use super::*;
use std::{
convert::Infallible,
sync::atomic::{AtomicUsize, Ordering},
time::Duration,
};
#[tokio::test]
async fn many_initializers() {
#[derive(Default, Debug)]
struct Counters {
factory_got_to_run: AtomicUsize,
future_polled: AtomicUsize,
winners: AtomicUsize,
}
let initializers = 100;
let cell = Arc::new(OnceCell::default());
let counters = Arc::new(Counters::default());
let barrier = Arc::new(tokio::sync::Barrier::new(initializers + 1));
let mut js = tokio::task::JoinSet::new();
for i in 0..initializers {
js.spawn({
let cell = cell.clone();
let counters = counters.clone();
let barrier = barrier.clone();
async move {
barrier.wait().await;
let won = {
let g = cell
.get_or_init(|| {
counters.factory_got_to_run.fetch_add(1, Ordering::Relaxed);
async {
counters.future_polled.fetch_add(1, Ordering::Relaxed);
Ok::<_, Infallible>(i)
}
})
.await
.unwrap();
*g == i
};
if won {
counters.winners.fetch_add(1, Ordering::Relaxed);
}
}
});
}
barrier.wait().await;
while let Some(next) = js.join_next().await {
next.expect("no panics expected");
}
let mut counters = Arc::try_unwrap(counters).unwrap();
assert_eq!(*counters.factory_got_to_run.get_mut(), 1);
assert_eq!(*counters.future_polled.get_mut(), 1);
assert_eq!(*counters.winners.get_mut(), 1);
}
#[tokio::test(start_paused = true)]
async fn reinit_waits_for_deinit() {
// with he tokio::time paused, we will "sleep" for 1s while holding the reinitialization
let sleep_for = Duration::from_secs(1);
let initial = 42;
let reinit = 1;
let cell = Arc::new(OnceCell::new(initial));
let deinitialization_started = Arc::new(tokio::sync::Barrier::new(2));
let jh = tokio::spawn({
let cell = cell.clone();
let deinitialization_started = deinitialization_started.clone();
async move {
let (answer, _permit) = cell.get().expect("initialized to value").take_and_deinit();
assert_eq!(answer, initial);
deinitialization_started.wait().await;
tokio::time::sleep(sleep_for).await;
}
});
deinitialization_started.wait().await;
let started_at = tokio::time::Instant::now();
cell.get_or_init(|| async { Ok::<_, Infallible>(reinit) })
.await
.unwrap();
let elapsed = started_at.elapsed();
assert!(
elapsed >= sleep_for,
"initialization should had taken at least the time time slept with permit"
);
jh.await.unwrap();
assert_eq!(*cell.get().unwrap(), reinit);
}
#[tokio::test]
async fn initialization_attemptable_until_ok() {
let cell = OnceCell::default();
for _ in 0..10 {
cell.get_or_init(|| async { Err("whatever error") })
.await
.unwrap_err();
}
let g = cell
.get_or_init(|| async { Ok::<_, Infallible>("finally success") })
.await
.unwrap();
assert_eq!(*g, "finally success");
}
#[tokio::test]
async fn initialization_is_cancellation_safe() {
let cell = OnceCell::default();
let barrier = tokio::sync::Barrier::new(2);
let initializer = cell.get_or_init(|| async {
barrier.wait().await;
futures::future::pending::<()>().await;
Ok::<_, Infallible>("never reached")
});
tokio::select! {
_ = initializer => { unreachable!("cannot complete; stuck in pending().await") },
_ = barrier.wait() => {}
};
// now initializer is dropped
assert!(cell.get().is_none());
let g = cell
.get_or_init(|| async { Ok::<_, Infallible>("now initialized") })
.await
.unwrap();
assert_eq!(*g, "now initialized");
}
}

View File

@@ -16,3 +16,19 @@ in the `neon-postgres` cgroup and set its `memory.{max,high}`.
* See also: [`neondatabase/vm-monitor`](https://github.com/neondatabase/vm-monitor/),
where initial development of the monitor happened. The repository is no longer
maintained but the commit history may be useful for debugging.
## Structure
The `vm-monitor` is loosely comprised of a few systems. These are:
* the server: this is just a simple `axum` server that accepts requests and
upgrades them to websocket connections. The server only allows one connection at
a time. This means that upon receiving a new connection, the server will terminate
and old one if it exists.
* the filecache: a struct that allows communication with the Postgres file cache.
On startup, we connect to the filecache and hold on to the connection for the
entire monitor lifetime.
* the cgroup watcher: the `CgroupWatcher` manages the `neon-postgres` cgroup by
listening for `memory.high` events and setting its `memory.{high,max}` values.
* the runner: the runner marries the filecache and cgroup watcher together,
communicating with the agent throught the `Dispatcher`, and then calling filecache
and cgroup watcher functions as needed to upscale and downscale

View File

@@ -1,7 +1,7 @@
//! Managing the websocket connection and other signals in the monitor.
//!
//! Contains types that manage the interaction (not data interchange, see `protocol`)
//! between informant and monitor, allowing us to to process and send messages in a
//! between agent and monitor, allowing us to to process and send messages in a
//! straightforward way. The dispatcher also manages that signals that come from
//! the cgroup (requesting upscale), and the signals that go to the cgroup
//! (notifying it of upscale).
@@ -24,16 +24,16 @@ use crate::protocol::{
/// The central handler for all communications in the monitor.
///
/// The dispatcher has two purposes:
/// 1. Manage the connection to the informant, sending and receiving messages.
/// 1. Manage the connection to the agent, sending and receiving messages.
/// 2. Communicate with the cgroup manager, notifying it when upscale is received,
/// and sending a message to the informant when the cgroup manager requests
/// and sending a message to the agent when the cgroup manager requests
/// upscale.
#[derive(Debug)]
pub struct Dispatcher {
/// We read informant messages of of `source`
/// We read agent messages of of `source`
pub(crate) source: SplitStream<WebSocket>,
/// We send messages to the informant through `sink`
/// We send messages to the agent through `sink`
sink: SplitSink<WebSocket, Message>,
/// Used to notify the cgroup when we are upscaled.
@@ -43,7 +43,7 @@ pub struct Dispatcher {
/// we send an `UpscaleRequst` to the agent.
pub(crate) request_upscale_events: mpsc::Receiver<()>,
/// The protocol version we have agreed to use with the informant. This is negotiated
/// The protocol version we have agreed to use with the agent. This is negotiated
/// during the creation of the dispatcher, and should be the highest shared protocol
/// version.
///
@@ -56,9 +56,9 @@ pub struct Dispatcher {
impl Dispatcher {
/// Creates a new dispatcher using the passed-in connection.
///
/// Performs a negotiation with the informant to determine the highest protocol
/// Performs a negotiation with the agent to determine the highest protocol
/// version that both support. This consists of two steps:
/// 1. Wait for the informant to sent the range of protocols it supports.
/// 1. Wait for the agent to sent the range of protocols it supports.
/// 2. Send a protocol version that works for us as well, or an error if there
/// is no compatible version.
pub async fn new(
@@ -69,7 +69,7 @@ impl Dispatcher {
let (mut sink, mut source) = stream.split();
// Figure out the highest protocol version we both support
info!("waiting for informant to send protocol version range");
info!("waiting for agent to send protocol version range");
let Some(message) = source.next().await else {
bail!("websocket connection closed while performing protocol handshake")
};
@@ -79,7 +79,7 @@ impl Dispatcher {
let Message::Text(message_text) = message else {
// All messages should be in text form, since we don't do any
// pinging/ponging. See nhooyr/websocket's implementation and the
// informant/agent for more info
// agent for more info
bail!("received non-text message during proocol handshake: {message:?}")
};
@@ -88,32 +88,30 @@ impl Dispatcher {
max: PROTOCOL_MAX_VERSION,
};
let informant_range: ProtocolRange = serde_json::from_str(&message_text)
let agent_range: ProtocolRange = serde_json::from_str(&message_text)
.context("failed to deserialize protocol version range")?;
info!(range = ?informant_range, "received protocol version range");
info!(range = ?agent_range, "received protocol version range");
let highest_shared_version = match monitor_range.highest_shared_version(&informant_range) {
let highest_shared_version = match monitor_range.highest_shared_version(&agent_range) {
Ok(version) => {
sink.send(Message::Text(
serde_json::to_string(&ProtocolResponse::Version(version)).unwrap(),
))
.await
.context("failed to notify informant of negotiated protocol version")?;
.context("failed to notify agent of negotiated protocol version")?;
version
}
Err(e) => {
sink.send(Message::Text(
serde_json::to_string(&ProtocolResponse::Error(format!(
"Received protocol version range {} which does not overlap with {}",
informant_range, monitor_range
agent_range, monitor_range
)))
.unwrap(),
))
.await
.context(
"failed to notify informant of no overlap between protocol version ranges",
)?;
.context("failed to notify agent of no overlap between protocol version ranges")?;
Err(e).context("error determining suitable protocol version range")?
}
};
@@ -137,7 +135,7 @@ impl Dispatcher {
.context("failed to send resources and oneshot sender across channel")
}
/// Send a message to the informant.
/// Send a message to the agent.
///
/// Although this function is small, it has one major benefit: it is the only
/// way to send data accross the connection, and you can only pass in a proper

View File

@@ -59,8 +59,8 @@ pub struct FileCacheConfig {
spread_factor: f64,
}
impl Default for FileCacheConfig {
fn default() -> Self {
impl FileCacheConfig {
pub fn default_in_memory() -> Self {
Self {
in_memory: true,
// 75 %
@@ -71,9 +71,19 @@ impl Default for FileCacheConfig {
spread_factor: 0.1,
}
}
}
impl FileCacheConfig {
pub fn default_on_disk() -> Self {
Self {
in_memory: false,
resource_multiplier: 0.75,
// 256 MiB - lower than when in memory because overcommitting is safe; if we don't have
// memory, the kernel will just evict from its page cache, rather than e.g. killing
// everything.
min_remaining_after_cache: NonZeroU64::new(256 * MiB).unwrap(),
spread_factor: 0.1,
}
}
/// Make sure fields of the config are consistent.
pub fn validate(&self) -> anyhow::Result<()> {
// Single field validity

View File

@@ -39,6 +39,16 @@ pub struct Args {
#[arg(short, long)]
pub pgconnstr: Option<String>,
/// Flag to signal that the Postgres file cache is on disk (i.e. not in memory aside from the
/// kernel's page cache), and therefore should not count against available memory.
//
// NB: Ideally this flag would directly refer to whether the file cache is in memory (rather
// than a roundabout way, via whether it's on disk), but in order to be backwards compatible
// during the switch away from an in-memory file cache, we had to default to the previous
// behavior.
#[arg(long)]
pub file_cache_on_disk: bool,
/// The address we should listen on for connection requests. For the
/// agent, this is 0.0.0.0:10301. For the informant, this is 127.0.0.1:10369.
#[arg(short, long)]
@@ -146,7 +156,7 @@ pub async fn start(args: &'static Args, token: CancellationToken) -> anyhow::Res
/// Handles incoming websocket connections.
///
/// If we are already to connected to an informant, we kill that old connection
/// If we are already to connected to an agent, we kill that old connection
/// and accept the new one.
#[tracing::instrument(name = "/monitor", skip_all, fields(?args))]
pub async fn ws_handler(
@@ -196,7 +206,7 @@ async fn start_monitor(
return;
}
};
info!("connected to informant");
info!("connected to agent");
match monitor.run().await {
Ok(()) => info!("monitor was killed due to new connection"),

View File

@@ -1,13 +1,13 @@
//! Types representing protocols and actual informant-monitor messages.
//! Types representing protocols and actual agent-monitor messages.
//!
//! The pervasive use of serde modifiers throughout this module is to ease
//! serialization on the go side. Because go does not have enums (which model
//! messages well), it is harder to model messages, and we accomodate that with
//! serde.
//!
//! *Note*: the informant sends and receives messages in different ways.
//! *Note*: the agent sends and receives messages in different ways.
//!
//! The informant serializes messages in the form and then sends them. The use
//! The agent serializes messages in the form and then sends them. The use
//! of `#[serde(tag = "type", content = "content")]` allows us to use `Type`
//! to determine how to deserialize `Content`.
//! ```ignore
@@ -25,9 +25,9 @@
//! Id uint64
//! }
//! ```
//! After reading the type field, the informant will decode the entire message
//! After reading the type field, the agent will decode the entire message
//! again, this time into the correct type using the embedded fields.
//! Because the informant cannot just extract the json contained in a certain field
//! Because the agent cannot just extract the json contained in a certain field
//! (it initially deserializes to `map[string]interface{}`), we keep the fields
//! at the top level, so the entire piece of json can be deserialized into a struct,
//! such as a `DownscaleResult`, with the `Type` and `Id` fields ignored.
@@ -37,7 +37,7 @@ use std::cmp;
use serde::{de::Error, Deserialize, Serialize};
/// A Message we send to the informant.
/// A Message we send to the agent.
#[derive(Serialize, Deserialize, Debug, Clone)]
pub struct OutboundMsg {
#[serde(flatten)]
@@ -51,31 +51,31 @@ impl OutboundMsg {
}
}
/// The different underlying message types we can send to the informant.
/// The different underlying message types we can send to the agent.
#[derive(Serialize, Deserialize, Debug, Clone)]
#[serde(tag = "type")]
pub enum OutboundMsgKind {
/// Indicates that the informant sent an invalid message, i.e, we couldn't
/// Indicates that the agent sent an invalid message, i.e, we couldn't
/// properly deserialize it.
InvalidMessage { error: String },
/// Indicates that we experienced an internal error while processing a message.
/// For example, if a cgroup operation fails while trying to handle an upscale,
/// we return `InternalError`.
InternalError { error: String },
/// Returned to the informant once we have finished handling an upscale. If the
/// Returned to the agent once we have finished handling an upscale. If the
/// handling was unsuccessful, an `InternalError` will get returned instead.
/// *Note*: this is a struct variant because of the way go serializes struct{}
UpscaleConfirmation {},
/// Indicates to the monitor that we are urgently requesting resources.
/// *Note*: this is a struct variant because of the way go serializes struct{}
UpscaleRequest {},
/// Returned to the informant once we have finished attempting to downscale. If
/// Returned to the agent once we have finished attempting to downscale. If
/// an error occured trying to do so, an `InternalError` will get returned instead.
/// However, if we are simply unsuccessful (for example, do to needing the resources),
/// that gets included in the `DownscaleResult`.
DownscaleResult {
// FIXME for the future (once the informant is deprecated)
// As of the time of writing, the informant/agent version of this struct is
// As of the time of writing, the agent/informant version of this struct is
// called api.DownscaleResult. This struct has uppercase fields which are
// serialized as such. Thus, we serialize using uppercase names so we don't
// have to make a breaking change to the agent<->informant protocol. Once
@@ -88,12 +88,12 @@ pub enum OutboundMsgKind {
status: String,
},
/// Part of the bidirectional heartbeat. The heartbeat is initiated by the
/// informant.
/// agent.
/// *Note*: this is a struct variant because of the way go serializes struct{}
HealthCheck {},
}
/// A message received form the informant.
/// A message received form the agent.
#[derive(Serialize, Deserialize, Debug, Clone)]
pub struct InboundMsg {
#[serde(flatten)]
@@ -101,7 +101,7 @@ pub struct InboundMsg {
pub(crate) id: usize,
}
/// The different underlying message types we can receive from the informant.
/// The different underlying message types we can receive from the agent.
#[derive(Serialize, Deserialize, Debug, Clone)]
#[serde(tag = "type", content = "content")]
pub enum InboundMsgKind {
@@ -120,14 +120,14 @@ pub enum InboundMsgKind {
/// when done.
DownscaleRequest { target: Resources },
/// Part of the bidirectional heartbeat. The heartbeat is initiated by the
/// informant.
/// agent.
/// *Note*: this is a struct variant because of the way go serializes struct{}
HealthCheck {},
}
/// Represents the resources granted to a VM.
#[derive(Serialize, Deserialize, Debug, Clone, Copy)]
// Renamed because the agent/informant has multiple resources types:
// Renamed because the agent has multiple resources types:
// `Resources` (milliCPU/memory slots)
// `Allocation` (vCPU/bytes) <- what we correspond to
#[serde(rename(serialize = "Allocation", deserialize = "Allocation"))]
@@ -151,7 +151,7 @@ pub const PROTOCOL_MAX_VERSION: ProtocolVersion = ProtocolVersion::V1_0;
pub struct ProtocolVersion(u8);
impl ProtocolVersion {
/// Represents v1.0 of the informant<-> monitor protocol - the initial version
/// Represents v1.0 of the agent<-> monitor protocol - the initial version
///
/// Currently the latest version.
const V1_0: ProtocolVersion = ProtocolVersion(1);

View File

@@ -1,4 +1,4 @@
//! Exposes the `Runner`, which handles messages received from informant and
//! Exposes the `Runner`, which handles messages received from agent and
//! sends upscale requests.
//!
//! This is the "Monitor" part of the monitor binary and is the main entrypoint for
@@ -21,8 +21,8 @@ use crate::filecache::{FileCacheConfig, FileCacheState};
use crate::protocol::{InboundMsg, InboundMsgKind, OutboundMsg, OutboundMsgKind, Resources};
use crate::{bytes_to_mebibytes, get_total_system_memory, spawn_with_cancel, Args, MiB};
/// Central struct that interacts with informant, dispatcher, and cgroup to handle
/// signals from the informant.
/// Central struct that interacts with agent, dispatcher, and cgroup to handle
/// signals from the agent.
#[derive(Debug)]
pub struct Runner {
config: Config,
@@ -110,10 +110,10 @@ impl Runner {
// memory limits.
if let Some(connstr) = &args.pgconnstr {
info!("initializing file cache");
let config: FileCacheConfig = Default::default();
if !config.in_memory {
panic!("file cache not in-memory implemented")
}
let config = match args.file_cache_on_disk {
true => FileCacheConfig::default_on_disk(),
false => FileCacheConfig::default_in_memory(),
};
let mut file_cache = FileCacheState::new(connstr, config, token.clone())
.await
@@ -140,7 +140,10 @@ impl Runner {
if actual_size != new_size {
info!("file cache size actually got set to {actual_size}")
}
file_cache_reserved_bytes = actual_size;
// Mark the resources given to the file cache as reserved, but only if it's in memory.
if !args.file_cache_on_disk {
file_cache_reserved_bytes = actual_size;
}
state.filecache = Some(file_cache);
}
@@ -227,18 +230,17 @@ impl Runner {
let mut status = vec![];
let mut file_cache_mem_usage = 0;
if let Some(file_cache) = &mut self.filecache {
if !file_cache.config.in_memory {
panic!("file cache not in-memory unimplemented")
}
let actual_usage = file_cache
.set_file_cache_size(expected_file_cache_mem_usage)
.await
.context("failed to set file cache size")?;
file_cache_mem_usage = actual_usage;
if file_cache.config.in_memory {
file_cache_mem_usage = actual_usage;
}
let message = format!(
"set file cache size to {} MiB",
bytes_to_mebibytes(actual_usage)
"set file cache size to {} MiB (in memory = {})",
bytes_to_mebibytes(actual_usage),
file_cache.config.in_memory,
);
info!("downscale: {message}");
status.push(message);
@@ -289,10 +291,6 @@ impl Runner {
// Get the file cache's expected contribution to the memory usage
let mut file_cache_mem_usage = 0;
if let Some(file_cache) = &mut self.filecache {
if !file_cache.config.in_memory {
panic!("file cache not in-memory unimplemented");
}
let expected_usage = file_cache.config.calculate_cache_size(usable_system_memory);
info!(
target = bytes_to_mebibytes(expected_usage),
@@ -304,6 +302,9 @@ impl Runner {
.set_file_cache_size(expected_usage)
.await
.context("failed to set file cache size")?;
if file_cache.config.in_memory {
file_cache_mem_usage = actual_usage;
}
if actual_usage != expected_usage {
warn!(
@@ -312,7 +313,6 @@ impl Runner {
bytes_to_mebibytes(actual_usage)
)
}
file_cache_mem_usage = actual_usage;
}
if let Some(cgroup) = &self.cgroup {
@@ -371,7 +371,7 @@ impl Runner {
Ok(None)
}
InboundMsgKind::InternalError { error } => {
warn!(error, id, "informant experienced an internal error");
warn!(error, id, "agent experienced an internal error");
Ok(None)
}
InboundMsgKind::HealthCheck {} => {
@@ -405,7 +405,7 @@ impl Runner {
.await
.context("failed to send message")?;
}
// there is a message from the informant
// there is a message from the agent
msg = self.dispatcher.source.next() => {
if let Some(msg) = msg {
// Don't use 'message' as a key as the string also uses
@@ -422,7 +422,7 @@ impl Runner {
// Don't use 'message' as a key as the
// string also uses that for its key
msg = ?other,
"informant should only send text messages but received different type"
"agent should only send text messages but received different type"
);
continue
},

View File

@@ -97,7 +97,7 @@ pub(crate) fn parse_filename(name: &str) -> Option<LayerFile> {
// Finds the max_holes largest holes, ignoring any that are smaller than MIN_HOLE_LENGTH"
async fn get_holes(path: &Path, max_holes: usize) -> Result<Vec<Hole>> {
let file = FileBlockReader::new(VirtualFile::open(path)?);
let summary_blk = file.read_blk(0)?;
let summary_blk = file.read_blk(0).await?;
let actual_summary = Summary::des_prefix(summary_blk.as_ref())?;
let tree_reader = DiskBtreeReader::<_, DELTA_KEY_SIZE>::new(
actual_summary.index_start_blk,

View File

@@ -48,7 +48,7 @@ async fn read_delta_file(path: impl AsRef<Path>) -> Result<()> {
virtual_file::init(10);
page_cache::init(100);
let file = FileBlockReader::new(VirtualFile::open(path)?);
let summary_blk = file.read_blk(0)?;
let summary_blk = file.read_blk(0).await?;
let actual_summary = Summary::des_prefix(summary_blk.as_ref())?;
let tree_reader = DiskBtreeReader::<_, DELTA_KEY_SIZE>::new(
actual_summary.index_start_blk,

View File

@@ -643,23 +643,6 @@ impl PageServerConf {
.join(METADATA_FILE_NAME)
}
/// Files on the remote storage are stored with paths, relative to the workdir.
/// That path includes in itself both tenant and timeline ids, allowing to have a unique remote storage path.
///
/// Errors if the path provided does not start from pageserver's workdir.
pub fn remote_path(&self, local_path: &Path) -> anyhow::Result<RemotePath> {
local_path
.strip_prefix(&self.workdir)
.context("Failed to strip workdir prefix")
.and_then(RemotePath::new)
.with_context(|| {
format!(
"Failed to resolve remote part of path {:?} for base {:?}",
local_path, self.workdir
)
})
}
/// Turns storage remote path of a file into its local path.
pub fn local_path(&self, remote_path: &RemotePath) -> PathBuf {
remote_path.with_base(&self.workdir)

View File

@@ -60,11 +60,7 @@ use utils::serde_percent::Percent;
use crate::{
config::PageServerConf,
task_mgr::{self, TaskKind, BACKGROUND_RUNTIME},
tenant::{
self,
storage_layer::{AsLayerDesc, EvictionError, Layer},
Timeline,
},
tenant::{self, storage_layer::PersistentLayer, timeline::EvictionError, Timeline},
};
#[derive(Debug, Clone, PartialEq, Eq, Serialize, Deserialize)]
@@ -112,7 +108,7 @@ pub fn launch_disk_usage_global_eviction_task(
_ = background_jobs_barrier.wait() => { }
};
disk_usage_eviction_task(&state, task_config, &storage, &conf.tenants_path(), cancel)
disk_usage_eviction_task(&state, task_config, storage, &conf.tenants_path(), cancel)
.await;
Ok(())
},
@@ -125,7 +121,7 @@ pub fn launch_disk_usage_global_eviction_task(
async fn disk_usage_eviction_task(
state: &State,
task_config: &DiskUsageEvictionTaskConfig,
_storage: &GenericRemoteStorage,
storage: GenericRemoteStorage,
tenants_dir: &Path,
cancel: CancellationToken,
) {
@@ -149,8 +145,14 @@ async fn disk_usage_eviction_task(
let start = Instant::now();
async {
let res =
disk_usage_eviction_task_iteration(state, task_config, tenants_dir, &cancel).await;
let res = disk_usage_eviction_task_iteration(
state,
task_config,
&storage,
tenants_dir,
&cancel,
)
.await;
match res {
Ok(()) => {}
@@ -181,12 +183,13 @@ pub trait Usage: Clone + Copy + std::fmt::Debug {
async fn disk_usage_eviction_task_iteration(
state: &State,
task_config: &DiskUsageEvictionTaskConfig,
storage: &GenericRemoteStorage,
tenants_dir: &Path,
cancel: &CancellationToken,
) -> anyhow::Result<()> {
let usage_pre = filesystem_level_usage::get(tenants_dir, task_config)
.context("get filesystem-level disk usage before evictions")?;
let res = disk_usage_eviction_task_iteration_impl(state, usage_pre, cancel).await;
let res = disk_usage_eviction_task_iteration_impl(state, storage, usage_pre, cancel).await;
match res {
Ok(outcome) => {
debug!(?outcome, "disk_usage_eviction_iteration finished");
@@ -270,6 +273,7 @@ struct LayerCount {
pub async fn disk_usage_eviction_task_iteration_impl<U: Usage>(
state: &State,
storage: &GenericRemoteStorage,
usage_pre: U,
cancel: &CancellationToken,
) -> anyhow::Result<IterationOutcome<U>> {
@@ -326,10 +330,9 @@ pub async fn disk_usage_eviction_task_iteration_impl<U: Usage>(
// If we get far enough in the list that we start to evict layers that are below
// the tenant's min-resident-size threshold, print a warning, and memorize the disk
// usage at that point, in 'usage_planned_min_resident_size_respecting'.
let mut batched: HashMap<_, Vec<_>> = HashMap::new();
let mut batched: HashMap<_, Vec<Arc<dyn PersistentLayer>>> = HashMap::new();
let mut warned = None;
let mut usage_planned = usage_pre;
let mut max_batch_size = 0;
for (i, (partition, candidate)) in candidates.into_iter().enumerate() {
if !usage_planned.has_pressure() {
debug!(
@@ -346,15 +349,10 @@ pub async fn disk_usage_eviction_task_iteration_impl<U: Usage>(
usage_planned.add_available_bytes(candidate.layer.layer_desc().file_size);
let batch = batched.entry(TimelineKey(candidate.timeline)).or_default();
// semaphore will later be used to limit eviction concurrency, and we can express at
// most u32 number of permits. unlikely we would have u32::MAX layers to be evicted,
// but fail gracefully by not making batches larger.
if batch.len() < u32::MAX as usize {
batch.push(candidate.layer);
max_batch_size = max_batch_size.max(batch.len());
}
batched
.entry(TimelineKey(candidate.timeline))
.or_default()
.push(candidate.layer);
}
let usage_planned = match warned {
@@ -371,101 +369,64 @@ pub async fn disk_usage_eviction_task_iteration_impl<U: Usage>(
// phase2: evict victims batched by timeline
let mut js = tokio::task::JoinSet::new();
// ratelimit to 1k files or any higher max batch size
let limit = Arc::new(tokio::sync::Semaphore::new(1000.max(max_batch_size)));
// After the loop, `usage_assumed` is the post-eviction usage,
// according to internal accounting.
let mut usage_assumed = usage_pre;
let mut evictions_failed = LayerCount::default();
for (timeline, batch) in batched {
let tenant_id = timeline.tenant_id;
let timeline_id = timeline.timeline_id;
let batch_size =
u32::try_from(batch.len()).expect("batch size limited to u32::MAX during partitioning");
// I dislike naming of `available_permits` but it means current total amount of permits
// because permits can be added
assert!(batch_size as usize <= limit.available_permits());
let batch_size = batch.len();
debug!(%timeline_id, "evicting batch for timeline");
let evict = {
let limit = limit.clone();
let cancel = cancel.clone();
async move {
let mut evicted_bytes = 0;
let mut evictions_failed = LayerCount::default();
async {
let results = timeline.evict_layers(storage, &batch, cancel.clone()).await;
let Ok(_permit) = limit.acquire_many_owned(batch_size).await else {
// semaphore closing means cancelled
return (evicted_bytes, evictions_failed);
};
let results = timeline.evict_layers(&batch, &cancel).await;
match results {
Ok(results) => {
assert_eq!(results.len(), batch.len());
for (result, layer) in results.into_iter().zip(batch.iter()) {
let file_size = layer.layer_desc().file_size;
match result {
Some(Ok(())) => {
evicted_bytes += file_size;
}
Some(Err(EvictionError::NotFound | EvictionError::Downloaded)) => {
evictions_failed.file_sizes += file_size;
evictions_failed.count += 1;
}
None => {
assert!(cancel.is_cancelled());
}
match results {
Err(e) => {
warn!("failed to evict batch: {:#}", e);
}
Ok(results) => {
assert_eq!(results.len(), batch.len());
for (result, layer) in results.into_iter().zip(batch.iter()) {
let file_size = layer.layer_desc().file_size;
match result {
Some(Ok(())) => {
usage_assumed.add_available_bytes(file_size);
}
Some(Err(EvictionError::CannotEvictRemoteLayer)) => {
unreachable!("get_local_layers_for_disk_usage_eviction finds only local layers")
}
Some(Err(EvictionError::FileNotFound)) => {
evictions_failed.file_sizes += file_size;
evictions_failed.count += 1;
}
Some(Err(
e @ EvictionError::LayerNotFound(_)
| e @ EvictionError::StatFailed(_),
)) => {
let e = utils::error::report_compact_sources(&e);
warn!(%layer, "failed to evict layer: {e}");
evictions_failed.file_sizes += file_size;
evictions_failed.count += 1;
}
None => {
assert!(cancel.is_cancelled());
return;
}
}
}
Err(e) => {
warn!("failed to evict batch: {:#}", e);
}
}
(evicted_bytes, evictions_failed)
}
}
.instrument(tracing::info_span!("evict_batch", %tenant_id, %timeline_id, batch_size));
.instrument(tracing::info_span!("evict_batch", %tenant_id, %timeline_id, batch_size))
.await;
js.spawn(evict);
// spwaning multiple thousands of these is essentially blocking, so give already spawned a
// chance of making progress
tokio::task::yield_now().await;
}
let join_all = async move {
// After the evictions, `usage_assumed` is the post-eviction usage,
// according to internal accounting.
let mut usage_assumed = usage_pre;
let mut evictions_failed = LayerCount::default();
while let Some(res) = js.join_next().await {
match res {
Ok((evicted_bytes, failed)) => {
usage_assumed.add_available_bytes(evicted_bytes);
evictions_failed.file_sizes += failed.file_sizes;
evictions_failed.count += failed.count;
}
Err(je) if je.is_cancelled() => unreachable!("not used"),
Err(je) if je.is_panic() => { /* already logged */ }
Err(je) => tracing::error!("unknown JoinError: {je:?}"),
}
}
(usage_assumed, evictions_failed)
};
let (usage_assumed, evictions_failed) = tokio::select! {
tuple = join_all => { tuple },
_ = cancel.cancelled() => {
// close the semaphore to stop any pending acquires
limit.close();
if cancel.is_cancelled() {
return Ok(IterationOutcome::Cancelled);
}
};
}
Ok(IterationOutcome::Finished(IterationOutcomeFinished {
before: usage_pre,
@@ -480,7 +441,7 @@ pub async fn disk_usage_eviction_task_iteration_impl<U: Usage>(
#[derive(Clone)]
struct EvictionCandidate {
timeline: Arc<Timeline>,
layer: Layer,
layer: Arc<dyn PersistentLayer>,
last_activity_ts: SystemTime,
}

View File

@@ -1028,7 +1028,7 @@ async fn timeline_compact_handler(
timeline
.compact(&cancel, &ctx)
.await
.map_err(|e| ApiError::InternalServerError(e.into()))?;
.map_err(ApiError::InternalServerError)?;
json_response(StatusCode::OK, ())
}
.instrument(info_span!("manual_compaction", %tenant_id, %timeline_id))
@@ -1053,7 +1053,7 @@ async fn timeline_checkpoint_handler(
timeline
.compact(&cancel, &ctx)
.await
.map_err(|e| ApiError::InternalServerError(e.into()))?;
.map_err(ApiError::InternalServerError)?;
json_response(StatusCode::OK, ())
}
@@ -1160,11 +1160,11 @@ async fn disk_usage_eviction_run(
let state = get_state(&r);
if state.remote_storage.as_ref().is_none() {
let Some(storage) = state.remote_storage.clone() else {
return Err(ApiError::InternalServerError(anyhow::anyhow!(
"remote storage not configured, cannot run eviction iteration"
)));
}
};
let state = state.disk_usage_eviction_state.clone();
@@ -1182,6 +1182,7 @@ async fn disk_usage_eviction_run(
async move {
let res = crate::disk_usage_eviction_task::disk_usage_eviction_task_iteration_impl(
&state,
&storage,
usage,
&child_cancel,
)

View File

@@ -75,10 +75,7 @@
use std::{
collections::{hash_map::Entry, HashMap},
convert::TryInto,
sync::{
atomic::{AtomicU64, AtomicU8, AtomicUsize, Ordering},
RwLock, RwLockReadGuard, RwLockWriteGuard, TryLockError,
},
sync::atomic::{AtomicU64, AtomicU8, AtomicUsize, Ordering},
};
use anyhow::Context;
@@ -162,7 +159,7 @@ struct Version {
}
struct Slot {
inner: RwLock<SlotInner>,
inner: tokio::sync::RwLock<SlotInner>,
usage_count: AtomicU8,
}
@@ -203,6 +200,11 @@ impl Slot {
Err(usage_count) => usage_count,
}
}
/// Sets the usage count to a specific value.
fn set_usage_count(&self, count: u8) {
self.usage_count.store(count, Ordering::Relaxed);
}
}
pub struct PageCache {
@@ -215,9 +217,9 @@ pub struct PageCache {
///
/// If you add support for caching different kinds of objects, each object kind
/// can have a separate mapping map, next to this field.
materialized_page_map: RwLock<HashMap<MaterializedPageHashKey, Vec<Version>>>,
materialized_page_map: std::sync::RwLock<HashMap<MaterializedPageHashKey, Vec<Version>>>,
immutable_page_map: RwLock<HashMap<(FileId, u32), usize>>,
immutable_page_map: std::sync::RwLock<HashMap<(FileId, u32), usize>>,
/// The actual buffers with their metadata.
slots: Box<[Slot]>,
@@ -233,7 +235,7 @@ pub struct PageCache {
/// PageReadGuard is a "lease" on a buffer, for reading. The page is kept locked
/// until the guard is dropped.
///
pub struct PageReadGuard<'i>(RwLockReadGuard<'i, SlotInner>);
pub struct PageReadGuard<'i>(tokio::sync::RwLockReadGuard<'i, SlotInner>);
impl std::ops::Deref for PageReadGuard<'_> {
type Target = [u8; PAGE_SZ];
@@ -260,9 +262,10 @@ impl AsRef<[u8; PAGE_SZ]> for PageReadGuard<'_> {
/// to initialize.
///
pub struct PageWriteGuard<'i> {
inner: RwLockWriteGuard<'i, SlotInner>,
inner: tokio::sync::RwLockWriteGuard<'i, SlotInner>,
// Are the page contents currently valid?
// Used to mark pages as invalid that are assigned but not yet filled with data.
valid: bool,
}
@@ -337,7 +340,7 @@ impl PageCache {
/// The 'lsn' is an upper bound, this will return the latest version of
/// the given block, but not newer than 'lsn'. Returns the actual LSN of the
/// returned page.
pub fn lookup_materialized_page(
pub async fn lookup_materialized_page(
&self,
tenant_id: TenantId,
timeline_id: TimelineId,
@@ -357,7 +360,7 @@ impl PageCache {
lsn,
};
if let Some(guard) = self.try_lock_for_read(&mut cache_key) {
if let Some(guard) = self.try_lock_for_read(&mut cache_key).await {
if let CacheKey::MaterializedPage {
hash_key: _,
lsn: available_lsn,
@@ -384,7 +387,7 @@ impl PageCache {
///
/// Store an image of the given page in the cache.
///
pub fn memorize_materialized_page(
pub async fn memorize_materialized_page(
&self,
tenant_id: TenantId,
timeline_id: TimelineId,
@@ -401,7 +404,7 @@ impl PageCache {
lsn,
};
match self.lock_for_write(&cache_key)? {
match self.lock_for_write(&cache_key).await? {
WriteBufResult::Found(write_guard) => {
// We already had it in cache. Another thread must've put it there
// concurrently. Check that it had the same contents that we
@@ -419,31 +422,14 @@ impl PageCache {
// Section 1.2: Public interface functions for working with immutable file pages.
pub fn read_immutable_buf(&self, file_id: FileId, blkno: u32) -> anyhow::Result<ReadBufResult> {
pub async fn read_immutable_buf(
&self,
file_id: FileId,
blkno: u32,
) -> anyhow::Result<ReadBufResult> {
let mut cache_key = CacheKey::ImmutableFilePage { file_id, blkno };
self.lock_for_read(&mut cache_key)
}
/// Immediately drop all buffers belonging to given file
pub fn drop_buffers_for_immutable(&self, drop_file_id: FileId) {
for slot_idx in 0..self.slots.len() {
let slot = &self.slots[slot_idx];
let mut inner = slot.inner.write().unwrap();
if let Some(key) = &inner.key {
match key {
CacheKey::ImmutableFilePage { file_id, blkno: _ }
if *file_id == drop_file_id =>
{
// remove mapping for old buffer
self.remove_mapping(key);
inner.key = None;
}
_ => {}
}
}
}
self.lock_for_read(&mut cache_key).await
}
//
@@ -463,14 +449,14 @@ impl PageCache {
///
/// If no page is found, returns None and *cache_key is left unmodified.
///
fn try_lock_for_read(&self, cache_key: &mut CacheKey) -> Option<PageReadGuard> {
async fn try_lock_for_read(&self, cache_key: &mut CacheKey) -> Option<PageReadGuard> {
let cache_key_orig = cache_key.clone();
if let Some(slot_idx) = self.search_mapping(cache_key) {
// The page was found in the mapping. Lock the slot, and re-check
// that it's still what we expected (because we released the mapping
// lock already, another thread could have evicted the page)
let slot = &self.slots[slot_idx];
let inner = slot.inner.read().unwrap();
let inner = slot.inner.read().await;
if inner.key.as_ref() == Some(cache_key) {
slot.inc_usage_count();
return Some(PageReadGuard(inner));
@@ -511,7 +497,7 @@ impl PageCache {
/// }
/// ```
///
fn lock_for_read(&self, cache_key: &mut CacheKey) -> anyhow::Result<ReadBufResult> {
async fn lock_for_read(&self, cache_key: &mut CacheKey) -> anyhow::Result<ReadBufResult> {
let (read_access, hit) = match cache_key {
CacheKey::MaterializedPage { .. } => {
unreachable!("Materialized pages use lookup_materialized_page")
@@ -526,7 +512,7 @@ impl PageCache {
let mut is_first_iteration = true;
loop {
// First check if the key already exists in the cache.
if let Some(read_guard) = self.try_lock_for_read(cache_key) {
if let Some(read_guard) = self.try_lock_for_read(cache_key).await {
if is_first_iteration {
hit.inc();
}
@@ -556,7 +542,7 @@ impl PageCache {
// Make the slot ready
let slot = &self.slots[slot_idx];
inner.key = Some(cache_key.clone());
slot.usage_count.store(1, Ordering::Relaxed);
slot.set_usage_count(1);
return Ok(ReadBufResult::NotFound(PageWriteGuard {
inner,
@@ -569,13 +555,13 @@ impl PageCache {
/// found, returns None.
///
/// When locking a page for writing, the search criteria is always "exact".
fn try_lock_for_write(&self, cache_key: &CacheKey) -> Option<PageWriteGuard> {
async fn try_lock_for_write(&self, cache_key: &CacheKey) -> Option<PageWriteGuard> {
if let Some(slot_idx) = self.search_mapping_for_write(cache_key) {
// The page was found in the mapping. Lock the slot, and re-check
// that it's still what we expected (because we don't released the mapping
// lock already, another thread could have evicted the page)
let slot = &self.slots[slot_idx];
let inner = slot.inner.write().unwrap();
let inner = slot.inner.write().await;
if inner.key.as_ref() == Some(cache_key) {
slot.inc_usage_count();
return Some(PageWriteGuard { inner, valid: true });
@@ -588,10 +574,10 @@ impl PageCache {
///
/// Similar to lock_for_read(), but the returned buffer is write-locked and
/// may be modified by the caller even if it's already found in the cache.
fn lock_for_write(&self, cache_key: &CacheKey) -> anyhow::Result<WriteBufResult> {
async fn lock_for_write(&self, cache_key: &CacheKey) -> anyhow::Result<WriteBufResult> {
loop {
// First check if the key already exists in the cache.
if let Some(write_guard) = self.try_lock_for_write(cache_key) {
if let Some(write_guard) = self.try_lock_for_write(cache_key).await {
return Ok(WriteBufResult::Found(write_guard));
}
@@ -617,7 +603,7 @@ impl PageCache {
// Make the slot ready
let slot = &self.slots[slot_idx];
inner.key = Some(cache_key.clone());
slot.usage_count.store(1, Ordering::Relaxed);
slot.set_usage_count(1);
return Ok(WriteBufResult::NotFound(PageWriteGuard {
inner,
@@ -772,7 +758,7 @@ impl PageCache {
/// Find a slot to evict.
///
/// On return, the slot is empty and write-locked.
fn find_victim(&self) -> anyhow::Result<(usize, RwLockWriteGuard<SlotInner>)> {
fn find_victim(&self) -> anyhow::Result<(usize, tokio::sync::RwLockWriteGuard<SlotInner>)> {
let iter_limit = self.slots.len() * 10;
let mut iters = 0;
loop {
@@ -784,10 +770,7 @@ impl PageCache {
if slot.dec_usage_count() == 0 {
let mut inner = match slot.inner.try_write() {
Ok(inner) => inner,
Err(TryLockError::Poisoned(err)) => {
anyhow::bail!("buffer lock was poisoned: {err:?}")
}
Err(TryLockError::WouldBlock) => {
Err(_err) => {
// If we have looped through the whole buffer pool 10 times
// and still haven't found a victim buffer, something's wrong.
// Maybe all the buffers were in locked. That could happen in
@@ -816,6 +799,8 @@ impl PageCache {
fn new(num_pages: usize) -> Self {
assert!(num_pages > 0, "page cache size must be > 0");
// We use Box::leak here and into_boxed_slice to avoid leaking uninitialized
// memory that Vec's might contain.
let page_buffer = Box::leak(vec![0u8; num_pages * PAGE_SZ].into_boxed_slice());
let size_metrics = &crate::metrics::PAGE_CACHE_SIZE;
@@ -829,7 +814,7 @@ impl PageCache {
let buf: &mut [u8; PAGE_SZ] = chunk.try_into().unwrap();
Slot {
inner: RwLock::new(SlotInner { key: None, buf }),
inner: tokio::sync::RwLock::new(SlotInner { key: None, buf }),
usage_count: AtomicU8::new(0),
}
})

View File

@@ -469,7 +469,9 @@ impl PageServerHandler {
// Create empty timeline
info!("creating new timeline");
let tenant = get_active_tenant_with_timeout(tenant_id, &ctx).await?;
let timeline = tenant.create_empty_timeline(timeline_id, base_lsn, pg_version, &ctx)?;
let timeline = tenant
.create_empty_timeline(timeline_id, base_lsn, pg_version, &ctx)
.await?;
// TODO mark timeline as not ready until it reaches end_lsn.
// We might have some wal to import as well, and we should prevent compute

View File

@@ -34,7 +34,6 @@ use std::fs;
use std::fs::File;
use std::fs::OpenOptions;
use std::io;
use std::io::Write;
use std::ops::Bound::Included;
use std::path::Path;
use std::path::PathBuf;
@@ -68,7 +67,7 @@ use crate::task_mgr;
use crate::task_mgr::TaskKind;
use crate::tenant::config::TenantConfOpt;
use crate::tenant::metadata::load_metadata;
use crate::tenant::remote_timeline_client::index::IndexPart;
pub use crate::tenant::remote_timeline_client::index::IndexPart;
use crate::tenant::remote_timeline_client::MaybeDeletedIndexPart;
use crate::tenant::storage_layer::DeltaLayer;
use crate::tenant::storage_layer::ImageLayer;
@@ -85,6 +84,7 @@ pub use pageserver_api::models::TenantState;
use toml_edit;
use utils::{
crashsafe,
generation::Generation,
id::{TenantId, TimelineId},
lsn::{Lsn, RecordLsn},
};
@@ -133,7 +133,9 @@ pub(crate) mod timeline;
pub mod size;
pub(crate) use timeline::span::debug_assert_current_span_has_tenant_and_timeline_id;
pub(crate) use timeline::{LogicalSizeCalculationCause, PageReconstructError, Timeline};
pub use timeline::{
LocalLayerInfoForDiskUsageEviction, LogicalSizeCalculationCause, PageReconstructError, Timeline,
};
// re-export for use in remote_timeline_client.rs
pub use crate::tenant::metadata::save_metadata;
@@ -176,6 +178,11 @@ pub struct Tenant {
tenant_conf: Arc<RwLock<TenantConfOpt>>,
tenant_id: TenantId,
/// The remote storage generation, used to protect S3 objects from split-brain.
/// Does not change over the lifetime of the [`Tenant`] object.
generation: Generation,
timelines: Mutex<HashMap<TimelineId, Arc<Timeline>>>,
// This mutex prevents creation of new timelines during GC.
// Adding yet another mutex (in addition to `timelines`) is needed because holding
@@ -440,6 +447,7 @@ impl Tenant {
up_to_date_metadata,
first_save,
)
.await
.context("save_metadata")?;
}
@@ -520,6 +528,7 @@ impl Tenant {
pub(crate) fn spawn_attach(
conf: &'static PageServerConf,
tenant_id: TenantId,
generation: Generation,
broker_client: storage_broker::BrokerClientChannel,
tenants: &'static tokio::sync::RwLock<TenantsMap>,
remote_storage: GenericRemoteStorage,
@@ -536,6 +545,7 @@ impl Tenant {
tenant_conf,
wal_redo_manager,
tenant_id,
generation,
Some(remote_storage.clone()),
));
@@ -646,12 +656,8 @@ impl Tenant {
.as_ref()
.ok_or_else(|| anyhow::anyhow!("cannot attach without remote storage"))?;
let remote_timeline_ids = remote_timeline_client::list_remote_timelines(
remote_storage,
self.conf,
self.tenant_id,
)
.await?;
let remote_timeline_ids =
remote_timeline_client::list_remote_timelines(remote_storage, self.tenant_id).await?;
info!("found {} timelines", remote_timeline_ids.len());
@@ -663,6 +669,7 @@ impl Tenant {
self.conf,
self.tenant_id,
timeline_id,
self.generation,
);
part_downloads.spawn(
async move {
@@ -849,6 +856,7 @@ impl Tenant {
TenantConfOpt::default(),
wal_redo_manager,
tenant_id,
Generation::broken(),
None,
))
}
@@ -866,6 +874,7 @@ impl Tenant {
pub(crate) fn spawn_load(
conf: &'static PageServerConf,
tenant_id: TenantId,
generation: Generation,
resources: TenantSharedResources,
init_order: Option<InitializationOrder>,
tenants: &'static tokio::sync::RwLock<TenantsMap>,
@@ -891,6 +900,7 @@ impl Tenant {
tenant_conf,
wal_redo_manager,
tenant_id,
generation,
remote_storage.clone(),
);
let tenant = Arc::new(tenant);
@@ -1440,7 +1450,7 @@ impl Tenant {
/// For tests, use `DatadirModification::init_empty_test_timeline` + `commit` to setup the
/// minimum amount of keys required to get a writable timeline.
/// (Without it, `put` might fail due to `repartition` failing.)
pub fn create_empty_timeline(
pub async fn create_empty_timeline(
&self,
new_timeline_id: TimelineId,
initdb_lsn: Lsn,
@@ -1452,10 +1462,10 @@ impl Tenant {
"Cannot create empty timelines on inactive tenant"
);
let timelines = self.timelines.lock().unwrap();
let timeline_uninit_mark = self.create_timeline_uninit_mark(new_timeline_id, &timelines)?;
drop(timelines);
let timeline_uninit_mark = {
let timelines = self.timelines.lock().unwrap();
self.create_timeline_uninit_mark(new_timeline_id, &timelines)?
};
let new_metadata = TimelineMetadata::new(
// Initialize disk_consistent LSN to 0, The caller must import some data to
// make it valid, before calling finish_creation()
@@ -1474,6 +1484,7 @@ impl Tenant {
initdb_lsn,
None,
)
.await
}
/// Helper for unit tests to create an empty timeline.
@@ -1489,7 +1500,9 @@ impl Tenant {
pg_version: u32,
ctx: &RequestContext,
) -> anyhow::Result<Arc<Timeline>> {
let uninit_tl = self.create_empty_timeline(new_timeline_id, initdb_lsn, pg_version, ctx)?;
let uninit_tl = self
.create_empty_timeline(new_timeline_id, initdb_lsn, pg_version, ctx)
.await?;
let tline = uninit_tl.raw_timeline().expect("we just created it");
assert_eq!(tline.get_last_record_lsn(), Lsn(0));
@@ -2272,6 +2285,7 @@ impl Tenant {
ancestor,
new_timeline_id,
self.tenant_id,
self.generation,
Arc::clone(&self.walredo_mgr),
resources,
pg_version,
@@ -2289,6 +2303,7 @@ impl Tenant {
tenant_conf: TenantConfOpt,
walredo_mgr: Arc<dyn WalRedoManager + Send + Sync>,
tenant_id: TenantId,
generation: Generation,
remote_storage: Option<GenericRemoteStorage>,
) -> Tenant {
let (state, mut rx) = watch::channel(state);
@@ -2347,6 +2362,7 @@ impl Tenant {
Tenant {
tenant_id,
generation,
conf,
// using now here is good enough approximation to catch tenants with really long
// activation times.
@@ -2408,67 +2424,67 @@ impl Tenant {
Ok(tenant_conf)
}
pub(super) fn persist_tenant_config(
#[tracing::instrument(skip_all, fields(%tenant_id))]
pub(super) async fn persist_tenant_config(
tenant_id: &TenantId,
target_config_path: &Path,
tenant_conf: TenantConfOpt,
creating_tenant: bool,
) -> anyhow::Result<()> {
let _enter = info_span!("saving tenantconf").entered();
// imitate a try-block with a closure
let do_persist = |target_config_path: &Path| -> anyhow::Result<()> {
let target_config_parent = target_config_path.parent().with_context(|| {
format!(
"Config path does not have a parent: {}",
target_config_path.display()
)
})?;
let do_persist = |target_config_path: PathBuf| -> _ {
async move {
let target_config_parent = target_config_path.parent().with_context(|| {
format!(
"Config path does not have a parent: {}",
target_config_path.display()
)
})?;
info!("persisting tenantconf to {}", target_config_path.display());
info!("persisting tenantconf to {}", target_config_path.display());
let mut conf_content = r#"# This file contains a specific per-tenant's config.
let mut conf_content = r#"# This file contains a specific per-tenant's config.
# It is read in case of pageserver restart.
[tenant_config]
"#
.to_string();
.to_string();
// Convert the config to a toml file.
conf_content += &toml_edit::ser::to_string(&tenant_conf)?;
// Convert the config to a toml file.
conf_content += &toml_edit::ser::to_string(&tenant_conf)?;
let mut target_config_file = VirtualFile::open_with_options(
target_config_path,
OpenOptions::new()
.truncate(true) // This needed for overwriting with small config files
.write(true)
.create_new(creating_tenant)
// when creating a new tenant, first_save will be true and `.create(true)` will be
// ignored (per rust std docs).
//
// later when updating the config of created tenant, or persisting config for the
// first time for attached tenant, the `.create(true)` is used.
.create(true),
)?;
let mut target_config_file = VirtualFile::open_with_options(
&target_config_path,
OpenOptions::new()
.truncate(true) // This needed for overwriting with small config files
.write(true)
.create_new(creating_tenant)
// when creating a new tenant, first_save will be true and `.create(true)` will be
// ignored (per rust std docs).
//
// later when updating the config of created tenant, or persisting config for the
// first time for attached tenant, the `.create(true)` is used.
.create(true),
)?;
target_config_file
.write(conf_content.as_bytes())
.context("write toml bytes into file")
.and_then(|_| target_config_file.sync_all().context("fsync config file"))
.context("write config file")?;
target_config_file
.write_and_fsync(conf_content.as_bytes())
.context("write tenant config toml bytes into file")?;
// fsync the parent directory to ensure the directory entry is durable.
// before this was done conditionally on creating_tenant, but these management actions are rare
// enough to just fsync it always.
// fsync the parent directory to ensure the directory entry is durable.
// before this was done conditionally on creating_tenant, but these management actions are rare
// enough to just fsync it always.
crashsafe::fsync(target_config_parent)?;
// XXX we're not fsyncing the parent dir, need to do that in case `creating_tenant`
Ok(())
crashsafe::fsync(target_config_parent)?;
// XXX we're not fsyncing the parent dir, need to do that in case `creating_tenant`
Result::<_, anyhow::Error>::Ok(())
}
};
let pb = target_config_path.to_owned();
// this function is called from creating the tenant and updating the tenant config, which
// would otherwise share this context, so keep it here in one place.
do_persist(target_config_path).with_context(|| {
do_persist(pb).await.with_context(|| {
format!(
"write tenant {tenant_id} config to {}",
target_config_path.display()
@@ -2784,13 +2800,15 @@ impl Tenant {
src_timeline.pg_version,
);
let uninitialized_timeline = self.prepare_new_timeline(
dst_id,
&metadata,
timeline_uninit_mark,
start_lsn + 1,
Some(Arc::clone(src_timeline)),
)?;
let uninitialized_timeline = self
.prepare_new_timeline(
dst_id,
&metadata,
timeline_uninit_mark,
start_lsn + 1,
Some(Arc::clone(src_timeline)),
)
.await?;
let new_timeline = uninitialized_timeline.finish_creation()?;
@@ -2868,13 +2886,15 @@ impl Tenant {
pgdata_lsn,
pg_version,
);
let raw_timeline = self.prepare_new_timeline(
timeline_id,
&new_metadata,
timeline_uninit_mark,
pgdata_lsn,
None,
)?;
let raw_timeline = self
.prepare_new_timeline(
timeline_id,
&new_metadata,
timeline_uninit_mark,
pgdata_lsn,
None,
)
.await?;
let tenant_id = raw_timeline.owning_tenant.tenant_id;
let unfinished_timeline = raw_timeline.raw_timeline()?;
@@ -2929,6 +2949,7 @@ impl Tenant {
self.conf,
self.tenant_id,
timeline_id,
self.generation,
);
Some(remote_client)
} else {
@@ -2944,7 +2965,7 @@ impl Tenant {
/// at 'disk_consistent_lsn'. After any initial data has been imported, call
/// `finish_creation` to insert the Timeline into the timelines map and to remove the
/// uninit mark file.
fn prepare_new_timeline(
async fn prepare_new_timeline(
&self,
new_timeline_id: TimelineId,
new_metadata: &TimelineMetadata,
@@ -2972,8 +2993,9 @@ impl Tenant {
timeline_struct.init_empty_layer_map(start_lsn);
if let Err(e) =
self.create_timeline_files(&uninit_mark.timeline_path, &new_timeline_id, new_metadata)
if let Err(e) = self
.create_timeline_files(&uninit_mark.timeline_path, &new_timeline_id, new_metadata)
.await
{
error!("Failed to create initial files for timeline {tenant_id}/{new_timeline_id}, cleaning up: {e:?}");
cleanup_timeline_directory(uninit_mark);
@@ -2989,7 +3011,7 @@ impl Tenant {
))
}
fn create_timeline_files(
async fn create_timeline_files(
&self,
timeline_path: &Path,
new_timeline_id: &TimelineId,
@@ -3008,6 +3030,7 @@ impl Tenant {
new_metadata,
true,
)
.await
.context("Failed to create timeline metadata")?;
Ok(())
}
@@ -3155,7 +3178,7 @@ pub(crate) enum CreateTenantFilesMode {
Attach,
}
pub(crate) fn create_tenant_files(
pub(crate) async fn create_tenant_files(
conf: &'static PageServerConf,
tenant_conf: TenantConfOpt,
tenant_id: &TenantId,
@@ -3191,7 +3214,8 @@ pub(crate) fn create_tenant_files(
mode,
&temporary_tenant_dir,
&target_tenant_directory,
);
)
.await;
if creation_result.is_err() {
error!("Failed to create directory structure for tenant {tenant_id}, cleaning tmp data");
@@ -3209,7 +3233,7 @@ pub(crate) fn create_tenant_files(
Ok(target_tenant_directory)
}
fn try_create_target_tenant_dir(
async fn try_create_target_tenant_dir(
conf: &'static PageServerConf,
tenant_conf: TenantConfOpt,
tenant_id: &TenantId,
@@ -3248,7 +3272,8 @@ fn try_create_target_tenant_dir(
)
.with_context(|| format!("resolve tenant {tenant_id} temporary config path"))?;
Tenant::persist_tenant_config(tenant_id, &temporary_tenant_config_path, tenant_conf, true)?;
Tenant::persist_tenant_config(tenant_id, &temporary_tenant_config_path, tenant_conf, true)
.await?;
crashsafe::create_dir(&temporary_tenant_timelines_dir).with_context(|| {
format!(
@@ -3452,6 +3477,7 @@ pub mod harness {
pub conf: &'static PageServerConf,
pub tenant_conf: TenantConf,
pub tenant_id: TenantId,
pub generation: Generation,
}
static LOG_HANDLE: OnceCell<()> = OnceCell::new();
@@ -3493,6 +3519,7 @@ pub mod harness {
conf,
tenant_conf,
tenant_id,
generation: Generation::new(0xdeadbeef),
})
}
@@ -3519,6 +3546,7 @@ pub mod harness {
TenantConfOpt::from(self.tenant_conf),
walredo_mgr,
self.tenant_id,
self.generation,
remote_storage,
));
tenant
@@ -3632,7 +3660,10 @@ mod tests {
.create_test_timeline(TIMELINE_ID, Lsn(0x10), DEFAULT_PG_VERSION, &ctx)
.await?;
match tenant.create_empty_timeline(TIMELINE_ID, Lsn(0x10), DEFAULT_PG_VERSION, &ctx) {
match tenant
.create_empty_timeline(TIMELINE_ID, Lsn(0x10), DEFAULT_PG_VERSION, &ctx)
.await
{
Ok(_) => panic!("duplicate timeline creation should fail"),
Err(e) => assert_eq!(
e.to_string(),
@@ -4039,7 +4070,6 @@ mod tests {
#[tokio::test]
async fn delta_layer_dumping() -> anyhow::Result<()> {
use storage_layer::AsLayerDesc;
let (tenant, ctx) = TenantHarness::create("test_layer_dumping")?.load().await;
let tline = tenant
.create_test_timeline(TIMELINE_ID, Lsn(0x10), DEFAULT_PG_VERSION, &ctx)
@@ -4047,18 +4077,16 @@ mod tests {
make_some_layers(tline.as_ref(), Lsn(0x20)).await?;
let layer_map = tline.layers.read().await;
let level0_deltas = layer_map
.layer_map()
.get_level0_deltas()?
.into_iter()
.map(|desc| layer_map.get_from_desc(&desc))
.collect::<Vec<_>>();
let level0_deltas = layer_map.layer_map().get_level0_deltas()?;
assert!(!level0_deltas.is_empty());
for delta in level0_deltas {
let delta = layer_map.get_from_desc(&delta);
// Ensure we are dumping a delta layer here
assert!(delta.layer_desc().is_delta);
let delta = delta.downcast_delta_layer().unwrap();
delta.dump(false, &ctx).await.unwrap();
delta.dump(true, &ctx).await.unwrap();
}
@@ -4475,8 +4503,9 @@ mod tests {
.await;
let initdb_lsn = Lsn(0x20);
let utline =
tenant.create_empty_timeline(TIMELINE_ID, initdb_lsn, DEFAULT_PG_VERSION, &ctx)?;
let utline = tenant
.create_empty_timeline(TIMELINE_ID, initdb_lsn, DEFAULT_PG_VERSION, &ctx)
.await?;
let tline = utline.raw_timeline().unwrap();
// Spawn flush loop now so that we can set the `expect_initdb_optimization`
@@ -4541,8 +4570,9 @@ mod tests {
let harness = TenantHarness::create(name)?;
{
let (tenant, ctx) = harness.load().await;
let tline =
tenant.create_empty_timeline(TIMELINE_ID, Lsn(0), DEFAULT_PG_VERSION, &ctx)?;
let tline = tenant
.create_empty_timeline(TIMELINE_ID, Lsn(0), DEFAULT_PG_VERSION, &ctx)
.await?;
// Keeps uninit mark in place
std::mem::forget(tline);
}

View File

@@ -33,7 +33,7 @@ impl<'a> BlockCursor<'a> {
let mut blknum = (offset / PAGE_SZ as u64) as u32;
let mut off = (offset % PAGE_SZ as u64) as usize;
let mut buf = self.read_blk(blknum)?;
let mut buf = self.read_blk(blknum).await?;
// peek at the first byte, to determine if it's a 1- or 4-byte length
let first_len_byte = buf[off];
@@ -49,7 +49,7 @@ impl<'a> BlockCursor<'a> {
// it is split across two pages
len_buf[..thislen].copy_from_slice(&buf[off..PAGE_SZ]);
blknum += 1;
buf = self.read_blk(blknum)?;
buf = self.read_blk(blknum).await?;
len_buf[thislen..].copy_from_slice(&buf[0..4 - thislen]);
off = 4 - thislen;
} else {
@@ -70,7 +70,7 @@ impl<'a> BlockCursor<'a> {
if page_remain == 0 {
// continue on next page
blknum += 1;
buf = self.read_blk(blknum)?;
buf = self.read_blk(blknum).await?;
off = 0;
page_remain = PAGE_SZ;
}
@@ -96,18 +96,12 @@ pub trait BlobWriter {
/// An implementation of BlobWriter to write blobs to anything that
/// implements std::io::Write.
///
pub struct WriteBlobWriter<W>
where
W: std::io::Write,
{
pub struct WriteBlobWriter<W> {
inner: W,
offset: u64,
}
impl<W> WriteBlobWriter<W>
where
W: std::io::Write,
{
impl<W> WriteBlobWriter<W> {
pub fn new(inner: W, start_offset: u64) -> Self {
WriteBlobWriter {
inner,

View File

@@ -7,9 +7,7 @@ use super::storage_layer::delta_layer::{Adapter, DeltaLayerInner};
use crate::page_cache::{self, PageReadGuard, ReadBufResult, PAGE_SZ};
use crate::virtual_file::VirtualFile;
use bytes::Bytes;
use std::fs::File;
use std::ops::{Deref, DerefMut};
use std::os::unix::fs::FileExt;
/// This is implemented by anything that can read 8 kB (PAGE_SZ)
/// blocks, using the page cache
@@ -39,7 +37,7 @@ pub enum BlockLease<'a> {
PageReadGuard(PageReadGuard<'static>),
EphemeralFileMutableTail(&'a [u8; PAGE_SZ]),
#[cfg(test)]
Rc(std::rc::Rc<[u8; PAGE_SZ]>),
Arc(std::sync::Arc<[u8; PAGE_SZ]>),
}
impl From<PageReadGuard<'static>> for BlockLease<'static> {
@@ -49,9 +47,9 @@ impl From<PageReadGuard<'static>> for BlockLease<'static> {
}
#[cfg(test)]
impl<'a> From<std::rc::Rc<[u8; PAGE_SZ]>> for BlockLease<'a> {
fn from(value: std::rc::Rc<[u8; PAGE_SZ]>) -> Self {
BlockLease::Rc(value)
impl<'a> From<std::sync::Arc<[u8; PAGE_SZ]>> for BlockLease<'a> {
fn from(value: std::sync::Arc<[u8; PAGE_SZ]>) -> Self {
BlockLease::Arc(value)
}
}
@@ -63,7 +61,7 @@ impl<'a> Deref for BlockLease<'a> {
BlockLease::PageReadGuard(v) => v.deref(),
BlockLease::EphemeralFileMutableTail(v) => v,
#[cfg(test)]
BlockLease::Rc(v) => v.deref(),
BlockLease::Arc(v) => v.deref(),
}
}
}
@@ -74,7 +72,6 @@ impl<'a> Deref for BlockLease<'a> {
/// Unlike traits, we also support the read function to be async though.
pub(crate) enum BlockReaderRef<'a> {
FileBlockReaderVirtual(&'a FileBlockReader<VirtualFile>),
FileBlockReaderFile(&'a FileBlockReader<std::fs::File>),
EphemeralFile(&'a EphemeralFile),
Adapter(Adapter<&'a DeltaLayerInner>),
#[cfg(test)]
@@ -83,13 +80,12 @@ pub(crate) enum BlockReaderRef<'a> {
impl<'a> BlockReaderRef<'a> {
#[inline(always)]
fn read_blk(&self, blknum: u32) -> Result<BlockLease, std::io::Error> {
async fn read_blk(&self, blknum: u32) -> Result<BlockLease, std::io::Error> {
use BlockReaderRef::*;
match self {
FileBlockReaderVirtual(r) => r.read_blk(blknum),
FileBlockReaderFile(r) => r.read_blk(blknum),
EphemeralFile(r) => r.read_blk(blknum),
Adapter(r) => r.read_blk(blknum),
FileBlockReaderVirtual(r) => r.read_blk(blknum).await,
EphemeralFile(r) => r.read_blk(blknum).await,
Adapter(r) => r.read_blk(blknum).await,
#[cfg(test)]
TestDisk(r) => r.read_blk(blknum),
}
@@ -134,8 +130,8 @@ impl<'a> BlockCursor<'a> {
/// access to the contents of the page. (For the page cache, the
/// lease object represents a lock on the buffer.)
#[inline(always)]
pub fn read_blk(&self, blknum: u32) -> Result<BlockLease, std::io::Error> {
self.reader.read_blk(blknum)
pub async fn read_blk(&self, blknum: u32) -> Result<BlockLease, std::io::Error> {
self.reader.read_blk(blknum).await
}
}
@@ -150,31 +146,31 @@ pub struct FileBlockReader<F> {
file_id: page_cache::FileId,
}
impl<F> FileBlockReader<F>
where
F: FileExt,
{
pub fn new(file: F) -> Self {
impl FileBlockReader<VirtualFile> {
pub fn new(file: VirtualFile) -> Self {
let file_id = page_cache::next_file_id();
FileBlockReader { file_id, file }
}
/// Read a page from the underlying file into given buffer.
fn fill_buffer(&self, buf: &mut [u8], blkno: u32) -> Result<(), std::io::Error> {
async fn fill_buffer(&self, buf: &mut [u8], blkno: u32) -> Result<(), std::io::Error> {
assert!(buf.len() == PAGE_SZ);
self.file.read_exact_at(buf, blkno as u64 * PAGE_SZ as u64)
self.file
.read_exact_at(buf, blkno as u64 * PAGE_SZ as u64)
.await
}
/// Read a block.
///
/// Returns a "lease" object that can be used to
/// access to the contents of the page. (For the page cache, the
/// lease object represents a lock on the buffer.)
pub fn read_blk(&self, blknum: u32) -> Result<BlockLease, std::io::Error> {
pub async fn read_blk(&self, blknum: u32) -> Result<BlockLease, std::io::Error> {
let cache = page_cache::get();
loop {
match cache
.read_immutable_buf(self.file_id, blknum)
.await
.map_err(|e| {
std::io::Error::new(
std::io::ErrorKind::Other,
@@ -184,7 +180,7 @@ where
ReadBufResult::Found(guard) => break Ok(guard.into()),
ReadBufResult::NotFound(mut write_guard) => {
// Read the page from disk into the buffer
self.fill_buffer(write_guard.deref_mut(), blknum)?;
self.fill_buffer(write_guard.deref_mut(), blknum).await?;
write_guard.mark_valid();
// Swap for read lock
@@ -195,12 +191,6 @@ where
}
}
impl BlockReader for FileBlockReader<File> {
fn block_cursor(&self) -> BlockCursor<'_> {
BlockCursor::new(BlockReaderRef::FileBlockReaderFile(self))
}
}
impl BlockReader for FileBlockReader<VirtualFile> {
fn block_cursor(&self) -> BlockCursor<'_> {
BlockCursor::new(BlockReaderRef::FileBlockReaderVirtual(self))

View File

@@ -262,7 +262,7 @@ where
let block_cursor = self.reader.block_cursor();
while let Some((node_blknum, opt_iter)) = stack.pop() {
// Locate the node.
let node_buf = block_cursor.read_blk(self.start_blk + node_blknum)?;
let node_buf = block_cursor.read_blk(self.start_blk + node_blknum).await?;
let node = OnDiskNode::deparse(node_buf.as_ref())?;
let prefix_len = node.prefix_len as usize;
@@ -357,7 +357,7 @@ where
let block_cursor = self.reader.block_cursor();
while let Some((blknum, path, depth, child_idx, key_off)) = stack.pop() {
let blk = block_cursor.read_blk(self.start_blk + blknum)?;
let blk = block_cursor.read_blk(self.start_blk + blknum).await?;
let buf: &[u8] = blk.as_ref();
let node = OnDiskNode::<L>::deparse(buf)?;
@@ -704,7 +704,7 @@ pub(crate) mod tests {
pub(crate) fn read_blk(&self, blknum: u32) -> io::Result<BlockLease> {
let mut buf = [0u8; PAGE_SZ];
buf.copy_from_slice(&self.blocks[blknum as usize]);
Ok(std::rc::Rc::new(buf).into())
Ok(std::sync::Arc::new(buf).into())
}
}
impl BlockReader for TestDisk {

View File

@@ -9,7 +9,6 @@ use std::cmp::min;
use std::fs::OpenOptions;
use std::io::{self, ErrorKind};
use std::ops::DerefMut;
use std::os::unix::prelude::FileExt;
use std::path::PathBuf;
use std::sync::atomic::AtomicU64;
use tracing::*;
@@ -61,13 +60,14 @@ impl EphemeralFile {
self.len
}
pub(crate) fn read_blk(&self, blknum: u32) -> Result<BlockLease, io::Error> {
pub(crate) async fn read_blk(&self, blknum: u32) -> Result<BlockLease, io::Error> {
let flushed_blknums = 0..self.len / PAGE_SZ as u64;
if flushed_blknums.contains(&(blknum as u64)) {
let cache = page_cache::get();
loop {
match cache
.read_immutable_buf(self.page_cache_file_id, blknum)
.await
.map_err(|e| {
std::io::Error::new(
std::io::ErrorKind::Other,
@@ -87,7 +87,8 @@ impl EphemeralFile {
let buf: &mut [u8] = write_guard.deref_mut();
debug_assert_eq!(buf.len(), PAGE_SZ);
self.file
.read_exact_at(&mut buf[..], blknum as u64 * PAGE_SZ as u64)?;
.read_exact_at(&mut buf[..], blknum as u64 * PAGE_SZ as u64)
.await?;
write_guard.mark_valid();
// Swap for read lock
@@ -127,18 +128,26 @@ impl EphemeralFile {
self.off += n;
src_remaining = &src_remaining[n..];
if self.off == PAGE_SZ {
match self.ephemeral_file.file.write_all_at(
&self.ephemeral_file.mutable_tail,
self.blknum as u64 * PAGE_SZ as u64,
) {
match self
.ephemeral_file
.file
.write_all_at(
&self.ephemeral_file.mutable_tail,
self.blknum as u64 * PAGE_SZ as u64,
)
.await
{
Ok(_) => {
// Pre-warm the page cache with what we just wrote.
// This isn't necessary for coherency/correctness, but it's how we've always done it.
let cache = page_cache::get();
match cache.read_immutable_buf(
self.ephemeral_file.page_cache_file_id,
self.blknum,
) {
match cache
.read_immutable_buf(
self.ephemeral_file.page_cache_file_id,
self.blknum,
)
.await
{
Ok(page_cache::ReadBufResult::Found(_guard)) => {
// This function takes &mut self, so, it shouldn't be possible to reach this point.
unreachable!("we just wrote blknum {} and this function takes &mut self, so, no concurrent read_blk is possible", self.blknum);
@@ -221,9 +230,8 @@ pub fn is_ephemeral_file(filename: &str) -> bool {
impl Drop for EphemeralFile {
fn drop(&mut self) {
// drop all pages from page cache
let cache = page_cache::get();
cache.drop_buffers_for_immutable(self.page_cache_file_id);
// There might still be pages in the [`crate::page_cache`] for this file.
// We leave them there, [`crate::page_cache::PageCache::find_victim`] will evict them when needed.
// unlink the file
let res = std::fs::remove_file(&self.file.path);

View File

@@ -639,10 +639,147 @@ impl LayerMap {
}
println!("historic_layers:");
for desc in self.iter_historic_layers() {
desc.dump();
for layer in self.iter_historic_layers() {
layer.dump(verbose, ctx)?;
}
println!("End dump LayerMap");
Ok(())
}
}
#[cfg(test)]
mod tests {
use super::LayerMap;
use crate::tenant::storage_layer::LayerFileName;
use std::str::FromStr;
use std::sync::Arc;
mod l0_delta_layers_updated {
use crate::tenant::{
storage_layer::{AsLayerDesc, PersistentLayerDesc},
timeline::layer_manager::LayerFileManager,
};
use super::*;
struct LayerObject(PersistentLayerDesc);
impl AsLayerDesc for LayerObject {
fn layer_desc(&self) -> &PersistentLayerDesc {
&self.0
}
}
impl LayerObject {
fn new(desc: PersistentLayerDesc) -> Self {
LayerObject(desc)
}
}
type TestLayerFileManager = LayerFileManager<LayerObject>;
#[test]
fn for_full_range_delta() {
// l0_delta_layers are used by compaction, and should observe all buffered updates
l0_delta_layers_updated_scenario(
"000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__0000000053423C21-0000000053424D69",
true
)
}
#[test]
fn for_non_full_range_delta() {
// has minimal uncovered areas compared to l0_delta_layers_updated_on_insert_replace_remove_for_full_range_delta
l0_delta_layers_updated_scenario(
"000000000000000000000000000000000001-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFE__0000000053423C21-0000000053424D69",
// because not full range
false
)
}
#[test]
fn for_image() {
l0_delta_layers_updated_scenario(
"000000000000000000000000000000000000-000000000000000000000000000000010000__0000000053424D69",
// code only checks if it is a full range layer, doesn't care about images, which must
// mean we should in practice never have full range images
false
)
}
#[test]
fn replacing_missing_l0_is_notfound() {
// original impl had an oversight, and L0 was an anyhow::Error. anyhow::Error should
// however only happen for precondition failures.
let layer = "000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__0000000053423C21-0000000053424D69";
let layer = LayerFileName::from_str(layer).unwrap();
let layer = PersistentLayerDesc::from(layer);
// same skeletan construction; see scenario below
let not_found = Arc::new(LayerObject::new(layer.clone()));
let new_version = Arc::new(LayerObject::new(layer));
// after the immutable storage state refactor, the replace operation
// will not use layer map any more. We keep it here for consistency in test cases
// and can remove it in the future.
let _map = LayerMap::default();
let mut mapping = TestLayerFileManager::new();
mapping
.replace_and_verify(not_found, new_version)
.unwrap_err();
}
fn l0_delta_layers_updated_scenario(layer_name: &str, expected_l0: bool) {
let name = LayerFileName::from_str(layer_name).unwrap();
let skeleton = PersistentLayerDesc::from(name);
let remote = Arc::new(LayerObject::new(skeleton.clone()));
let downloaded = Arc::new(LayerObject::new(skeleton));
let mut map = LayerMap::default();
let mut mapping = LayerFileManager::new();
// two disjoint Arcs in different lifecycle phases. even if it seems they must be the
// same layer, we use LayerMap::compare_arced_layers as the identity of layers.
assert_eq!(remote.layer_desc(), downloaded.layer_desc());
let expected_in_counts = (1, usize::from(expected_l0));
map.batch_update()
.insert_historic(remote.layer_desc().clone());
mapping.insert(remote.clone());
assert_eq!(
count_layer_in(&map, remote.layer_desc()),
expected_in_counts
);
mapping
.replace_and_verify(remote, downloaded.clone())
.expect("name derived attributes are the same");
assert_eq!(
count_layer_in(&map, downloaded.layer_desc()),
expected_in_counts
);
map.batch_update().remove_historic(downloaded.layer_desc());
assert_eq!(count_layer_in(&map, downloaded.layer_desc()), (0, 0));
}
fn count_layer_in(map: &LayerMap, layer: &PersistentLayerDesc) -> (usize, usize) {
let historic = map
.iter_historic_layers()
.filter(|x| x.key() == layer.key())
.count();
let l0s = map
.get_level0_deltas()
.expect("why does this return a result");
let l0 = l0s.iter().filter(|x| x.key() == layer.key()).count();
(historic, l0)
}
}
}

View File

@@ -26,7 +26,7 @@
//! recovered from this file. This is tracked in
//! <https://github.com/neondatabase/neon/issues/4418>
use std::io::{self, Read, Write};
use std::io::{self, Write};
use crate::virtual_file::VirtualFile;
use anyhow::Result;
@@ -151,11 +151,13 @@ impl Manifest {
/// Load a manifest. Returns the manifest and a list of operations. If the manifest is corrupted,
/// the bool flag will be set to true and the user is responsible to reconstruct a new manifest and
/// backup the current one.
pub fn load(
mut file: VirtualFile,
pub async fn load(
file: VirtualFile,
) -> Result<(Self, Vec<Operation>, ManifestPartiallyCorrupted), ManifestLoadError> {
let mut buf = vec![];
file.read_to_end(&mut buf).map_err(ManifestLoadError::Io)?;
file.read_exact_at(&mut buf, 0)
.await
.map_err(ManifestLoadError::Io)?;
// Read manifest header
let mut buf = Bytes::from(buf);
@@ -241,8 +243,8 @@ mod tests {
use super::*;
#[test]
fn test_read_manifest() {
#[tokio::test]
async fn test_read_manifest() {
let testdir = crate::config::PageServerConf::test_repo_dir("test_read_manifest");
std::fs::create_dir_all(&testdir).unwrap();
let file = VirtualFile::create(&testdir.join("MANIFEST")).unwrap();
@@ -274,7 +276,7 @@ mod tests {
.truncate(false),
)
.unwrap();
let (mut manifest, operations, corrupted) = Manifest::load(file).unwrap();
let (mut manifest, operations, corrupted) = Manifest::load(file).await.unwrap();
assert!(!corrupted.0);
assert_eq!(operations.len(), 2);
assert_eq!(
@@ -306,7 +308,7 @@ mod tests {
.truncate(false),
)
.unwrap();
let (_manifest, operations, corrupted) = Manifest::load(file).unwrap();
let (_manifest, operations, corrupted) = Manifest::load(file).await.unwrap();
assert!(!corrupted.0);
assert_eq!(operations.len(), 3);
assert_eq!(&operations[0], &Operation::Snapshot(snapshot, Lsn::from(0)));

View File

@@ -9,9 +9,9 @@
//! [`remote_timeline_client`]: super::remote_timeline_client
use std::fs::{File, OpenOptions};
use std::io::{self, Write};
use std::io;
use anyhow::{bail, ensure, Context};
use anyhow::{ensure, Context};
use serde::{de::Error, Deserialize, Serialize, Serializer};
use thiserror::Error;
use tracing::info_span;
@@ -255,7 +255,7 @@ impl Serialize for TimelineMetadata {
}
/// Save timeline metadata to file
pub fn save_metadata(
pub async fn save_metadata(
conf: &'static PageServerConf,
tenant_id: &TenantId,
timeline_id: &TimelineId,
@@ -273,10 +273,7 @@ pub fn save_metadata(
let metadata_bytes = data.to_bytes().context("Failed to get metadata bytes")?;
if file.write(&metadata_bytes)? != metadata_bytes.len() {
bail!("Could not write all the metadata bytes in a single call");
}
file.sync_all()?;
file.write_and_fsync(&metadata_bytes)?;
// fsync the parent directory to ensure the directory entry is durable
if first_save {

View File

@@ -25,6 +25,7 @@ use crate::tenant::{create_tenant_files, CreateTenantFilesMode, Tenant, TenantSt
use crate::{InitializationOrder, IGNORED_TENANT_FILE_NAME};
use utils::fs_ext::PathExt;
use utils::generation::Generation;
use utils::id::{TenantId, TimelineId};
use super::delete::DeleteTenantError;
@@ -202,6 +203,7 @@ pub(crate) fn schedule_local_tenant_processing(
match Tenant::spawn_attach(
conf,
tenant_id,
Generation::none(),
resources.broker_client,
tenants,
remote_storage,
@@ -224,7 +226,15 @@ pub(crate) fn schedule_local_tenant_processing(
} else {
info!("tenant {tenant_id} is assumed to be loadable, starting load operation");
// Start loading the tenant into memory. It will initially be in Loading state.
Tenant::spawn_load(conf, tenant_id, resources, init_order, tenants, ctx)
Tenant::spawn_load(
conf,
tenant_id,
Generation::none(),
resources,
init_order,
tenants,
ctx,
)
};
Ok(tenant)
}
@@ -351,11 +361,11 @@ pub async fn create_tenant(
remote_storage: Option<GenericRemoteStorage>,
ctx: &RequestContext,
) -> Result<Arc<Tenant>, TenantMapInsertError> {
tenant_map_insert(tenant_id, || {
tenant_map_insert(tenant_id, || async {
// We're holding the tenants lock in write mode while doing local IO.
// If this section ever becomes contentious, introduce a new `TenantState::Creating`
// and do the work in that state.
let tenant_directory = super::create_tenant_files(conf, tenant_conf, &tenant_id, CreateTenantFilesMode::Create)?;
let tenant_directory = super::create_tenant_files(conf, tenant_conf, &tenant_id, CreateTenantFilesMode::Create).await?;
// TODO: tenant directory remains on disk if we bail out from here on.
// See https://github.com/neondatabase/neon/issues/4233
@@ -395,6 +405,7 @@ pub async fn set_new_tenant_config(
let tenant_config_path = conf.tenant_config_path(&tenant_id);
Tenant::persist_tenant_config(&tenant_id, &tenant_config_path, new_tenant_conf, false)
.await
.map_err(SetNewTenantConfigError::Persist)?;
tenant.set_new_tenant_config(new_tenant_conf);
Ok(())
@@ -515,7 +526,7 @@ pub async fn load_tenant(
remote_storage: Option<GenericRemoteStorage>,
ctx: &RequestContext,
) -> Result<(), TenantMapInsertError> {
tenant_map_insert(tenant_id, || {
tenant_map_insert(tenant_id, || async {
let tenant_path = conf.tenant_path(&tenant_id);
let tenant_ignore_mark = conf.tenant_ignore_mark_file_path(&tenant_id);
if tenant_ignore_mark.exists() {
@@ -596,8 +607,8 @@ pub async fn attach_tenant(
remote_storage: GenericRemoteStorage,
ctx: &RequestContext,
) -> Result<(), TenantMapInsertError> {
tenant_map_insert(tenant_id, || {
let tenant_dir = create_tenant_files(conf, tenant_conf, &tenant_id, CreateTenantFilesMode::Attach)?;
tenant_map_insert(tenant_id, || async {
let tenant_dir = create_tenant_files(conf, tenant_conf, &tenant_id, CreateTenantFilesMode::Attach).await?;
// TODO: tenant directory remains on disk if we bail out from here on.
// See https://github.com/neondatabase/neon/issues/4233
@@ -645,12 +656,13 @@ pub enum TenantMapInsertError {
///
/// NB: the closure should return quickly because the current implementation of tenants map
/// serializes access through an `RwLock`.
async fn tenant_map_insert<F>(
async fn tenant_map_insert<F, R>(
tenant_id: TenantId,
insert_fn: F,
) -> Result<Arc<Tenant>, TenantMapInsertError>
where
F: FnOnce() -> anyhow::Result<Arc<Tenant>>,
F: FnOnce() -> R,
R: std::future::Future<Output = anyhow::Result<Arc<Tenant>>>,
{
let mut guard = TENANTS.write().await;
let m = match &mut *guard {
@@ -663,7 +675,7 @@ where
tenant_id,
e.get().current_state(),
)),
hash_map::Entry::Vacant(v) => match insert_fn() {
hash_map::Entry::Vacant(v) => match insert_fn().await {
Ok(tenant) => {
v.insert(tenant.clone());
Ok(tenant)

View File

@@ -163,6 +163,8 @@
//! - download their remote [`IndexPart`]s
//! - create `Timeline` struct and a `RemoteTimelineClient`
//! - initialize the client's upload queue with its `IndexPart`
//! - create [`RemoteLayer`](super::storage_layer::RemoteLayer) instances
//! for layers that are referenced by `IndexPart` but not present locally
//! - schedule uploads for layers that are only present locally.
//! - if the remote `IndexPart`'s metadata was newer than the metadata in
//! the local filesystem, write the remote metadata to the local filesystem
@@ -214,7 +216,7 @@ use utils::backoff::{
};
use std::collections::{HashMap, VecDeque};
use std::path::Path;
use std::path::{Path, PathBuf};
use std::sync::atomic::{AtomicU32, Ordering};
use std::sync::{Arc, Mutex};
@@ -231,9 +233,9 @@ use crate::metrics::{
};
use crate::task_mgr::shutdown_token;
use crate::tenant::debug_assert_current_span_has_tenant_and_timeline_id;
pub(crate) use crate::tenant::remote_timeline_client::index::LayerFileMetadata;
use crate::tenant::storage_layer::AsLayerDesc;
use crate::tenant::remote_timeline_client::index::LayerFileMetadata;
use crate::tenant::upload_queue::Delete;
use crate::tenant::TIMELINES_SEGMENT_NAME;
use crate::{
config::PageServerConf,
task_mgr,
@@ -249,8 +251,9 @@ use utils::id::{TenantId, TimelineId};
use self::index::IndexPart;
use super::storage_layer::{LayerFileName, ResidentLayer};
use super::storage_layer::LayerFileName;
use super::upload_queue::SetDeletedFlagProgress;
use super::Generation;
// Occasional network issues and such can cause remote operations to fail, and
// that's expected. If a download fails, we log it at info-level, and retry.
@@ -314,6 +317,7 @@ pub struct RemoteTimelineClient {
tenant_id: TenantId,
timeline_id: TimelineId,
generation: Generation,
upload_queue: Mutex<UploadQueue>,
@@ -334,12 +338,14 @@ impl RemoteTimelineClient {
conf: &'static PageServerConf,
tenant_id: TenantId,
timeline_id: TimelineId,
generation: Generation,
) -> RemoteTimelineClient {
RemoteTimelineClient {
conf,
runtime: BACKGROUND_RUNTIME.handle().to_owned(),
tenant_id,
timeline_id,
generation,
storage_impl: remote_storage,
upload_queue: Mutex::new(UploadQueue::Uninitialized),
metrics: Arc::new(RemoteTimelineClientMetrics::new(&tenant_id, &timeline_id)),
@@ -448,10 +454,10 @@ impl RemoteTimelineClient {
);
let index_part = download::download_index_part(
self.conf,
&self.storage_impl,
&self.tenant_id,
&self.timeline_id,
self.generation,
)
.measure_remote_op(
self.tenant_id,
@@ -597,25 +603,25 @@ impl RemoteTimelineClient {
///
/// Launch an upload operation in the background.
///
pub(crate) fn schedule_layer_file_upload(
pub fn schedule_layer_file_upload(
self: &Arc<Self>,
layer: ResidentLayer,
layer_file_name: &LayerFileName,
layer_metadata: &LayerFileMetadata,
) -> anyhow::Result<()> {
let mut guard = self.upload_queue.lock().unwrap();
let upload_queue = guard.initialized_mut()?;
let metadata = LayerFileMetadata::new(layer.layer_desc().file_size);
upload_queue
.latest_files
.insert(layer.layer_desc().filename(), metadata.clone());
.insert(layer_file_name.clone(), layer_metadata.clone());
upload_queue.latest_files_changes_since_metadata_upload_scheduled += 1;
info!("scheduled layer file upload {layer}");
let op = UploadOp::UploadLayer(layer, metadata);
let op = UploadOp::UploadLayer(layer_file_name.clone(), layer_metadata.clone());
self.calls_unfinished_metric_begin(&op);
upload_queue.queued_operations.push_back(op);
info!("scheduled layer file upload {layer_file_name}");
// Launch the task immediately, if possible
self.launch_queued_tasks(upload_queue);
Ok(())
@@ -649,22 +655,41 @@ impl RemoteTimelineClient {
// from latest_files, but not yet scheduled for deletion. Use a closure
// to syntactically forbid ? or bail! calls here.
let no_bail_here = || {
for name in names {
if upload_queue.latest_files.remove(name).is_some() {
upload_queue.latest_files_changes_since_metadata_upload_scheduled += 1;
}
}
// Decorate our list of names with each name's generation, dropping
// makes that are unexpectedly missing from our metadata.
let with_generations: Vec<_> = names
.iter()
.filter_map(|name| {
// Remove from latest_files, learning the file's remote generation in the process
let meta = upload_queue.latest_files.remove(name);
if let Some(meta) = meta {
upload_queue.latest_files_changes_since_metadata_upload_scheduled += 1;
Some((name, meta.generation))
} else {
// This can only happen if we forgot to to schedule the file upload
// before scheduling the delete. Log it because it is a rare/strange
// situation, and in case something is misbehaving, we'd like to know which
// layers experienced this.
info!(
"Deleting layer {name} not found in latest_files list, never uploaded?"
);
None
}
})
.collect();
if upload_queue.latest_files_changes_since_metadata_upload_scheduled > 0 {
self.schedule_index_upload(upload_queue, metadata);
}
// schedule the actual deletions
for name in names {
for (name, generation) in with_generations {
let op = UploadOp::Delete(Delete {
file_kind: RemoteOpFileKind::Layer,
layer_file_name: name.clone(),
scheduled_from_timeline_delete: false,
generation,
});
self.calls_unfinished_metric_begin(&op);
upload_queue.queued_operations.push_back(op);
@@ -760,10 +785,10 @@ impl RemoteTimelineClient {
backoff::retry(
|| {
upload::upload_index_part(
self.conf,
&self.storage_impl,
&self.tenant_id,
&self.timeline_id,
self.generation,
&index_part_with_deleted_at,
)
},
@@ -821,12 +846,14 @@ impl RemoteTimelineClient {
.reserve(stopped.upload_queue_for_deletion.latest_files.len());
// schedule the actual deletions
for name in stopped.upload_queue_for_deletion.latest_files.keys() {
for (name, meta) in &stopped.upload_queue_for_deletion.latest_files {
let op = UploadOp::Delete(Delete {
file_kind: RemoteOpFileKind::Layer,
layer_file_name: name.clone(),
scheduled_from_timeline_delete: true,
generation: meta.generation,
});
self.calls_unfinished_metric_begin(&op);
stopped
.upload_queue_for_deletion
@@ -849,8 +876,7 @@ impl RemoteTimelineClient {
// Do not delete index part yet, it is needed for possible retry. If we remove it first
// and retry will arrive to different pageserver there wont be any traces of it on remote storage
let timeline_path = self.conf.timeline_path(&self.tenant_id, &self.timeline_id);
let timeline_storage_path = self.conf.remote_path(&timeline_path)?;
let timeline_storage_path = remote_timeline_path(&self.tenant_id, &self.timeline_id);
let remaining = backoff::retry(
|| async {
@@ -1053,13 +1079,18 @@ impl RemoteTimelineClient {
}
let upload_result: anyhow::Result<()> = match &task.op {
UploadOp::UploadLayer(ref layer, ref layer_metadata) => {
let path = layer.local_path();
UploadOp::UploadLayer(ref layer_file_name, ref layer_metadata) => {
let path = self
.conf
.timeline_path(&self.tenant_id, &self.timeline_id)
.join(layer_file_name.file_name());
upload::upload_timeline_layer(
self.conf,
&self.storage_impl,
path,
&path,
layer_metadata,
self.generation,
)
.measure_remote_op(
self.tenant_id,
@@ -1081,10 +1112,10 @@ impl RemoteTimelineClient {
};
let res = upload::upload_index_part(
self.conf,
&self.storage_impl,
&self.tenant_id,
&self.timeline_id,
self.generation,
index_part,
)
.measure_remote_op(
@@ -1109,7 +1140,7 @@ impl RemoteTimelineClient {
.conf
.timeline_path(&self.tenant_id, &self.timeline_id)
.join(delete.layer_file_name.file_name());
delete::delete_layer(self.conf, &self.storage_impl, path)
delete::delete_layer(self.conf, &self.storage_impl, path, delete.generation)
.measure_remote_op(
self.tenant_id,
self.timeline_id,
@@ -1356,6 +1387,71 @@ impl RemoteTimelineClient {
}
}
pub fn remote_timelines_path(tenant_id: &TenantId) -> RemotePath {
let path = format!("tenants/{tenant_id}/{TIMELINES_SEGMENT_NAME}");
RemotePath::from_string(&path).expect("Failed to construct path")
}
pub fn remote_timeline_path(tenant_id: &TenantId, timeline_id: &TimelineId) -> RemotePath {
remote_timelines_path(tenant_id).join(&PathBuf::from(timeline_id.to_string()))
}
pub fn remote_layer_path(
tenant_id: &TenantId,
timeline_id: &TimelineId,
layer_file_name: &LayerFileName,
layer_meta: &LayerFileMetadata,
) -> RemotePath {
// Generation-aware key format
let path = format!(
"tenants/{tenant_id}/{TIMELINES_SEGMENT_NAME}/{timeline_id}/{0}{1}",
layer_file_name.file_name(),
layer_meta.generation.get_suffix()
);
RemotePath::from_string(&path).expect("Failed to construct path")
}
pub fn remote_index_path(
tenant_id: &TenantId,
timeline_id: &TimelineId,
generation: Generation,
) -> RemotePath {
RemotePath::from_string(&format!(
"tenants/{tenant_id}/{TIMELINES_SEGMENT_NAME}/{timeline_id}/{0}{1}",
IndexPart::FILE_NAME,
generation.get_suffix()
))
.expect("Failed to construct path")
}
/// Files on the remote storage are stored with paths, relative to the workdir.
/// That path includes in itself both tenant and timeline ids, allowing to have a unique remote storage path.
///
/// Errors if the path provided does not start from pageserver's workdir.
pub fn remote_path(
conf: &PageServerConf,
local_path: &Path,
generation: Generation,
) -> anyhow::Result<RemotePath> {
let stripped = local_path
.strip_prefix(&conf.workdir)
.context("Failed to strip workdir prefix")?;
let suffixed = format!(
"{0}{1}",
stripped.to_string_lossy(),
generation.get_suffix()
);
RemotePath::new(&PathBuf::from(suffixed)).with_context(|| {
format!(
"to resolve remote part of path {:?} for base {:?}",
local_path, conf.workdir
)
})
}
#[cfg(test)]
mod tests {
use super::*;
@@ -1363,8 +1459,7 @@ mod tests {
context::RequestContext,
tenant::{
harness::{TenantHarness, TIMELINE_ID},
storage_layer::Layer,
Tenant, Timeline,
Generation, Tenant, Timeline,
},
DEFAULT_PG_VERSION,
};
@@ -1406,8 +1501,11 @@ mod tests {
assert_eq!(avec, bvec);
}
fn assert_remote_files(expected: &[&str], remote_path: &Path) {
let mut expected: Vec<String> = expected.iter().map(|x| String::from(*x)).collect();
fn assert_remote_files(expected: &[&str], remote_path: &Path, generation: Generation) {
let mut expected: Vec<String> = expected
.iter()
.map(|x| format!("{}{}", x, generation.get_suffix()))
.collect();
expected.sort();
let mut found: Vec<String> = Vec::new();
@@ -1458,6 +1556,8 @@ mod tests {
storage: RemoteStorageKind::LocalFs(remote_fs_dir.clone()),
};
let generation = Generation::new(0xdeadbeef);
let storage = GenericRemoteStorage::from_config(&storage_config).unwrap();
let client = Arc::new(RemoteTimelineClient {
@@ -1465,6 +1565,7 @@ mod tests {
runtime: tokio::runtime::Handle::current(),
tenant_id: harness.tenant_id,
timeline_id: TIMELINE_ID,
generation,
storage_impl: storage,
upload_queue: Mutex::new(UploadQueue::Uninitialized),
metrics: Arc::new(RemoteTimelineClientMetrics::new(
@@ -1504,7 +1605,7 @@ mod tests {
let TestSetup {
harness,
tenant: _tenant,
timeline,
timeline: _timeline,
tenant_ctx: _tenant_ctx,
remote_fs_dir,
client,
@@ -1523,30 +1624,35 @@ mod tests {
.init_upload_queue_for_empty_remote(&metadata)
.unwrap();
let generation = Generation::new(0xdeadbeef);
// Create a couple of dummy files, schedule upload for them
let layer_file_name_1: LayerFileName = "000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000016B59D8-00000000016B5A51".parse().unwrap();
let layer_file_name_2: LayerFileName = "000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000016B59D9-00000000016B5A52".parse().unwrap();
let layer_file_name_3: LayerFileName = "000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000016B59DA-00000000016B5A53".parse().unwrap();
let content_1 = dummy_contents("foo");
let content_2 = dummy_contents("bar");
let content_3 = dummy_contents("baz");
let layers = [
("000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000016B59D8-00000000016B5A51".parse().unwrap(), dummy_contents("foo")),
("000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000016B59D9-00000000016B5A52".parse().unwrap(), dummy_contents("bar")),
("000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000016B59DA-00000000016B5A53".parse().unwrap(), dummy_contents("baz"))
]
.into_iter()
.map(|(name, contents): (LayerFileName, Vec<u8>)| {
std::fs::write(timeline_path.join(name.file_name()), &contents).unwrap();
Layer::for_resident(
harness.conf,
&timeline,
name,
LayerFileMetadata::new(contents.len() as u64),
)
}).collect::<Vec<_>>();
for (filename, content) in [
(&layer_file_name_1, &content_1),
(&layer_file_name_2, &content_2),
(&layer_file_name_3, &content_3),
] {
std::fs::write(timeline_path.join(filename.file_name()), content).unwrap();
}
client
.schedule_layer_file_upload(layers[0].clone())
.schedule_layer_file_upload(
&layer_file_name_1,
&LayerFileMetadata::new(content_1.len() as u64, generation),
)
.unwrap();
client
.schedule_layer_file_upload(layers[1].clone())
.schedule_layer_file_upload(
&layer_file_name_2,
&LayerFileMetadata::new(content_2.len() as u64, generation),
)
.unwrap();
// Check that they are started immediately, not queued
@@ -1599,18 +1705,21 @@ mod tests {
.map(|f| f.to_owned())
.collect(),
&[
&layers[0].layer_desc().filename().file_name(),
&layers[1].layer_desc().filename().file_name(),
&layer_file_name_1.file_name(),
&layer_file_name_2.file_name(),
],
);
assert_eq!(index_part.metadata, metadata);
// Schedule upload and then a deletion. Check that the deletion is queued
client
.schedule_layer_file_upload(layers[2].clone())
.schedule_layer_file_upload(
&layer_file_name_3,
&LayerFileMetadata::new(content_3.len() as u64, generation),
)
.unwrap();
client
.schedule_layer_file_deletion(&[layers[0].layer_desc().filename()])
.schedule_layer_file_deletion(&[layer_file_name_1.clone()])
.unwrap();
{
let mut guard = client.upload_queue.lock().unwrap();
@@ -1625,11 +1734,12 @@ mod tests {
}
assert_remote_files(
&[
&layers[0].layer_desc().filename().file_name(),
&layers[1].layer_desc().filename().file_name(),
&layer_file_name_1.file_name(),
&layer_file_name_2.file_name(),
"index_part.json",
],
&remote_timeline_dir,
generation,
);
// Finish them
@@ -1637,11 +1747,12 @@ mod tests {
assert_remote_files(
&[
&layers[1].layer_desc().filename().file_name(),
&layers[2].layer_desc().filename().file_name(),
&layer_file_name_2.file_name(),
&layer_file_name_3.file_name(),
"index_part.json",
],
&remote_timeline_dir,
generation,
);
}
@@ -1652,7 +1763,7 @@ mod tests {
let TestSetup {
harness,
tenant: _tenant,
timeline,
timeline: _timeline,
client,
..
} = TestSetup::new("metrics").await.unwrap();
@@ -1672,13 +1783,6 @@ mod tests {
)
.unwrap();
let layer_file_1 = Layer::for_resident(
harness.conf,
&timeline,
layer_file_name_1.clone(),
LayerFileMetadata::new(content_1.len() as u64),
);
#[derive(Debug, PartialEq)]
struct BytesStartedFinished {
started: Option<usize>,
@@ -1701,10 +1805,15 @@ mod tests {
// Test
let generation = Generation::new(0xdeadbeef);
let init = get_bytes_started_stopped();
client
.schedule_layer_file_upload(layer_file_1.clone())
.schedule_layer_file_upload(
&layer_file_name_1,
&LayerFileMetadata::new(content_1.len() as u64, generation),
)
.unwrap();
let pre = get_bytes_started_stopped();

View File

@@ -5,25 +5,30 @@ use tracing::debug;
use remote_storage::GenericRemoteStorage;
use crate::config::PageServerConf;
use crate::{
config::PageServerConf,
tenant::{remote_timeline_client::remote_path, Generation},
};
pub(super) async fn delete_layer<'a>(
conf: &'static PageServerConf,
storage: &'a GenericRemoteStorage,
local_layer_path: &'a Path,
generation: Generation,
) -> anyhow::Result<()> {
fail::fail_point!("before-delete-layer", |_| {
anyhow::bail!("failpoint before-delete-layer")
});
debug!("Deleting layer from remote storage: {local_layer_path:?}",);
let path_to_delete = conf.remote_path(local_layer_path)?;
let path_to_delete = remote_path(conf, local_layer_path, generation)?;
// We don't want to print an error if the delete failed if the file has
// already been deleted. Thankfully, in this situation S3 already
// does not yield an error. While OS-provided local file system APIs do yield
// errors, we avoid them in the `LocalFs` wrapper.
storage.delete(&path_to_delete).await.with_context(|| {
format!("Failed to delete remote layer from storage at {path_to_delete:?}")
})
storage
.delete(&path_to_delete)
.await
.with_context(|| format!("delete remote layer from storage at {path_to_delete:?}"))
}

View File

@@ -15,14 +15,16 @@ use tokio_util::sync::CancellationToken;
use utils::{backoff, crashsafe};
use crate::config::PageServerConf;
use crate::tenant::remote_timeline_client::{remote_layer_path, remote_timelines_path};
use crate::tenant::storage_layer::LayerFileName;
use crate::tenant::timeline::span::debug_assert_current_span_has_tenant_and_timeline_id;
use crate::tenant::Generation;
use remote_storage::{DownloadError, GenericRemoteStorage};
use utils::crashsafe::path_with_suffix_extension;
use utils::id::{TenantId, TimelineId};
use super::index::{IndexPart, LayerFileMetadata};
use super::{FAILED_DOWNLOAD_WARN_THRESHOLD, FAILED_REMOTE_OP_RETRIES};
use super::{remote_index_path, FAILED_DOWNLOAD_WARN_THRESHOLD, FAILED_REMOTE_OP_RETRIES};
static MAX_DOWNLOAD_DURATION: Duration = Duration::from_secs(120);
@@ -41,13 +43,11 @@ pub async fn download_layer_file<'a>(
) -> Result<u64, DownloadError> {
debug_assert_current_span_has_tenant_and_timeline_id();
let timeline_path = conf.timeline_path(&tenant_id, &timeline_id);
let local_path = conf
.timeline_path(&tenant_id, &timeline_id)
.join(layer_file_name.file_name());
let local_path = timeline_path.join(layer_file_name.file_name());
let remote_path = conf
.remote_path(&local_path)
.map_err(DownloadError::Other)?;
let remote_path = remote_layer_path(&tenant_id, &timeline_id, layer_file_name, layer_metadata);
// Perform a rename inspired by durable_rename from file_utils.c.
// The sequence:
@@ -64,33 +64,43 @@ pub async fn download_layer_file<'a>(
let (mut destination_file, bytes_amount) = download_retry(
|| async {
// TODO: this doesn't use the cached fd for some reason?
let mut destination_file = fs::File::create(&temp_file_path).await.with_context(|| {
format!(
"create a destination file for layer '{}'",
temp_file_path.display()
)
})
.map_err(DownloadError::Other)?;
let mut download = storage.download(&remote_path).await.with_context(|| {
format!(
let mut destination_file = fs::File::create(&temp_file_path)
.await
.with_context(|| {
format!(
"create a destination file for layer '{}'",
temp_file_path.display()
)
})
.map_err(DownloadError::Other)?;
let mut download = storage
.download(&remote_path)
.await
.with_context(|| {
format!(
"open a download stream for layer with remote storage path '{remote_path:?}'"
)
})
.map_err(DownloadError::Other)?;
let bytes_amount = tokio::time::timeout(MAX_DOWNLOAD_DURATION, tokio::io::copy(&mut download.download_stream, &mut destination_file))
.await
.map_err(|e| DownloadError::Other(anyhow::anyhow!("Timed out {:?}", e)))?
.with_context(|| {
format!("Failed to download layer with remote storage path '{remote_path:?}' into file {temp_file_path:?}")
})
.map_err(DownloadError::Other)?;
Ok((destination_file, bytes_amount))
let bytes_amount = tokio::time::timeout(
MAX_DOWNLOAD_DURATION,
tokio::io::copy(&mut download.download_stream, &mut destination_file),
)
.await
.map_err(|e| DownloadError::Other(anyhow::anyhow!("Timed out {:?}", e)))?
.with_context(|| {
format!(
"download layer at remote path '{remote_path:?}' into file {temp_file_path:?}"
)
})
.map_err(DownloadError::Other)?;
Ok((destination_file, bytes_amount))
},
&format!("download {remote_path:?}"),
).await?;
)
.await?;
// Tokio doc here: https://docs.rs/tokio/1.17.0/tokio/fs/struct.File.html states that:
// A file will not be closed immediately when it goes out of scope if there are any IO operations
@@ -103,12 +113,7 @@ pub async fn download_layer_file<'a>(
destination_file
.flush()
.await
.with_context(|| {
format!(
"failed to flush source file at {}",
temp_file_path.display()
)
})
.with_context(|| format!("flush source file at {}", temp_file_path.display()))
.map_err(DownloadError::Other)?;
let expected = layer_metadata.file_size();
@@ -139,17 +144,12 @@ pub async fn download_layer_file<'a>(
fs::rename(&temp_file_path, &local_path)
.await
.with_context(|| {
format!(
"Could not rename download layer file to {}",
local_path.display(),
)
})
.with_context(|| format!("rename download layer file to {}", local_path.display(),))
.map_err(DownloadError::Other)?;
crashsafe::fsync_async(&local_path)
.await
.with_context(|| format!("Could not fsync layer file {}", local_path.display(),))
.with_context(|| format!("fsync layer file {}", local_path.display(),))
.map_err(DownloadError::Other)?;
tracing::debug!("download complete: {}", local_path.display());
@@ -173,21 +173,19 @@ pub fn is_temp_download_file(path: &Path) -> bool {
}
/// List timelines of given tenant in remote storage
pub async fn list_remote_timelines<'a>(
storage: &'a GenericRemoteStorage,
conf: &'static PageServerConf,
pub async fn list_remote_timelines(
storage: &GenericRemoteStorage,
tenant_id: TenantId,
) -> anyhow::Result<HashSet<TimelineId>> {
let tenant_path = conf.timelines_path(&tenant_id);
let tenant_storage_path = conf.remote_path(&tenant_path)?;
let remote_path = remote_timelines_path(&tenant_id);
fail::fail_point!("storage-sync-list-remote-timelines", |_| {
anyhow::bail!("storage-sync-list-remote-timelines");
});
let timelines = download_retry(
|| storage.list_prefixes(Some(&tenant_storage_path)),
&format!("list prefixes for {tenant_path:?}"),
|| storage.list_prefixes(Some(&remote_path)),
&format!("list prefixes for {tenant_id}"),
)
.await?;
@@ -202,9 +200,9 @@ pub async fn list_remote_timelines<'a>(
anyhow::anyhow!("failed to get timeline id for remote tenant {tenant_id}")
})?;
let timeline_id: TimelineId = object_name.parse().with_context(|| {
format!("failed to parse object name into timeline id '{object_name}'")
})?;
let timeline_id: TimelineId = object_name
.parse()
.with_context(|| format!("parse object name into timeline id '{object_name}'"))?;
// list_prefixes is assumed to return unique names. Ensure this here.
// NB: it's safer to bail out than warn-log this because the pageserver
@@ -222,21 +220,16 @@ pub async fn list_remote_timelines<'a>(
}
pub(super) async fn download_index_part(
conf: &'static PageServerConf,
storage: &GenericRemoteStorage,
tenant_id: &TenantId,
timeline_id: &TimelineId,
generation: Generation,
) -> Result<IndexPart, DownloadError> {
let index_part_path = conf
.metadata_path(tenant_id, timeline_id)
.with_file_name(IndexPart::FILE_NAME);
let part_storage_path = conf
.remote_path(&index_part_path)
.map_err(DownloadError::BadInput)?;
let remote_path = remote_index_path(tenant_id, timeline_id, generation);
let index_part_bytes = download_retry(
|| async {
let mut index_part_download = storage.download(&part_storage_path).await?;
let mut index_part_download = storage.download(&remote_path).await?;
let mut index_part_bytes = Vec::new();
tokio::io::copy(
@@ -244,20 +237,16 @@ pub(super) async fn download_index_part(
&mut index_part_bytes,
)
.await
.with_context(|| {
format!("Failed to download an index part into file {index_part_path:?}")
})
.with_context(|| format!("download index part at {remote_path:?}"))
.map_err(DownloadError::Other)?;
Ok(index_part_bytes)
},
&format!("download {part_storage_path:?}"),
&format!("download {remote_path:?}"),
)
.await?;
let index_part: IndexPart = serde_json::from_slice(&index_part_bytes)
.with_context(|| {
format!("Failed to deserialize index part file into file {index_part_path:?}")
})
.with_context(|| format!("download index part file at {remote_path:?}"))
.map_err(DownloadError::Other)?;
Ok(index_part)

View File

@@ -2,7 +2,7 @@
//! Able to restore itself from the storage index parts, that are located in every timeline's remote directory and contain all data about
//! remote timeline layers and its metadata.
use std::collections::{HashMap, HashSet};
use std::collections::HashMap;
use chrono::NaiveDateTime;
use serde::{Deserialize, Serialize};
@@ -12,6 +12,7 @@ use utils::bin_ser::SerializeError;
use crate::tenant::metadata::TimelineMetadata;
use crate::tenant::storage_layer::LayerFileName;
use crate::tenant::upload_queue::UploadQueueInitialized;
use crate::tenant::Generation;
use utils::lsn::Lsn;
@@ -20,22 +21,28 @@ use utils::lsn::Lsn;
/// Fields have to be `Option`s because remote [`IndexPart`]'s can be from different version, which
/// might have less or more metadata depending if upgrading or rolling back an upgrade.
#[derive(Debug, Clone, PartialEq, Eq, PartialOrd, Ord)]
#[cfg_attr(test, derive(Default))]
//#[cfg_attr(test, derive(Default))]
pub struct LayerFileMetadata {
file_size: u64,
pub(crate) generation: Generation,
}
impl From<&'_ IndexLayerMetadata> for LayerFileMetadata {
fn from(other: &IndexLayerMetadata) -> Self {
LayerFileMetadata {
file_size: other.file_size,
generation: other.generation,
}
}
}
impl LayerFileMetadata {
pub fn new(file_size: u64) -> Self {
LayerFileMetadata { file_size }
pub fn new(file_size: u64, generation: Generation) -> Self {
LayerFileMetadata {
file_size,
generation,
}
}
pub fn file_size(&self) -> u64 {
@@ -62,10 +69,6 @@ pub struct IndexPart {
#[serde(skip_serializing_if = "Option::is_none")]
pub deleted_at: Option<NaiveDateTime>,
/// Legacy field: equal to the keys of `layer_metadata`, only written out for forward compat
#[serde(default, skip_deserializing)]
timeline_layers: HashSet<LayerFileName>,
/// Per layer file name metadata, which can be present for a present or missing layer file.
///
/// Older versions of `IndexPart` will not have this property or have only a part of metadata
@@ -91,7 +94,12 @@ impl IndexPart {
/// - 2: added `deleted_at`
/// - 3: no longer deserialize `timeline_layers` (serialized format is the same, but timeline_layers
/// is always generated from the keys of `layer_metadata`)
const LATEST_VERSION: usize = 3;
/// - 4: timeline_layers is fully removed.
const LATEST_VERSION: usize = 4;
// Versions we may see when reading from a bucket.
pub const KNOWN_VERSIONS: &[usize] = &[1, 2, 3, 4];
pub const FILE_NAME: &'static str = "index_part.json";
pub fn new(
@@ -99,24 +107,30 @@ impl IndexPart {
disk_consistent_lsn: Lsn,
metadata: TimelineMetadata,
) -> Self {
let mut timeline_layers = HashSet::with_capacity(layers_and_metadata.len());
let mut layer_metadata = HashMap::with_capacity(layers_and_metadata.len());
for (remote_name, metadata) in &layers_and_metadata {
timeline_layers.insert(remote_name.to_owned());
let metadata = IndexLayerMetadata::from(metadata);
layer_metadata.insert(remote_name.to_owned(), metadata);
}
// Transform LayerFileMetadata into IndexLayerMetadata
let layer_metadata = layers_and_metadata
.into_iter()
.map(|(k, v)| (k, IndexLayerMetadata::from(v)))
.collect();
Self {
version: Self::LATEST_VERSION,
timeline_layers,
layer_metadata,
disk_consistent_lsn,
metadata,
deleted_at: None,
}
}
pub fn get_version(&self) -> usize {
self.version
}
/// If you want this under normal operations, read it from self.metadata:
/// this method is just for the scrubber to use when validating an index.
pub fn get_disk_consistent_lsn(&self) -> Lsn {
self.disk_consistent_lsn
}
}
impl TryFrom<&UploadQueueInitialized> for IndexPart {
@@ -135,15 +149,20 @@ impl TryFrom<&UploadQueueInitialized> for IndexPart {
}
/// Serialized form of [`LayerFileMetadata`].
#[derive(Debug, PartialEq, Eq, Clone, Serialize, Deserialize, Default)]
#[derive(Debug, PartialEq, Eq, Clone, Serialize, Deserialize)]
pub struct IndexLayerMetadata {
pub(super) file_size: u64,
pub file_size: u64,
#[serde(default = "Generation::none")]
#[serde(skip_serializing_if = "Generation::is_none")]
pub(super) generation: Generation,
}
impl From<&'_ LayerFileMetadata> for IndexLayerMetadata {
fn from(other: &'_ LayerFileMetadata) -> Self {
impl From<LayerFileMetadata> for IndexLayerMetadata {
fn from(other: LayerFileMetadata) -> Self {
IndexLayerMetadata {
file_size: other.file_size,
generation: other.generation,
}
}
}
@@ -168,15 +187,16 @@ mod tests {
let expected = IndexPart {
// note this is not verified, could be anything, but exists for humans debugging.. could be the git version instead?
version: 1,
timeline_layers: HashSet::new(),
layer_metadata: HashMap::from([
("000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__0000000001696070-00000000016960E9".parse().unwrap(), IndexLayerMetadata {
file_size: 25600000,
generation: Generation::none()
}),
("000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000016B59D8-00000000016B5A51".parse().unwrap(), IndexLayerMetadata {
// serde_json should always parse this but this might be a double with jq for
// example.
file_size: 9007199254741001,
generation: Generation::none()
})
]),
disk_consistent_lsn: "0/16960E8".parse::<Lsn>().unwrap(),
@@ -205,15 +225,16 @@ mod tests {
let expected = IndexPart {
// note this is not verified, could be anything, but exists for humans debugging.. could be the git version instead?
version: 1,
timeline_layers: HashSet::new(),
layer_metadata: HashMap::from([
("000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__0000000001696070-00000000016960E9".parse().unwrap(), IndexLayerMetadata {
file_size: 25600000,
generation: Generation::none()
}),
("000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000016B59D8-00000000016B5A51".parse().unwrap(), IndexLayerMetadata {
// serde_json should always parse this but this might be a double with jq for
// example.
file_size: 9007199254741001,
generation: Generation::none()
})
]),
disk_consistent_lsn: "0/16960E8".parse::<Lsn>().unwrap(),
@@ -243,15 +264,16 @@ mod tests {
let expected = IndexPart {
// note this is not verified, could be anything, but exists for humans debugging.. could be the git version instead?
version: 2,
timeline_layers: HashSet::new(),
layer_metadata: HashMap::from([
("000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__0000000001696070-00000000016960E9".parse().unwrap(), IndexLayerMetadata {
file_size: 25600000,
generation: Generation::none()
}),
("000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000016B59D8-00000000016B5A51".parse().unwrap(), IndexLayerMetadata {
// serde_json should always parse this but this might be a double with jq for
// example.
file_size: 9007199254741001,
generation: Generation::none()
})
]),
disk_consistent_lsn: "0/16960E8".parse::<Lsn>().unwrap(),
@@ -276,7 +298,6 @@ mod tests {
let expected = IndexPart {
version: 1,
timeline_layers: HashSet::new(),
layer_metadata: HashMap::new(),
disk_consistent_lsn: "0/2532648".parse::<Lsn>().unwrap(),
metadata: TimelineMetadata::from_bytes(&[
@@ -309,4 +330,41 @@ mod tests {
assert_eq!(empty_layers_parsed, expected);
}
#[test]
fn v4_indexpart_is_parsed() {
let example = r#"{
"version":4,
"layer_metadata":{
"000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__0000000001696070-00000000016960E9": { "file_size": 25600000 },
"000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000016B59D8-00000000016B5A51": { "file_size": 9007199254741001 }
},
"disk_consistent_lsn":"0/16960E8",
"metadata_bytes":[113,11,159,210,0,54,0,4,0,0,0,0,1,105,96,232,1,0,0,0,0,1,105,96,112,0,0,0,0,0,0,0,0,0,0,0,0,0,1,105,96,112,0,0,0,0,1,105,96,112,0,0,0,14,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],
"deleted_at": "2023-07-31T09:00:00.123"
}"#;
let expected = IndexPart {
version: 4,
layer_metadata: HashMap::from([
("000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__0000000001696070-00000000016960E9".parse().unwrap(), IndexLayerMetadata {
file_size: 25600000,
generation: Generation::none()
}),
("000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000016B59D8-00000000016B5A51".parse().unwrap(), IndexLayerMetadata {
// serde_json should always parse this but this might be a double with jq for
// example.
file_size: 9007199254741001,
generation: Generation::none()
})
]),
disk_consistent_lsn: "0/16960E8".parse::<Lsn>().unwrap(),
metadata: TimelineMetadata::from_bytes(&[113,11,159,210,0,54,0,4,0,0,0,0,1,105,96,232,1,0,0,0,0,1,105,96,112,0,0,0,0,0,0,0,0,0,0,0,0,0,1,105,96,112,0,0,0,0,1,105,96,112,0,0,0,14,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]).unwrap(),
deleted_at: Some(chrono::NaiveDateTime::parse_from_str(
"2023-07-31T09:00:00.123000000", "%Y-%m-%dT%H:%M:%S.%f").unwrap())
};
let part = serde_json::from_str::<IndexPart>(example).unwrap();
assert_eq!(part, expected);
}
}

View File

@@ -5,7 +5,11 @@ use fail::fail_point;
use std::{io::ErrorKind, path::Path};
use tokio::fs;
use crate::{config::PageServerConf, tenant::remote_timeline_client::index::IndexPart};
use super::Generation;
use crate::{
config::PageServerConf,
tenant::remote_timeline_client::{index::IndexPart, remote_index_path, remote_path},
};
use remote_storage::GenericRemoteStorage;
use utils::id::{TenantId, TimelineId};
@@ -15,10 +19,10 @@ use tracing::info;
/// Serializes and uploads the given index part data to the remote storage.
pub(super) async fn upload_index_part<'a>(
conf: &'static PageServerConf,
storage: &'a GenericRemoteStorage,
tenant_id: &TenantId,
timeline_id: &TimelineId,
generation: Generation,
index_part: &'a IndexPart,
) -> anyhow::Result<()> {
tracing::trace!("uploading new index part");
@@ -27,20 +31,16 @@ pub(super) async fn upload_index_part<'a>(
bail!("failpoint before-upload-index")
});
let index_part_bytes = serde_json::to_vec(&index_part)
.context("Failed to serialize index part file into bytes")?;
let index_part_bytes =
serde_json::to_vec(&index_part).context("serialize index part file into bytes")?;
let index_part_size = index_part_bytes.len();
let index_part_bytes = tokio::io::BufReader::new(std::io::Cursor::new(index_part_bytes));
let index_part_path = conf
.metadata_path(tenant_id, timeline_id)
.with_file_name(IndexPart::FILE_NAME);
let storage_path = conf.remote_path(&index_part_path)?;
let remote_path = remote_index_path(tenant_id, timeline_id, generation);
storage
.upload_storage_object(Box::new(index_part_bytes), index_part_size, &storage_path)
.upload_storage_object(Box::new(index_part_bytes), index_part_size, &remote_path)
.await
.with_context(|| format!("Failed to upload index part for '{tenant_id} / {timeline_id}'"))
.with_context(|| format!("upload index part for '{tenant_id} / {timeline_id}'"))
}
/// Attempts to upload given layer files.
@@ -52,12 +52,13 @@ pub(super) async fn upload_timeline_layer<'a>(
storage: &'a GenericRemoteStorage,
source_path: &'a Path,
known_metadata: &'a LayerFileMetadata,
generation: Generation,
) -> anyhow::Result<()> {
fail_point!("before-upload-layer", |_| {
bail!("failpoint before-upload-layer")
});
let storage_path = conf.remote_path(source_path)?;
let storage_path = remote_path(conf, source_path, generation)?;
let source_file_res = fs::File::open(&source_path).await;
let source_file = match source_file_res {
Ok(source_file) => source_file,
@@ -67,21 +68,18 @@ pub(super) async fn upload_timeline_layer<'a>(
// upload. However, a nonexistent file can also be indicative of
// something worse, like when a file is scheduled for upload before
// it has been written to disk yet.
//
// This is tested against `test_compaction_delete_before_upload`
info!(path = %source_path.display(), "File to upload doesn't exist. Likely the file has been deleted and an upload is not required any more.");
return Ok(());
}
Err(e) => Err(e)
.with_context(|| format!("Failed to open a source file for layer {source_path:?}"))?,
Err(e) => {
Err(e).with_context(|| format!("open a source file for layer {source_path:?}"))?
}
};
let fs_size = source_file
.metadata()
.await
.with_context(|| {
format!("Failed to get the source file metadata for layer {source_path:?}")
})?
.with_context(|| format!("get the source file metadata for layer {source_path:?}"))?
.len();
let metadata_size = known_metadata.file_size();
@@ -89,19 +87,13 @@ pub(super) async fn upload_timeline_layer<'a>(
bail!("File {source_path:?} has its current FS size {fs_size} diferent from initially determined {metadata_size}");
}
let fs_size = usize::try_from(fs_size).with_context(|| {
format!("File {source_path:?} size {fs_size} could not be converted to usize")
})?;
let fs_size = usize::try_from(fs_size)
.with_context(|| format!("convert {source_path:?} size {fs_size} usize"))?;
storage
.upload(source_file, fs_size, &storage_path, None)
.await
.with_context(|| {
format!(
"Failed to upload a layer from local path '{}'",
source_path.display()
)
})?;
.with_context(|| format!("upload layer from local path '{}'", source_path.display()))?;
Ok(())
}

View File

@@ -4,21 +4,26 @@ pub mod delta_layer;
mod filename;
mod image_layer;
mod inmemory_layer;
mod layer;
mod layer_desc;
mod remote_layer;
use crate::config::PageServerConf;
use crate::context::{AccessStatsBehavior, RequestContext};
use crate::repository::Key;
use crate::task_mgr::TaskKind;
use crate::walrecord::NeonWalRecord;
use anyhow::Result;
use bytes::Bytes;
use enum_map::EnumMap;
use enumset::EnumSet;
use once_cell::sync::Lazy;
use pageserver_api::models::LayerAccessKind;
use pageserver_api::models::{
LayerAccessKind, LayerResidenceEvent, LayerResidenceEventReason, LayerResidenceStatus,
HistoricLayerInfo, LayerResidenceEvent, LayerResidenceEventReason, LayerResidenceStatus,
};
use std::ops::Range;
use std::sync::Mutex;
use std::path::PathBuf;
use std::sync::{Arc, Mutex};
use std::time::{Duration, SystemTime, UNIX_EPOCH};
use tracing::warn;
use utils::history_buffer::HistoryBufferWithDropCounter;
@@ -34,8 +39,7 @@ pub use filename::{DeltaFileName, ImageFileName, LayerFileName};
pub use image_layer::{ImageLayer, ImageLayerWriter};
pub use inmemory_layer::InMemoryLayer;
pub use layer_desc::{PersistentLayerDesc, PersistentLayerKey};
pub(crate) use layer::{EvictionError, Layer, ResidentLayer};
pub use remote_layer::RemoteLayer;
pub fn range_overlaps<T>(a: &Range<T>, b: &Range<T>) -> bool
where
@@ -70,7 +74,7 @@ pub struct ValueReconstructState {
pub img: Option<(Lsn, Bytes)>,
}
/// Return value from [`Layer::get_value_reconstruct_data`]
/// Return value from Layer::get_page_reconstruct_data
#[derive(Clone, Copy, Debug)]
pub enum ValueReconstructResult {
/// Got all the data needed to reconstruct the requested page
@@ -175,6 +179,26 @@ impl LayerAccessStats {
new
}
/// Creates a clone of `self` and records `new_status` in the clone.
///
/// The `new_status` is not recorded in `self`.
///
/// See [`record_residence_event`] for why you need to do this while holding the layer map lock.
///
/// [`record_residence_event`]: Self::record_residence_event
pub(crate) fn clone_for_residence_change(
&self,
new_status: LayerResidenceStatus,
) -> LayerAccessStats {
let clone = {
let inner = self.0.lock().unwrap();
inner.clone()
};
let new = LayerAccessStats(Mutex::new(clone));
new.record_residence_event(new_status, LayerResidenceEventReason::ResidenceChange);
new
}
/// Record a change in layer residency.
///
/// Recording the event must happen while holding the layer map lock to
@@ -297,12 +321,95 @@ impl LayerAccessStats {
}
}
/// Supertrait of the [`Layer`] trait that captures the bare minimum interface
/// required by [`LayerMap`](super::layer_map::LayerMap).
///
/// All layers should implement a minimal `std::fmt::Debug` without tenant or
/// timeline names, because those are known in the context of which the layers
/// are used in (timeline).
#[async_trait::async_trait]
pub trait Layer: std::fmt::Debug + std::fmt::Display + Send + Sync + 'static {
///
/// Return data needed to reconstruct given page at LSN.
///
/// It is up to the caller to collect more data from previous layer and
/// perform WAL redo, if necessary.
///
/// See PageReconstructResult for possible return values. The collected data
/// is appended to reconstruct_data; the caller should pass an empty struct
/// on first call, or a struct with a cached older image of the page if one
/// is available. If this returns ValueReconstructResult::Continue, look up
/// the predecessor layer and call again with the same 'reconstruct_data' to
/// collect more data.
async fn get_value_reconstruct_data(
&self,
key: Key,
lsn_range: Range<Lsn>,
reconstruct_data: &mut ValueReconstructState,
ctx: &RequestContext,
) -> Result<ValueReconstructResult>;
}
/// Get a layer descriptor from a layer.
pub trait AsLayerDesc {
/// Get the layer descriptor.
fn layer_desc(&self) -> &PersistentLayerDesc;
}
/// A Layer contains all data in a "rectangle" consisting of a range of keys and
/// range of LSNs.
///
/// There are two kinds of layers, in-memory and on-disk layers. In-memory
/// layers are used to ingest incoming WAL, and provide fast access to the
/// recent page versions. On-disk layers are stored as files on disk, and are
/// immutable. This trait presents the common functionality of in-memory and
/// on-disk layers.
///
/// Furthermore, there are two kinds of on-disk layers: delta and image layers.
/// A delta layer contains all modifications within a range of LSNs and keys.
/// An image layer is a snapshot of all the data in a key-range, at a single
/// LSN.
pub trait PersistentLayer: Layer + AsLayerDesc {
/// File name used for this layer, both in the pageserver's local filesystem
/// state as well as in the remote storage.
fn filename(&self) -> LayerFileName {
self.layer_desc().filename()
}
// Path to the layer file in the local filesystem.
// `None` for `RemoteLayer`.
fn local_path(&self) -> Option<PathBuf>;
/// Permanently remove this layer from disk.
fn delete_resident_layer_file(&self) -> Result<()>;
fn downcast_remote_layer(self: Arc<Self>) -> Option<std::sync::Arc<RemoteLayer>> {
None
}
fn downcast_delta_layer(self: Arc<Self>) -> Option<std::sync::Arc<DeltaLayer>> {
None
}
fn is_remote_layer(&self) -> bool {
false
}
fn info(&self, reset: LayerAccessStatsReset) -> HistoricLayerInfo;
fn access_stats(&self) -> &LayerAccessStats;
}
pub fn downcast_remote_layer(
layer: &Arc<dyn PersistentLayer>,
) -> Option<std::sync::Arc<RemoteLayer>> {
if layer.is_remote_layer() {
Arc::clone(layer).downcast_remote_layer()
} else {
None
}
}
pub mod tests {
use super::*;
@@ -340,6 +447,19 @@ pub mod tests {
}
}
/// Helper enum to hold a PageServerConf, or a path
///
/// This is used by DeltaLayer and ImageLayer. Normally, this holds a reference to the
/// global config, and paths to layer files are constructed using the tenant/timeline
/// path from the config. But in the 'pagectl' binary, we need to construct a Layer
/// struct for a file on disk, without having a page server running, so that we have no
/// config. In that case, we use the Path variant to hold the full path to the file on
/// disk.
enum PathOrConf {
Path(PathBuf),
Conf(&'static PageServerConf),
}
/// Range wrapping newtype, which uses display to render Debug.
///
/// Useful with `Key`, which has too verbose `{:?}` for printing multiple layers.

View File

@@ -34,18 +34,19 @@ use crate::repository::{Key, Value, KEY_SIZE};
use crate::tenant::blob_io::{BlobWriter, WriteBlobWriter};
use crate::tenant::block_io::{BlockBuf, BlockCursor, BlockLease, BlockReader, FileBlockReader};
use crate::tenant::disk_btree::{DiskBtreeBuilder, DiskBtreeReader, VisitDirection};
use crate::tenant::storage_layer::{Layer, ValueReconstructResult, ValueReconstructState};
use crate::tenant::Timeline;
use crate::tenant::storage_layer::{
PersistentLayer, ValueReconstructResult, ValueReconstructState,
};
use crate::virtual_file::VirtualFile;
use crate::{walrecord, TEMP_FILE_SUFFIX};
use crate::{DELTA_FILE_MAGIC, STORAGE_FORMAT_VERSION};
use anyhow::{bail, ensure, Context, Result};
use pageserver_api::models::LayerAccessKind;
use pageserver_api::models::{HistoricLayerInfo, LayerAccessKind};
use rand::{distributions::Alphanumeric, Rng};
use serde::{Deserialize, Serialize};
use std::fs::File;
use std::fs::{self, File};
use std::io::SeekFrom;
use std::io::{BufWriter, Write};
use std::io::{Seek, SeekFrom};
use std::ops::Range;
use std::os::unix::fs::FileExt;
use std::path::{Path, PathBuf};
@@ -59,7 +60,10 @@ use utils::{
lsn::Lsn,
};
use super::{AsLayerDesc, LayerAccessStats, PersistentLayerDesc, ResidentLayer};
use super::{
AsLayerDesc, DeltaFileName, Layer, LayerAccessStats, LayerAccessStatsReset, PathOrConf,
PersistentLayerDesc,
};
///
/// Header stored in the beginning of the file
@@ -179,12 +183,20 @@ impl DeltaKey {
}
}
/// This is used only from `pagectl`. Within pageserver, all layers are
/// [`crate::tenant::storage_layer::Layer`], which can hold a [`DeltaLayerInner`].
/// DeltaLayer is the in-memory data structure associated with an on-disk delta
/// file.
///
/// We keep a DeltaLayer in memory for each file, in the LayerMap. If a layer
/// is in "loaded" state, we have a copy of the index in memory, in 'inner'.
/// Otherwise the struct is just a placeholder for a file that exists on disk,
/// and it needs to be loaded before using it in queries.
pub struct DeltaLayer {
path: PathBuf,
path_or_conf: PathOrConf,
pub desc: PersistentLayerDesc,
access_stats: LayerAccessStats,
inner: OnceCell<Arc<DeltaLayerInner>>,
}
@@ -201,8 +213,6 @@ impl std::fmt::Debug for DeltaLayer {
}
}
/// `DeltaLayerInner` is the in-memory data structure associated with an on-disk delta
/// file.
pub struct DeltaLayerInner {
// values copied from summary
index_start_blk: u32,
@@ -212,6 +222,12 @@ pub struct DeltaLayerInner {
file: FileBlockReader<VirtualFile>,
}
impl AsRef<DeltaLayerInner> for DeltaLayerInner {
fn as_ref(&self) -> &DeltaLayerInner {
self
}
}
impl std::fmt::Debug for DeltaLayerInner {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
f.debug_struct("DeltaLayerInner")
@@ -221,6 +237,19 @@ impl std::fmt::Debug for DeltaLayerInner {
}
}
#[async_trait::async_trait]
impl Layer for DeltaLayer {
async fn get_value_reconstruct_data(
&self,
key: Key,
lsn_range: Range<Lsn>,
reconstruct_state: &mut ValueReconstructState,
ctx: &RequestContext,
) -> anyhow::Result<ValueReconstructResult> {
self.get_value_reconstruct_data(key, lsn_range, reconstruct_state, ctx)
.await
}
}
/// Boilerplate to implement the Layer trait, always use layer_desc for persistent layers.
impl std::fmt::Display for DeltaLayer {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
@@ -234,9 +263,40 @@ impl AsLayerDesc for DeltaLayer {
}
}
impl PersistentLayer for DeltaLayer {
fn downcast_delta_layer(self: Arc<Self>) -> Option<std::sync::Arc<DeltaLayer>> {
Some(self)
}
fn local_path(&self) -> Option<PathBuf> {
self.local_path()
}
fn delete_resident_layer_file(&self) -> Result<()> {
self.delete_resident_layer_file()
}
fn info(&self, reset: LayerAccessStatsReset) -> HistoricLayerInfo {
self.info(reset)
}
fn access_stats(&self) -> &LayerAccessStats {
self.access_stats()
}
}
impl DeltaLayer {
pub(crate) async fn dump(&self, verbose: bool, ctx: &RequestContext) -> Result<()> {
self.desc.dump();
println!(
"----- delta layer for ten {} tli {} keys {}-{} lsn {}-{} size {} ----",
self.desc.tenant_id,
self.desc.timeline_id,
self.desc.key_range.start,
self.desc.key_range.end,
self.desc.lsn_range.start,
self.desc.lsn_range.end,
self.desc.file_size,
);
if !verbose {
return Ok(());
@@ -244,7 +304,119 @@ impl DeltaLayer {
let inner = self.load(LayerAccessKind::Dump, ctx).await?;
inner.dump().await
println!(
"index_start_blk: {}, root {}",
inner.index_start_blk, inner.index_root_blk
);
let file = &inner.file;
let tree_reader = DiskBtreeReader::<_, DELTA_KEY_SIZE>::new(
inner.index_start_blk,
inner.index_root_blk,
file,
);
tree_reader.dump().await?;
let keys = DeltaLayerInner::load_keys(&inner).await?;
// A subroutine to dump a single blob
async fn dump_blob(val: ValueRef<'_>) -> Result<String> {
let buf = val.reader.read_blob(val.blob_ref.pos()).await?;
let val = Value::des(&buf)?;
let desc = match val {
Value::Image(img) => {
format!(" img {} bytes", img.len())
}
Value::WalRecord(rec) => {
let wal_desc = walrecord::describe_wal_record(&rec)?;
format!(
" rec {} bytes will_init: {} {}",
buf.len(),
rec.will_init(),
wal_desc
)
}
};
Ok(desc)
}
for entry in keys {
let DeltaEntry { key, lsn, val, .. } = entry;
let desc = match dump_blob(val).await {
Ok(desc) => desc,
Err(err) => {
let err: anyhow::Error = err;
format!("ERROR: {err}")
}
};
println!(" key {key} at {lsn}: {desc}");
}
Ok(())
}
pub(crate) async fn get_value_reconstruct_data(
&self,
key: Key,
lsn_range: Range<Lsn>,
reconstruct_state: &mut ValueReconstructState,
ctx: &RequestContext,
) -> anyhow::Result<ValueReconstructResult> {
ensure!(lsn_range.start >= self.desc.lsn_range.start);
ensure!(self.desc.key_range.contains(&key));
let inner = self
.load(LayerAccessKind::GetValueReconstructData, ctx)
.await?;
inner
.get_value_reconstruct_data(key, lsn_range, reconstruct_state)
.await
}
pub(crate) fn local_path(&self) -> Option<PathBuf> {
Some(self.path())
}
pub(crate) fn delete_resident_layer_file(&self) -> Result<()> {
// delete underlying file
fs::remove_file(self.path())?;
Ok(())
}
pub(crate) fn info(&self, reset: LayerAccessStatsReset) -> HistoricLayerInfo {
let layer_file_name = self.layer_desc().filename().file_name();
let lsn_range = self.layer_desc().lsn_range.clone();
let access_stats = self.access_stats.as_api_model(reset);
HistoricLayerInfo::Delta {
layer_file_name,
layer_file_size: self.desc.file_size,
lsn_start: lsn_range.start,
lsn_end: lsn_range.end,
remote: false,
access_stats,
}
}
pub(crate) fn access_stats(&self) -> &LayerAccessStats {
&self.access_stats
}
fn path_for(
path_or_conf: &PathOrConf,
tenant_id: &TenantId,
timeline_id: &TimelineId,
fname: &DeltaFileName,
) -> PathBuf {
match path_or_conf {
PathOrConf::Path(path) => path.clone(),
PathOrConf::Conf(conf) => conf
.timeline_path(tenant_id, timeline_id)
.join(fname.to_string()),
}
}
fn temp_path_for(
@@ -290,22 +462,52 @@ impl DeltaLayer {
async fn load_inner(&self) -> Result<Arc<DeltaLayerInner>> {
let path = self.path();
let loaded = DeltaLayerInner::load(&path, None)?;
let summary = match &self.path_or_conf {
PathOrConf::Conf(_) => Some(Summary::from(self)),
PathOrConf::Path(_) => None,
};
// not production code
let loaded = DeltaLayerInner::load(&path, summary).await?;
let actual_filename = self.path.file_name().unwrap().to_str().unwrap().to_owned();
let expected_filename = self.layer_desc().filename().file_name();
if let PathOrConf::Path(ref path) = self.path_or_conf {
// not production code
if actual_filename != expected_filename {
println!("warning: filename does not match what is expected from in-file summary");
println!("actual: {:?}", actual_filename);
println!("expected: {:?}", expected_filename);
let actual_filename = path.file_name().unwrap().to_str().unwrap().to_owned();
let expected_filename = self.filename().file_name();
if actual_filename != expected_filename {
println!("warning: filename does not match what is expected from in-file summary");
println!("actual: {:?}", actual_filename);
println!("expected: {:?}", expected_filename);
}
}
Ok(Arc::new(loaded))
}
/// Create a DeltaLayer struct representing an existing file on disk.
pub fn new(
conf: &'static PageServerConf,
timeline_id: TimelineId,
tenant_id: TenantId,
filename: &DeltaFileName,
file_size: u64,
access_stats: LayerAccessStats,
) -> DeltaLayer {
DeltaLayer {
path_or_conf: PathOrConf::Conf(conf),
desc: PersistentLayerDesc::new_delta(
tenant_id,
timeline_id,
filename.key_range.clone(),
filename.lsn_range.clone(),
file_size,
),
access_stats,
inner: OnceCell::new(),
}
}
/// Create a DeltaLayer struct representing an existing file on disk.
///
/// This variant is only used for debugging purposes, by the 'pagectl' binary.
@@ -320,7 +522,7 @@ impl DeltaLayer {
.context("get file metadata to determine size")?;
Ok(DeltaLayer {
path: path.to_path_buf(),
path_or_conf: PathOrConf::Path(path.to_path_buf()),
desc: PersistentLayerDesc::new_delta(
summary.tenant_id,
summary.timeline_id,
@@ -333,9 +535,29 @@ impl DeltaLayer {
})
}
/// Path to the layer file
fn path(&self) -> PathBuf {
self.path.clone()
fn layer_name(&self) -> DeltaFileName {
self.desc.delta_file_name()
}
/// Path to the layer file in pageserver workdir.
pub fn path(&self) -> PathBuf {
Self::path_for(
&self.path_or_conf,
&self.desc.tenant_id,
&self.desc.timeline_id,
&self.layer_name(),
)
}
/// Loads all keys stored in the layer. Returns key, lsn, value size and value reference.
///
/// The value can be obtained via the [`ValueRef::load`] function.
pub(crate) async fn load_keys(&self, ctx: &RequestContext) -> Result<Vec<DeltaEntry<'_>>> {
let inner = self
.load(LayerAccessKind::KeyIter, ctx)
.await
.context("load delta layer keys")?;
DeltaLayerInner::load_keys(inner)
.await
.context("Layer index is corrupted")
}
}
@@ -440,7 +662,7 @@ impl DeltaLayerWriterInner {
///
/// Finish writing the delta layer.
///
fn finish(self, key_end: Key, timeline: &Arc<Timeline>) -> anyhow::Result<ResidentLayer> {
fn finish(self, key_end: Key) -> anyhow::Result<DeltaLayer> {
let index_start_blk =
((self.blob_writer.size() + PAGE_SZ as u64 - 1) / PAGE_SZ as u64) as u32;
@@ -486,21 +708,37 @@ impl DeltaLayerWriterInner {
// Note: Because we opened the file in write-only mode, we cannot
// reuse the same VirtualFile for reading later. That's why we don't
// set inner.file here. The first read will have to re-open it.
let desc = PersistentLayerDesc::new_delta(
self.tenant_id,
self.timeline_id,
self.key_start..key_end,
self.lsn_range.clone(),
metadata.len(),
);
let layer = DeltaLayer {
path_or_conf: PathOrConf::Conf(self.conf),
desc: PersistentLayerDesc::new_delta(
self.tenant_id,
self.timeline_id,
self.key_start..key_end,
self.lsn_range.clone(),
metadata.len(),
),
access_stats: LayerAccessStats::empty_will_record_residence_event_later(),
inner: OnceCell::new(),
};
// fsync the file
file.sync_all()?;
// Rename the file to its final name
//
// Note: This overwrites any existing file. There shouldn't be any.
// FIXME: throw an error instead?
let final_path = DeltaLayer::path_for(
&PathOrConf::Conf(self.conf),
&self.tenant_id,
&self.timeline_id,
&DeltaFileName {
key_range: self.key_start..key_end,
lsn_range: self.lsn_range,
},
);
std::fs::rename(self.path, &final_path)?;
let layer = Layer::finish_creating(self.conf, timeline, desc, &self.path)?;
trace!("created delta layer {}", layer.local_path().display());
trace!("created delta layer {}", final_path.display());
Ok(layer)
}
@@ -583,12 +821,8 @@ impl DeltaLayerWriter {
///
/// Finish writing the delta layer.
///
pub(crate) fn finish(
mut self,
key_end: Key,
timeline: &Arc<Timeline>,
) -> anyhow::Result<ResidentLayer> {
self.inner.take().unwrap().finish(key_end, timeline)
pub fn finish(mut self, key_end: Key) -> anyhow::Result<DeltaLayer> {
self.inner.take().unwrap().finish(key_end)
}
}
@@ -607,12 +841,15 @@ impl Drop for DeltaLayerWriter {
}
impl DeltaLayerInner {
pub(super) fn load(path: &std::path::Path, summary: Option<Summary>) -> anyhow::Result<Self> {
pub(super) async fn load(
path: &std::path::Path,
summary: Option<Summary>,
) -> anyhow::Result<Self> {
let file = VirtualFile::open(path)
.with_context(|| format!("Failed to open file '{}'", path.display()))?;
let file = FileBlockReader::new(file);
let summary_blk = file.read_blk(0)?;
let summary_blk = file.read_blk(0).await?;
let actual_summary = Summary::des_prefix(summary_blk.as_ref())?;
if let Some(mut expected_summary) = summary {
@@ -715,14 +952,14 @@ impl DeltaLayerInner {
}
}
pub(super) async fn load_keys(&self) -> Result<Vec<DeltaEntry<'_>>> {
let file = &self.file;
pub(super) async fn load_keys<T: AsRef<DeltaLayerInner> + Clone>(
this: &T,
) -> Result<Vec<DeltaEntry<'_>>> {
let dl = this.as_ref();
let file = &dl.file;
let tree_reader = DiskBtreeReader::<_, DELTA_KEY_SIZE>::new(
self.index_start_blk,
self.index_root_blk,
file,
);
let tree_reader =
DiskBtreeReader::<_, DELTA_KEY_SIZE>::new(dl.index_start_blk, dl.index_root_blk, file);
let mut all_keys: Vec<DeltaEntry<'_>> = Vec::new();
@@ -735,7 +972,7 @@ impl DeltaLayerInner {
let val_ref = ValueRef {
blob_ref: BlobRef(value),
reader: BlockCursor::new(crate::tenant::block_io::BlockReaderRef::Adapter(
Adapter(self),
Adapter(dl),
)),
};
let pos = BlobRef(value).pos();
@@ -759,61 +996,10 @@ impl DeltaLayerInner {
if let Some(last) = all_keys.last_mut() {
// Last key occupies all space till end of value storage,
// which corresponds to beginning of the index
last.size = self.index_start_blk as u64 * PAGE_SZ as u64 - last.size;
last.size = dl.index_start_blk as u64 * PAGE_SZ as u64 - last.size;
}
Ok(all_keys)
}
pub(super) async fn dump(&self) -> anyhow::Result<()> {
println!(
"index_start_blk: {}, root {}",
self.index_start_blk, self.index_root_blk
);
let file = &self.file;
let tree_reader = DiskBtreeReader::<_, DELTA_KEY_SIZE>::new(
self.index_start_blk,
self.index_root_blk,
file,
);
tree_reader.dump().await?;
let keys = self.load_keys().await?;
async fn dump_blob(val: ValueRef<'_>) -> anyhow::Result<String> {
let buf = val.reader.read_blob(val.blob_ref.pos()).await?;
let val = Value::des(&buf)?;
let desc = match val {
Value::Image(img) => {
format!(" img {} bytes", img.len())
}
Value::WalRecord(rec) => {
let wal_desc = walrecord::describe_wal_record(&rec)?;
format!(
" rec {} bytes will_init: {} {}",
buf.len(),
rec.will_init(),
wal_desc
)
}
};
Ok(desc)
}
for entry in keys {
let DeltaEntry { key, lsn, val, .. } = entry;
let desc = match dump_blob(val).await {
Ok(desc) => desc,
Err(err) => {
format!("ERROR: {err}")
}
};
println!(" key {key} at {lsn}: {desc}");
}
Ok(())
}
}
/// A set of data associated with a delta layer key and its value
@@ -845,13 +1031,7 @@ impl<'a> ValueRef<'a> {
pub(crate) struct Adapter<T>(T);
impl<T: AsRef<DeltaLayerInner>> Adapter<T> {
pub(crate) fn read_blk(&self, blknum: u32) -> Result<BlockLease, std::io::Error> {
self.0.as_ref().file.read_blk(blknum)
}
}
impl AsRef<DeltaLayerInner> for DeltaLayerInner {
fn as_ref(&self) -> &DeltaLayerInner {
self
pub(crate) async fn read_blk(&self, blknum: u32) -> Result<BlockLease, std::io::Error> {
self.0.as_ref().file.read_blk(blknum).await
}
}

View File

@@ -212,7 +212,7 @@ pub enum LayerFileName {
}
impl LayerFileName {
pub(crate) fn file_name(&self) -> String {
pub fn file_name(&self) -> String {
self.to_string()
}

View File

@@ -31,24 +31,22 @@ use crate::tenant::blob_io::{BlobWriter, WriteBlobWriter};
use crate::tenant::block_io::{BlockBuf, BlockReader, FileBlockReader};
use crate::tenant::disk_btree::{DiskBtreeBuilder, DiskBtreeReader, VisitDirection};
use crate::tenant::storage_layer::{
LayerAccessStats, ValueReconstructResult, ValueReconstructState,
LayerAccessStats, PersistentLayer, ValueReconstructResult, ValueReconstructState,
};
use crate::tenant::Timeline;
use crate::virtual_file::VirtualFile;
use crate::{IMAGE_FILE_MAGIC, STORAGE_FORMAT_VERSION, TEMP_FILE_SUFFIX};
use anyhow::{bail, ensure, Context, Result};
use bytes::Bytes;
use hex;
use pageserver_api::models::LayerAccessKind;
use pageserver_api::models::{HistoricLayerInfo, LayerAccessKind};
use rand::{distributions::Alphanumeric, Rng};
use serde::{Deserialize, Serialize};
use std::fs::File;
use std::fs::{self, File};
use std::io::SeekFrom;
use std::io::Write;
use std::io::{Seek, SeekFrom};
use std::ops::Range;
use std::os::unix::prelude::FileExt;
use std::path::{Path, PathBuf};
use std::sync::Arc;
use tokio::sync::OnceCell;
use tracing::*;
@@ -59,7 +57,7 @@ use utils::{
};
use super::filename::ImageFileName;
use super::{AsLayerDesc, Layer, PersistentLayerDesc, ResidentLayer};
use super::{AsLayerDesc, Layer, LayerAccessStatsReset, PathOrConf, PersistentLayerDesc};
///
/// Header stored in the beginning of the file
@@ -117,14 +115,22 @@ impl Summary {
}
}
/// This is used only from `pagectl`. Within pageserver, all layers are
/// [`crate::tenant::storage_layer::Layer`], which can hold an [`ImageLayerInner`].
/// ImageLayer is the in-memory data structure associated with an on-disk image
/// file.
///
/// We keep an ImageLayer in memory for each file, in the LayerMap. If a layer
/// is in "loaded" state, we have a copy of the index in memory, in 'inner'.
/// Otherwise the struct is just a placeholder for a file that exists on disk,
/// and it needs to be loaded before using it in queries.
pub struct ImageLayer {
path: PathBuf,
path_or_conf: PathOrConf,
pub desc: PersistentLayerDesc,
// This entry contains an image of all pages as of this LSN, should be the same as desc.lsn
pub lsn: Lsn,
access_stats: LayerAccessStats,
inner: OnceCell<ImageLayerInner>,
}
@@ -141,8 +147,6 @@ impl std::fmt::Debug for ImageLayer {
}
}
/// ImageLayer is the in-memory data structure associated with an on-disk image
/// file.
pub struct ImageLayerInner {
// values copied from summary
index_start_blk: u32,
@@ -163,22 +167,18 @@ impl std::fmt::Debug for ImageLayerInner {
}
}
impl ImageLayerInner {
pub(super) async fn dump(&self) -> anyhow::Result<()> {
let file = &self.file;
let tree_reader =
DiskBtreeReader::<_, KEY_SIZE>::new(self.index_start_blk, self.index_root_blk, file);
tree_reader.dump().await?;
tree_reader
.visit(&[0u8; KEY_SIZE], VisitDirection::Forwards, |key, value| {
println!("key: {} offset {}", hex::encode(key), value);
true
})
.await?;
Ok(())
#[async_trait::async_trait]
impl Layer for ImageLayer {
/// Look up given page in the file
async fn get_value_reconstruct_data(
&self,
key: Key,
lsn_range: Range<Lsn>,
reconstruct_state: &mut ValueReconstructState,
ctx: &RequestContext,
) -> anyhow::Result<ValueReconstructResult> {
self.get_value_reconstruct_data(key, lsn_range, reconstruct_state, ctx)
.await
}
}
@@ -195,21 +195,120 @@ impl AsLayerDesc for ImageLayer {
}
}
impl PersistentLayer for ImageLayer {
fn local_path(&self) -> Option<PathBuf> {
self.local_path()
}
fn delete_resident_layer_file(&self) -> Result<()> {
self.delete_resident_layer_file()
}
fn info(&self, reset: LayerAccessStatsReset) -> HistoricLayerInfo {
self.info(reset)
}
fn access_stats(&self) -> &LayerAccessStats {
self.access_stats()
}
}
impl ImageLayer {
pub(crate) async fn dump(&self, verbose: bool, ctx: &RequestContext) -> Result<()> {
self.desc.dump();
println!(
"----- image layer for ten {} tli {} key {}-{} at {} is_incremental {} size {} ----",
self.desc.tenant_id,
self.desc.timeline_id,
self.desc.key_range.start,
self.desc.key_range.end,
self.lsn,
self.desc.is_incremental(),
self.desc.file_size
);
if !verbose {
return Ok(());
}
let inner = self.load(LayerAccessKind::Dump, ctx).await?;
let file = &inner.file;
let tree_reader =
DiskBtreeReader::<_, KEY_SIZE>::new(inner.index_start_blk, inner.index_root_blk, file);
inner.dump().await?;
tree_reader.dump().await?;
tree_reader
.visit(&[0u8; KEY_SIZE], VisitDirection::Forwards, |key, value| {
println!("key: {} offset {}", hex::encode(key), value);
true
})
.await?;
Ok(())
}
pub(crate) async fn get_value_reconstruct_data(
&self,
key: Key,
lsn_range: Range<Lsn>,
reconstruct_state: &mut ValueReconstructState,
ctx: &RequestContext,
) -> anyhow::Result<ValueReconstructResult> {
assert!(self.desc.key_range.contains(&key));
assert!(lsn_range.start >= self.lsn);
assert!(lsn_range.end >= self.lsn);
let inner = self
.load(LayerAccessKind::GetValueReconstructData, ctx)
.await?;
inner
.get_value_reconstruct_data(key, reconstruct_state)
.await
// FIXME: makes no sense to dump paths
.with_context(|| format!("read {}", self.path().display()))
}
pub(crate) fn local_path(&self) -> Option<PathBuf> {
Some(self.path())
}
pub(crate) fn delete_resident_layer_file(&self) -> Result<()> {
// delete underlying file
fs::remove_file(self.path())?;
Ok(())
}
pub(crate) fn info(&self, reset: LayerAccessStatsReset) -> HistoricLayerInfo {
let layer_file_name = self.layer_desc().filename().file_name();
let lsn_start = self.layer_desc().image_layer_lsn();
HistoricLayerInfo::Image {
layer_file_name,
layer_file_size: self.desc.file_size,
lsn_start,
remote: false,
access_stats: self.access_stats.as_api_model(reset),
}
}
pub(crate) fn access_stats(&self) -> &LayerAccessStats {
&self.access_stats
}
fn path_for(
path_or_conf: &PathOrConf,
timeline_id: TimelineId,
tenant_id: TenantId,
fname: &ImageFileName,
) -> PathBuf {
match path_or_conf {
PathOrConf::Path(path) => path.to_path_buf(),
PathOrConf::Conf(conf) => conf
.timeline_path(&tenant_id, &timeline_id)
.join(fname.to_string()),
}
}
fn temp_path_for(
conf: &PageServerConf,
timeline_id: TimelineId,
@@ -245,21 +344,53 @@ impl ImageLayer {
async fn load_inner(&self) -> Result<ImageLayerInner> {
let path = self.path();
let loaded = ImageLayerInner::load(&path, self.desc.image_layer_lsn(), None)?;
let expected_summary = match &self.path_or_conf {
PathOrConf::Conf(_) => Some(Summary::from(self)),
PathOrConf::Path(_) => None,
};
// not production code
let actual_filename = self.path.file_name().unwrap().to_str().unwrap().to_owned();
let expected_filename = self.layer_desc().filename().file_name();
let loaded =
ImageLayerInner::load(&path, self.desc.image_layer_lsn(), expected_summary).await?;
if actual_filename != expected_filename {
println!("warning: filename does not match what is expected from in-file summary");
println!("actual: {:?}", actual_filename);
println!("expected: {:?}", expected_filename);
if let PathOrConf::Path(ref path) = self.path_or_conf {
// not production code
let actual_filename = path.file_name().unwrap().to_str().unwrap().to_owned();
let expected_filename = self.filename().file_name();
if actual_filename != expected_filename {
println!("warning: filename does not match what is expected from in-file summary");
println!("actual: {:?}", actual_filename);
println!("expected: {:?}", expected_filename);
}
}
Ok(loaded)
}
/// Create an ImageLayer struct representing an existing file on disk
pub fn new(
conf: &'static PageServerConf,
timeline_id: TimelineId,
tenant_id: TenantId,
filename: &ImageFileName,
file_size: u64,
access_stats: LayerAccessStats,
) -> ImageLayer {
ImageLayer {
path_or_conf: PathOrConf::Conf(conf),
desc: PersistentLayerDesc::new_img(
tenant_id,
timeline_id,
filename.key_range.clone(),
filename.lsn,
file_size,
), // Now we assume image layer ALWAYS covers the full range. This may change in the future.
lsn: filename.lsn,
access_stats,
inner: OnceCell::new(),
}
}
/// Create an ImageLayer struct representing an existing file on disk.
///
/// This variant is only used for debugging purposes, by the 'pagectl' binary.
@@ -272,7 +403,7 @@ impl ImageLayer {
.metadata()
.context("get file metadata to determine size")?;
Ok(ImageLayer {
path: path.to_path_buf(),
path_or_conf: PathOrConf::Path(path.to_path_buf()),
desc: PersistentLayerDesc::new_img(
summary.tenant_id,
summary.timeline_id,
@@ -286,14 +417,23 @@ impl ImageLayer {
})
}
fn layer_name(&self) -> ImageFileName {
self.desc.image_file_name()
}
/// Path to the layer file in pageserver workdir.
fn path(&self) -> PathBuf {
self.path.clone()
pub fn path(&self) -> PathBuf {
Self::path_for(
&self.path_or_conf,
self.desc.timeline_id,
self.desc.tenant_id,
&self.layer_name(),
)
}
}
impl ImageLayerInner {
pub(super) fn load(
pub(super) async fn load(
path: &std::path::Path,
lsn: Lsn,
summary: Option<Summary>,
@@ -301,7 +441,7 @@ impl ImageLayerInner {
let file = VirtualFile::open(path)
.with_context(|| format!("Failed to open file '{}'", path.display()))?;
let file = FileBlockReader::new(file);
let summary_blk = file.read_blk(0)?;
let summary_blk = file.read_blk(0).await?;
let actual_summary = Summary::des_prefix(summary_blk.as_ref())?;
if let Some(mut expected_summary) = summary {
@@ -443,7 +583,7 @@ impl ImageLayerWriterInner {
///
/// Finish writing the image layer.
///
fn finish(self, timeline: &Arc<Timeline>) -> anyhow::Result<ResidentLayer> {
fn finish(self) -> anyhow::Result<ImageLayer> {
let index_start_blk =
((self.blob_writer.size() + PAGE_SZ as u64 - 1) / PAGE_SZ as u64) as u32;
@@ -485,13 +625,33 @@ impl ImageLayerWriterInner {
// Note: Because we open the file in write-only mode, we cannot
// reuse the same VirtualFile for reading later. That's why we don't
// set inner.file here. The first read will have to re-open it.
let layer = ImageLayer {
path_or_conf: PathOrConf::Conf(self.conf),
desc,
lsn: self.lsn,
access_stats: LayerAccessStats::empty_will_record_residence_event_later(),
inner: OnceCell::new(),
};
// fsync the file
file.sync_all()?;
let layer = Layer::finish_creating(self.conf, timeline, desc, &self.path)?;
// Rename the file to its final name
//
// Note: This overwrites any existing file. There shouldn't be any.
// FIXME: throw an error instead?
let final_path = ImageLayer::path_for(
&PathOrConf::Conf(self.conf),
self.timeline_id,
self.tenant_id,
&ImageFileName {
key_range: self.key_range.clone(),
lsn: self.lsn,
},
);
std::fs::rename(self.path, final_path)?;
trace!("created image layer {}", layer.local_path().display());
trace!("created image layer {}", layer.path().display());
Ok(layer)
}
@@ -557,11 +717,8 @@ impl ImageLayerWriter {
///
/// Finish writing the image layer.
///
pub(crate) fn finish(
mut self,
timeline: &Arc<Timeline>,
) -> anyhow::Result<super::ResidentLayer> {
self.inner.take().unwrap().finish(timeline)
pub fn finish(mut self) -> anyhow::Result<ImageLayer> {
self.inner.take().unwrap().finish()
}
}

View File

@@ -10,12 +10,11 @@ use crate::repository::{Key, Value};
use crate::tenant::block_io::BlockReader;
use crate::tenant::ephemeral_file::EphemeralFile;
use crate::tenant::storage_layer::{ValueReconstructResult, ValueReconstructState};
use crate::tenant::Timeline;
use crate::walrecord;
use anyhow::{ensure, Result};
use pageserver_api::models::InMemoryLayerInfo;
use std::collections::HashMap;
use std::sync::{Arc, OnceLock};
use std::sync::OnceLock;
use tracing::*;
use utils::{
bin_ser::BeSer,
@@ -29,7 +28,7 @@ use std::fmt::Write as _;
use std::ops::Range;
use tokio::sync::RwLock;
use super::{DeltaLayerWriter, ResidentLayer};
use super::{DeltaLayer, DeltaLayerWriter, Layer};
pub struct InMemoryLayer {
conf: &'static PageServerConf,
@@ -204,6 +203,20 @@ impl InMemoryLayer {
}
}
#[async_trait::async_trait]
impl Layer for InMemoryLayer {
async fn get_value_reconstruct_data(
&self,
key: Key,
lsn_range: Range<Lsn>,
reconstruct_data: &mut ValueReconstructState,
ctx: &RequestContext,
) -> Result<ValueReconstructResult> {
self.get_value_reconstruct_data(key, lsn_range, reconstruct_data, ctx)
.await
}
}
impl std::fmt::Display for InMemoryLayer {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
let end_lsn = self.end_lsn_or_max();
@@ -212,13 +225,17 @@ impl std::fmt::Display for InMemoryLayer {
}
impl InMemoryLayer {
///
/// Get layer size.
///
pub async fn size(&self) -> Result<u64> {
let inner = self.inner.read().await;
Ok(inner.file.len())
}
///
/// Create a new, empty, in-memory layer
///
pub fn create(
conf: &'static PageServerConf,
timeline_id: TimelineId,
@@ -296,7 +313,7 @@ impl InMemoryLayer {
/// Write this frozen in-memory layer to disk.
///
/// Returns a new delta layer with all the same data as this in-memory layer
pub(crate) async fn write_to_disk(&self, timeline: &Arc<Timeline>) -> Result<ResidentLayer> {
pub(crate) async fn write_to_disk(&self) -> Result<DeltaLayer> {
// Grab the lock in read-mode. We hold it over the I/O, but because this
// layer is not writeable anymore, no one should be trying to acquire the
// write lock on it, so we shouldn't block anyone. There's one exception
@@ -335,7 +352,7 @@ impl InMemoryLayer {
}
}
let delta_layer = delta_layer_writer.finish(Key::MAX, timeline)?;
let delta_layer = delta_layer_writer.finish(Key::MAX)?;
Ok(delta_layer)
}
}

File diff suppressed because it is too large Load Diff

View File

@@ -1,3 +1,4 @@
use anyhow::Result;
use core::fmt::Display;
use std::ops::Range;
use utils::{
@@ -5,7 +6,7 @@ use utils::{
lsn::Lsn,
};
use crate::repository::Key;
use crate::{context::RequestContext, repository::Key};
use super::{DeltaFileName, ImageFileName, LayerFileName};
@@ -99,22 +100,6 @@ impl PersistentLayerDesc {
}
}
pub fn from_filename(
tenant_id: TenantId,
timeline_id: TimelineId,
filename: LayerFileName,
file_size: u64,
) -> Self {
match filename {
LayerFileName::Image(i) => {
Self::new_img(tenant_id, timeline_id, i.key_range, i.lsn, file_size)
}
LayerFileName::Delta(d) => {
Self::new_delta(tenant_id, timeline_id, d.key_range, d.lsn_range, file_size)
}
}
}
/// Get the LSN that the image layer covers.
pub fn image_layer_lsn(&self) -> Lsn {
assert!(!self.is_delta);
@@ -188,31 +173,21 @@ impl PersistentLayerDesc {
self.is_delta
}
pub fn dump(&self) {
if self.is_delta {
println!(
"----- delta layer for ten {} tli {} keys {}-{} lsn {}-{} is_incremental {} size {} ----",
self.tenant_id,
self.timeline_id,
self.key_range.start,
self.key_range.end,
self.lsn_range.start,
self.lsn_range.end,
self.is_incremental(),
self.file_size,
);
} else {
println!(
"----- image layer for ten {} tli {} key {}-{} at {} is_incremental {} size {} ----",
self.tenant_id,
self.timeline_id,
self.key_range.start,
self.key_range.end,
self.image_layer_lsn(),
self.is_incremental(),
self.file_size
);
}
pub fn dump(&self, _verbose: bool, _ctx: &RequestContext) -> Result<()> {
println!(
"----- layer for ten {} tli {} keys {}-{} lsn {}-{} is_delta {} is_incremental {} size {} ----",
self.tenant_id,
self.timeline_id,
self.key_range.start,
self.key_range.end,
self.lsn_range.start,
self.lsn_range.end,
self.is_delta,
self.is_incremental(),
self.file_size,
);
Ok(())
}
pub fn file_size(&self) -> u64 {

View File

@@ -0,0 +1,216 @@
//! A RemoteLayer is an in-memory placeholder for a layer file that exists
//! in remote storage.
//!
use crate::config::PageServerConf;
use crate::context::RequestContext;
use crate::repository::Key;
use crate::tenant::remote_timeline_client::index::LayerFileMetadata;
use crate::tenant::storage_layer::{Layer, ValueReconstructResult, ValueReconstructState};
use crate::tenant::timeline::layer_manager::LayerManager;
use anyhow::{bail, Result};
use pageserver_api::models::HistoricLayerInfo;
use std::ops::Range;
use std::path::PathBuf;
use std::sync::Arc;
use utils::{
id::{TenantId, TimelineId},
lsn::Lsn,
};
use super::filename::{DeltaFileName, ImageFileName};
use super::{
AsLayerDesc, DeltaLayer, ImageLayer, LayerAccessStats, LayerAccessStatsReset,
LayerResidenceStatus, PersistentLayer, PersistentLayerDesc,
};
/// RemoteLayer is a not yet downloaded [`ImageLayer`] or
/// [`DeltaLayer`](super::DeltaLayer).
///
/// RemoteLayer might be downloaded on-demand during operations which are
/// allowed download remote layers and during which, it gets replaced with a
/// concrete `DeltaLayer` or `ImageLayer`.
///
/// See: [`crate::context::RequestContext`] for authorization to download
pub struct RemoteLayer {
pub desc: PersistentLayerDesc,
pub layer_metadata: LayerFileMetadata,
access_stats: LayerAccessStats,
pub(crate) ongoing_download: Arc<tokio::sync::Semaphore>,
/// Has `LayerMap::replace` failed for this (true) or not (false).
///
/// Used together with [`ongoing_download`] semaphore in `Timeline::download_remote_layer`.
/// The field is used to mark a RemoteLayer permanently (until restart or ignore+load)
/// unprocessable, because a LayerMap::replace failed.
///
/// It is very unlikely to accumulate these in the Timeline's LayerMap, but having this avoids
/// a possible fast loop between `Timeline::get_reconstruct_data` and
/// `Timeline::download_remote_layer`, which also logs.
///
/// [`ongoing_download`]: Self::ongoing_download
pub(crate) download_replacement_failure: std::sync::atomic::AtomicBool,
}
impl std::fmt::Debug for RemoteLayer {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
f.debug_struct("RemoteLayer")
.field("file_name", &self.desc.filename())
.field("layer_metadata", &self.layer_metadata)
.field("is_incremental", &self.desc.is_incremental())
.finish()
}
}
#[async_trait::async_trait]
impl Layer for RemoteLayer {
async fn get_value_reconstruct_data(
&self,
_key: Key,
_lsn_range: Range<Lsn>,
_reconstruct_state: &mut ValueReconstructState,
_ctx: &RequestContext,
) -> Result<ValueReconstructResult> {
bail!("layer {self} needs to be downloaded");
}
}
/// Boilerplate to implement the Layer trait, always use layer_desc for persistent layers.
impl std::fmt::Display for RemoteLayer {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
write!(f, "{}", self.layer_desc().short_id())
}
}
impl AsLayerDesc for RemoteLayer {
fn layer_desc(&self) -> &PersistentLayerDesc {
&self.desc
}
}
impl PersistentLayer for RemoteLayer {
fn local_path(&self) -> Option<PathBuf> {
None
}
fn delete_resident_layer_file(&self) -> Result<()> {
bail!("remote layer has no layer file");
}
fn downcast_remote_layer<'a>(self: Arc<Self>) -> Option<std::sync::Arc<RemoteLayer>> {
Some(self)
}
fn is_remote_layer(&self) -> bool {
true
}
fn info(&self, reset: LayerAccessStatsReset) -> HistoricLayerInfo {
let layer_file_name = self.layer_desc().filename().file_name();
let lsn_range = self.layer_desc().lsn_range.clone();
if self.desc.is_delta {
HistoricLayerInfo::Delta {
layer_file_name,
layer_file_size: self.layer_metadata.file_size(),
lsn_start: lsn_range.start,
lsn_end: lsn_range.end,
remote: true,
access_stats: self.access_stats.as_api_model(reset),
}
} else {
HistoricLayerInfo::Image {
layer_file_name,
layer_file_size: self.layer_metadata.file_size(),
lsn_start: lsn_range.start,
remote: true,
access_stats: self.access_stats.as_api_model(reset),
}
}
}
fn access_stats(&self) -> &LayerAccessStats {
&self.access_stats
}
}
impl RemoteLayer {
pub fn new_img(
tenantid: TenantId,
timelineid: TimelineId,
fname: &ImageFileName,
layer_metadata: &LayerFileMetadata,
access_stats: LayerAccessStats,
) -> RemoteLayer {
RemoteLayer {
desc: PersistentLayerDesc::new_img(
tenantid,
timelineid,
fname.key_range.clone(),
fname.lsn,
layer_metadata.file_size(),
),
layer_metadata: layer_metadata.clone(),
ongoing_download: Arc::new(tokio::sync::Semaphore::new(1)),
download_replacement_failure: std::sync::atomic::AtomicBool::default(),
access_stats,
}
}
pub fn new_delta(
tenantid: TenantId,
timelineid: TimelineId,
fname: &DeltaFileName,
layer_metadata: &LayerFileMetadata,
access_stats: LayerAccessStats,
) -> RemoteLayer {
RemoteLayer {
desc: PersistentLayerDesc::new_delta(
tenantid,
timelineid,
fname.key_range.clone(),
fname.lsn_range.clone(),
layer_metadata.file_size(),
),
layer_metadata: layer_metadata.clone(),
ongoing_download: Arc::new(tokio::sync::Semaphore::new(1)),
download_replacement_failure: std::sync::atomic::AtomicBool::default(),
access_stats,
}
}
/// Create a Layer struct representing this layer, after it has been downloaded.
pub(crate) fn create_downloaded_layer(
&self,
_layer_map_lock_held_witness: &LayerManager,
conf: &'static PageServerConf,
file_size: u64,
) -> Arc<dyn PersistentLayer> {
if self.desc.is_delta {
let fname = self.desc.delta_file_name();
Arc::new(DeltaLayer::new(
conf,
self.desc.timeline_id,
self.desc.tenant_id,
&fname,
file_size,
self.access_stats
.clone_for_residence_change(LayerResidenceStatus::Resident),
))
} else {
let fname = self.desc.image_file_name();
Arc::new(ImageLayer::new(
conf,
self.desc.timeline_id,
self.desc.tenant_id,
&fname,
file_size,
self.access_stats
.clone_for_residence_change(LayerResidenceStatus::Resident),
))
}
}
}

File diff suppressed because it is too large Load Diff

View File

@@ -29,6 +29,7 @@ use crate::{
task_mgr::{self, TaskKind, BACKGROUND_RUNTIME},
tenant::{
config::{EvictionPolicy, EvictionPolicyLayerAccessThreshold},
storage_layer::PersistentLayer,
timeline::EvictionError,
LogicalSizeCalculationCause, Tenant,
},
@@ -193,26 +194,15 @@ impl Timeline {
// NB: all the checks can be invalidated as soon as we release the layer map lock.
// We don't want to hold the layer map lock during eviction.
// So, we just need to deal with this.
let candidates: Vec<_> = {
let candidates: Vec<Arc<dyn PersistentLayer>> = {
let guard = self.layers.read().await;
let layers = guard.layer_map();
let mut candidates = Vec::new();
for hist_layer in layers.iter_historic_layers() {
let hist_layer = guard.get_from_desc(&hist_layer);
// guard against eviction while we inspect it; it might be that eviction_task and
// disk_usage_eviction_task both select the same layers to be evicted, and
// seemingly free up double the space. both succeeding is of no consequence.
let guard = match hist_layer.keep_resident().await {
Ok(Some(l)) => l,
Ok(None) => continue,
Err(e) => {
// these should not happen, but we cannot make them statically impossible right
// now.
tracing::warn!(layer=%hist_layer, "failed to keep the layer resident: {e:#}");
continue;
}
};
if hist_layer.is_remote_layer() {
continue;
}
let last_activity_ts = hist_layer.access_stats().latest_activity().unwrap_or_else(|| {
// We only use this fallback if there's an implementation error.
@@ -243,7 +233,7 @@ impl Timeline {
}
};
if no_activity_for > p.threshold {
candidates.push(guard.drop_eviction_guard())
candidates.push(hist_layer)
}
}
candidates
@@ -262,7 +252,7 @@ impl Timeline {
};
let results = match self
.evict_layer_batch(remote_client, &candidates, cancel)
.evict_layer_batch(remote_client, &candidates[..], cancel.clone())
.await
{
Err(pre_err) => {
@@ -273,7 +263,7 @@ impl Timeline {
Ok(results) => results,
};
assert_eq!(results.len(), candidates.len());
for result in results {
for (l, result) in candidates.iter().zip(results) {
match result {
None => {
stats.skipped_for_shutdown += 1;
@@ -281,10 +271,20 @@ impl Timeline {
Some(Ok(())) => {
stats.evicted += 1;
}
Some(Err(EvictionError::NotFound | EvictionError::Downloaded)) => {
Some(Err(EvictionError::CannotEvictRemoteLayer)) => {
stats.not_evictable += 1;
}
Some(Err(EvictionError::FileNotFound)) => {
// compaction/gc removed the file while we were waiting on layer_removal_cs
stats.not_evictable += 1;
}
Some(Err(
e @ EvictionError::LayerNotFound(_) | e @ EvictionError::StatFailed(_),
)) => {
let e = utils::error::report_compact_sources(&e);
warn!(layer = %l, "failed to evict layer: {e}");
stats.not_evictable += 1;
}
}
}
if stats.candidates == stats.not_evictable {

View File

@@ -7,6 +7,7 @@ use crate::{
index::{IndexPart, LayerFileMetadata},
},
storage_layer::LayerFileName,
Generation,
},
METADATA_FILE_NAME,
};
@@ -104,6 +105,7 @@ pub(super) fn reconcile(
discovered: Vec<(LayerFileName, u64)>,
index_part: Option<&IndexPart>,
disk_consistent_lsn: Lsn,
generation: Generation,
) -> Vec<(LayerFileName, Result<Decision, FutureLayer>)> {
use Decision::*;
@@ -112,7 +114,15 @@ pub(super) fn reconcile(
let mut discovered = discovered
.into_iter()
.map(|(name, file_size)| (name, (Some(LayerFileMetadata::new(file_size)), None)))
.map(|(name, file_size)| {
(
name,
// The generation here will be corrected to match IndexPart in the merge below, unless
// it is not in IndexPart, in which case using our current generation makes sense
// because it will be uploaded in this generation.
(Some(LayerFileMetadata::new(file_size, generation)), None),
)
})
.collect::<Collected>();
// merge any index_part information, when available
@@ -137,7 +147,11 @@ pub(super) fn reconcile(
Err(FutureLayer { local })
} else {
Ok(match (local, remote) {
(Some(local), Some(remote)) if local != remote => UseRemote { local, remote },
(Some(local), Some(remote)) if local != remote => {
assert_eq!(local.generation, remote.generation);
UseRemote { local, remote }
}
(Some(x), Some(_)) => UseLocal(x),
(None, Some(x)) => Evicted(x),
(Some(x), None) => NeedsUpload(x),

View File

@@ -8,19 +8,21 @@ use utils::{
use crate::{
config::PageServerConf,
metrics::TimelineMetrics,
tenant::{
layer_map::{BatchedUpdates, LayerMap},
storage_layer::{
AsLayerDesc, InMemoryLayer, Layer, PersistentLayerDesc, PersistentLayerKey,
ResidentLayer,
AsLayerDesc, DeltaLayer, ImageLayer, InMemoryLayer, PersistentLayer,
PersistentLayerDesc, PersistentLayerKey,
},
timeline::compare_arced_layers,
},
};
/// Provides semantic APIs to manipulate the layer map.
pub(crate) struct LayerManager {
layer_map: LayerMap,
layer_fmgr: LayerFileManager<Layer>,
layer_fmgr: LayerFileManager,
}
/// After GC, the layer map changes will not be applied immediately. Users should manually apply the changes after
@@ -41,7 +43,7 @@ impl LayerManager {
}
}
pub(crate) fn get_from_desc(&self, desc: &PersistentLayerDesc) -> Layer {
pub(crate) fn get_from_desc(&self, desc: &PersistentLayerDesc) -> Arc<dyn PersistentLayer> {
self.layer_fmgr.get_from_desc(desc)
}
@@ -53,12 +55,21 @@ impl LayerManager {
&self.layer_map
}
/// Replace layers in the layer file manager, used in evictions and layer downloads.
pub(crate) fn replace_and_verify(
&mut self,
expected: Arc<dyn PersistentLayer>,
new: Arc<dyn PersistentLayer>,
) -> Result<()> {
self.layer_fmgr.replace_and_verify(expected, new)
}
/// Called from `load_layer_map`. Initialize the layer manager with:
/// 1. all on-disk layers
/// 2. next open layer (with disk disk_consistent_lsn LSN)
pub(crate) fn initialize_local_layers(
&mut self,
on_disk_layers: Vec<Layer>,
on_disk_layers: Vec<Arc<dyn PersistentLayer>>,
next_open_layer_at: Lsn,
) {
let mut updates = self.layer_map.batch_update();
@@ -153,10 +164,10 @@ impl LayerManager {
}
/// Add image layers to the layer map, called from `create_image_layers`.
pub(crate) fn track_new_image_layers(&mut self, image_layers: &[ResidentLayer]) {
pub(crate) fn track_new_image_layers(&mut self, image_layers: Vec<ImageLayer>) {
let mut updates = self.layer_map.batch_update();
for layer in image_layers {
Self::insert_historic_layer(layer.as_ref().clone(), &mut updates, &mut self.layer_fmgr);
Self::insert_historic_layer(Arc::new(layer), &mut updates, &mut self.layer_fmgr);
}
updates.flush();
}
@@ -164,47 +175,46 @@ impl LayerManager {
/// Flush a frozen layer and add the written delta layer to the layer map.
pub(crate) fn finish_flush_l0_layer(
&mut self,
delta_layer: Option<&ResidentLayer>,
delta_layer: Option<DeltaLayer>,
frozen_layer_for_check: &Arc<InMemoryLayer>,
) {
let inmem = self
.layer_map
.frozen_layers
.pop_front()
.expect("there must be a inmem layer to flush");
let l = self.layer_map.frozen_layers.pop_front();
let mut updates = self.layer_map.batch_update();
// Only one task may call this function at a time (for this
// timeline). If two tasks tried to flush the same frozen
// Only one thread may call this function at a time (for this
// timeline). If two threads tried to flush the same frozen
// layer to disk at the same time, that would not work.
assert_eq!(Arc::as_ptr(&inmem), Arc::as_ptr(frozen_layer_for_check));
assert!(compare_arced_layers(&l.unwrap(), frozen_layer_for_check));
if let Some(l) = delta_layer {
let mut updates = self.layer_map.batch_update();
Self::insert_historic_layer(l.as_ref().clone(), &mut updates, &mut self.layer_fmgr);
updates.flush();
if let Some(delta_layer) = delta_layer {
Self::insert_historic_layer(Arc::new(delta_layer), &mut updates, &mut self.layer_fmgr);
}
updates.flush();
}
/// Called when compaction is completed.
pub(crate) fn finish_compact_l0(
&mut self,
layer_removal_cs: &Arc<tokio::sync::OwnedMutexGuard<()>>,
compact_from: Vec<Layer>,
compact_to: &[ResidentLayer],
duplicates: &[(ResidentLayer, ResidentLayer)],
layer_removal_cs: Arc<tokio::sync::OwnedMutexGuard<()>>,
compact_from: Vec<Arc<dyn PersistentLayer>>,
compact_to: Vec<Arc<dyn PersistentLayer>>,
metrics: &TimelineMetrics,
) -> Result<()> {
let mut updates = self.layer_map.batch_update();
for l in compact_to {
Self::insert_historic_layer(l.as_ref().clone(), &mut updates, &mut self.layer_fmgr);
Self::insert_historic_layer(l, &mut updates, &mut self.layer_fmgr);
}
for l in compact_from {
// NB: the layer file identified by descriptor `l` is guaranteed to be present
// in the LayerFileManager because compaction kept holding `layer_removal_cs` the entire
// time, even though we dropped `Timeline::layers` inbetween.
Self::delete_historic_layer(layer_removal_cs, l, &mut updates, &mut self.layer_fmgr)?;
}
for (old, new) in duplicates {
self.layer_fmgr.replace(old.as_ref(), new.as_ref().clone());
Self::delete_historic_layer(
layer_removal_cs.clone(),
l,
&mut updates,
metrics,
&mut self.layer_fmgr,
)?;
}
updates.flush();
Ok(())
@@ -213,26 +223,28 @@ impl LayerManager {
/// Called when garbage collect the timeline. Returns a guard that will apply the updates to the layer map.
pub(crate) fn finish_gc_timeline(
&mut self,
layer_removal_cs: &Arc<tokio::sync::OwnedMutexGuard<()>>,
gc_layers: Vec<Layer>,
layer_removal_cs: Arc<tokio::sync::OwnedMutexGuard<()>>,
gc_layers: Vec<Arc<dyn PersistentLayer>>,
metrics: &TimelineMetrics,
) -> Result<ApplyGcResultGuard> {
let mut updates = self.layer_map.batch_update();
for doomed_layer in gc_layers {
Self::delete_historic_layer(
layer_removal_cs,
layer_removal_cs.clone(),
doomed_layer,
&mut updates,
metrics,
&mut self.layer_fmgr,
)?;
)?; // FIXME: schedule succeeded deletions in timeline.rs `gc_timeline` instead of in batch?
}
Ok(ApplyGcResultGuard(updates))
}
/// Helper function to insert a layer into the layer map and file manager.
fn insert_historic_layer(
layer: Layer,
layer: Arc<dyn PersistentLayer>,
updates: &mut BatchedUpdates<'_>,
mapping: &mut LayerFileManager<Layer>,
mapping: &mut LayerFileManager,
) {
updates.insert_historic(layer.layer_desc().clone());
mapping.insert(layer);
@@ -242,12 +254,17 @@ impl LayerManager {
/// Remote storage is not affected by this operation.
fn delete_historic_layer(
// we cannot remove layers otherwise, since gc and compaction will race
_layer_removal_cs: &Arc<tokio::sync::OwnedMutexGuard<()>>,
layer: Layer,
_layer_removal_cs: Arc<tokio::sync::OwnedMutexGuard<()>>,
layer: Arc<dyn PersistentLayer>,
updates: &mut BatchedUpdates<'_>,
mapping: &mut LayerFileManager<Layer>,
metrics: &TimelineMetrics,
mapping: &mut LayerFileManager,
) -> anyhow::Result<()> {
let desc = layer.layer_desc();
if !layer.is_remote_layer() {
layer.delete_resident_layer_file()?;
metrics.resident_physical_size_gauge.sub(desc.file_size);
}
// TODO Removing from the bottom of the layer map is expensive.
// Maybe instead discard all layer map historic versions that
@@ -255,21 +272,22 @@ impl LayerManager {
// and mark what we can't delete yet as deleted from the layer
// map index without actually rebuilding the index.
updates.remove_historic(desc);
mapping.remove(&layer);
layer.garbage_collect_on_drop();
mapping.remove(layer);
Ok(())
}
pub(crate) fn contains(&self, layer: &Layer) -> bool {
pub(crate) fn contains(&self, layer: &Arc<dyn PersistentLayer>) -> bool {
self.layer_fmgr.contains(layer)
}
}
pub(crate) struct LayerFileManager<T>(HashMap<PersistentLayerKey, T>);
pub(crate) struct LayerFileManager<T: AsLayerDesc + ?Sized = dyn PersistentLayer>(
HashMap<PersistentLayerKey, Arc<T>>,
);
impl<T: AsLayerDesc + Clone + PartialEq + std::fmt::Debug> LayerFileManager<T> {
fn get_from_desc(&self, desc: &PersistentLayerDesc) -> T {
impl<T: AsLayerDesc + ?Sized> LayerFileManager<T> {
fn get_from_desc(&self, desc: &PersistentLayerDesc) -> Arc<T> {
// The assumption for the `expect()` is that all code maintains the following invariant:
// A layer's descriptor is present in the LayerMap => the LayerFileManager contains a layer for the descriptor.
self.0
@@ -279,14 +297,14 @@ impl<T: AsLayerDesc + Clone + PartialEq + std::fmt::Debug> LayerFileManager<T> {
.clone()
}
pub(crate) fn insert(&mut self, layer: T) {
pub(crate) fn insert(&mut self, layer: Arc<T>) {
let present = self.0.insert(layer.layer_desc().key(), layer.clone());
if present.is_some() && cfg!(debug_assertions) {
panic!("overwriting a layer: {:?}", layer.layer_desc())
}
}
pub(crate) fn contains(&self, layer: &T) -> bool {
pub(crate) fn contains(&self, layer: &Arc<T>) -> bool {
self.0.contains_key(&layer.layer_desc().key())
}
@@ -294,7 +312,7 @@ impl<T: AsLayerDesc + Clone + PartialEq + std::fmt::Debug> LayerFileManager<T> {
Self(HashMap::new())
}
pub(crate) fn remove(&mut self, layer: &T) {
pub(crate) fn remove(&mut self, layer: Arc<T>) {
let present = self.0.remove(&layer.layer_desc().key());
if present.is_none() && cfg!(debug_assertions) {
panic!(
@@ -304,13 +322,38 @@ impl<T: AsLayerDesc + Clone + PartialEq + std::fmt::Debug> LayerFileManager<T> {
}
}
pub(crate) fn replace(&mut self, old: &T, new: T) {
let key = old.layer_desc().key();
assert_eq!(key, new.layer_desc().key());
pub(crate) fn replace_and_verify(&mut self, expected: Arc<T>, new: Arc<T>) -> Result<()> {
let key = expected.layer_desc().key();
let other = new.layer_desc().key();
if let Some(existing) = self.0.get_mut(&key) {
assert_eq!(existing, old);
*existing = new;
let expected_l0 = LayerMap::is_l0(expected.layer_desc());
let new_l0 = LayerMap::is_l0(new.layer_desc());
fail::fail_point!("layermap-replace-notfound", |_| anyhow::bail!(
"layermap-replace-notfound"
));
anyhow::ensure!(
key == other,
"expected and new layer have different keys: {key:?} != {other:?}"
);
anyhow::ensure!(
expected_l0 == new_l0,
"one layer is l0 while the other is not: {expected_l0} != {new_l0}"
);
if let Some(layer) = self.0.get_mut(&key) {
anyhow::ensure!(
compare_arced_layers(&expected, layer),
"another layer was found instead of expected, expected={expected:?}, new={new:?}",
expected = Arc::as_ptr(&expected),
new = Arc::as_ptr(layer),
);
*layer = new;
Ok(())
} else {
anyhow::bail!("layer was not found");
}
}
}

View File

@@ -31,10 +31,11 @@ use storage_broker::Streaming;
use tokio::select;
use tracing::*;
use postgres_connection::{parse_host_port, PgConnectionConfig};
use postgres_connection::PgConnectionConfig;
use utils::backoff::{
exponential_backoff, DEFAULT_BASE_BACKOFF_SECONDS, DEFAULT_MAX_BACKOFF_SECONDS,
};
use utils::postgres_client::wal_stream_connection_config;
use utils::{
id::{NodeId, TenantTimelineId},
lsn::Lsn,
@@ -879,33 +880,6 @@ impl ReconnectReason {
}
}
fn wal_stream_connection_config(
TenantTimelineId {
tenant_id,
timeline_id,
}: TenantTimelineId,
listen_pg_addr_str: &str,
auth_token: Option<&str>,
availability_zone: Option<&str>,
) -> anyhow::Result<PgConnectionConfig> {
let (host, port) =
parse_host_port(listen_pg_addr_str).context("Unable to parse listen_pg_addr_str")?;
let port = port.unwrap_or(5432);
let mut connstr = PgConnectionConfig::new_host_port(host, port)
.extend_options([
"-c".to_owned(),
format!("timeline_id={}", timeline_id),
format!("tenant_id={}", tenant_id),
])
.set_password(auth_token.map(|s| s.to_owned()));
if let Some(availability_zone) = availability_zone {
connstr = connstr.extend_options([format!("availability_zone={}", availability_zone)]);
}
Ok(connstr)
}
#[cfg(test)]
mod tests {
use super::*;
@@ -921,6 +895,7 @@ mod tests {
timeline: SafekeeperTimelineInfo {
safekeeper_id: 0,
tenant_timeline_id: None,
term: 0,
last_log_term: 0,
flush_lsn: 0,
commit_lsn,
@@ -929,6 +904,7 @@ mod tests {
peer_horizon_lsn: 0,
local_start_lsn: 0,
safekeeper_connstr: safekeeper_connstr.to_owned(),
http_connstr: safekeeper_connstr.to_owned(),
availability_zone: None,
},
latest_update,

View File

@@ -1,7 +1,7 @@
use crate::metrics::RemoteOpFileKind;
use super::storage_layer::LayerFileName;
use super::storage_layer::ResidentLayer;
use super::Generation;
use crate::tenant::metadata::TimelineMetadata;
use crate::tenant::remote_timeline_client::index::IndexPart;
use crate::tenant::remote_timeline_client::index::LayerFileMetadata;
@@ -206,12 +206,13 @@ pub(crate) struct Delete {
pub(crate) file_kind: RemoteOpFileKind,
pub(crate) layer_file_name: LayerFileName,
pub(crate) scheduled_from_timeline_delete: bool,
pub(crate) generation: Generation,
}
#[derive(Debug)]
pub(crate) enum UploadOp {
/// Upload a layer file
UploadLayer(ResidentLayer, LayerFileMetadata),
UploadLayer(LayerFileName, LayerFileMetadata),
/// Upload the metadata file
UploadMetadata(IndexPart, Lsn),
@@ -226,15 +227,24 @@ pub(crate) enum UploadOp {
impl std::fmt::Display for UploadOp {
fn fmt(&self, f: &mut std::fmt::Formatter) -> std::fmt::Result {
match self {
UploadOp::UploadLayer(layer, metadata) => {
write!(f, "UploadLayer({}, size={:?})", layer, metadata.file_size())
UploadOp::UploadLayer(path, metadata) => {
write!(
f,
"UploadLayer({}, size={:?}, gen={:?})",
path.file_name(),
metadata.file_size(),
metadata.generation,
)
}
UploadOp::UploadMetadata(_, lsn) => {
write!(f, "UploadMetadata(lsn: {})", lsn)
}
UploadOp::UploadMetadata(_, lsn) => write!(f, "UploadMetadata(lsn: {})", lsn),
UploadOp::Delete(delete) => write!(
f,
"Delete(path: {}, scheduled_from_timeline_delete: {})",
"Delete(path: {}, scheduled_from_timeline_delete: {}, gen: {:?})",
delete.layer_file_name.file_name(),
delete.scheduled_from_timeline_delete
delete.scheduled_from_timeline_delete,
delete.generation
),
UploadOp::Barrier(_) => write!(f, "Barrier"),
}

View File

@@ -13,7 +13,7 @@
use crate::metrics::{STORAGE_IO_SIZE, STORAGE_IO_TIME};
use once_cell::sync::OnceCell;
use std::fs::{self, File, OpenOptions};
use std::io::{Error, ErrorKind, Read, Seek, SeekFrom, Write};
use std::io::{Error, ErrorKind, Seek, SeekFrom, Write};
use std::os::unix::fs::FileExt;
use std::path::{Path, PathBuf};
use std::sync::atomic::{AtomicBool, AtomicUsize, Ordering};
@@ -321,54 +321,8 @@ impl VirtualFile {
drop(self);
std::fs::remove_file(path).expect("failed to remove the virtual file");
}
}
impl Drop for VirtualFile {
/// If a VirtualFile is dropped, close the underlying file if it was open.
fn drop(&mut self) {
let handle = self.handle.get_mut().unwrap();
// We could check with a read-lock first, to avoid waiting on an
// unrelated I/O.
let slot = &get_open_files().slots[handle.index];
let mut slot_guard = slot.inner.write().unwrap();
if slot_guard.tag == handle.tag {
slot.recently_used.store(false, Ordering::Relaxed);
// there is also operation "close-by-replace" for closes done on eviction for
// comparison.
STORAGE_IO_TIME
.with_label_values(&["close"])
.observe_closure_duration(|| drop(slot_guard.file.take()));
}
}
}
impl Read for VirtualFile {
fn read(&mut self, buf: &mut [u8]) -> Result<usize, Error> {
let pos = self.pos;
let n = self.read_at(buf, pos)?;
self.pos += n as u64;
Ok(n)
}
}
impl Write for VirtualFile {
fn write(&mut self, buf: &[u8]) -> Result<usize, std::io::Error> {
let pos = self.pos;
let n = self.write_at(buf, pos)?;
self.pos += n as u64;
Ok(n)
}
fn flush(&mut self) -> Result<(), std::io::Error> {
// flush is no-op for File (at least on unix), so we don't need to do
// anything here either.
Ok(())
}
}
impl Seek for VirtualFile {
fn seek(&mut self, pos: SeekFrom) -> Result<u64, Error> {
pub fn seek(&mut self, pos: SeekFrom) -> Result<u64, Error> {
match pos {
SeekFrom::Start(offset) => {
self.pos = offset;
@@ -392,10 +346,64 @@ impl Seek for VirtualFile {
}
Ok(self.pos)
}
}
impl FileExt for VirtualFile {
fn read_at(&self, buf: &mut [u8], offset: u64) -> Result<usize, Error> {
// Copied from https://doc.rust-lang.org/1.72.0/src/std/os/unix/fs.rs.html#117-135
pub async fn read_exact_at(&self, mut buf: &mut [u8], mut offset: u64) -> Result<(), Error> {
while !buf.is_empty() {
match self.read_at(buf, offset).await {
Ok(0) => {
return Err(Error::new(
std::io::ErrorKind::UnexpectedEof,
"failed to fill whole buffer",
))
}
Ok(n) => {
buf = &mut buf[n..];
offset += n as u64;
}
Err(ref e) if e.kind() == std::io::ErrorKind::Interrupted => {}
Err(e) => return Err(e),
}
}
Ok(())
}
// Copied from https://doc.rust-lang.org/1.72.0/src/std/os/unix/fs.rs.html#219-235
pub async fn write_all_at(&self, mut buf: &[u8], mut offset: u64) -> Result<(), Error> {
while !buf.is_empty() {
match self.write_at(buf, offset) {
Ok(0) => {
return Err(Error::new(
std::io::ErrorKind::WriteZero,
"failed to write whole buffer",
));
}
Ok(n) => {
buf = &buf[n..];
offset += n as u64;
}
Err(ref e) if e.kind() == std::io::ErrorKind::Interrupted => {}
Err(e) => return Err(e),
}
}
Ok(())
}
/// Write the given buffer (which has to be below the kernel's internal page size) and fsync
///
/// This ensures some level of atomicity (not a good one, but it's the best we have).
pub fn write_and_fsync(&mut self, buf: &[u8]) -> Result<(), Error> {
if self.write(buf)? != buf.len() {
return Err(Error::new(
std::io::ErrorKind::Other,
"Could not write all the bytes in a single call",
));
}
self.sync_all()?;
Ok(())
}
async fn read_at(&self, buf: &mut [u8], offset: u64) -> Result<usize, Error> {
let result = self.with_file("read", |file| file.read_at(buf, offset))?;
if let Ok(size) = result {
STORAGE_IO_SIZE
@@ -405,7 +413,7 @@ impl FileExt for VirtualFile {
result
}
fn write_at(&self, buf: &[u8], offset: u64) -> Result<usize, Error> {
pub fn write_at(&self, buf: &[u8], offset: u64) -> Result<usize, Error> {
let result = self.with_file("write", |file| file.write_at(buf, offset))?;
if let Ok(size) = result {
STORAGE_IO_SIZE
@@ -416,6 +424,41 @@ impl FileExt for VirtualFile {
}
}
impl Drop for VirtualFile {
/// If a VirtualFile is dropped, close the underlying file if it was open.
fn drop(&mut self) {
let handle = self.handle.get_mut().unwrap();
// We could check with a read-lock first, to avoid waiting on an
// unrelated I/O.
let slot = &get_open_files().slots[handle.index];
let mut slot_guard = slot.inner.write().unwrap();
if slot_guard.tag == handle.tag {
slot.recently_used.store(false, Ordering::Relaxed);
// there is also operation "close-by-replace" for closes done on eviction for
// comparison.
STORAGE_IO_TIME
.with_label_values(&["close"])
.observe_closure_duration(|| drop(slot_guard.file.take()));
}
}
}
impl Write for VirtualFile {
fn write(&mut self, buf: &[u8]) -> Result<usize, std::io::Error> {
let pos = self.pos;
let n = self.write_at(buf, pos)?;
self.pos += n as u64;
Ok(n)
}
fn flush(&mut self) -> Result<(), std::io::Error> {
// flush is no-op for File (at least on unix), so we don't need to do
// anything here either.
Ok(())
}
}
impl OpenFiles {
fn new(num_slots: usize) -> OpenFiles {
let mut slots = Box::new(Vec::with_capacity(num_slots));
@@ -470,32 +513,68 @@ mod tests {
use rand::thread_rng;
use rand::Rng;
use std::sync::Arc;
use std::thread;
// Helper function to slurp contents of a file, starting at the current position,
// into a string
fn read_string<FD>(vfile: &mut FD) -> Result<String, Error>
where
FD: Read,
{
let mut buf = String::new();
vfile.read_to_string(&mut buf)?;
Ok(buf)
enum MaybeVirtualFile {
VirtualFile(VirtualFile),
File(File),
}
// Helper function to slurp a portion of a file into a string
fn read_string_at<FD>(vfile: &mut FD, pos: u64, len: usize) -> Result<String, Error>
where
FD: FileExt,
{
let mut buf = Vec::new();
buf.resize(len, 0);
vfile.read_exact_at(&mut buf, pos)?;
Ok(String::from_utf8(buf).unwrap())
impl MaybeVirtualFile {
async fn read_exact_at(&self, buf: &mut [u8], offset: u64) -> Result<(), Error> {
match self {
MaybeVirtualFile::VirtualFile(file) => file.read_exact_at(buf, offset).await,
MaybeVirtualFile::File(file) => file.read_exact_at(buf, offset),
}
}
async fn write_all_at(&self, buf: &[u8], offset: u64) -> Result<(), Error> {
match self {
MaybeVirtualFile::VirtualFile(file) => file.write_all_at(buf, offset).await,
MaybeVirtualFile::File(file) => file.write_all_at(buf, offset),
}
}
fn seek(&mut self, pos: SeekFrom) -> Result<u64, Error> {
match self {
MaybeVirtualFile::VirtualFile(file) => file.seek(pos),
MaybeVirtualFile::File(file) => file.seek(pos),
}
}
async fn write_all(&mut self, buf: &[u8]) -> Result<(), Error> {
match self {
MaybeVirtualFile::VirtualFile(file) => file.write_all(buf),
MaybeVirtualFile::File(file) => file.write_all(buf),
}
}
// Helper function to slurp contents of a file, starting at the current position,
// into a string
async fn read_string(&mut self) -> Result<String, Error> {
use std::io::Read;
let mut buf = String::new();
match self {
MaybeVirtualFile::VirtualFile(file) => {
let pos = file.seek(SeekFrom::Current(0))?;
let len = file.metadata()?.len().saturating_sub(pos);
let len_usize = len.try_into().unwrap();
return self.read_string_at(pos, len_usize).await;
}
MaybeVirtualFile::File(file) => {
file.read_to_string(&mut buf)?;
}
}
Ok(buf)
}
// Helper function to slurp a portion of a file into a string
async fn read_string_at(&mut self, pos: u64, len: usize) -> Result<String, Error> {
let mut buf = Vec::new();
buf.resize(len, 0);
self.read_exact_at(&mut buf, pos).await?;
Ok(String::from_utf8(buf).unwrap())
}
}
#[test]
fn test_virtual_files() -> Result<(), Error> {
#[tokio::test]
async fn test_virtual_files() -> Result<(), Error> {
// The real work is done in the test_files() helper function. This
// allows us to run the same set of tests against a native File, and
// VirtualFile. We trust the native Files and wouldn't need to test them,
@@ -504,21 +583,23 @@ mod tests {
// native files, you will run out of file descriptors if the ulimit
// is low enough.)
test_files("virtual_files", |path, open_options| {
VirtualFile::open_with_options(path, open_options)
let vf = VirtualFile::open_with_options(path, open_options)?;
Ok(MaybeVirtualFile::VirtualFile(vf))
})
.await
}
#[test]
fn test_physical_files() -> Result<(), Error> {
#[tokio::test]
async fn test_physical_files() -> Result<(), Error> {
test_files("physical_files", |path, open_options| {
open_options.open(path)
Ok(MaybeVirtualFile::File(open_options.open(path)?))
})
.await
}
fn test_files<OF, FD>(testname: &str, openfunc: OF) -> Result<(), Error>
async fn test_files<OF>(testname: &str, openfunc: OF) -> Result<(), Error>
where
FD: Read + Write + Seek + FileExt,
OF: Fn(&Path, &OpenOptions) -> Result<FD, std::io::Error>,
OF: Fn(&Path, &OpenOptions) -> Result<MaybeVirtualFile, std::io::Error>,
{
let testdir = crate::config::PageServerConf::test_repo_dir(testname);
std::fs::create_dir_all(&testdir)?;
@@ -528,36 +609,36 @@ mod tests {
&path_a,
OpenOptions::new().write(true).create(true).truncate(true),
)?;
file_a.write_all(b"foobar")?;
file_a.write_all(b"foobar").await?;
// cannot read from a file opened in write-only mode
assert!(read_string(&mut file_a).is_err());
assert!(file_a.read_string().await.is_err());
// Close the file and re-open for reading
let mut file_a = openfunc(&path_a, OpenOptions::new().read(true))?;
// cannot write to a file opened in read-only mode
assert!(file_a.write(b"bar").is_err());
assert!(file_a.write_all(b"bar").await.is_err());
// Try simple read
assert_eq!("foobar", read_string(&mut file_a)?);
assert_eq!("foobar", file_a.read_string().await?);
// It's positioned at the EOF now.
assert_eq!("", read_string(&mut file_a)?);
assert_eq!("", file_a.read_string().await?);
// Test seeks.
assert_eq!(file_a.seek(SeekFrom::Start(1))?, 1);
assert_eq!("oobar", read_string(&mut file_a)?);
assert_eq!("oobar", file_a.read_string().await?);
assert_eq!(file_a.seek(SeekFrom::End(-2))?, 4);
assert_eq!("ar", read_string(&mut file_a)?);
assert_eq!("ar", file_a.read_string().await?);
assert_eq!(file_a.seek(SeekFrom::Start(1))?, 1);
assert_eq!(file_a.seek(SeekFrom::Current(2))?, 3);
assert_eq!("bar", read_string(&mut file_a)?);
assert_eq!("bar", file_a.read_string().await?);
assert_eq!(file_a.seek(SeekFrom::Current(-5))?, 1);
assert_eq!("oobar", read_string(&mut file_a)?);
assert_eq!("oobar", file_a.read_string().await?);
// Test erroneous seeks to before byte 0
assert!(file_a.seek(SeekFrom::End(-7)).is_err());
@@ -565,7 +646,7 @@ mod tests {
assert!(file_a.seek(SeekFrom::Current(-2)).is_err());
// the erroneous seek should have left the position unchanged
assert_eq!("oobar", read_string(&mut file_a)?);
assert_eq!("oobar", file_a.read_string().await?);
// Create another test file, and try FileExt functions on it.
let path_b = testdir.join("file_b");
@@ -577,10 +658,10 @@ mod tests {
.create(true)
.truncate(true),
)?;
file_b.write_all_at(b"BAR", 3)?;
file_b.write_all_at(b"FOO", 0)?;
file_b.write_all_at(b"BAR", 3).await?;
file_b.write_all_at(b"FOO", 0).await?;
assert_eq!(read_string_at(&mut file_b, 2, 3)?, "OBA");
assert_eq!(file_b.read_string_at(2, 3).await?, "OBA");
// Open a lot of files, enough to cause some evictions. (Or to be precise,
// open the same file many times. The effect is the same.)
@@ -591,7 +672,7 @@ mod tests {
let mut vfiles = Vec::new();
for _ in 0..100 {
let mut vfile = openfunc(&path_b, OpenOptions::new().read(true))?;
assert_eq!("FOOBAR", read_string(&mut vfile)?);
assert_eq!("FOOBAR", vfile.read_string().await?);
vfiles.push(vfile);
}
@@ -600,13 +681,13 @@ mod tests {
// The underlying file descriptor for 'file_a' should be closed now. Try to read
// from it again. We left the file positioned at offset 1 above.
assert_eq!("oobar", read_string(&mut file_a)?);
assert_eq!("oobar", file_a.read_string().await?);
// Check that all the other FDs still work too. Use them in random order for
// good measure.
vfiles.as_mut_slice().shuffle(&mut thread_rng());
for vfile in vfiles.iter_mut() {
assert_eq!("OOBAR", read_string_at(vfile, 1, 5)?);
assert_eq!("OOBAR", vfile.read_string_at(1, 5).await?);
}
Ok(())
@@ -641,28 +722,22 @@ mod tests {
let files = Arc::new(files);
// Launch many threads, and use the virtual files concurrently in random order.
let mut threads = Vec::new();
for threadno in 0..THREADS {
let builder =
thread::Builder::new().name(format!("test_vfile_concurrency thread {}", threadno));
let rt = tokio::runtime::Builder::new_multi_thread()
.worker_threads(THREADS)
.thread_name("test_vfile_concurrency thread")
.build()
.unwrap();
for _threadno in 0..THREADS {
let files = files.clone();
let thread = builder
.spawn(move || {
let mut buf = [0u8; SIZE];
let mut rng = rand::thread_rng();
for _ in 1..1000 {
let f = &files[rng.gen_range(0..files.len())];
f.read_exact_at(&mut buf, 0).unwrap();
assert!(buf == SAMPLE);
}
})
.unwrap();
threads.push(thread);
}
for thread in threads {
thread.join().unwrap();
rt.spawn(async move {
let mut buf = [0u8; SIZE];
let mut rng = rand::rngs::OsRng;
for _ in 1..1000 {
let f = &files[rng.gen_range(0..files.len())];
f.read_exact_at(&mut buf, 0).await.unwrap();
assert!(buf == SAMPLE);
}
});
}
Ok(())

View File

@@ -12,13 +12,19 @@ pub struct PasswordHackPayload {
impl PasswordHackPayload {
pub fn parse(bytes: &[u8]) -> Option<Self> {
// The format is `project=<utf-8>;<password-bytes>`.
let mut iter = bytes.splitn_str(2, ";");
let endpoint = iter.next()?.to_str().ok()?;
let endpoint = parse_endpoint_param(endpoint)?.to_owned();
let password = iter.next()?.to_owned();
// The format is `project=<utf-8>;<password-bytes>` or `project=<utf-8>$<password-bytes>`.
let separators = [";", "$"];
for sep in separators {
if let Some((endpoint, password)) = bytes.split_once_str(sep) {
let endpoint = endpoint.to_str().ok()?;
return Some(Self {
endpoint: parse_endpoint_param(endpoint)?.to_owned(),
password: password.to_owned(),
});
}
}
Some(Self { endpoint, password })
None
}
}
@@ -91,4 +97,23 @@ mod tests {
assert_eq!(payload.endpoint, "foobar");
assert_eq!(payload.password, b"pass;word");
}
#[test]
fn parse_password_hack_payload_dollar() {
let bytes = b"";
assert!(PasswordHackPayload::parse(bytes).is_none());
let bytes = b"endpoint=";
assert!(PasswordHackPayload::parse(bytes).is_none());
let bytes = b"endpoint=$";
let payload = PasswordHackPayload::parse(bytes).expect("parsing failed");
assert_eq!(payload.endpoint, "");
assert_eq!(payload.password, b"");
let bytes = b"endpoint=foobar$pass$word";
let payload = PasswordHackPayload::parse(bytes).expect("parsing failed");
assert_eq!(payload.endpoint, "foobar");
assert_eq!(payload.password, b"pass$word");
}
}

View File

@@ -63,8 +63,8 @@ pub mod errors {
format!("{REQUEST_FAILED}: endpoint is disabled")
}
http::StatusCode::LOCKED => {
// Status 423: project might be in maintenance mode (or bad state).
format!("{REQUEST_FAILED}: endpoint is temporary unavailable")
// Status 423: project might be in maintenance mode (or bad state), or quotas exceeded.
format!("{REQUEST_FAILED}: endpoint is temporary unavailable. check your quotas and/or contract our support")
}
_ => REQUEST_FAILED.to_owned(),
},
@@ -81,9 +81,15 @@ pub mod errors {
// retry some temporary failures because the compute was in a bad state
// (bad request can be returned when the endpoint was in transition)
Self::Console {
status: http::StatusCode::BAD_REQUEST | http::StatusCode::LOCKED,
status: http::StatusCode::BAD_REQUEST,
..
} => true,
// locked can be returned when the endpoint was in transition
// or when quotas are exceeded. don't retry when quotas are exceeded
Self::Console {
status: http::StatusCode::LOCKED,
ref text,
} => !text.contains("quota"),
// retry server errors
Self::Console { status, .. } if status.is_server_error() => true,
_ => false,

View File

@@ -16,12 +16,21 @@ use tracing::{error, info, info_span, warn, Instrument};
pub struct Api {
endpoint: http::Endpoint,
caches: &'static ApiCaches,
jwt: String,
}
impl Api {
/// Construct an API object containing the auth parameters.
pub fn new(endpoint: http::Endpoint, caches: &'static ApiCaches) -> Self {
Self { endpoint, caches }
let jwt: String = match std::env::var("NEON_PROXY_TO_CONTROLPLANE_TOKEN") {
Ok(v) => v,
Err(_) => "".to_string(),
};
Self {
endpoint,
caches,
jwt,
}
}
pub fn url(&self) -> &str {
@@ -39,6 +48,7 @@ impl Api {
.endpoint
.get("proxy_get_role_secret")
.header("X-Request-ID", &request_id)
.header("Authorization", &self.jwt)
.query(&[("session_id", extra.session_id)])
.query(&[
("application_name", extra.application_name),
@@ -83,6 +93,7 @@ impl Api {
.endpoint
.get("proxy_wake_compute")
.header("X-Request-ID", &request_id)
.header("Authorization", &self.jwt)
.query(&[("session_id", extra.session_id)])
.query(&[
("application_name", extra.application_name),

View File

@@ -2,6 +2,7 @@ use crate::{
cancellation::CancelMap,
config::ProxyConfig,
error::io_error,
protocol2::{ProxyProtocolAccept, WithClientIp},
proxy::{handle_client, ClientMode},
};
use bytes::{Buf, Bytes};
@@ -292,6 +293,9 @@ pub async fn task_main(
let mut addr_incoming = AddrIncoming::from_listener(ws_listener)?;
let _ = addr_incoming.set_nodelay(true);
let addr_incoming = ProxyProtocolAccept {
incoming: addr_incoming,
};
let tls_listener = TlsListener::new(tls_acceptor, addr_incoming).filter(|conn| {
if let Err(err) = conn {
@@ -302,9 +306,11 @@ pub async fn task_main(
}
});
let make_svc =
hyper::service::make_service_fn(|stream: &tokio_rustls::server::TlsStream<AddrStream>| {
let sni_name = stream.get_ref().1.sni_hostname().map(|s| s.to_string());
let make_svc = hyper::service::make_service_fn(
|stream: &tokio_rustls::server::TlsStream<WithClientIp<AddrStream>>| {
let (io, tls) = stream.get_ref();
let peer_addr = io.client_addr().unwrap_or(io.inner.remote_addr());
let sni_name = tls.server_name().map(|s| s.to_string());
let conn_pool = conn_pool.clone();
async move {
@@ -319,13 +325,15 @@ pub async fn task_main(
ws_handler(req, config, conn_pool, cancel_map, session_id, sni_name)
.instrument(info_span!(
"ws-client",
session = %session_id
session = %session_id,
%peer_addr,
))
.await
}
}))
}
});
},
);
hyper::Server::builder(accept::from_stream(tls_listener))
.serve(make_svc)

View File

@@ -16,6 +16,7 @@ pub mod http;
pub mod logging;
pub mod metrics;
pub mod parse;
pub mod protocol2;
pub mod proxy;
pub mod sasl;
pub mod scram;

479
proxy/src/protocol2.rs Normal file
View File

@@ -0,0 +1,479 @@
//! Proxy Protocol V2 implementation
use std::{
future::poll_fn,
future::Future,
io,
net::SocketAddr,
pin::{pin, Pin},
task::{ready, Context, Poll},
};
use bytes::{Buf, BytesMut};
use hyper::server::conn::{AddrIncoming, AddrStream};
use pin_project_lite::pin_project;
use tls_listener::AsyncAccept;
use tokio::io::{AsyncRead, AsyncReadExt, AsyncWrite, ReadBuf};
pub struct ProxyProtocolAccept {
pub incoming: AddrIncoming,
}
pin_project! {
pub struct WithClientIp<T> {
#[pin]
pub inner: T,
buf: BytesMut,
tlv_bytes: u16,
state: ProxyParse,
}
}
#[derive(Clone, PartialEq, Debug)]
enum ProxyParse {
NotStarted,
Finished(SocketAddr),
None,
}
impl<T: AsyncWrite> AsyncWrite for WithClientIp<T> {
#[inline]
fn poll_write(
self: Pin<&mut Self>,
cx: &mut Context<'_>,
buf: &[u8],
) -> Poll<Result<usize, io::Error>> {
self.project().inner.poll_write(cx, buf)
}
#[inline]
fn poll_flush(self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<Result<(), io::Error>> {
self.project().inner.poll_flush(cx)
}
#[inline]
fn poll_shutdown(self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<Result<(), io::Error>> {
self.project().inner.poll_shutdown(cx)
}
#[inline]
fn poll_write_vectored(
self: Pin<&mut Self>,
cx: &mut Context<'_>,
bufs: &[io::IoSlice<'_>],
) -> Poll<Result<usize, io::Error>> {
self.project().inner.poll_write_vectored(cx, bufs)
}
#[inline]
fn is_write_vectored(&self) -> bool {
self.inner.is_write_vectored()
}
}
impl<T> WithClientIp<T> {
pub fn new(inner: T) -> Self {
WithClientIp {
inner,
buf: BytesMut::with_capacity(128),
tlv_bytes: 0,
state: ProxyParse::NotStarted,
}
}
pub fn client_addr(&self) -> Option<SocketAddr> {
match self.state {
ProxyParse::Finished(socket) => Some(socket),
_ => None,
}
}
}
impl<T: AsyncRead + Unpin> WithClientIp<T> {
pub async fn wait_for_addr(&mut self) -> io::Result<Option<SocketAddr>> {
match self.state {
ProxyParse::NotStarted => {
let mut pin = Pin::new(&mut *self);
let addr = poll_fn(|cx| pin.as_mut().poll_client_ip(cx)).await?;
match addr {
Some(addr) => self.state = ProxyParse::Finished(addr),
None => self.state = ProxyParse::None,
}
Ok(addr)
}
ProxyParse::Finished(addr) => Ok(Some(addr)),
ProxyParse::None => Ok(None),
}
}
}
/// Proxy Protocol Version 2 Header
const HEADER: [u8; 12] = [
0x0D, 0x0A, 0x0D, 0x0A, 0x00, 0x0D, 0x0A, 0x51, 0x55, 0x49, 0x54, 0x0A,
];
impl<T: AsyncRead> WithClientIp<T> {
/// implementation of <https://www.haproxy.org/download/2.4/doc/proxy-protocol.txt>
/// Version 2 (Binary Format)
fn poll_client_ip(
mut self: Pin<&mut Self>,
cx: &mut Context<'_>,
) -> Poll<io::Result<Option<SocketAddr>>> {
// The binary header format starts with a constant 12 bytes block containing the protocol signature :
// \x0D \x0A \x0D \x0A \x00 \x0D \x0A \x51 \x55 \x49 \x54 \x0A
while self.buf.len() < 16 {
let mut this = self.as_mut().project();
let bytes_read = pin!(this.inner.read_buf(this.buf)).poll(cx)?;
// exit for bad header
let len = usize::min(self.buf.len(), HEADER.len());
if self.buf[..len] != HEADER[..len] {
return Poll::Ready(Ok(None));
}
// if no more bytes available then exit
if ready!(bytes_read) == 0 {
return Poll::Ready(Ok(None));
};
}
// The next byte (the 13th one) is the protocol version and command.
// The highest four bits contains the version. As of this specification, it must
// always be sent as \x2 and the receiver must only accept this value.
let vc = self.buf[12];
let version = vc >> 4;
let command = vc & 0b1111;
if version != 2 {
return Poll::Ready(Err(io::Error::new(
io::ErrorKind::Other,
"invalid proxy protocol version. expected version 2",
)));
}
match command {
// the connection was established on purpose by the proxy
// without being relayed. The connection endpoints are the sender and the
// receiver. Such connections exist when the proxy sends health-checks to the
// server. The receiver must accept this connection as valid and must use the
// real connection endpoints and discard the protocol block including the
// family which is ignored.
0 => {}
// the connection was established on behalf of another node,
// and reflects the original connection endpoints. The receiver must then use
// the information provided in the protocol block to get original the address.
1 => {}
// other values are unassigned and must not be emitted by senders. Receivers
// must drop connections presenting unexpected values here.
_ => {
return Poll::Ready(Err(io::Error::new(
io::ErrorKind::Other,
"invalid proxy protocol command. expected local (0) or proxy (1)",
)))
}
};
// The 14th byte contains the transport protocol and address family. The highest 4
// bits contain the address family, the lowest 4 bits contain the protocol.
let ft = self.buf[13];
let address_length = match ft {
// - \x11 : TCP over IPv4 : the forwarded connection uses TCP over the AF_INET
// protocol family. Address length is 2*4 + 2*2 = 12 bytes.
// - \x12 : UDP over IPv4 : the forwarded connection uses UDP over the AF_INET
// protocol family. Address length is 2*4 + 2*2 = 12 bytes.
0x11 | 0x12 => 12,
// - \x21 : TCP over IPv6 : the forwarded connection uses TCP over the AF_INET6
// protocol family. Address length is 2*16 + 2*2 = 36 bytes.
// - \x22 : UDP over IPv6 : the forwarded connection uses UDP over the AF_INET6
// protocol family. Address length is 2*16 + 2*2 = 36 bytes.
0x21 | 0x22 => 36,
// unspecified or unix stream. ignore the addresses
_ => 0,
};
// The 15th and 16th bytes is the address length in bytes in network endian order.
// It is used so that the receiver knows how many address bytes to skip even when
// it does not implement the presented protocol. Thus the length of the protocol
// header in bytes is always exactly 16 + this value. When a sender presents a
// LOCAL connection, it should not present any address so it sets this field to
// zero. Receivers MUST always consider this field to skip the appropriate number
// of bytes and must not assume zero is presented for LOCAL connections. When a
// receiver accepts an incoming connection showing an UNSPEC address family or
// protocol, it may or may not decide to log the address information if present.
let remaining_length = u16::from_be_bytes(self.buf[14..16].try_into().unwrap());
if remaining_length < address_length {
return Poll::Ready(Err(io::Error::new(
io::ErrorKind::Other,
"invalid proxy protocol length. not enough to fit requested IP addresses",
)));
}
while self.buf.len() < 16 + address_length as usize {
let mut this = self.as_mut().project();
if ready!(pin!(this.inner.read_buf(this.buf)).poll(cx)?) == 0 {
return Poll::Ready(Err(io::Error::new(
io::ErrorKind::UnexpectedEof,
"stream closed while waiting for proxy protocol addresses",
)));
}
}
let this = self.as_mut().project();
// we are sure this is a proxy protocol v2 entry and we have read all the bytes we need
// discard the header we have parsed
this.buf.advance(16);
// Starting from the 17th byte, addresses are presented in network byte order.
// The address order is always the same :
// - source layer 3 address in network byte order
// - destination layer 3 address in network byte order
// - source layer 4 address if any, in network byte order (port)
// - destination layer 4 address if any, in network byte order (port)
let addresses = this.buf.split_to(address_length as usize);
let socket = match address_length {
12 => {
let src_addr: [u8; 4] = addresses[0..4].try_into().unwrap();
let src_port = u16::from_be_bytes(addresses[8..10].try_into().unwrap());
Some(SocketAddr::from((src_addr, src_port)))
}
36 => {
let src_addr: [u8; 16] = addresses[0..16].try_into().unwrap();
let src_port = u16::from_be_bytes(addresses[32..34].try_into().unwrap());
Some(SocketAddr::from((src_addr, src_port)))
}
_ => None,
};
*this.tlv_bytes = remaining_length - address_length;
self.as_mut().skip_tlv_inner();
Poll::Ready(Ok(socket))
}
#[cold]
fn read_ip(mut self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<io::Result<()>> {
let ip = ready!(self.as_mut().poll_client_ip(cx)?);
match ip {
Some(x) => *self.as_mut().project().state = ProxyParse::Finished(x),
None => *self.as_mut().project().state = ProxyParse::None,
}
Poll::Ready(Ok(()))
}
#[cold]
fn skip_tlv(mut self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<io::Result<()>> {
let mut this = self.as_mut().project();
// we know that this.buf is empty
debug_assert_eq!(this.buf.len(), 0);
this.buf.reserve((*this.tlv_bytes).clamp(0, 1024) as usize);
ready!(pin!(this.inner.read_buf(this.buf)).poll(cx)?);
self.skip_tlv_inner();
Poll::Ready(Ok(()))
}
fn skip_tlv_inner(self: Pin<&mut Self>) {
let tlv_bytes_read = match u16::try_from(self.buf.len()) {
// we read more than u16::MAX therefore we must have read the full tlv_bytes
Err(_) => self.tlv_bytes,
// we might not have read the full tlv bytes yet
Ok(n) => u16::min(n, self.tlv_bytes),
};
let this = self.project();
*this.tlv_bytes -= tlv_bytes_read;
this.buf.advance(tlv_bytes_read as usize);
}
}
impl<T: AsyncRead> AsyncRead for WithClientIp<T> {
#[inline]
fn poll_read(
mut self: Pin<&mut Self>,
cx: &mut Context<'_>,
buf: &mut ReadBuf<'_>,
) -> Poll<io::Result<()>> {
// I'm assuming these 3 comparisons will be easy to branch predict.
// especially with the cold attributes
// which should make this read wrapper almost invisible
if let ProxyParse::NotStarted = self.state {
ready!(self.as_mut().read_ip(cx)?);
}
while self.tlv_bytes > 0 {
ready!(self.as_mut().skip_tlv(cx)?)
}
let this = self.project();
if this.buf.is_empty() {
this.inner.poll_read(cx, buf)
} else {
// we know that tlv_bytes is 0
debug_assert_eq!(*this.tlv_bytes, 0);
let write = usize::min(this.buf.len(), buf.remaining());
let slice = this.buf.split_to(write).freeze();
buf.put_slice(&slice);
// reset the allocation so it can be freed
if this.buf.is_empty() {
*this.buf = BytesMut::new();
}
Poll::Ready(Ok(()))
}
}
}
impl AsyncAccept for ProxyProtocolAccept {
type Connection = WithClientIp<AddrStream>;
type Error = io::Error;
fn poll_accept(
mut self: Pin<&mut Self>,
cx: &mut Context<'_>,
) -> Poll<Option<Result<Self::Connection, Self::Error>>> {
let conn = ready!(Pin::new(&mut self.incoming).poll_accept(cx)?);
let Some(conn) = conn else {
return Poll::Ready(None);
};
Poll::Ready(Some(Ok(WithClientIp::new(conn))))
}
}
#[cfg(test)]
mod tests {
use std::pin::pin;
use tokio::io::AsyncReadExt;
use crate::protocol2::{ProxyParse, WithClientIp};
#[tokio::test]
async fn test_ipv4() {
let header = super::HEADER
// Proxy command, IPV4 | TCP
.chain([(2 << 4) | 1, (1 << 4) | 1].as_slice())
// 12 + 3 bytes
.chain([0, 15].as_slice())
// src ip
.chain([127, 0, 0, 1].as_slice())
// dst ip
.chain([192, 168, 0, 1].as_slice())
// src port
.chain([255, 255].as_slice())
// dst port
.chain([1, 1].as_slice())
// TLV
.chain([1, 2, 3].as_slice());
let extra_data = [0x55; 256];
let mut read = pin!(WithClientIp::new(header.chain(extra_data.as_slice())));
let mut bytes = vec![];
read.read_to_end(&mut bytes).await.unwrap();
assert_eq!(bytes, extra_data);
assert_eq!(
read.state,
ProxyParse::Finished(([127, 0, 0, 1], 65535).into())
);
}
#[tokio::test]
async fn test_ipv6() {
let header = super::HEADER
// Proxy command, IPV6 | UDP
.chain([(2 << 4) | 1, (2 << 4) | 2].as_slice())
// 36 + 3 bytes
.chain([0, 39].as_slice())
// src ip
.chain([15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0].as_slice())
// dst ip
.chain([0, 15, 1, 14, 2, 13, 3, 12, 4, 11, 5, 10, 6, 9, 7, 8].as_slice())
// src port
.chain([1, 1].as_slice())
// dst port
.chain([255, 255].as_slice())
// TLV
.chain([1, 2, 3].as_slice());
let extra_data = [0x55; 256];
let mut read = pin!(WithClientIp::new(header.chain(extra_data.as_slice())));
let mut bytes = vec![];
read.read_to_end(&mut bytes).await.unwrap();
assert_eq!(bytes, extra_data);
assert_eq!(
read.state,
ProxyParse::Finished(
([15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0], 257).into()
)
);
}
#[tokio::test]
async fn test_invalid() {
let data = [0x55; 256];
let mut read = pin!(WithClientIp::new(data.as_slice()));
let mut bytes = vec![];
read.read_to_end(&mut bytes).await.unwrap();
assert_eq!(bytes, data);
assert_eq!(read.state, ProxyParse::None);
}
#[tokio::test]
async fn test_short() {
let data = [0x55; 10];
let mut read = pin!(WithClientIp::new(data.as_slice()));
let mut bytes = vec![];
read.read_to_end(&mut bytes).await.unwrap();
assert_eq!(bytes, data);
assert_eq!(read.state, ProxyParse::None);
}
#[tokio::test]
async fn test_large_tlv() {
let tlv = vec![0x55; 32768];
let len = (12 + tlv.len() as u16).to_be_bytes();
let header = super::HEADER
// Proxy command, Inet << 4 | Stream
.chain([(2 << 4) | 1, (1 << 4) | 1].as_slice())
// 12 + 3 bytes
.chain(len.as_slice())
// src ip
.chain([55, 56, 57, 58].as_slice())
// dst ip
.chain([192, 168, 0, 1].as_slice())
// src port
.chain([255, 255].as_slice())
// dst port
.chain([1, 1].as_slice())
// TLV
.chain(tlv.as_slice());
let extra_data = [0xaa; 256];
let mut read = pin!(WithClientIp::new(header.chain(extra_data.as_slice())));
let mut bytes = vec![];
read.read_to_end(&mut bytes).await.unwrap();
assert_eq!(bytes, extra_data);
assert_eq!(
read.state,
ProxyParse::Finished(([55, 56, 57, 58], 65535).into())
);
}
}

View File

@@ -7,6 +7,7 @@ use crate::{
compute::{self, PostgresConnection},
config::{ProxyConfig, TlsConfig},
console::{self, errors::WakeComputeError, messages::MetricsAuxInfo, Api},
protocol2::WithClientIp,
stream::{PqStream, Stream},
};
use anyhow::{bail, Context};
@@ -100,7 +101,7 @@ pub async fn task_main(
loop {
tokio::select! {
accept_result = listener.accept() => {
let (socket, peer_addr) = accept_result?;
let (socket, _) = accept_result?;
let session_id = uuid::Uuid::new_v4();
let cancel_map = Arc::clone(&cancel_map);
@@ -108,13 +109,19 @@ pub async fn task_main(
async move {
info!("accepted postgres client connection");
let mut socket = WithClientIp::new(socket);
if let Some(ip) = socket.wait_for_addr().await? {
tracing::Span::current().record("peer_addr", &tracing::field::display(ip));
}
socket
.inner
.set_nodelay(true)
.context("failed to set socket option")?;
handle_client(config, &cancel_map, session_id, socket, ClientMode::Tcp).await
}
.instrument(info_span!("handle_client", ?session_id, %peer_addr))
.instrument(info_span!("handle_client", ?session_id, peer_addr = tracing::field::Empty))
.unwrap_or_else(move |e| {
// Acknowledge that the task has finished with an error.
error!(?session_id, "per-client task finished with an error: {e:#}");

View File

@@ -137,6 +137,7 @@ async fn dummy_proxy(
auth: impl TestAuth + Send,
) -> anyhow::Result<()> {
let cancel_map = CancelMap::default();
let client = WithClientIp::new(client);
let (mut stream, _params) = handshake(client, tls.as_ref(), &cancel_map)
.await?
.context("handshake failed")?;

View File

@@ -141,7 +141,7 @@ impl<S> Stream<S> {
pub fn sni_hostname(&self) -> Option<&str> {
match self {
Stream::Raw { .. } => None,
Stream::Tls { tls } => tls.get_ref().1.sni_hostname(),
Stream::Tls { tls } => tls.get_ref().1.server_name(),
}
}
}

39
s3_scrubber/Cargo.toml Normal file
View File

@@ -0,0 +1,39 @@
[package]
name = "s3_scrubber"
version = "0.1.0"
edition.workspace = true
license.workspace = true
[dependencies]
aws-sdk-s3.workspace = true
aws-smithy-http.workspace = true
aws-types.workspace = true
either.workspace = true
tokio-rustls.workspace = true
anyhow.workspace = true
hex.workspace = true
thiserror.workspace = true
rand.workspace = true
bytes.workspace = true
bincode.workspace = true
crc32c.workspace = true
serde.workspace = true
serde_json.workspace = true
serde_with.workspace = true
workspace_hack.workspace = true
utils.workspace = true
tokio = { workspace = true, features = ["macros", "rt-multi-thread"] }
chrono = { workspace = true, default-features = false, features = ["clock", "serde"] }
reqwest = { workspace = true, default-features = false, features = ["rustls-tls", "json"] }
aws-config = { workspace = true, default-features = false, features = ["rustls", "credentials-sso"] }
pageserver = {path="../pageserver"}
tracing.workspace = true
tracing-subscriber.workspace = true
clap.workspace = true
atty = "0.2"
tracing-appender = "0.2"

93
s3_scrubber/README.md Normal file
View File

@@ -0,0 +1,93 @@
# Neon S3 scrubber
This tool directly accesses the S3 buckets used by the Neon `pageserver`
and `safekeeper`, and does housekeeping such as cleaning up objects for tenants & timelines that no longer exist.
## Usage
### Generic Parameters
#### S3
Do `aws sso login --profile dev` to get the SSO access to the bucket to clean, get the SSO_ACCOUNT_ID for your profile (`cat ~/.aws/config` may help).
- `SSO_ACCOUNT_ID`: Credentials id to use for accessing S3 buckets
- `REGION`: A region where the bucket is located at.
- `BUCKET`: Bucket name
#### Console API
_This section is only relevant if using a command that requires access to Neon's internal control plane_
- `CLOUD_ADMIN_API_URL`: The URL base to use for checking tenant/timeline for existence via the Cloud API. e.g. `https://<admin host>/admin`
- `CLOUD_ADMIN_API_TOKEN`: The token to provide when querying the admin API. Get one on the corresponding console page, e.g. `https://<admin host>/app/settings/api-keys`
### Commands
#### `tidy`
Iterate over S3 buckets for storage nodes, checking their contents and removing the data not present in the console. Node S3 data that's not removed is then further checked for discrepancies and, sometimes, validated.
Unless the global `--delete` argument is provided, this command only dry-runs and logs
what it would have deleted.
```
tidy --node-kind=<safekeeper|pageserver> [--depth=<tenant|timeline>] [--skip-validation]
```
- `--node-kind`: whether to inspect safekeeper or pageserver bucket prefix
- `--depth`: whether to only search for deletable tenants, or also search for
deletable timelines within active tenants. Default: `tenant`
- `--skip-validation`: skip additional post-deletion checks. Default: `false`
For a selected S3 path, the tool lists the S3 bucket given for either tenants or both tenants and timelines — for every found entry, console API is queried: any deleted or missing in the API entity is scheduled for deletion from S3.
If validation is enabled, only the non-deleted tenants' ones are checked.
For pageserver, timelines' index_part.json on S3 is also checked for various discrepancies: no files are removed, even if there are "extra" S3 files not present in index_part.json: due to the way pageserver updates the remote storage, it's better to do such removals manually, stopping the corresponding tenant first.
Command examples:
`env SSO_ACCOUNT_ID=369495373322 REGION=eu-west-1 BUCKET=neon-dev-storage-eu-west-1 CLOUD_ADMIN_API_TOKEN=${NEON_CLOUD_ADMIN_API_STAGING_KEY} CLOUD_ADMIN_API_URL=[url] cargo run --release -- tidy --node-kind=safekeeper`
`env SSO_ACCOUNT_ID=369495373322 REGION=us-east-2 BUCKET=neon-staging-storage-us-east-2 CLOUD_ADMIN_API_TOKEN=${NEON_CLOUD_ADMIN_API_STAGING_KEY} CLOUD_ADMIN_API_URL=[url] cargo run --release -- tidy --node-kind=pageserver --depth=timeline`
When dry run stats look satisfying, use `-- --delete` before the `tidy` command to
disable dry run and run the binary with deletion enabled.
See these lines (and lines around) in the logs for the final stats:
- `Finished listing the bucket for tenants`
- `Finished active tenant and timeline validation`
- `Total tenant deletion stats`
- `Total timeline deletion stats`
## Current implementation details
- The tool does not have any peristent state currently: instead, it creates very verbose logs, with every S3 delete request logged, every tenant/timeline id check, etc.
Worse, any panic or early errored tasks might force the tool to exit without printing the final summary — all affected ids will still be in the logs though. The tool has retries inside it, so it's error-resistant up to some extent, and recent runs showed no traces of errors/panics.
- Instead of checking non-deleted tenants' timelines instantly, the tool attempts to create separate tasks (futures) for that,
complicating the logic and slowing down the process, this should be fixed and done in one "task".
- The tool does uses only publicly available remote resources (S3, console) and does not access pageserver/safekeeper nodes themselves.
Yet, its S3 set up should be prepared for running on any pageserver/safekeeper node, using node's S3 credentials, so the node API access logic could be implemented relatively simply on top.
## Cleanup procedure:
### Pageserver preparations
If S3 state is altered first manually, pageserver in-memory state will contain wrong data about S3 state, and tenants/timelines may get recreated on S3 (due to any layer upload due to compaction, pageserver restart, etc.). So before proceeding, for tenants/timelines which are already deleted in the console, we must remove these from pageservers.
First, we need to group pageservers by buckets, `https://<admin host>/admin/pageservers`` can be used for all env nodes, then `cat /storage/pageserver/data/pageserver.toml` on every node will show the bucket names and regions needed.
Per bucket, for every pageserver id related, find deleted tenants:
`curl -X POST "https://<admin_host>/admin/check_pageserver/{id}" -H "Accept: application/json" -H "Authorization: Bearer ${NEON_CLOUD_ADMIN_API_STAGING_KEY}" | jq`
use `?check_timelines=true` to find deleted timelines, but the check runs a separate query on every alive tenant, so that could be long and time out for big pageservers.
Note that some tenants/timelines could be marked as deleted in console, but console might continue querying the node later to fully remove the tenant/timeline: wait for some time before ensuring that the "extra" tenant/timeline is not going away by itself.
When all IDs are collected, manually go to every pageserver and detach/delete the tenant/timeline.
In future, the cleanup tool may access pageservers directly, but now it's only console and S3 it has access to.

438
s3_scrubber/src/checks.rs Normal file
View File

@@ -0,0 +1,438 @@
use std::collections::{hash_map, HashMap, HashSet};
use std::sync::Arc;
use std::time::Duration;
use anyhow::Context;
use aws_sdk_s3::Client;
use tokio::io::AsyncReadExt;
use tokio::task::JoinSet;
use tracing::{error, info, info_span, warn, Instrument};
use crate::cloud_admin_api::{BranchData, CloudAdminApiClient, ProjectId};
use crate::delete_batch_producer::DeleteProducerStats;
use crate::{list_objects_with_retries, RootTarget, MAX_RETRIES};
use pageserver::tenant::storage_layer::LayerFileName;
use pageserver::tenant::IndexPart;
use utils::id::TenantTimelineId;
pub async fn validate_pageserver_active_tenant_and_timelines(
s3_client: Arc<Client>,
s3_root: RootTarget,
admin_client: Arc<CloudAdminApiClient>,
batch_producer_stats: DeleteProducerStats,
) -> anyhow::Result<BranchCheckStats> {
let Some(timeline_stats) = batch_producer_stats.timeline_stats else {
info!("No tenant-only checks, exiting");
return Ok(BranchCheckStats::default());
};
let s3_active_projects = batch_producer_stats
.tenant_stats
.active_entries
.into_iter()
.map(|project| (project.id.clone(), project))
.collect::<HashMap<_, _>>();
info!("Validating {} active tenants", s3_active_projects.len());
let mut s3_active_branches_per_project = HashMap::<ProjectId, Vec<BranchData>>::new();
let mut s3_blob_data = HashMap::<TenantTimelineId, S3TimelineBlobData>::new();
for active_branch in timeline_stats.active_entries {
let active_project_id = active_branch.project_id.clone();
let active_branch_id = active_branch.id.clone();
let active_timeline_id = active_branch.timeline_id;
s3_active_branches_per_project
.entry(active_project_id.clone())
.or_default()
.push(active_branch);
let Some(active_project) = s3_active_projects.get(&active_project_id) else {
error!("Branch {:?} for project {:?} has no such project in the active projects", active_branch_id, active_project_id);
continue;
};
let id = TenantTimelineId::new(active_project.tenant, active_timeline_id);
s3_blob_data.insert(
id,
list_timeline_blobs(&s3_client, id, &s3_root)
.await
.with_context(|| format!("List timeline {id} blobs"))?,
);
}
let mut branch_checks = JoinSet::new();
for (_, s3_active_project) in s3_active_projects {
let project_id = &s3_active_project.id;
let tenant_id = s3_active_project.tenant;
let mut console_active_branches =
branches_for_project_with_retries(&admin_client, project_id)
.await
.with_context(|| {
format!("Client API branches for project {project_id:?} retrieval")
})?
.into_iter()
.map(|branch| (branch.id.clone(), branch))
.collect::<HashMap<_, _>>();
let active_branches = s3_active_branches_per_project
.remove(project_id)
.unwrap_or_default();
info!(
"Spawning tasks for {} tenant {} active timelines",
active_branches.len(),
tenant_id
);
for s3_active_branch in active_branches {
let console_branch = console_active_branches.remove(&s3_active_branch.id);
let timeline_id = s3_active_branch.timeline_id;
let id = TenantTimelineId::new(tenant_id, timeline_id);
let s3_data = s3_blob_data.remove(&id);
let s3_root = s3_root.clone();
branch_checks.spawn(
async move {
let check_errors = branch_cleanup_and_check_errors(
id,
&s3_root,
&s3_active_branch,
console_branch,
s3_data,
)
.await;
(id, check_errors)
}
.instrument(info_span!("check_timeline", id = %id)),
);
}
}
let mut total_stats = BranchCheckStats::default();
while let Some((id, branch_check_errors)) = branch_checks
.join_next()
.await
.transpose()
.context("branch check task join")?
{
total_stats.add(id, branch_check_errors);
}
Ok(total_stats)
}
async fn branches_for_project_with_retries(
admin_client: &CloudAdminApiClient,
project_id: &ProjectId,
) -> anyhow::Result<Vec<BranchData>> {
for _ in 0..MAX_RETRIES {
match admin_client.branches_for_project(project_id, false).await {
Ok(branches) => return Ok(branches),
Err(e) => {
error!("admin list branches for project {project_id:?} query failed: {e}");
tokio::time::sleep(Duration::from_secs(1)).await;
}
}
}
anyhow::bail!("Failed to list branches for project {project_id:?} {MAX_RETRIES} times")
}
#[derive(Debug, Default)]
pub struct BranchCheckStats {
pub timelines_with_errors: HashMap<TenantTimelineId, Vec<String>>,
pub normal_timelines: HashSet<TenantTimelineId>,
}
impl BranchCheckStats {
pub fn add(&mut self, id: TenantTimelineId, check_errors: Vec<String>) {
if check_errors.is_empty() {
if !self.normal_timelines.insert(id) {
panic!("Checking branch with timeline {id} more than once")
}
} else {
match self.timelines_with_errors.entry(id) {
hash_map::Entry::Occupied(_) => {
panic!("Checking branch with timeline {id} more than once")
}
hash_map::Entry::Vacant(v) => {
v.insert(check_errors);
}
}
}
}
}
async fn branch_cleanup_and_check_errors(
id: TenantTimelineId,
s3_root: &RootTarget,
s3_active_branch: &BranchData,
console_branch: Option<BranchData>,
s3_data: Option<S3TimelineBlobData>,
) -> Vec<String> {
info!(
"Checking timeline for branch branch {:?}/{:?}",
s3_active_branch.project_id, s3_active_branch.id
);
let mut branch_check_errors = Vec::new();
match console_branch {
Some(console_active_branch) => {
if console_active_branch.deleted {
branch_check_errors.push(format!("Timeline has deleted branch data in the console (id = {:?}, project_id = {:?}), recheck whether if it got removed during the check",
s3_active_branch.id, s3_active_branch.project_id))
}
},
None => branch_check_errors.push(format!("Timeline has no branch data in the console (id = {:?}, project_id = {:?}), recheck whether if it got removed during the check",
s3_active_branch.id, s3_active_branch.project_id))
}
let mut keys_to_remove = Vec::new();
match s3_data {
Some(s3_data) => {
keys_to_remove.extend(s3_data.keys_to_remove);
match s3_data.blob_data {
BlobDataParseResult::Parsed {
index_part,
mut s3_layers,
} => {
if !IndexPart::KNOWN_VERSIONS.contains(&index_part.get_version()) {
branch_check_errors.push(format!(
"index_part.json version: {}",
index_part.get_version()
))
}
if index_part.metadata.disk_consistent_lsn()
!= index_part.get_disk_consistent_lsn()
{
branch_check_errors.push(format!(
"Mismatching disk_consistent_lsn in TimelineMetadata ({}) and in the index_part ({})",
index_part.metadata.disk_consistent_lsn(),
index_part.get_disk_consistent_lsn(),
))
}
if index_part.layer_metadata.is_empty() {
// not an error, can happen for branches with zero writes, but notice that
info!("index_part.json has no layers");
}
for (layer, metadata) in index_part.layer_metadata {
if metadata.file_size == 0 {
branch_check_errors.push(format!(
"index_part.json contains a layer {} that has 0 size in its layer metadata", layer.file_name(),
))
}
if !s3_layers.remove(&layer) {
branch_check_errors.push(format!(
"index_part.json contains a layer {} that is not present in S3",
layer.file_name(),
))
}
}
if !s3_layers.is_empty() {
branch_check_errors.push(format!(
"index_part.json does not contain layers from S3: {:?}",
s3_layers
.iter()
.map(|layer_name| layer_name.file_name())
.collect::<Vec<_>>(),
));
keys_to_remove.extend(s3_layers.iter().map(|layer_name| {
let mut key = s3_root.timeline_root(id).prefix_in_bucket;
let delimiter = s3_root.delimiter();
if !key.ends_with(delimiter) {
key.push_str(delimiter);
}
key.push_str(&layer_name.file_name());
key
}));
}
}
BlobDataParseResult::Incorrect(parse_errors) => branch_check_errors.extend(
parse_errors
.into_iter()
.map(|error| format!("parse error: {error}")),
),
}
}
None => branch_check_errors.push("Timeline has no data on S3 at all".to_string()),
}
if branch_check_errors.is_empty() {
info!("No check errors found");
} else {
warn!("Found check errors: {branch_check_errors:?}");
}
if !keys_to_remove.is_empty() {
error!("The following keys should be removed from S3: {keys_to_remove:?}")
}
branch_check_errors
}
#[derive(Debug)]
struct S3TimelineBlobData {
blob_data: BlobDataParseResult,
keys_to_remove: Vec<String>,
}
#[derive(Debug)]
enum BlobDataParseResult {
Parsed {
index_part: IndexPart,
s3_layers: HashSet<LayerFileName>,
},
Incorrect(Vec<String>),
}
async fn list_timeline_blobs(
s3_client: &Client,
id: TenantTimelineId,
s3_root: &RootTarget,
) -> anyhow::Result<S3TimelineBlobData> {
let mut s3_layers = HashSet::new();
let mut index_part_object = None;
let timeline_dir_target = s3_root.timeline_root(id);
let mut continuation_token = None;
let mut errors = Vec::new();
let mut keys_to_remove = Vec::new();
loop {
let fetch_response =
list_objects_with_retries(s3_client, &timeline_dir_target, continuation_token.clone())
.await?;
let subdirectories = fetch_response.common_prefixes().unwrap_or_default();
if !subdirectories.is_empty() {
errors.push(format!(
"S3 list response should not contain any subdirectories, but got {subdirectories:?}"
));
}
for (object, key) in fetch_response
.contents()
.unwrap_or_default()
.iter()
.filter_map(|object| Some((object, object.key()?)))
{
let blob_name = key.strip_prefix(&timeline_dir_target.prefix_in_bucket);
match blob_name {
Some("index_part.json") => index_part_object = Some(object.clone()),
Some(maybe_layer_name) => match maybe_layer_name.parse::<LayerFileName>() {
Ok(new_layer) => {
s3_layers.insert(new_layer);
}
Err(e) => {
errors.push(
format!("S3 list response got an object with key {key} that is not a layer name: {e}"),
);
keys_to_remove.push(key.to_string());
}
},
None => {
errors.push(format!("S3 list response got an object with odd key {key}"));
keys_to_remove.push(key.to_string());
}
}
}
match fetch_response.next_continuation_token {
Some(new_token) => continuation_token = Some(new_token),
None => break,
}
}
if index_part_object.is_none() {
errors.push("S3 list response got no index_part.json file".to_string());
}
if let Some(index_part_object_key) = index_part_object.as_ref().and_then(|object| object.key())
{
let index_part_bytes = download_object_with_retries(
s3_client,
&timeline_dir_target.bucket_name,
index_part_object_key,
)
.await
.context("index_part.json download")?;
match serde_json::from_slice(&index_part_bytes) {
Ok(index_part) => {
return Ok(S3TimelineBlobData {
blob_data: BlobDataParseResult::Parsed {
index_part,
s3_layers,
},
keys_to_remove,
})
}
Err(index_parse_error) => errors.push(format!(
"index_part.json body parsing error: {index_parse_error}"
)),
}
} else {
errors.push(format!(
"Index part object {index_part_object:?} has no key"
));
}
if errors.is_empty() {
errors.push(
"Unexpected: no errors did not lead to a successfully parsed blob return".to_string(),
);
}
Ok(S3TimelineBlobData {
blob_data: BlobDataParseResult::Incorrect(errors),
keys_to_remove,
})
}
async fn download_object_with_retries(
s3_client: &Client,
bucket_name: &str,
key: &str,
) -> anyhow::Result<Vec<u8>> {
for _ in 0..MAX_RETRIES {
let mut body_buf = Vec::new();
let response_stream = match s3_client
.get_object()
.bucket(bucket_name)
.key(key)
.send()
.await
{
Ok(response) => response,
Err(e) => {
error!("Failed to download object for key {key}: {e}");
tokio::time::sleep(Duration::from_secs(1)).await;
continue;
}
};
match response_stream
.body
.into_async_read()
.read_to_end(&mut body_buf)
.await
{
Ok(bytes_read) => {
info!("Downloaded {bytes_read} bytes for object object with key {key}");
return Ok(body_buf);
}
Err(e) => {
error!("Failed to stream object body for key {key}: {e}");
tokio::time::sleep(Duration::from_secs(1)).await;
}
}
}
anyhow::bail!("Failed to download objects with key {key} {MAX_RETRIES} times")
}

View File

@@ -0,0 +1,418 @@
#![allow(unused)]
use chrono::{DateTime, Utc};
use reqwest::{header, Client, Url};
use tokio::sync::Semaphore;
use utils::id::{TenantId, TimelineId};
use utils::lsn::Lsn;
#[derive(Debug)]
pub struct Error {
context: String,
kind: ErrorKind,
}
impl Error {
fn new(context: String, kind: ErrorKind) -> Self {
Self { context, kind }
}
}
impl std::fmt::Display for Error {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
match &self.kind {
ErrorKind::RequestSend(e) => write!(
f,
"Failed to send a request. Context: {}, error: {}",
self.context, e
),
ErrorKind::BodyRead(e) => {
write!(
f,
"Failed to read a request body. Context: {}, error: {}",
self.context, e
)
}
ErrorKind::UnexpectedState => write!(f, "Unexpected state: {}", self.context),
}
}
}
#[derive(Debug, Clone, serde::Deserialize, Hash, PartialEq, Eq)]
#[serde(transparent)]
pub struct ProjectId(pub String);
#[derive(Clone, Debug, serde::Deserialize, Hash, PartialEq, Eq)]
#[serde(transparent)]
pub struct BranchId(pub String);
impl std::error::Error for Error {}
#[derive(Debug)]
pub enum ErrorKind {
RequestSend(reqwest::Error),
BodyRead(reqwest::Error),
UnexpectedState,
}
pub struct CloudAdminApiClient {
request_limiter: Semaphore,
token: String,
base_url: Url,
http_client: Client,
}
#[derive(Debug, serde::Deserialize)]
struct AdminApiResponse<T> {
data: T,
total: Option<usize>,
}
#[derive(Debug, serde::Deserialize)]
pub struct PageserverData {
pub id: u64,
pub created_at: DateTime<Utc>,
pub updated_at: DateTime<Utc>,
pub region_id: String,
pub version: i64,
pub instance_id: String,
pub port: u16,
pub http_host: String,
pub http_port: u16,
pub active: bool,
pub projects_count: usize,
pub availability_zone_id: String,
}
#[derive(Debug, Clone, serde::Deserialize)]
pub struct SafekeeperData {
pub id: u64,
pub created_at: DateTime<Utc>,
pub updated_at: DateTime<Utc>,
pub region_id: String,
pub version: i64,
pub instance_id: String,
pub active: bool,
pub host: String,
pub port: u16,
pub projects_count: usize,
pub availability_zone_id: String,
}
#[serde_with::serde_as]
#[derive(Debug, Clone, serde::Deserialize)]
pub struct ProjectData {
pub id: ProjectId,
pub name: String,
pub region_id: String,
pub platform_id: String,
pub user_id: String,
pub pageserver_id: u64,
#[serde_as(as = "serde_with::DisplayFromStr")]
pub tenant: TenantId,
pub safekeepers: Vec<SafekeeperData>,
pub deleted: bool,
pub created_at: DateTime<Utc>,
pub updated_at: DateTime<Utc>,
pub pg_version: u32,
pub max_project_size: u64,
pub remote_storage_size: u64,
pub resident_size: u64,
pub synthetic_storage_size: u64,
pub compute_time: u64,
pub data_transfer: u64,
pub data_storage: u64,
pub maintenance_set: Option<String>,
}
#[serde_with::serde_as]
#[derive(Debug, serde::Deserialize)]
pub struct BranchData {
pub id: BranchId,
pub created_at: DateTime<Utc>,
pub updated_at: DateTime<Utc>,
pub name: String,
pub project_id: ProjectId,
#[serde_as(as = "serde_with::DisplayFromStr")]
pub timeline_id: TimelineId,
#[serde(default)]
pub parent_id: Option<BranchId>,
#[serde(default)]
#[serde_as(as = "Option<serde_with::DisplayFromStr>")]
pub parent_lsn: Option<Lsn>,
pub default: bool,
pub deleted: bool,
pub logical_size: Option<u64>,
pub physical_size: Option<u64>,
pub written_size: Option<u64>,
}
impl CloudAdminApiClient {
pub fn new(token: String, base_url: Url) -> Self {
Self {
token,
base_url,
request_limiter: Semaphore::new(200),
http_client: Client::new(), // TODO timeout configs at least
}
}
pub async fn find_tenant_project(
&self,
tenant_id: TenantId,
) -> Result<Option<ProjectData>, Error> {
let _permit = self
.request_limiter
.acquire()
.await
.expect("Semaphore is not closed");
let response = self
.http_client
.get(self.append_url("/projects"))
.query(&[
("tenant_id", tenant_id.to_string()),
("show_deleted", "true".to_string()),
])
.header(header::ACCEPT, "application/json")
.bearer_auth(&self.token)
.send()
.await
.map_err(|e| {
Error::new(
"Find project for tenant".to_string(),
ErrorKind::RequestSend(e),
)
})?;
let response: AdminApiResponse<Vec<ProjectData>> = response.json().await.map_err(|e| {
Error::new(
"Find project for tenant".to_string(),
ErrorKind::BodyRead(e),
)
})?;
match response.data.len() {
0 => Ok(None),
1 => Ok(Some(
response
.data
.into_iter()
.next()
.expect("Should have exactly one element"),
)),
too_many => Err(Error::new(
format!("Find project for tenant returned {too_many} projects instead of 0 or 1"),
ErrorKind::UnexpectedState,
)),
}
}
pub async fn find_timeline_branch(
&self,
timeline_id: TimelineId,
) -> Result<Option<BranchData>, Error> {
let _permit = self
.request_limiter
.acquire()
.await
.expect("Semaphore is not closed");
let response = self
.http_client
.get(self.append_url("/branches"))
.query(&[
("timeline_id", timeline_id.to_string()),
("show_deleted", "true".to_string()),
])
.header(header::ACCEPT, "application/json")
.bearer_auth(&self.token)
.send()
.await
.map_err(|e| {
Error::new(
"Find branch for timeline".to_string(),
ErrorKind::RequestSend(e),
)
})?;
let response: AdminApiResponse<Vec<BranchData>> = response.json().await.map_err(|e| {
Error::new(
"Find branch for timeline".to_string(),
ErrorKind::BodyRead(e),
)
})?;
match response.data.len() {
0 => Ok(None),
1 => Ok(Some(
response
.data
.into_iter()
.next()
.expect("Should have exactly one element"),
)),
too_many => Err(Error::new(
format!("Find branch for timeline returned {too_many} branches instead of 0 or 1"),
ErrorKind::UnexpectedState,
)),
}
}
pub async fn list_pageservers(&self) -> Result<Vec<PageserverData>, Error> {
let _permit = self
.request_limiter
.acquire()
.await
.expect("Semaphore is not closed");
let response = self
.http_client
.get(self.append_url("/pageservers"))
.header(header::ACCEPT, "application/json")
.bearer_auth(&self.token)
.send()
.await
.map_err(|e| Error::new("List pageservers".to_string(), ErrorKind::RequestSend(e)))?;
let response: AdminApiResponse<Vec<PageserverData>> = response
.json()
.await
.map_err(|e| Error::new("List pageservers".to_string(), ErrorKind::BodyRead(e)))?;
Ok(response.data)
}
pub async fn list_safekeepers(&self) -> Result<Vec<SafekeeperData>, Error> {
let _permit = self
.request_limiter
.acquire()
.await
.expect("Semaphore is not closed");
let response = self
.http_client
.get(self.append_url("/safekeepers"))
.header(header::ACCEPT, "application/json")
.bearer_auth(&self.token)
.send()
.await
.map_err(|e| Error::new("List safekeepers".to_string(), ErrorKind::RequestSend(e)))?;
let response: AdminApiResponse<Vec<SafekeeperData>> = response
.json()
.await
.map_err(|e| Error::new("List safekeepers".to_string(), ErrorKind::BodyRead(e)))?;
Ok(response.data)
}
pub async fn projects_for_pageserver(
&self,
pageserver_id: u64,
show_deleted: bool,
) -> Result<Vec<ProjectData>, Error> {
let _permit = self
.request_limiter
.acquire()
.await
.expect("Semaphore is not closed");
let response = self
.http_client
.get(self.append_url("/projects"))
.query(&[
("pageserver_id", &pageserver_id.to_string()),
("show_deleted", &show_deleted.to_string()),
])
.header(header::ACCEPT, "application/json")
.bearer_auth(&self.token)
.send()
.await
.map_err(|e| Error::new("Project for tenant".to_string(), ErrorKind::RequestSend(e)))?;
let response: AdminApiResponse<Vec<ProjectData>> = response
.json()
.await
.map_err(|e| Error::new("Project for tenant".to_string(), ErrorKind::BodyRead(e)))?;
Ok(response.data)
}
pub async fn project_for_tenant(
&self,
tenant_id: TenantId,
show_deleted: bool,
) -> Result<Option<ProjectData>, Error> {
let _permit = self
.request_limiter
.acquire()
.await
.expect("Semaphore is not closed");
let response = self
.http_client
.get(self.append_url("/projects"))
.query(&[
("search", &tenant_id.to_string()),
("show_deleted", &show_deleted.to_string()),
])
.header(header::ACCEPT, "application/json")
.bearer_auth(&self.token)
.send()
.await
.map_err(|e| Error::new("Project for tenant".to_string(), ErrorKind::RequestSend(e)))?;
let response: AdminApiResponse<Vec<ProjectData>> = response
.json()
.await
.map_err(|e| Error::new("Project for tenant".to_string(), ErrorKind::BodyRead(e)))?;
match response.data.as_slice() {
[] => Ok(None),
[_single] => Ok(Some(response.data.into_iter().next().unwrap())),
multiple => Err(Error::new(
format!("Got more than one project for tenant {tenant_id} : {multiple:?}"),
ErrorKind::UnexpectedState,
)),
}
}
pub async fn branches_for_project(
&self,
project_id: &ProjectId,
show_deleted: bool,
) -> Result<Vec<BranchData>, Error> {
let _permit = self
.request_limiter
.acquire()
.await
.expect("Semaphore is not closed");
let response = self
.http_client
.get(self.append_url("/branches"))
.query(&[
("project_id", &project_id.0),
("show_deleted", &show_deleted.to_string()),
])
.header(header::ACCEPT, "application/json")
.bearer_auth(&self.token)
.send()
.await
.map_err(|e| Error::new("Project for tenant".to_string(), ErrorKind::RequestSend(e)))?;
let response: AdminApiResponse<Vec<BranchData>> = response
.json()
.await
.map_err(|e| Error::new("Project for tenant".to_string(), ErrorKind::BodyRead(e)))?;
Ok(response.data)
}
fn append_url(&self, subpath: &str) -> Url {
// TODO fugly, but `.join` does not work when called
(self.base_url.to_string() + subpath)
.parse()
.unwrap_or_else(|e| panic!("Could not append {subpath} to base url: {e}"))
}
}

View File

@@ -0,0 +1,354 @@
mod tenant_batch;
mod timeline_batch;
use std::future::Future;
use std::str::FromStr;
use std::sync::Arc;
use std::time::Duration;
use anyhow::Context;
use aws_sdk_s3::Client;
use either::Either;
use tokio::sync::mpsc::UnboundedReceiver;
use tokio::sync::Mutex;
use tokio::task::{JoinHandle, JoinSet};
use tracing::{error, info, info_span, Instrument};
use crate::cloud_admin_api::{BranchData, CloudAdminApiClient, ProjectData};
use crate::{list_objects_with_retries, RootTarget, S3Target, TraversingDepth, MAX_RETRIES};
use utils::id::{TenantId, TenantTimelineId};
/// Typical tenant to remove contains 1 layer and 1 index_part.json blobs
/// Also, there are some non-standard tenants to remove, having more layers.
/// delete_objects request allows up to 1000 keys, so be on a safe side and allow most
/// batch processing tasks to do 1 delete objects request only.
///
/// Every batch item will be additionally S3 LS'ed later, so keep the batch size
/// even lower to allow multiple concurrent tasks do the LS requests.
const BATCH_SIZE: usize = 100;
pub struct DeleteBatchProducer {
delete_tenants_sender_task: JoinHandle<anyhow::Result<ProcessedS3List<TenantId, ProjectData>>>,
delete_timelines_sender_task:
JoinHandle<anyhow::Result<ProcessedS3List<TenantTimelineId, BranchData>>>,
delete_batch_creator_task: JoinHandle<()>,
delete_batch_receiver: Arc<Mutex<UnboundedReceiver<DeleteBatch>>>,
}
pub struct DeleteProducerStats {
pub tenant_stats: ProcessedS3List<TenantId, ProjectData>,
pub timeline_stats: Option<ProcessedS3List<TenantTimelineId, BranchData>>,
}
impl DeleteProducerStats {
pub fn tenants_checked(&self) -> usize {
self.tenant_stats.entries_total
}
pub fn active_tenants(&self) -> usize {
self.tenant_stats.active_entries.len()
}
pub fn timelines_checked(&self) -> usize {
self.timeline_stats
.as_ref()
.map(|stats| stats.entries_total)
.unwrap_or(0)
}
}
#[derive(Debug, Default, Clone)]
pub struct DeleteBatch {
pub tenants: Vec<TenantId>,
pub timelines: Vec<TenantTimelineId>,
}
impl DeleteBatch {
pub fn merge(&mut self, other: Self) {
self.tenants.extend(other.tenants);
self.timelines.extend(other.timelines);
}
pub fn len(&self) -> usize {
self.tenants.len() + self.timelines.len()
}
pub fn is_empty(&self) -> bool {
self.len() == 0
}
}
impl DeleteBatchProducer {
pub fn start(
admin_client: Arc<CloudAdminApiClient>,
s3_client: Arc<Client>,
s3_root_target: RootTarget,
traversing_depth: TraversingDepth,
) -> Self {
let (delete_elements_sender, mut delete_elements_receiver) =
tokio::sync::mpsc::unbounded_channel();
let delete_elements_sender = Arc::new(delete_elements_sender);
let admin_client = Arc::new(admin_client);
let (projects_to_check_sender, mut projects_to_check_receiver) =
tokio::sync::mpsc::unbounded_channel();
let delete_tenants_root_target = s3_root_target.clone();
let delete_tenants_client = Arc::clone(&s3_client);
let delete_tenants_admin_client = Arc::clone(&admin_client);
let delete_sender = Arc::clone(&delete_elements_sender);
let delete_tenants_sender_task = tokio::spawn(
async move {
tenant_batch::schedule_cleanup_deleted_tenants(
&delete_tenants_root_target,
&delete_tenants_client,
&delete_tenants_admin_client,
projects_to_check_sender,
delete_sender,
traversing_depth,
)
.await
}
.instrument(info_span!("delete_tenants_sender")),
);
let delete_timelines_sender_task = tokio::spawn(async move {
timeline_batch::schedule_cleanup_deleted_timelines(
&s3_root_target,
&s3_client,
&admin_client,
&mut projects_to_check_receiver,
delete_elements_sender,
)
.in_current_span()
.await
});
let (delete_batch_sender, delete_batch_receiver) = tokio::sync::mpsc::unbounded_channel();
let delete_batch_creator_task = tokio::spawn(
async move {
'outer: loop {
let mut delete_batch = DeleteBatch::default();
while delete_batch.len() < BATCH_SIZE {
match delete_elements_receiver.recv().await {
Some(new_task) => match new_task {
Either::Left(tenant_id) => delete_batch.tenants.push(tenant_id),
Either::Right(timeline_id) => {
delete_batch.timelines.push(timeline_id)
}
},
None => {
info!("Task finished: sender dropped");
delete_batch_sender.send(delete_batch).ok();
break 'outer;
}
}
}
if !delete_batch.is_empty() {
delete_batch_sender.send(delete_batch).ok();
}
}
}
.instrument(info_span!("delete batch creator")),
);
Self {
delete_tenants_sender_task,
delete_timelines_sender_task,
delete_batch_creator_task,
delete_batch_receiver: Arc::new(Mutex::new(delete_batch_receiver)),
}
}
pub fn subscribe(&self) -> Arc<Mutex<UnboundedReceiver<DeleteBatch>>> {
self.delete_batch_receiver.clone()
}
pub async fn join(self) -> anyhow::Result<DeleteProducerStats> {
let (delete_tenants_task_result, delete_timelines_task_result, batch_task_result) = tokio::join!(
self.delete_tenants_sender_task,
self.delete_timelines_sender_task,
self.delete_batch_creator_task,
);
let tenant_stats = match delete_tenants_task_result {
Ok(Ok(stats)) => stats,
Ok(Err(tenant_deletion_error)) => return Err(tenant_deletion_error),
Err(join_error) => {
anyhow::bail!("Failed to join the delete tenant producing task: {join_error}")
}
};
let timeline_stats = match delete_timelines_task_result {
Ok(Ok(stats)) => Some(stats),
Ok(Err(timeline_deletion_error)) => return Err(timeline_deletion_error),
Err(join_error) => {
anyhow::bail!("Failed to join the delete timeline producing task: {join_error}")
}
};
match batch_task_result {
Ok(()) => (),
Err(join_error) => anyhow::bail!("Failed to join the batch forming task: {join_error}"),
};
Ok(DeleteProducerStats {
tenant_stats,
timeline_stats,
})
}
}
pub struct ProcessedS3List<I, A> {
pub entries_total: usize,
pub entries_to_delete: Vec<I>,
pub active_entries: Vec<A>,
}
impl<I, A> Default for ProcessedS3List<I, A> {
fn default() -> Self {
Self {
entries_total: 0,
entries_to_delete: Vec::new(),
active_entries: Vec::new(),
}
}
}
impl<I, A> ProcessedS3List<I, A> {
fn merge(&mut self, other: Self) {
self.entries_total += other.entries_total;
self.entries_to_delete.extend(other.entries_to_delete);
self.active_entries.extend(other.active_entries);
}
fn change_ids<NewI>(self, transform: impl Fn(I) -> NewI) -> ProcessedS3List<NewI, A> {
ProcessedS3List {
entries_total: self.entries_total,
entries_to_delete: self.entries_to_delete.into_iter().map(transform).collect(),
active_entries: self.active_entries,
}
}
}
async fn process_s3_target_recursively<F, Fut, I, E, A>(
s3_client: &Client,
target: &S3Target,
find_active_and_deleted_entries: F,
) -> anyhow::Result<ProcessedS3List<I, A>>
where
I: FromStr<Err = E> + Send + Sync,
E: Send + Sync + std::error::Error + 'static,
F: FnOnce(Vec<I>) -> Fut + Clone,
Fut: Future<Output = anyhow::Result<ProcessedS3List<I, A>>>,
{
let mut continuation_token = None;
let mut total_entries = ProcessedS3List::default();
loop {
let fetch_response =
list_objects_with_retries(s3_client, target, continuation_token.clone()).await?;
let new_entry_ids = fetch_response
.common_prefixes()
.unwrap_or_default()
.iter()
.filter_map(|prefix| prefix.prefix())
.filter_map(|prefix| -> Option<&str> {
prefix
.strip_prefix(&target.prefix_in_bucket)?
.strip_suffix('/')
})
.map(|entry_id_str| {
entry_id_str
.parse()
.with_context(|| format!("Incorrect entry id str: {entry_id_str}"))
})
.collect::<anyhow::Result<Vec<I>>>()
.context("list and parse bucket's entry ids")?;
total_entries.merge(
(find_active_and_deleted_entries.clone())(new_entry_ids)
.await
.context("filter active and deleted entry ids")?,
);
match fetch_response.next_continuation_token {
Some(new_token) => continuation_token = Some(new_token),
None => break,
}
}
Ok(total_entries)
}
enum FetchResult<A> {
Found(A),
Deleted,
Absent,
}
async fn split_to_active_and_deleted_entries<I, A, F, Fut>(
new_entry_ids: Vec<I>,
find_active_entry: F,
) -> anyhow::Result<ProcessedS3List<I, A>>
where
I: std::fmt::Display + Send + Sync + 'static + Copy,
A: Send + 'static,
F: FnOnce(I) -> Fut + Send + Sync + 'static + Clone,
Fut: Future<Output = anyhow::Result<FetchResult<A>>> + Send,
{
let entries_total = new_entry_ids.len();
let mut check_tasks = JoinSet::new();
let mut active_entries = Vec::with_capacity(entries_total);
let mut entries_to_delete = Vec::with_capacity(entries_total);
for new_entry_id in new_entry_ids {
let check_closure = find_active_entry.clone();
check_tasks.spawn(
async move {
(
new_entry_id,
async {
for _ in 0..MAX_RETRIES {
let closure_clone = check_closure.clone();
match closure_clone(new_entry_id).await {
Ok(active_entry) => return Ok(active_entry),
Err(e) => {
error!("find active entry admin API call failed: {e}");
tokio::time::sleep(Duration::from_secs(1)).await;
}
}
}
anyhow::bail!("Failed to check entry {new_entry_id} {MAX_RETRIES} times")
}
.await,
)
}
.instrument(info_span!("filter_active_entries")),
);
}
while let Some(task_result) = check_tasks.join_next().await {
let (entry_id, entry_data_fetch_result) = task_result.context("task join")?;
match entry_data_fetch_result.context("entry data fetch")? {
FetchResult::Found(active_entry) => {
info!("Entry {entry_id} is alive, cannot delete");
active_entries.push(active_entry);
}
FetchResult::Deleted => {
info!("Entry {entry_id} deleted in the admin data, can safely delete");
entries_to_delete.push(entry_id);
}
FetchResult::Absent => {
info!("Entry {entry_id} absent in the admin data, can safely delete");
entries_to_delete.push(entry_id);
}
}
}
Ok(ProcessedS3List {
entries_total,
entries_to_delete,
active_entries,
})
}

View File

@@ -0,0 +1,87 @@
use std::sync::Arc;
use anyhow::Context;
use aws_sdk_s3::Client;
use either::Either;
use tokio::sync::mpsc::UnboundedSender;
use tracing::info;
use crate::cloud_admin_api::{CloudAdminApiClient, ProjectData};
use crate::delete_batch_producer::FetchResult;
use crate::{RootTarget, TraversingDepth};
use utils::id::{TenantId, TenantTimelineId};
use super::ProcessedS3List;
pub async fn schedule_cleanup_deleted_tenants(
s3_root_target: &RootTarget,
s3_client: &Arc<Client>,
admin_client: &Arc<CloudAdminApiClient>,
projects_to_check_sender: UnboundedSender<ProjectData>,
delete_sender: Arc<UnboundedSender<Either<TenantId, TenantTimelineId>>>,
traversing_depth: TraversingDepth,
) -> anyhow::Result<ProcessedS3List<TenantId, ProjectData>> {
info!(
"Starting to list the bucket from root {}",
s3_root_target.bucket_name()
);
s3_client
.head_bucket()
.bucket(s3_root_target.bucket_name())
.send()
.await
.with_context(|| format!("bucket {} was not found", s3_root_target.bucket_name()))?;
let check_client = Arc::clone(admin_client);
let tenant_stats = super::process_s3_target_recursively(
s3_client,
s3_root_target.tenants_root(),
|s3_tenants| async move {
let another_client = Arc::clone(&check_client);
super::split_to_active_and_deleted_entries(s3_tenants, move |tenant_id| async move {
let project_data = another_client
.find_tenant_project(tenant_id)
.await
.with_context(|| format!("Tenant {tenant_id} project admin check"))?;
Ok(if let Some(console_project) = project_data {
if console_project.deleted {
delete_sender.send(Either::Left(tenant_id)).ok();
FetchResult::Deleted
} else {
if traversing_depth == TraversingDepth::Timeline {
projects_to_check_sender.send(console_project.clone()).ok();
}
FetchResult::Found(console_project)
}
} else {
delete_sender.send(Either::Left(tenant_id)).ok();
FetchResult::Absent
})
})
.await
},
)
.await
.context("tenant batch processing")?;
info!(
"Among {} tenants, found {} tenants to delete and {} active ones",
tenant_stats.entries_total,
tenant_stats.entries_to_delete.len(),
tenant_stats.active_entries.len(),
);
let tenant_stats = match traversing_depth {
TraversingDepth::Tenant => {
info!("Finished listing the bucket for tenants only");
tenant_stats
}
TraversingDepth::Timeline => {
info!("Finished listing the bucket for tenants and sent {} active tenants to check for timelines", tenant_stats.active_entries.len());
tenant_stats
}
};
Ok(tenant_stats)
}

View File

@@ -0,0 +1,102 @@
use std::sync::Arc;
use anyhow::Context;
use aws_sdk_s3::Client;
use either::Either;
use tokio::sync::mpsc::{UnboundedReceiver, UnboundedSender};
use tracing::{info, info_span, Instrument};
use crate::cloud_admin_api::{BranchData, CloudAdminApiClient, ProjectData};
use crate::delete_batch_producer::{FetchResult, ProcessedS3List};
use crate::RootTarget;
use utils::id::{TenantId, TenantTimelineId};
pub async fn schedule_cleanup_deleted_timelines(
s3_root_target: &RootTarget,
s3_client: &Arc<Client>,
admin_client: &Arc<CloudAdminApiClient>,
projects_to_check_receiver: &mut UnboundedReceiver<ProjectData>,
delete_elements_sender: Arc<UnboundedSender<Either<TenantId, TenantTimelineId>>>,
) -> anyhow::Result<ProcessedS3List<TenantTimelineId, BranchData>> {
info!(
"Starting to list the bucket from root {}",
s3_root_target.bucket_name()
);
s3_client
.head_bucket()
.bucket(s3_root_target.bucket_name())
.send()
.await
.with_context(|| format!("bucket {} was not found", s3_root_target.bucket_name()))?;
let mut timeline_stats = ProcessedS3List::default();
while let Some(project_to_check) = projects_to_check_receiver.recv().await {
let check_client = Arc::clone(admin_client);
let check_s3_client = Arc::clone(s3_client);
let check_delete_sender = Arc::clone(&delete_elements_sender);
let check_root = s3_root_target.clone();
let new_stats = async move {
let tenant_id_to_check = project_to_check.tenant;
let check_target = check_root.timelines_root(tenant_id_to_check);
let stats = super::process_s3_target_recursively(
&check_s3_client,
&check_target,
|s3_timelines| async move {
let another_client = check_client.clone();
super::split_to_active_and_deleted_entries(
s3_timelines,
move |timeline_id| async move {
let console_branch = another_client
.find_timeline_branch(timeline_id)
.await
.map_err(|e| {
anyhow::anyhow!(
"Timeline {timeline_id} branch admin check: {e}"
)
})?;
let id = TenantTimelineId::new(tenant_id_to_check, timeline_id);
Ok(match console_branch {
Some(console_branch) => {
if console_branch.deleted {
check_delete_sender.send(Either::Right(id)).ok();
FetchResult::Deleted
} else {
FetchResult::Found(console_branch)
}
}
None => {
check_delete_sender.send(Either::Right(id)).ok();
FetchResult::Absent
}
})
},
)
.await
},
)
.await
.with_context(|| format!("tenant {tenant_id_to_check} timeline batch processing"))?
.change_ids(|timeline_id| TenantTimelineId::new(tenant_id_to_check, timeline_id));
Ok::<_, anyhow::Error>(stats)
}
.instrument(info_span!("delete_timelines_sender", tenant = %project_to_check.tenant))
.await?;
timeline_stats.merge(new_stats);
}
info!(
"Among {} timelines, found {} timelines to delete and {} active ones",
timeline_stats.entries_total,
timeline_stats.entries_to_delete.len(),
timeline_stats.active_entries.len(),
);
Ok(timeline_stats)
}

204
s3_scrubber/src/lib.rs Normal file
View File

@@ -0,0 +1,204 @@
pub mod checks;
pub mod cloud_admin_api;
pub mod delete_batch_producer;
mod s3_deletion;
use std::env;
use std::fmt::Display;
use std::time::Duration;
use aws_config::environment::EnvironmentVariableCredentialsProvider;
use aws_config::imds::credentials::ImdsCredentialsProvider;
use aws_config::meta::credentials::CredentialsProviderChain;
use aws_config::sso::SsoCredentialsProvider;
use aws_sdk_s3::config::Region;
use aws_sdk_s3::{Client, Config};
pub use s3_deletion::S3Deleter;
use tracing::error;
use tracing_appender::non_blocking::WorkerGuard;
use tracing_subscriber::{fmt, prelude::*, EnvFilter};
use utils::id::{TenantId, TenantTimelineId};
const MAX_RETRIES: usize = 20;
const CLOUD_ADMIN_API_TOKEN_ENV_VAR: &str = "CLOUD_ADMIN_API_TOKEN";
#[derive(Debug, Clone)]
pub struct S3Target {
pub bucket_name: String,
pub prefix_in_bucket: String,
pub delimiter: String,
}
#[derive(clap::ValueEnum, Debug, Clone, Copy, PartialEq, Eq)]
pub enum TraversingDepth {
Tenant,
Timeline,
}
impl Display for TraversingDepth {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
f.write_str(match self {
Self::Tenant => "tenant",
Self::Timeline => "timeline",
})
}
}
impl S3Target {
pub fn with_sub_segment(&self, new_segment: &str) -> Self {
let mut new_self = self.clone();
let _ = new_self.prefix_in_bucket.pop();
new_self.prefix_in_bucket =
[&new_self.prefix_in_bucket, new_segment, ""].join(&new_self.delimiter);
new_self
}
}
#[derive(Clone)]
pub enum RootTarget {
Pageserver(S3Target),
Safekeeper(S3Target),
}
impl RootTarget {
pub fn tenants_root(&self) -> &S3Target {
match self {
Self::Pageserver(root) => root,
Self::Safekeeper(root) => root,
}
}
pub fn tenant_root(&self, tenant_id: TenantId) -> S3Target {
self.tenants_root().with_sub_segment(&tenant_id.to_string())
}
pub fn timelines_root(&self, tenant_id: TenantId) -> S3Target {
match self {
Self::Pageserver(_) => self.tenant_root(tenant_id).with_sub_segment("timelines"),
Self::Safekeeper(_) => self.tenant_root(tenant_id),
}
}
pub fn timeline_root(&self, id: TenantTimelineId) -> S3Target {
self.timelines_root(id.tenant_id)
.with_sub_segment(&id.timeline_id.to_string())
}
pub fn bucket_name(&self) -> &str {
match self {
Self::Pageserver(root) => &root.bucket_name,
Self::Safekeeper(root) => &root.bucket_name,
}
}
pub fn delimiter(&self) -> &str {
match self {
Self::Pageserver(root) => &root.delimiter,
Self::Safekeeper(root) => &root.delimiter,
}
}
}
pub fn get_cloud_admin_api_token_or_exit() -> String {
match env::var(CLOUD_ADMIN_API_TOKEN_ENV_VAR) {
Ok(token) => token,
Err(env::VarError::NotPresent) => {
error!("{CLOUD_ADMIN_API_TOKEN_ENV_VAR} env variable is not present");
std::process::exit(1);
}
Err(env::VarError::NotUnicode(not_unicode_string)) => {
error!("{CLOUD_ADMIN_API_TOKEN_ENV_VAR} env variable's value is not a valid unicode string: {not_unicode_string:?}");
std::process::exit(1);
}
}
}
pub fn init_logging(binary_name: &str, dry_run: bool, node_kind: &str) -> WorkerGuard {
let file_name = if dry_run {
format!(
"{}_{}_{}__dry.log",
binary_name,
node_kind,
chrono::Utc::now().format("%Y_%m_%d__%H_%M_%S")
)
} else {
format!(
"{}_{}_{}.log",
binary_name,
node_kind,
chrono::Utc::now().format("%Y_%m_%d__%H_%M_%S")
)
};
let (file_writer, guard) =
tracing_appender::non_blocking(tracing_appender::rolling::never("./logs/", file_name));
let file_logs = fmt::Layer::new()
.with_target(false)
.with_ansi(false)
.with_writer(file_writer);
let stdout_logs = fmt::Layer::new()
.with_target(false)
.with_ansi(atty::is(atty::Stream::Stdout))
.with_writer(std::io::stdout);
tracing_subscriber::registry()
.with(EnvFilter::try_from_default_env().unwrap_or_else(|_| EnvFilter::new("info")))
.with(file_logs)
.with(stdout_logs)
.init();
guard
}
pub fn init_s3_client(account_id: String, bucket_region: Region) -> Client {
let credentials_provider = {
// uses "AWS_ACCESS_KEY_ID", "AWS_SECRET_ACCESS_KEY"
CredentialsProviderChain::first_try("env", EnvironmentVariableCredentialsProvider::new())
// uses sso
.or_else(
"sso",
SsoCredentialsProvider::builder()
.account_id(account_id)
.role_name("PowerUserAccess")
.start_url("https://neondb.awsapps.com/start")
.region(Region::from_static("eu-central-1"))
.build(),
)
// uses imds v2
.or_else("imds", ImdsCredentialsProvider::builder().build())
};
let config = Config::builder()
.region(bucket_region)
.credentials_provider(credentials_provider)
.build();
Client::from_conf(config)
}
async fn list_objects_with_retries(
s3_client: &Client,
s3_target: &S3Target,
continuation_token: Option<String>,
) -> anyhow::Result<aws_sdk_s3::operation::list_objects_v2::ListObjectsV2Output> {
for _ in 0..MAX_RETRIES {
match s3_client
.list_objects_v2()
.bucket(&s3_target.bucket_name)
.prefix(&s3_target.prefix_in_bucket)
.delimiter(&s3_target.delimiter)
.set_continuation_token(continuation_token.clone())
.send()
.await
{
Ok(response) => return Ok(response),
Err(e) => {
error!("list_objects_v2 query failed: {e}");
tokio::time::sleep(Duration::from_secs(1)).await;
}
}
}
anyhow::bail!("Failed to list objects {MAX_RETRIES} times")
}

268
s3_scrubber/src/main.rs Normal file
View File

@@ -0,0 +1,268 @@
use std::collections::HashMap;
use std::env;
use std::fmt::Display;
use std::num::NonZeroUsize;
use std::sync::Arc;
use anyhow::Context;
use aws_sdk_s3::config::Region;
use reqwest::Url;
use s3_scrubber::cloud_admin_api::CloudAdminApiClient;
use s3_scrubber::delete_batch_producer::DeleteBatchProducer;
use s3_scrubber::{
checks, get_cloud_admin_api_token_or_exit, init_logging, init_s3_client, RootTarget, S3Deleter,
S3Target, TraversingDepth,
};
use tracing::{info, info_span, warn};
use clap::{Parser, Subcommand, ValueEnum};
#[derive(Parser)]
#[command(author, version, about, long_about = None)]
#[command(arg_required_else_help(true))]
struct Cli {
#[command(subcommand)]
command: Command,
#[arg(short, long, default_value_t = false)]
delete: bool,
}
#[derive(ValueEnum, Clone, Copy, Eq, PartialEq)]
enum NodeKind {
Safekeeper,
Pageserver,
}
impl NodeKind {
fn as_str(&self) -> &'static str {
match self {
Self::Safekeeper => "safekeeper",
Self::Pageserver => "pageserver",
}
}
}
impl Display for NodeKind {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
f.write_str(self.as_str())
}
}
#[derive(Subcommand)]
enum Command {
Tidy {
#[arg(short, long)]
node_kind: NodeKind,
#[arg(short, long, default_value_t=TraversingDepth::Tenant)]
depth: TraversingDepth,
#[arg(short, long, default_value_t = false)]
skip_validation: bool,
},
}
struct BucketConfig {
region: String,
bucket: String,
sso_account_id: String,
}
impl Display for BucketConfig {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
write!(f, "{}/{}/{}", self.sso_account_id, self.region, self.bucket)
}
}
impl BucketConfig {
fn from_env() -> anyhow::Result<Self> {
let sso_account_id =
env::var("SSO_ACCOUNT_ID").context("'SSO_ACCOUNT_ID' param retrieval")?;
let region = env::var("REGION").context("'REGION' param retrieval")?;
let bucket = env::var("BUCKET").context("'BUCKET' param retrieval")?;
Ok(Self {
region,
bucket,
sso_account_id,
})
}
}
struct ConsoleConfig {
admin_api_url: Url,
}
impl ConsoleConfig {
fn from_env() -> anyhow::Result<Self> {
let admin_api_url: Url = env::var("CLOUD_ADMIN_API_URL")
.context("'CLOUD_ADMIN_API_URL' param retrieval")?
.parse()
.context("'CLOUD_ADMIN_API_URL' param parsing")?;
Ok(Self { admin_api_url })
}
}
async fn tidy(
cli: &Cli,
bucket_config: BucketConfig,
console_config: ConsoleConfig,
node_kind: NodeKind,
depth: TraversingDepth,
skip_validation: bool,
) -> anyhow::Result<()> {
let binary_name = env::args()
.next()
.context("binary name in not the first argument")?;
let dry_run = !cli.delete;
let _guard = init_logging(&binary_name, dry_run, node_kind.as_str());
let _main_span = info_span!("tidy", binary = %binary_name, %dry_run).entered();
if dry_run {
info!("Dry run, not removing items for real");
} else {
warn!("Dry run disabled, removing bucket items for real");
}
info!("skip_validation={skip_validation}");
info!("Starting extra S3 removal in {bucket_config} for node kind '{node_kind}', traversing depth: {depth:?}");
info!("Starting extra tenant S3 removal in {bucket_config} for node kind '{node_kind}'");
let cloud_admin_api_client = Arc::new(CloudAdminApiClient::new(
get_cloud_admin_api_token_or_exit(),
console_config.admin_api_url,
));
let bucket_region = Region::new(bucket_config.region);
let delimiter = "/".to_string();
let s3_client = Arc::new(init_s3_client(bucket_config.sso_account_id, bucket_region));
let s3_root = match node_kind {
NodeKind::Pageserver => RootTarget::Pageserver(S3Target {
bucket_name: bucket_config.bucket,
prefix_in_bucket: ["pageserver", "v1", "tenants", ""].join(&delimiter),
delimiter,
}),
NodeKind::Safekeeper => RootTarget::Safekeeper(S3Target {
bucket_name: bucket_config.bucket,
prefix_in_bucket: ["safekeeper", "v1", "wal", ""].join(&delimiter),
delimiter,
}),
};
let delete_batch_producer = DeleteBatchProducer::start(
Arc::clone(&cloud_admin_api_client),
Arc::clone(&s3_client),
s3_root.clone(),
depth,
);
let s3_deleter = S3Deleter::new(
dry_run,
NonZeroUsize::new(15).unwrap(),
Arc::clone(&s3_client),
delete_batch_producer.subscribe(),
s3_root.clone(),
);
let (deleter_task_result, batch_producer_task_result) =
tokio::join!(s3_deleter.remove_all(), delete_batch_producer.join());
let deletion_stats = deleter_task_result.context("s3 deletion")?;
info!(
"Deleted {} tenants ({} keys) and {} timelines ({} keys) total. Dry run: {}",
deletion_stats.deleted_tenant_keys.len(),
deletion_stats.deleted_tenant_keys.values().sum::<usize>(),
deletion_stats.deleted_timeline_keys.len(),
deletion_stats.deleted_timeline_keys.values().sum::<usize>(),
dry_run,
);
info!(
"Total tenant deletion stats: {:?}",
deletion_stats
.deleted_tenant_keys
.into_iter()
.map(|(id, key)| (id.to_string(), key))
.collect::<HashMap<_, _>>()
);
info!(
"Total timeline deletion stats: {:?}",
deletion_stats
.deleted_timeline_keys
.into_iter()
.map(|(id, key)| (id.to_string(), key))
.collect::<HashMap<_, _>>()
);
let batch_producer_stats = batch_producer_task_result.context("delete batch producer join")?;
info!(
"Total bucket tenants listed: {}; for {} active tenants, timelines checked: {}",
batch_producer_stats.tenants_checked(),
batch_producer_stats.active_tenants(),
batch_producer_stats.timelines_checked()
);
if node_kind == NodeKind::Pageserver {
info!("node_kind != pageserver, finish without performing validation step");
return Ok(());
}
if skip_validation {
info!("--skip-validation is set, exiting");
return Ok(());
}
info!("validating active tenants and timelines for pageserver S3 data");
// TODO kb real stats for validation + better stats for every place: add and print `min`, `max`, `mean` values at least
let validation_stats = checks::validate_pageserver_active_tenant_and_timelines(
s3_client,
s3_root,
cloud_admin_api_client,
batch_producer_stats,
)
.await
.context("active tenant and timeline validation")?;
info!("Finished active tenant and timeline validation, correct timelines: {}, timeline validation errors: {}",
validation_stats.normal_timelines.len(), validation_stats.timelines_with_errors.len());
if !validation_stats.timelines_with_errors.is_empty() {
warn!(
"Validation errors: {:#?}",
validation_stats
.timelines_with_errors
.into_iter()
.map(|(id, errors)| (id.to_string(), format!("{errors:?}")))
.collect::<HashMap<_, _>>()
);
}
info!("Done");
Ok(())
}
#[tokio::main]
async fn main() -> anyhow::Result<()> {
let cli = Cli::parse();
let bucket_config = BucketConfig::from_env()?;
match &cli.command {
Command::Tidy {
node_kind,
depth,
skip_validation,
} => {
let console_config = ConsoleConfig::from_env()?;
tidy(
&cli,
bucket_config,
console_config,
*node_kind,
*depth,
*skip_validation,
)
.await
}
}
}

View File

@@ -0,0 +1,434 @@
use std::collections::BTreeMap;
use std::num::NonZeroUsize;
use std::sync::Arc;
use std::time::Duration;
use anyhow::Context;
use aws_sdk_s3::types::{Delete, ObjectIdentifier};
use aws_sdk_s3::Client;
use tokio::sync::mpsc::error::TryRecvError;
use tokio::sync::mpsc::UnboundedReceiver;
use tokio::sync::Mutex;
use tokio::task::JoinSet;
use tracing::{debug, error, info, info_span, Instrument};
use crate::delete_batch_producer::DeleteBatch;
use crate::{list_objects_with_retries, RootTarget, S3Target, TenantId, MAX_RETRIES};
use utils::id::TenantTimelineId;
pub struct S3Deleter {
dry_run: bool,
concurrent_tasks_count: NonZeroUsize,
delete_batch_receiver: Arc<Mutex<UnboundedReceiver<DeleteBatch>>>,
s3_client: Arc<Client>,
s3_target: RootTarget,
}
impl S3Deleter {
pub fn new(
dry_run: bool,
concurrent_tasks_count: NonZeroUsize,
s3_client: Arc<Client>,
delete_batch_receiver: Arc<Mutex<UnboundedReceiver<DeleteBatch>>>,
s3_target: RootTarget,
) -> Self {
Self {
dry_run,
concurrent_tasks_count,
delete_batch_receiver,
s3_client,
s3_target,
}
}
pub async fn remove_all(self) -> anyhow::Result<DeletionStats> {
let mut deletion_tasks = JoinSet::new();
for id in 0..self.concurrent_tasks_count.get() {
let closure_client = Arc::clone(&self.s3_client);
let closure_s3_target = self.s3_target.clone();
let closure_batch_receiver = Arc::clone(&self.delete_batch_receiver);
let dry_run = self.dry_run;
deletion_tasks.spawn(
async move {
info!("Task started");
(
id,
async move {
let mut task_stats = DeletionStats::default();
loop {
let mut guard = closure_batch_receiver.lock().await;
let receiver_result = guard.try_recv();
drop(guard);
match receiver_result {
Ok(batch) => {
let stats = delete_batch(
&closure_client,
&closure_s3_target,
batch,
dry_run,
)
.await
.context("batch deletion")?;
debug!(
"Batch processed, number of objects deleted per tenant in the batch is: {}, per timeline — {}",
stats.deleted_tenant_keys.len(),
stats.deleted_timeline_keys.len(),
);
task_stats.merge(stats);
}
Err(TryRecvError::Empty) => {
debug!("No tasks yet, waiting");
tokio::time::sleep(Duration::from_secs(1)).await;
continue;
}
Err(TryRecvError::Disconnected) => {
info!("Task finished: sender dropped");
return Ok(task_stats);
}
}
}
}
.in_current_span()
.await,
)
}
.instrument(info_span!("deletion_task", %id)),
);
}
let mut total_stats = DeletionStats::default();
while let Some(task_result) = deletion_tasks.join_next().await {
match task_result {
Ok((id, Ok(task_stats))) => {
info!("Task {id} completed");
total_stats.merge(task_stats);
}
Ok((id, Err(e))) => {
error!("Task {id} failed: {e:#}");
return Err(e);
}
Err(join_error) => anyhow::bail!("Failed to join on a task: {join_error:?}"),
}
}
Ok(total_stats)
}
}
/// S3 delete_objects allows up to 1000 keys to be passed in a single request.
/// Yet if you pass too many key requests, apparently S3 could return with OK and
/// actually delete nothing, so keep the number lower.
const MAX_ITEMS_TO_DELETE: usize = 200;
#[derive(Debug, Default)]
pub struct DeletionStats {
pub deleted_tenant_keys: BTreeMap<TenantId, usize>,
pub deleted_timeline_keys: BTreeMap<TenantTimelineId, usize>,
}
impl DeletionStats {
fn merge(&mut self, other: Self) {
self.deleted_tenant_keys.extend(other.deleted_tenant_keys);
self.deleted_timeline_keys
.extend(other.deleted_timeline_keys);
}
}
async fn delete_batch(
s3_client: &Client,
s3_target: &RootTarget,
batch: DeleteBatch,
dry_run: bool,
) -> anyhow::Result<DeletionStats> {
let (deleted_tenant_keys, deleted_timeline_keys) = tokio::join!(
delete_tenants_batch(batch.tenants, s3_target, s3_client, dry_run),
delete_timelines_batch(batch.timelines, s3_target, s3_client, dry_run),
);
Ok(DeletionStats {
deleted_tenant_keys: deleted_tenant_keys.context("tenant batch deletion")?,
deleted_timeline_keys: deleted_timeline_keys.context("timeline batch deletion")?,
})
}
async fn delete_tenants_batch(
batched_tenants: Vec<TenantId>,
s3_target: &RootTarget,
s3_client: &Client,
dry_run: bool,
) -> Result<BTreeMap<TenantId, usize>, anyhow::Error> {
info!("Deleting tenants batch of size {}", batched_tenants.len());
info!("Tenant ids to remove: {batched_tenants:?}");
let deleted_keys = delete_elements(
&batched_tenants,
s3_target,
s3_client,
dry_run,
|root_target, tenant_to_delete| root_target.tenant_root(tenant_to_delete),
)
.await?;
if !dry_run {
let mut last_err = None;
for _ in 0..MAX_RETRIES {
match ensure_tenant_batch_deleted(s3_client, s3_target, &batched_tenants).await {
Ok(()) => {
last_err = None;
break;
}
Err(e) => {
error!("Failed to ensure the tenant batch is deleted: {e}");
last_err = Some(e);
}
}
}
if let Some(e) = last_err {
anyhow::bail!(
"Failed to ensure that tenant batch is deleted {MAX_RETRIES} times: {e:?}"
);
}
}
Ok(deleted_keys)
}
async fn delete_timelines_batch(
batched_timelines: Vec<TenantTimelineId>,
s3_target: &RootTarget,
s3_client: &Client,
dry_run: bool,
) -> Result<BTreeMap<TenantTimelineId, usize>, anyhow::Error> {
info!(
"Deleting timelines batch of size {}",
batched_timelines.len()
);
info!(
"Timeline ids to remove: {:?}",
batched_timelines
.iter()
.map(|id| id.to_string())
.collect::<Vec<_>>()
);
let deleted_keys = delete_elements(
&batched_timelines,
s3_target,
s3_client,
dry_run,
|root_target, timeline_to_delete| root_target.timeline_root(timeline_to_delete),
)
.await?;
if !dry_run {
let mut last_err = None;
for _ in 0..MAX_RETRIES {
match ensure_timeline_batch_deleted(s3_client, s3_target, &batched_timelines).await {
Ok(()) => {
last_err = None;
break;
}
Err(e) => {
error!("Failed to ensure the timelines batch is deleted: {e}");
last_err = Some(e);
}
}
}
if let Some(e) = last_err {
anyhow::bail!(
"Failed to ensure that timeline batch is deleted {MAX_RETRIES} times: {e:?}"
);
}
}
Ok(deleted_keys)
}
async fn delete_elements<I>(
batched_ids: &Vec<I>,
s3_target: &RootTarget,
s3_client: &Client,
dry_run: bool,
target_producer: impl Fn(&RootTarget, I) -> S3Target,
) -> Result<BTreeMap<I, usize>, anyhow::Error>
where
I: Ord + PartialOrd + Copy,
{
let mut deleted_keys = BTreeMap::new();
let mut object_ids_to_delete = Vec::with_capacity(MAX_ITEMS_TO_DELETE);
for &id_to_delete in batched_ids {
let mut continuation_token = None;
let mut subtargets = vec![target_producer(s3_target, id_to_delete)];
while let Some(current_target) = subtargets.pop() {
loop {
let fetch_response = list_objects_with_retries(
s3_client,
&current_target,
continuation_token.clone(),
)
.await?;
for object_id in fetch_response
.contents()
.unwrap_or_default()
.iter()
.filter_map(|object| object.key())
.map(|key| ObjectIdentifier::builder().key(key).build())
{
if object_ids_to_delete.len() >= MAX_ITEMS_TO_DELETE {
let object_ids_for_request = std::mem::replace(
&mut object_ids_to_delete,
Vec::with_capacity(MAX_ITEMS_TO_DELETE),
);
send_delete_request(
s3_client,
s3_target.bucket_name(),
object_ids_for_request,
dry_run,
)
.await
.context("object ids deletion")?;
}
object_ids_to_delete.push(object_id);
*deleted_keys.entry(id_to_delete).or_default() += 1;
}
subtargets.extend(
fetch_response
.common_prefixes()
.unwrap_or_default()
.iter()
.filter_map(|common_prefix| common_prefix.prefix())
.map(|prefix| {
let mut new_target = current_target.clone();
new_target.prefix_in_bucket = prefix.to_string();
new_target
}),
);
match fetch_response.next_continuation_token {
Some(new_token) => continuation_token = Some(new_token),
None => break,
}
}
}
}
if !object_ids_to_delete.is_empty() {
info!("Removing last objects of the batch");
send_delete_request(
s3_client,
s3_target.bucket_name(),
object_ids_to_delete,
dry_run,
)
.await
.context("Last object ids deletion")?;
}
Ok(deleted_keys)
}
pub async fn send_delete_request(
s3_client: &Client,
bucket_name: &str,
ids: Vec<ObjectIdentifier>,
dry_run: bool,
) -> anyhow::Result<()> {
info!("Removing {} object ids from S3", ids.len());
info!("Object ids to remove: {ids:?}");
let delete_request = s3_client
.delete_objects()
.bucket(bucket_name)
.delete(Delete::builder().set_objects(Some(ids)).build());
if dry_run {
info!("Dry run, skipping the actual removal");
Ok(())
} else {
let original_request = delete_request.clone();
for _ in 0..MAX_RETRIES {
match delete_request
.clone()
.send()
.await
.context("delete request processing")
{
Ok(delete_response) => {
info!("Delete response: {delete_response:?}");
match delete_response.errors() {
Some(delete_errors) => {
error!("Delete request returned errors: {delete_errors:?}");
tokio::time::sleep(Duration::from_secs(1)).await;
}
None => {
info!("Successfully removed an object batch from S3");
return Ok(());
}
}
}
Err(e) => {
error!("Failed to send a delete request: {e:#}");
tokio::time::sleep(Duration::from_secs(1)).await;
}
}
}
error!("Failed to do deletion, request: {original_request:?}");
anyhow::bail!("Failed to run deletion request {MAX_RETRIES} times");
}
}
async fn ensure_tenant_batch_deleted(
s3_client: &Client,
s3_target: &RootTarget,
batch: &[TenantId],
) -> anyhow::Result<()> {
let mut not_deleted_tenants = Vec::with_capacity(batch.len());
for &tenant_id in batch {
let fetch_response =
list_objects_with_retries(s3_client, &s3_target.tenant_root(tenant_id), None).await?;
if fetch_response.is_truncated()
|| fetch_response.contents().is_some()
|| fetch_response.common_prefixes().is_some()
{
error!(
"Tenant {tenant_id} should be deleted, but its list response is {fetch_response:?}"
);
not_deleted_tenants.push(tenant_id);
}
}
anyhow::ensure!(
not_deleted_tenants.is_empty(),
"Failed to delete all tenants in a batch. Tenants {not_deleted_tenants:?} should be deleted."
);
Ok(())
}
async fn ensure_timeline_batch_deleted(
s3_client: &Client,
s3_target: &RootTarget,
batch: &[TenantTimelineId],
) -> anyhow::Result<()> {
let mut not_deleted_timelines = Vec::with_capacity(batch.len());
for &id in batch {
let fetch_response =
list_objects_with_retries(s3_client, &s3_target.timeline_root(id), None).await?;
if fetch_response.is_truncated()
|| fetch_response.contents().is_some()
|| fetch_response.common_prefixes().is_some()
{
error!("Timeline {id} should be deleted, but its list response is {fetch_response:?}");
not_deleted_timelines.push(id);
}
}
anyhow::ensure!(
not_deleted_timelines.is_empty(),
"Failed to delete all timelines in a batch"
);
Ok(())
}

View File

@@ -341,21 +341,35 @@ async fn start_safekeeper(conf: SafeKeeperConf) -> Result<()> {
let (wal_backup_launcher_tx, wal_backup_launcher_rx) = mpsc::channel(100);
// Load all timelines from disk to memory.
GlobalTimelines::init(conf.clone(), wal_backup_launcher_tx)?;
// Keep handles to main tasks to die if any of them disappears.
let mut tasks_handles: FuturesUnordered<BoxFuture<(String, JoinTaskRes)>> =
FuturesUnordered::new();
// Start wal backup launcher before loading timelines as we'll notify it
// through the channel about timelines which need offloading, not draining
// the channel would cause deadlock.
let current_thread_rt = conf
.current_thread_runtime
.then(|| Handle::try_current().expect("no runtime in main"));
let conf_ = conf.clone();
let wal_backup_handle = current_thread_rt
.as_ref()
.unwrap_or_else(|| WAL_BACKUP_RUNTIME.handle())
.spawn(wal_backup::wal_backup_launcher_task_main(
conf_,
wal_backup_launcher_rx,
))
.map(|res| ("WAL backup launcher".to_owned(), res));
tasks_handles.push(Box::pin(wal_backup_handle));
// Load all timelines from disk to memory.
GlobalTimelines::init(conf.clone(), wal_backup_launcher_tx).await?;
let conf_ = conf.clone();
// Run everything in current thread rt, if asked.
if conf.current_thread_runtime {
info!("running in current thread runtime");
}
let current_thread_rt = conf
.current_thread_runtime
.then(|| Handle::try_current().expect("no runtime in main"));
let wal_service_handle = current_thread_rt
.as_ref()
@@ -408,17 +422,6 @@ async fn start_safekeeper(conf: SafeKeeperConf) -> Result<()> {
.map(|res| ("WAL remover".to_owned(), res));
tasks_handles.push(Box::pin(wal_remover_handle));
let conf_ = conf.clone();
let wal_backup_handle = current_thread_rt
.as_ref()
.unwrap_or_else(|| WAL_BACKUP_RUNTIME.handle())
.spawn(wal_backup::wal_backup_launcher_task_main(
conf_,
wal_backup_launcher_rx,
))
.map(|res| ("WAL backup launcher".to_owned(), res));
tasks_handles.push(Box::pin(wal_backup_handle));
set_build_info_metric(GIT_VERSION);
// TODO: update tokio-stream, convert to real async Stream with

View File

@@ -1,7 +1,6 @@
//! Code to deal with safekeeper control file upgrades
use crate::safekeeper::{
AcceptorState, PersistedPeers, PgUuid, SafeKeeperState, ServerInfo, Term, TermHistory,
TermSwitchEntry,
AcceptorState, PersistedPeers, PgUuid, SafeKeeperState, ServerInfo, Term, TermHistory, TermLsn,
};
use anyhow::{bail, Result};
use pq_proto::SystemId;
@@ -145,7 +144,7 @@ pub fn upgrade_control_file(buf: &[u8], version: u32) -> Result<SafeKeeperState>
let oldstate = SafeKeeperStateV1::des(&buf[..buf.len()])?;
let ac = AcceptorState {
term: oldstate.acceptor_state.term,
term_history: TermHistory(vec![TermSwitchEntry {
term_history: TermHistory(vec![TermLsn {
term: oldstate.acceptor_state.epoch,
lsn: Lsn(0),
}]),

View File

@@ -19,6 +19,7 @@ use crate::receive_wal::WalReceiverState;
use crate::safekeeper::ServerInfo;
use crate::safekeeper::Term;
use crate::send_wal::WalSenderState;
use crate::timeline::PeerInfo;
use crate::{debug_dump, pull_timeline};
use crate::timelines_global_map::TimelineDeleteForceResult;
@@ -101,6 +102,7 @@ pub struct TimelineStatus {
pub peer_horizon_lsn: Lsn,
#[serde_as(as = "DisplayFromStr")]
pub remote_consistent_lsn: Lsn,
pub peers: Vec<PeerInfo>,
pub walsenders: Vec<WalSenderState>,
pub walreceivers: Vec<WalReceiverState>,
}
@@ -140,6 +142,7 @@ async fn timeline_status_handler(request: Request<Body>) -> Result<Response<Body
term_history,
};
let conf = get_conf(&request);
// Note: we report in memory values which can be lost.
let status = TimelineStatus {
tenant_id: ttid.tenant_id,
@@ -153,6 +156,7 @@ async fn timeline_status_handler(request: Request<Body>) -> Result<Response<Body
backup_lsn: inmem.backup_lsn,
peer_horizon_lsn: inmem.peer_horizon_lsn,
remote_consistent_lsn: tli.get_walsenders().get_remote_consistent_lsn(),
peers: tli.get_peers(conf).await,
walsenders: tli.get_walsenders().get_all(),
walreceivers: tli.get_walreceivers().get_all(),
};
@@ -282,12 +286,14 @@ async fn record_safekeeper_info(mut request: Request<Body>) -> Result<Response<B
tenant_id: ttid.tenant_id.as_ref().to_owned(),
timeline_id: ttid.timeline_id.as_ref().to_owned(),
}),
term: sk_info.term.unwrap_or(0),
last_log_term: sk_info.last_log_term.unwrap_or(0),
flush_lsn: sk_info.flush_lsn.0,
commit_lsn: sk_info.commit_lsn.0,
remote_consistent_lsn: sk_info.remote_consistent_lsn.0,
peer_horizon_lsn: sk_info.peer_horizon_lsn.0,
safekeeper_connstr: sk_info.safekeeper_connstr.unwrap_or_else(|| "".to_owned()),
http_connstr: sk_info.http_connstr.unwrap_or_else(|| "".to_owned()),
backup_lsn: sk_info.backup_lsn.0,
local_start_lsn: sk_info.local_start_lsn.0,
availability_zone: None,

View File

@@ -21,7 +21,7 @@ use crate::safekeeper::{AcceptorProposerMessage, AppendResponse, ServerInfo};
use crate::safekeeper::{
AppendRequest, AppendRequestHeader, ProposerAcceptorMessage, ProposerElected,
};
use crate::safekeeper::{SafeKeeperState, Term, TermHistory, TermSwitchEntry};
use crate::safekeeper::{SafeKeeperState, Term, TermHistory, TermLsn};
use crate::timeline::Timeline;
use crate::GlobalTimelines;
use postgres_backend::PostgresBackend;
@@ -119,7 +119,7 @@ async fn send_proposer_elected(tli: &Arc<Timeline>, term: Term, lsn: Lsn) -> any
let history = tli.get_state().await.1.acceptor_state.term_history;
let history = history.up_to(lsn.checked_sub(1u64).unwrap());
let mut history_entries = history.0;
history_entries.push(TermSwitchEntry { term, lsn });
history_entries.push(TermLsn { term, lsn });
let history = TermHistory(history_entries);
let proposer_elected_request = ProposerAcceptorMessage::Elected(ProposerElected {

View File

@@ -19,6 +19,7 @@ pub mod json_ctrl;
pub mod metrics;
pub mod pull_timeline;
pub mod receive_wal;
pub mod recovery;
pub mod remove_wal;
pub mod safekeeper;
pub mod send_wal;

View File

@@ -227,7 +227,9 @@ async fn pull_timeline(status: TimelineStatus, host: String) -> Result<Response>
tokio::fs::create_dir_all(conf.tenant_dir(&ttid.tenant_id)).await?;
tokio::fs::rename(tli_dir_path, &timeline_path).await?;
let tli = GlobalTimelines::load_timeline(ttid).context("Failed to load timeline after copy")?;
let tli = GlobalTimelines::load_timeline(ttid)
.await
.context("Failed to load timeline after copy")?;
info!(
"Loaded timeline {}, flush_lsn={}",

View File

@@ -0,0 +1,40 @@
//! This module implements pulling WAL from peer safekeepers if compute can't
//! provide it, i.e. safekeeper lags too much.
use std::sync::Arc;
use tokio::{select, time::sleep, time::Duration};
use tracing::{info, instrument};
use crate::{timeline::Timeline, SafeKeeperConf};
/// Entrypoint for per timeline task which always runs, checking whether
/// recovery for this safekeeper is needed and starting it if so.
#[instrument(name = "recovery task", skip_all, fields(ttid = %tli.ttid))]
pub async fn recovery_main(tli: Arc<Timeline>, _conf: SafeKeeperConf) {
info!("started");
let mut cancellation_rx = match tli.get_cancellation_rx() {
Ok(rx) => rx,
Err(_) => {
info!("timeline canceled during task start");
return;
}
};
select! {
_ = recovery_main_loop(tli) => { unreachable!() }
_ = cancellation_rx.changed() => {
info!("stopped");
}
}
}
const CHECK_INTERVAL_MS: u64 = 2000;
/// Check regularly whether we need to start recovery.
async fn recovery_main_loop(_tli: Arc<Timeline>) {
let check_duration = Duration::from_millis(CHECK_INTERVAL_MS);
loop {
sleep(check_duration).await;
}
}

View File

@@ -34,22 +34,33 @@ pub const UNKNOWN_SERVER_VERSION: u32 = 0;
/// Consensus logical timestamp.
pub type Term = u64;
const INVALID_TERM: Term = 0;
pub const INVALID_TERM: Term = 0;
#[derive(Debug, Clone, Copy, Serialize, Deserialize)]
pub struct TermSwitchEntry {
#[derive(Debug, Clone, Copy, Serialize, Deserialize, PartialEq, Eq, PartialOrd, Ord)]
pub struct TermLsn {
pub term: Term,
pub lsn: Lsn,
}
// Creation from tuple provides less typing (e.g. for unit tests).
impl From<(Term, Lsn)> for TermLsn {
fn from(pair: (Term, Lsn)) -> TermLsn {
TermLsn {
term: pair.0,
lsn: pair.1,
}
}
}
#[derive(Clone, Serialize, Deserialize)]
pub struct TermHistory(pub Vec<TermSwitchEntry>);
pub struct TermHistory(pub Vec<TermLsn>);
impl TermHistory {
pub fn empty() -> TermHistory {
TermHistory(Vec::new())
}
// Parse TermHistory as n_entries followed by TermSwitchEntry pairs
// Parse TermHistory as n_entries followed by TermLsn pairs
pub fn from_bytes(bytes: &mut Bytes) -> Result<TermHistory> {
if bytes.remaining() < 4 {
bail!("TermHistory misses len");
@@ -60,7 +71,7 @@ impl TermHistory {
if bytes.remaining() < 16 {
bail!("TermHistory is incomplete");
}
res.push(TermSwitchEntry {
res.push(TermLsn {
term: bytes.get_u64_le(),
lsn: bytes.get_u64_le().into(),
})
@@ -557,12 +568,17 @@ where
.up_to(self.flush_lsn())
}
/// Get current term.
pub fn get_term(&self) -> Term {
self.state.acceptor_state.term
}
pub fn get_epoch(&self) -> Term {
self.state.acceptor_state.get_epoch(self.flush_lsn())
}
/// wal_store wrapper avoiding commit_lsn <= flush_lsn violation when we don't have WAL yet.
fn flush_lsn(&self) -> Lsn {
pub fn flush_lsn(&self) -> Lsn {
max(self.wal_store.flush_lsn(), self.state.timeline_start_lsn)
}
@@ -1138,7 +1154,7 @@ mod tests {
let pem = ProposerElected {
term: 1,
start_streaming_at: Lsn(1),
term_history: TermHistory(vec![TermSwitchEntry {
term_history: TermHistory(vec![TermLsn {
term: 1,
lsn: Lsn(3),
}]),

View File

@@ -2,12 +2,12 @@
//! with the "START_REPLICATION" message, and registry of walsenders.
use crate::handler::SafekeeperPostgresHandler;
use crate::safekeeper::Term;
use crate::safekeeper::{Term, TermLsn};
use crate::timeline::Timeline;
use crate::wal_service::ConnectionId;
use crate::wal_storage::WalReader;
use crate::GlobalTimelines;
use anyhow::Context as AnyhowContext;
use anyhow::{bail, Context as AnyhowContext};
use bytes::Bytes;
use parking_lot::Mutex;
use postgres_backend::PostgresBackend;
@@ -390,26 +390,25 @@ impl SafekeeperPostgresHandler {
self.appname.clone(),
));
let commit_lsn_watch_rx = tli.get_commit_lsn_watch_rx();
// Walproposer gets special handling: safekeeper must give proposer all
// local WAL till the end, whether committed or not (walproposer will
// hang otherwise). That's because walproposer runs the consensus and
// synchronizes safekeepers on the most advanced one.
// Walsender can operate in one of two modes which we select by
// application_name: give only committed WAL (used by pageserver) or all
// existing WAL (up to flush_lsn, used by walproposer or peer recovery).
// The second case is always driven by a consensus leader which term
// must generally be also supplied. However we're sloppy to do this in
// walproposer recovery which will be removed soon. So TODO is to make
// it not Option'al then.
//
// There is a small risk of this WAL getting concurrently garbaged if
// another compute rises which collects majority and starts fixing log
// on this safekeeper itself. That's ok as (old) proposer will never be
// able to commit such WAL.
let stop_pos: Option<Lsn> = if self.is_walproposer_recovery() {
let wal_end = tli.get_flush_lsn().await;
Some(wal_end)
// Fetching WAL without term in recovery creates a small risk of this
// WAL getting concurrently garbaged if another compute rises which
// collects majority and starts fixing log on this safekeeper itself.
// That's ok as (old) proposer will never be able to commit such WAL.
let end_watch = if self.is_walproposer_recovery() {
EndWatch::Flush(tli.get_term_flush_lsn_watch_rx())
} else {
None
EndWatch::Commit(tli.get_commit_lsn_watch_rx())
};
// take the latest commit_lsn if don't have stop_pos
let end_pos = stop_pos.unwrap_or(*commit_lsn_watch_rx.borrow());
// we don't check term here; it will be checked on first waiting/WAL reading anyway.
let end_pos = end_watch.get();
if end_pos < start_pos {
warn!(
@@ -419,8 +418,10 @@ impl SafekeeperPostgresHandler {
}
info!(
"starting streaming from {:?} till {:?}, available WAL ends at {}",
start_pos, stop_pos, end_pos
"starting streaming from {:?}, available WAL ends at {}, recovery={}",
start_pos,
end_pos,
matches!(end_watch, EndWatch::Flush(_))
);
// switch to copy
@@ -445,9 +446,8 @@ impl SafekeeperPostgresHandler {
appname,
start_pos,
end_pos,
stop_pos,
term,
commit_lsn_watch_rx,
end_watch,
ws_guard: ws_guard.clone(),
wal_reader,
send_buf: [0; MAX_SEND_SIZE],
@@ -466,6 +466,32 @@ impl SafekeeperPostgresHandler {
}
}
/// Walsender streams either up to commit_lsn (normally) or flush_lsn in the
/// given term (recovery by walproposer or peer safekeeper).
enum EndWatch {
Commit(Receiver<Lsn>),
Flush(Receiver<TermLsn>),
}
impl EndWatch {
/// Get current end of WAL.
fn get(&self) -> Lsn {
match self {
EndWatch::Commit(r) => *r.borrow(),
EndWatch::Flush(r) => r.borrow().lsn,
}
}
/// Wait for the update.
async fn changed(&mut self) -> anyhow::Result<()> {
match self {
EndWatch::Commit(r) => r.changed().await?,
EndWatch::Flush(r) => r.changed().await?,
}
Ok(())
}
}
/// A half driving sending WAL.
struct WalSender<'a, IO> {
pgb: &'a mut PostgresBackend<IO>,
@@ -480,14 +506,12 @@ struct WalSender<'a, IO> {
// We send this LSN to the receiver as wal_end, so that it knows how much
// WAL this safekeeper has. This LSN should be as fresh as possible.
end_pos: Lsn,
// If present, terminate after reaching this position; used by walproposer
// in recovery.
stop_pos: Option<Lsn>,
/// When streaming uncommitted part, the term the client acts as the leader
/// in. Streaming is stopped if local term changes to a different (higher)
/// value.
term: Option<Term>,
commit_lsn_watch_rx: Receiver<Lsn>,
/// Watch channel receiver to learn end of available WAL (and wait for its advancement).
end_watch: EndWatch,
ws_guard: Arc<WalSenderGuard>,
wal_reader: WalReader,
// buffer for readling WAL into to send it
@@ -497,29 +521,20 @@ struct WalSender<'a, IO> {
impl<IO: AsyncRead + AsyncWrite + Unpin> WalSender<'_, IO> {
/// Send WAL until
/// - an error occurs
/// - if we are streaming to walproposer, we've streamed until stop_pos
/// (recovery finished)
/// - receiver is caughtup and there is no computes
/// - receiver is caughtup and there is no computes (if streaming up to commit_lsn)
///
/// Err(CopyStreamHandlerEnd) is always returned; Result is used only for ?
/// convenience.
async fn run(&mut self) -> Result<(), CopyStreamHandlerEnd> {
loop {
// If we are streaming to walproposer, check it is time to stop.
if let Some(stop_pos) = self.stop_pos {
if self.start_pos >= stop_pos {
// recovery finished
return Err(CopyStreamHandlerEnd::ServerInitiated(format!(
"ending streaming to walproposer at {}, recovery finished",
self.start_pos
)));
}
} else {
// Wait for the next portion if it is not there yet, or just
// update our end of WAL available for sending value, we
// communicate it to the receiver.
self.wait_wal().await?;
}
// Wait for the next portion if it is not there yet, or just
// update our end of WAL available for sending value, we
// communicate it to the receiver.
self.wait_wal().await?;
assert!(
self.end_pos > self.start_pos,
"nothing to send after waiting for WAL"
);
// try to send as much as available, capped by MAX_SEND_SIZE
let mut send_size = self
@@ -567,7 +582,7 @@ impl<IO: AsyncRead + AsyncWrite + Unpin> WalSender<'_, IO> {
/// exit in the meanwhile
async fn wait_wal(&mut self) -> Result<(), CopyStreamHandlerEnd> {
loop {
self.end_pos = *self.commit_lsn_watch_rx.borrow();
self.end_pos = self.end_watch.get();
if self.end_pos > self.start_pos {
// We have something to send.
trace!("got end_pos {:?}, streaming", self.end_pos);
@@ -575,27 +590,31 @@ impl<IO: AsyncRead + AsyncWrite + Unpin> WalSender<'_, IO> {
}
// Wait for WAL to appear, now self.end_pos == self.start_pos.
if let Some(lsn) = wait_for_lsn(&mut self.commit_lsn_watch_rx, self.start_pos).await? {
if let Some(lsn) = wait_for_lsn(&mut self.end_watch, self.term, self.start_pos).await? {
self.end_pos = lsn;
trace!("got end_pos {:?}, streaming", self.end_pos);
return Ok(());
}
// Timed out waiting for WAL, check for termination and send KA
if let Some(remote_consistent_lsn) = self
.ws_guard
.walsenders
.get_ws_remote_consistent_lsn(self.ws_guard.id)
{
if self.tli.should_walsender_stop(remote_consistent_lsn).await {
// Terminate if there is nothing more to send.
// Note that "ending streaming" part of the string is used by
// pageserver to identify WalReceiverError::SuccessfulCompletion,
// do not change this string without updating pageserver.
return Err(CopyStreamHandlerEnd::ServerInitiated(format!(
// Timed out waiting for WAL, check for termination and send KA.
// Check for termination only if we are streaming up to commit_lsn
// (to pageserver).
if let EndWatch::Commit(_) = self.end_watch {
if let Some(remote_consistent_lsn) = self
.ws_guard
.walsenders
.get_ws_remote_consistent_lsn(self.ws_guard.id)
{
if self.tli.should_walsender_stop(remote_consistent_lsn).await {
// Terminate if there is nothing more to send.
// Note that "ending streaming" part of the string is used by
// pageserver to identify WalReceiverError::SuccessfulCompletion,
// do not change this string without updating pageserver.
return Err(CopyStreamHandlerEnd::ServerInitiated(format!(
"ending streaming to {:?} at {}, receiver is caughtup and there is no computes",
self.appname, self.start_pos,
)));
}
}
}
@@ -663,22 +682,32 @@ impl<IO: AsyncRead + AsyncWrite + Unpin> ReplyReader<IO> {
const POLL_STATE_TIMEOUT: Duration = Duration::from_secs(1);
/// Wait until we have commit_lsn > lsn or timeout expires. Returns
/// - Ok(Some(commit_lsn)) if needed lsn is successfully observed;
/// Wait until we have available WAL > start_pos or timeout expires. Returns
/// - Ok(Some(end_pos)) if needed lsn is successfully observed;
/// - Ok(None) if timeout expired;
/// - Err in case of error (if watch channel is in trouble, shouldn't happen).
async fn wait_for_lsn(rx: &mut Receiver<Lsn>, lsn: Lsn) -> anyhow::Result<Option<Lsn>> {
/// - Err in case of error -- only if 1) term changed while fetching in recovery
/// mode 2) watch channel closed, which must never happen.
async fn wait_for_lsn(
rx: &mut EndWatch,
client_term: Option<Term>,
start_pos: Lsn,
) -> anyhow::Result<Option<Lsn>> {
let res = timeout(POLL_STATE_TIMEOUT, async move {
let mut commit_lsn;
loop {
rx.changed().await?;
commit_lsn = *rx.borrow();
if commit_lsn > lsn {
break;
let end_pos = rx.get();
if end_pos > start_pos {
return Ok(end_pos);
}
if let EndWatch::Flush(rx) = rx {
let curr_term = rx.borrow().term;
if let Some(client_term) = client_term {
if curr_term != client_term {
bail!("term changed: requested {}, now {}", client_term, curr_term);
}
}
}
rx.changed().await?;
}
Ok(commit_lsn)
})
.await;

View File

@@ -3,8 +3,11 @@
use anyhow::{anyhow, bail, Result};
use postgres_ffi::XLogSegNo;
use serde::{Deserialize, Serialize};
use serde_with::serde_as;
use tokio::fs;
use serde_with::DisplayFromStr;
use std::cmp::max;
use std::path::PathBuf;
use std::sync::Arc;
@@ -24,9 +27,10 @@ use storage_broker::proto::SafekeeperTimelineInfo;
use storage_broker::proto::TenantTimelineId as ProtoTenantTimelineId;
use crate::receive_wal::WalReceivers;
use crate::recovery::recovery_main;
use crate::safekeeper::{
AcceptorProposerMessage, ProposerAcceptorMessage, SafeKeeper, SafeKeeperState,
SafekeeperMemState, ServerInfo, Term,
SafekeeperMemState, ServerInfo, Term, TermLsn, INVALID_TERM,
};
use crate::send_wal::WalSenders;
use crate::{control_file, safekeeper::UNKNOWN_SERVER_VERSION};
@@ -37,18 +41,25 @@ use crate::SafeKeeperConf;
use crate::{debug_dump, wal_storage};
/// Things safekeeper should know about timeline state on peers.
#[derive(Debug, Clone)]
#[serde_as]
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct PeerInfo {
pub sk_id: NodeId,
/// Term of the last entry.
_last_log_term: Term,
/// LSN of the last record.
#[serde_as(as = "DisplayFromStr")]
_flush_lsn: Lsn,
#[serde_as(as = "DisplayFromStr")]
pub commit_lsn: Lsn,
/// Since which LSN safekeeper has WAL. TODO: remove this once we fill new
/// sk since backup_lsn.
#[serde_as(as = "DisplayFromStr")]
pub local_start_lsn: Lsn,
/// When info was received.
/// When info was received. Serde annotations are not very useful but make
/// the code compile -- we don't rely on this field externally.
#[serde(skip)]
#[serde(default = "Instant::now")]
ts: Instant,
}
@@ -237,8 +248,9 @@ impl SharedState {
tenant_id: ttid.tenant_id.as_ref().to_owned(),
timeline_id: ttid.timeline_id.as_ref().to_owned(),
}),
term: self.sk.state.acceptor_state.term,
last_log_term: self.sk.get_epoch(),
flush_lsn: self.sk.wal_store.flush_lsn().0,
flush_lsn: self.sk.flush_lsn().0,
// note: this value is not flushed to control file yet and can be lost
commit_lsn: self.sk.inmem.commit_lsn.0,
remote_consistent_lsn: remote_consistent_lsn.0,
@@ -247,6 +259,7 @@ impl SharedState {
.advertise_pg_addr
.to_owned()
.unwrap_or(conf.listen_pg_addr.clone()),
http_connstr: conf.listen_http_addr.to_owned(),
backup_lsn: self.sk.inmem.backup_lsn.0,
local_start_lsn: self.sk.state.local_start_lsn.0,
availability_zone: conf.availability_zone.clone(),
@@ -296,6 +309,13 @@ pub struct Timeline {
commit_lsn_watch_tx: watch::Sender<Lsn>,
commit_lsn_watch_rx: watch::Receiver<Lsn>,
/// Broadcasts (current term, flush_lsn) updates, walsender is interested in
/// them when sending in recovery mode (to walproposer or peers). Note: this
/// is just a notification, WAL reading should always done with lock held as
/// term can change otherwise.
term_flush_lsn_watch_tx: watch::Sender<TermLsn>,
term_flush_lsn_watch_rx: watch::Receiver<TermLsn>,
/// Safekeeper and other state, that should remain consistent and
/// synchronized with the disk. This is tokio mutex as we write WAL to disk
/// while holding it, ensuring that consensus checks are in order.
@@ -317,16 +337,20 @@ pub struct Timeline {
impl Timeline {
/// Load existing timeline from disk.
pub fn load_timeline(
conf: SafeKeeperConf,
conf: &SafeKeeperConf,
ttid: TenantTimelineId,
wal_backup_launcher_tx: Sender<TenantTimelineId>,
) -> Result<Timeline> {
let _enter = info_span!("load_timeline", timeline = %ttid.timeline_id).entered();
let shared_state = SharedState::restore(&conf, &ttid)?;
let shared_state = SharedState::restore(conf, &ttid)?;
let rcl = shared_state.sk.state.remote_consistent_lsn;
let (commit_lsn_watch_tx, commit_lsn_watch_rx) =
watch::channel(shared_state.sk.state.commit_lsn);
let (term_flush_lsn_watch_tx, term_flush_lsn_watch_rx) = watch::channel(TermLsn::from((
shared_state.sk.get_term(),
shared_state.sk.flush_lsn(),
)));
let (cancellation_tx, cancellation_rx) = watch::channel(false);
Ok(Timeline {
@@ -334,6 +358,8 @@ impl Timeline {
wal_backup_launcher_tx,
commit_lsn_watch_tx,
commit_lsn_watch_rx,
term_flush_lsn_watch_tx,
term_flush_lsn_watch_rx,
mutex: Mutex::new(shared_state),
walsenders: WalSenders::new(rcl),
walreceivers: WalReceivers::new(),
@@ -345,7 +371,7 @@ impl Timeline {
/// Create a new timeline, which is not yet persisted to disk.
pub fn create_empty(
conf: SafeKeeperConf,
conf: &SafeKeeperConf,
ttid: TenantTimelineId,
wal_backup_launcher_tx: Sender<TenantTimelineId>,
server_info: ServerInfo,
@@ -353,6 +379,8 @@ impl Timeline {
local_start_lsn: Lsn,
) -> Result<Timeline> {
let (commit_lsn_watch_tx, commit_lsn_watch_rx) = watch::channel(Lsn::INVALID);
let (term_flush_lsn_watch_tx, term_flush_lsn_watch_rx) =
watch::channel(TermLsn::from((INVALID_TERM, Lsn::INVALID)));
let (cancellation_tx, cancellation_rx) = watch::channel(false);
let state = SafeKeeperState::new(&ttid, server_info, vec![], commit_lsn, local_start_lsn);
@@ -361,7 +389,9 @@ impl Timeline {
wal_backup_launcher_tx,
commit_lsn_watch_tx,
commit_lsn_watch_rx,
mutex: Mutex::new(SharedState::create_new(&conf, &ttid, state)?),
term_flush_lsn_watch_tx,
term_flush_lsn_watch_rx,
mutex: Mutex::new(SharedState::create_new(conf, &ttid, state)?),
walsenders: WalSenders::new(Lsn(0)),
walreceivers: WalReceivers::new(),
cancellation_rx,
@@ -370,12 +400,16 @@ impl Timeline {
})
}
/// Initialize fresh timeline on disk and start background tasks. If bootstrap
/// Initialize fresh timeline on disk and start background tasks. If init
/// fails, timeline is cancelled and cannot be used anymore.
///
/// Bootstrap is transactional, so if it fails, created files will be deleted,
/// Init is transactional, so if it fails, created files will be deleted,
/// and state on disk should remain unchanged.
pub async fn bootstrap(&self, shared_state: &mut MutexGuard<'_, SharedState>) -> Result<()> {
pub async fn init_new(
self: &Arc<Timeline>,
shared_state: &mut MutexGuard<'_, SharedState>,
conf: &SafeKeeperConf,
) -> Result<()> {
match fs::metadata(&self.timeline_dir).await {
Ok(_) => {
// Timeline directory exists on disk, we should leave state unchanged
@@ -391,7 +425,7 @@ impl Timeline {
// Create timeline directory.
fs::create_dir_all(&self.timeline_dir).await?;
// Write timeline to disk and TODO: start background tasks.
// Write timeline to disk and start background tasks.
if let Err(e) = shared_state.sk.persist().await {
// Bootstrap failed, cancel timeline and remove timeline directory.
self.cancel(shared_state);
@@ -405,12 +439,16 @@ impl Timeline {
return Err(e);
}
// TODO: add more initialization steps here
self.update_status(shared_state);
self.bootstrap(conf);
Ok(())
}
/// Bootstrap new or existing timeline starting background stasks.
pub fn bootstrap(self: &Arc<Timeline>, conf: &SafeKeeperConf) {
// Start recovery task which always runs on the timeline.
tokio::spawn(recovery_main(self.clone(), conf.clone()));
}
/// Delete timeline from disk completely, by removing timeline directory. Background
/// timeline activities will stop eventually.
pub async fn delete_from_disk(
@@ -444,6 +482,16 @@ impl Timeline {
*self.cancellation_rx.borrow()
}
/// Returns watch channel which gets value when timeline is cancelled. It is
/// guaranteed to have not cancelled value observed (errors otherwise).
pub fn get_cancellation_rx(&self) -> Result<watch::Receiver<bool>> {
let rx = self.cancellation_rx.clone();
if *rx.borrow() {
bail!(TimelineError::Cancelled(self.ttid));
}
Ok(rx)
}
/// Take a writing mutual exclusive lock on timeline shared_state.
pub async fn write_shared_state(&self) -> MutexGuard<SharedState> {
self.mutex.lock().await
@@ -520,6 +568,11 @@ impl Timeline {
self.commit_lsn_watch_rx.clone()
}
/// Returns term_flush_lsn watch channel.
pub fn get_term_flush_lsn_watch_rx(&self) -> watch::Receiver<TermLsn> {
self.term_flush_lsn_watch_rx.clone()
}
/// Pass arrived message to the safekeeper.
pub async fn process_msg(
&self,
@@ -531,6 +584,7 @@ impl Timeline {
let mut rmsg: Option<AcceptorProposerMessage>;
let commit_lsn: Lsn;
let term_flush_lsn: TermLsn;
{
let mut shared_state = self.write_shared_state().await;
rmsg = shared_state.sk.process_msg(msg).await?;
@@ -544,8 +598,11 @@ impl Timeline {
}
commit_lsn = shared_state.sk.inmem.commit_lsn;
term_flush_lsn =
TermLsn::from((shared_state.sk.get_term(), shared_state.sk.flush_lsn()));
}
self.commit_lsn_watch_tx.send(commit_lsn)?;
self.term_flush_lsn_watch_tx.send(term_flush_lsn)?;
Ok(rmsg)
}

View File

@@ -11,7 +11,7 @@ use serde::Serialize;
use std::collections::HashMap;
use std::path::PathBuf;
use std::str::FromStr;
use std::sync::{Arc, Mutex, MutexGuard};
use std::sync::{Arc, Mutex};
use tokio::sync::mpsc::Sender;
use tracing::*;
use utils::id::{TenantId, TenantTimelineId, TimelineId};
@@ -71,19 +71,23 @@ pub struct GlobalTimelines;
impl GlobalTimelines {
/// Inject dependencies needed for the timeline constructors and load all timelines to memory.
pub fn init(
pub async fn init(
conf: SafeKeeperConf,
wal_backup_launcher_tx: Sender<TenantTimelineId>,
) -> Result<()> {
let mut state = TIMELINES_STATE.lock().unwrap();
assert!(state.wal_backup_launcher_tx.is_none());
state.wal_backup_launcher_tx = Some(wal_backup_launcher_tx);
state.conf = Some(conf);
// clippy isn't smart enough to understand that drop(state) releases the
// lock, so use explicit block
let tenants_dir = {
let mut state = TIMELINES_STATE.lock().unwrap();
assert!(state.wal_backup_launcher_tx.is_none());
state.wal_backup_launcher_tx = Some(wal_backup_launcher_tx);
state.conf = Some(conf);
// Iterate through all directories and load tenants for all directories
// named as a valid tenant_id.
// Iterate through all directories and load tenants for all directories
// named as a valid tenant_id.
state.get_conf().workdir.clone()
};
let mut tenant_count = 0;
let tenants_dir = state.get_conf().workdir.clone();
for tenants_dir_entry in std::fs::read_dir(&tenants_dir)
.with_context(|| format!("failed to list tenants dir {}", tenants_dir.display()))?
{
@@ -93,7 +97,7 @@ impl GlobalTimelines {
TenantId::from_str(tenants_dir_entry.file_name().to_str().unwrap_or(""))
{
tenant_count += 1;
GlobalTimelines::load_tenant_timelines(&mut state, tenant_id)?;
GlobalTimelines::load_tenant_timelines(tenant_id).await?;
}
}
Err(e) => error!(
@@ -108,7 +112,7 @@ impl GlobalTimelines {
info!(
"found {} tenants directories, successfully loaded {} timelines",
tenant_count,
state.timelines.len()
TIMELINES_STATE.lock().unwrap().timelines.len()
);
Ok(())
}
@@ -116,17 +120,21 @@ impl GlobalTimelines {
/// Loads all timelines for the given tenant to memory. Returns fs::read_dir
/// errors if any.
///
/// Note: This function (and all reading/loading below) is sync because
/// timelines are loaded while holding GlobalTimelinesState lock. Which is
/// fine as this is called only from single threaded main runtime on boot,
/// but clippy complains anyway, and suppressing that isn't trivial as async
/// is the keyword, ha. That only other user is pull_timeline.rs for which
/// being blocked is not that bad, and we can do spawn_blocking.
fn load_tenant_timelines(
state: &mut MutexGuard<'_, GlobalTimelinesState>,
tenant_id: TenantId,
) -> Result<()> {
let timelines_dir = state.get_conf().tenant_dir(&tenant_id);
/// It is async for update_status_notify sake. Since TIMELINES_STATE lock is
/// sync and there is no important reason to make it async (it is always
/// held for a short while) we just lock and unlock it for each timeline --
/// this function is called during init when nothing else is running, so
/// this is fine.
async fn load_tenant_timelines(tenant_id: TenantId) -> Result<()> {
let (conf, wal_backup_launcher_tx) = {
let state = TIMELINES_STATE.lock().unwrap();
(
state.get_conf().clone(),
state.wal_backup_launcher_tx.as_ref().unwrap().clone(),
)
};
let timelines_dir = conf.tenant_dir(&tenant_id);
for timelines_dir_entry in std::fs::read_dir(&timelines_dir)
.with_context(|| format!("failed to list timelines dir {}", timelines_dir.display()))?
{
@@ -136,13 +144,16 @@ impl GlobalTimelines {
TimelineId::from_str(timeline_dir_entry.file_name().to_str().unwrap_or(""))
{
let ttid = TenantTimelineId::new(tenant_id, timeline_id);
match Timeline::load_timeline(
state.get_conf().clone(),
ttid,
state.wal_backup_launcher_tx.as_ref().unwrap().clone(),
) {
match Timeline::load_timeline(&conf, ttid, wal_backup_launcher_tx.clone()) {
Ok(timeline) => {
state.timelines.insert(ttid, Arc::new(timeline));
let tli = Arc::new(timeline);
TIMELINES_STATE
.lock()
.unwrap()
.timelines
.insert(ttid, tli.clone());
tli.bootstrap(&conf);
tli.update_status_notify().await.unwrap();
}
// If we can't load a timeline, it's most likely because of a corrupted
// directory. We will log an error and won't allow to delete/recreate
@@ -168,18 +179,22 @@ impl GlobalTimelines {
}
/// Load timeline from disk to the memory.
pub fn load_timeline(ttid: TenantTimelineId) -> Result<Arc<Timeline>> {
pub async fn load_timeline(ttid: TenantTimelineId) -> Result<Arc<Timeline>> {
let (conf, wal_backup_launcher_tx) = TIMELINES_STATE.lock().unwrap().get_dependencies();
match Timeline::load_timeline(conf, ttid, wal_backup_launcher_tx) {
match Timeline::load_timeline(&conf, ttid, wal_backup_launcher_tx) {
Ok(timeline) => {
let tli = Arc::new(timeline);
// TODO: prevent concurrent timeline creation/loading
TIMELINES_STATE
.lock()
.unwrap()
.timelines
.insert(ttid, tli.clone());
tli.bootstrap(&conf);
Ok(tli)
}
// If we can't load a timeline, it's bad. Caller will figure it out.
@@ -217,7 +232,7 @@ impl GlobalTimelines {
info!("creating new timeline {}", ttid);
let timeline = Arc::new(Timeline::create_empty(
conf,
&conf,
ttid,
wal_backup_launcher_tx,
server_info,
@@ -240,23 +255,24 @@ impl GlobalTimelines {
// Write the new timeline to the disk and start background workers.
// Bootstrap is transactional, so if it fails, the timeline will be deleted,
// and the state on disk should remain unchanged.
if let Err(e) = timeline.bootstrap(&mut shared_state).await {
// Note: the most likely reason for bootstrap failure is that the timeline
if let Err(e) = timeline.init_new(&mut shared_state, &conf).await {
// Note: the most likely reason for init failure is that the timeline
// directory already exists on disk. This happens when timeline is corrupted
// and wasn't loaded from disk on startup because of that. We want to preserve
// the timeline directory in this case, for further inspection.
// TODO: this is an unusual error, perhaps we should send it to sentry
// TODO: compute will try to create timeline every second, we should add backoff
error!("failed to bootstrap timeline {}: {}", ttid, e);
error!("failed to init new timeline {}: {}", ttid, e);
// Timeline failed to bootstrap, it cannot be used. Remove it from the map.
// Timeline failed to init, it cannot be used. Remove it from the map.
TIMELINES_STATE.lock().unwrap().timelines.remove(&ttid);
return Err(e);
}
// We are done with bootstrap, release the lock, return the timeline.
// {} block forces release before .await
}
timeline.update_status_notify().await?;
timeline.wal_backup_launcher_tx.send(timeline.ttid).await?;
Ok(timeline)
}

View File

@@ -12,25 +12,26 @@ import psycopg2.extras
# We call the test "flaky" if it failed at least once on the main branch in the last N=10 days.
FLAKY_TESTS_QUERY = """
SELECT
DISTINCT parent_suite, suite, test
DISTINCT parent_suite, suite, REGEXP_REPLACE(test, '(release|debug)-pg(\\d+)-?', '') as deparametrized_test
FROM
(
SELECT
revision,
jsonb_array_elements(data -> 'children') -> 'name' as parent_suite,
jsonb_array_elements(jsonb_array_elements(data -> 'children') -> 'children') -> 'name' as suite,
jsonb_array_elements(jsonb_array_elements(jsonb_array_elements(data -> 'children') -> 'children') -> 'children') -> 'name' as test,
jsonb_array_elements(jsonb_array_elements(jsonb_array_elements(data -> 'children') -> 'children') -> 'children') -> 'status' as status,
jsonb_array_elements(jsonb_array_elements(jsonb_array_elements(data -> 'children') -> 'children') -> 'children') -> 'retriesStatusChange' as retries_status_change,
to_timestamp((jsonb_array_elements(jsonb_array_elements(jsonb_array_elements(data -> 'children') -> 'children') -> 'children') -> 'time' -> 'start')::bigint / 1000)::date as timestamp
reference,
jsonb_array_elements(data -> 'children') ->> 'name' as parent_suite,
jsonb_array_elements(jsonb_array_elements(data -> 'children') -> 'children') ->> 'name' as suite,
jsonb_array_elements(jsonb_array_elements(jsonb_array_elements(data -> 'children') -> 'children') -> 'children') ->> 'name' as test,
jsonb_array_elements(jsonb_array_elements(jsonb_array_elements(data -> 'children') -> 'children') -> 'children') ->> 'status' as status,
jsonb_array_elements(jsonb_array_elements(jsonb_array_elements(data -> 'children') -> 'children') -> 'children') ->> 'retriesStatusChange' as retries_status_change,
to_timestamp((jsonb_array_elements(jsonb_array_elements(jsonb_array_elements(data -> 'children') -> 'children') -> 'children') -> 'time' ->> 'start')::bigint / 1000)::date as timestamp
FROM
regress_test_results
WHERE
reference = 'refs/heads/main'
) data
WHERE
timestamp > CURRENT_DATE - INTERVAL '%s' day
AND (status::text IN ('"failed"', '"broken"') OR retries_status_change::boolean)
AND (
(status IN ('failed', 'broken') AND reference = 'refs/heads/main')
OR retries_status_change::boolean
)
;
"""
@@ -40,6 +41,9 @@ def main(args: argparse.Namespace):
interval_days = args.days
output = args.output
build_type = args.build_type
pg_version = args.pg_version
res: DefaultDict[str, DefaultDict[str, Dict[str, bool]]]
res = defaultdict(lambda: defaultdict(dict))
@@ -55,8 +59,21 @@ def main(args: argparse.Namespace):
rows = []
for row in rows:
logging.info(f"\t{row['parent_suite'].replace('.', '/')}/{row['suite']}.py::{row['test']}")
res[row["parent_suite"]][row["suite"]][row["test"]] = True
# We don't want to automatically rerun tests in a performance suite
if row["parent_suite"] != "test_runner.regress":
continue
deparametrized_test = row["deparametrized_test"]
dash_if_needed = "" if deparametrized_test.endswith("[]") else "-"
parametrized_test = deparametrized_test.replace(
"[",
f"[{build_type}-pg{pg_version}{dash_if_needed}",
)
res[row["parent_suite"]][row["suite"]][parametrized_test] = True
logging.info(
f"\t{row['parent_suite'].replace('.', '/')}/{row['suite']}.py::{parametrized_test}"
)
logging.info(f"saving results to {output.name}")
json.dump(res, output, indent=2)
@@ -77,6 +94,18 @@ if __name__ == "__main__":
type=int,
help="how many days to look back for flaky tests (default: 10)",
)
parser.add_argument(
"--build-type",
required=True,
type=str,
help="for which build type to create list of flaky tests (debug or release)",
)
parser.add_argument(
"--pg-version",
required=True,
type=int,
help="for which Postgres version to create list of flaky tests (14, 15, etc.)",
)
parser.add_argument(
"connstr",
help="connection string to the test results database",

View File

@@ -125,6 +125,7 @@ async fn publish(client: Option<BrokerClientChannel>, n_keys: u64) {
tenant_id: vec![0xFF; 16],
timeline_id: tli_from_u64(counter % n_keys),
}),
term: 0,
last_log_term: 0,
flush_lsn: counter,
commit_lsn: 2,
@@ -132,6 +133,7 @@ async fn publish(client: Option<BrokerClientChannel>, n_keys: u64) {
remote_consistent_lsn: 4,
peer_horizon_lsn: 5,
safekeeper_connstr: "zenith-1-sk-1.local:7676".to_owned(),
http_connstr: "zenith-1-sk-1.local:7677".to_owned(),
local_start_lsn: 0,
availability_zone: None,
};

View File

@@ -22,6 +22,8 @@ message SubscribeSafekeeperInfoRequest {
message SafekeeperTimelineInfo {
uint64 safekeeper_id = 1;
TenantTimelineId tenant_timeline_id = 2;
// Safekeeper term
uint64 term = 12;
// Term of the last entry.
uint64 last_log_term = 3;
// LSN of the last record.
@@ -36,6 +38,8 @@ message SafekeeperTimelineInfo {
uint64 local_start_lsn = 9;
// A connection string to use for WAL receiving.
string safekeeper_connstr = 10;
// HTTP endpoint connection string
string http_connstr = 13;
// Availability zone of a safekeeper.
optional string availability_zone = 11;
}

View File

@@ -519,6 +519,7 @@ mod tests {
tenant_id: vec![0x00; 16],
timeline_id,
}),
term: 0,
last_log_term: 0,
flush_lsn: 1,
commit_lsn: 2,
@@ -526,6 +527,7 @@ mod tests {
remote_consistent_lsn: 4,
peer_horizon_lsn: 5,
safekeeper_connstr: "neon-1-sk-1.local:7676".to_owned(),
http_connstr: "neon-1-sk-1.local:7677".to_owned(),
local_start_lsn: 0,
availability_zone: None,
}

View File

@@ -1524,7 +1524,7 @@ class NeonPageserver(PgProtocol):
".*wait for layer upload ops to complete.*", # .*Caused by:.*wait_completion aborted because upload queue was stopped
".*gc_loop.*Gc failed, retrying in.*timeline is Stopping", # When gc checks timeline state after acquiring layer_removal_cs
".*gc_loop.*Gc failed, retrying in.*: Cannot run GC iteration on inactive tenant", # Tenant::gc precondition
".*compaction_loop.*Compaction failed, retrying in.*timeline or pageserver is shutting down", # When compaction checks timeline state after acquiring layer_removal_cs
".*compaction_loop.*Compaction failed, retrying in.*timeline is Stopping", # When compaction checks timeline state after acquiring layer_removal_cs
".*query handler for 'pagestream.*failed: Timeline .* was not found", # postgres reconnects while timeline_delete doesn't hold the tenant's timelines.lock()
".*query handler for 'pagestream.*failed: Timeline .* is not active", # timeline delete in progress
".*task iteration took longer than the configured period.*",

View File

@@ -233,10 +233,19 @@ if TYPE_CHECKING:
def assert_prefix_empty(neon_env_builder: "NeonEnvBuilder", prefix: Optional[str] = None):
response = list_prefix(neon_env_builder, prefix)
objects = response.get("Contents")
assert (
response["KeyCount"] == 0
), f"remote dir with prefix {prefix} is not empty after deletion: {objects}"
keys = response["KeyCount"]
objects = response.get("Contents", [])
if keys != 0 and len(objects) == 0:
# this has been seen in one case with mock_s3:
# https://neon-github-public-dev.s3.amazonaws.com/reports/pr-4938/6000769714/index.html#suites/3556ed71f2d69272a7014df6dcb02317/ca01e4f4d8d9a11f
# looking at moto impl, it might be there's a race with common prefix (sub directory) not going away with deletes
common_prefixes = response.get("CommonPrefixes", [])
log.warn(
f"contradicting ListObjectsV2 response with KeyCount={keys} and Contents={objects}, CommonPrefixes={common_prefixes}"
)
assert keys == 0, f"remote dir with prefix {prefix} is not empty after deletion: {objects}"
def assert_prefix_not_empty(neon_env_builder: "NeonEnvBuilder", prefix: Optional[str] = None):

View File

@@ -22,7 +22,7 @@ def positive_env(neon_env_builder: NeonEnvBuilder) -> NeonEnv:
# eviction might be the first one after an attach to access the layers
env.pageserver.allowed_errors.append(
".*unexpectedly on-demand downloading remote layer .* for task kind Eviction"
".*unexpectedly on-demand downloading remote layer remote.* for task kind Eviction"
)
assert isinstance(env.remote_storage, LocalFsStorage)
return env

Some files were not shown because too many files have changed in this diff Show More