Compare commits

..

67 Commits

Author SHA1 Message Date
Konstantin Knizhnik
93b0f44a06 Set lfc_size_limit as Min(lfc_size_limit, lfc_max_mem) 2023-08-24 17:54:22 +03:00
Konstantin Knizhnik
fc206e60d2 Add lfc_max_mem GUC 2023-08-24 17:54:22 +03:00
Konstantin Knizhnik
7b8ccee8db Restore neon.lfs_file_parth GUC 2023-08-24 17:54:22 +03:00
Konstantin Knizhnik
d8d38f2c42 Monitor avaiable memory size in local file cache and shrink cache if watermark is reached 2023-08-24 17:54:22 +03:00
Chengpeng Yan
fa74d5649e rename EphmeralFile::size to EphemeralFile::len (#5076)
## Problem
close https://github.com/neondatabase/neon/issues/5034

## Summary of changes
Based on the
[comment](https://github.com/neondatabase/neon/pull/4994#discussion_r1297277922).
Just rename the `EphmeralFile::size` to `EphemeralFile::len`.
2023-08-24 16:41:57 +02:00
Joonas Koivunen
f70871dfd0 internal-devx: pageserver future layers (#5092)
I've personally forgotten why/how can we have future layers during
reconciliation. Adds `#[cfg(feature = "testing")]` logging when we
upload such index_part.json, with a cross reference to where the cleanup
happens.

Latest private slack thread:
https://neondb.slack.com/archives/C033RQ5SPDH/p1692879032573809?thread_ts=1692792276.173979&cid=C033RQ5SPDH

Builds upon #5074. Should had been considered on #4837.
2023-08-24 17:22:36 +03:00
Alek Westover
99a1be6c4e remove upload step from neon, it is in private repo now (#5085) 2023-08-24 17:14:40 +03:00
Joonas Koivunen
76aa01c90f refactor: single phase Timeline::load_layer_map (#5074)
Current implementation first calls `load_layer_map`, which loads all
local layers, cleans up files, leave cleaning up stuff to "second
function". Then the "second function" is finally called, it does not do
the cleanup and some of the first functions setup can torn down. "Second
function" is actually both `reconcile_with_remote` and
`create_remote_layers`.

This change makes it a bit more verbose but in one phase with the
following sub-steps:
1. scan the timeline directory
2. delete extra files
    - now including on-demand download files
    - fixes #3660
3. recoincile the two sources of layers (directory, index_part)
4. rename_to_backup future layers, short layers
5. create the remaining as layers

Needed by #4938.

It was also noticed that this is blocking code in an `async fn` so just
do it in a `spawn_blocking`, which should be healthy for our startup
times. Other effects includes hopefully halving of `stat` calls; extra
calls which were not done previously are now done for the future layers.

Co-authored-by: Christian Schwarz <christian@neon.tech>
Co-authored-by: John Spray <john@neon.tech>
2023-08-24 16:07:40 +03:00
John Spray
3e2f0ffb11 libs: make backoff::retry() take a cancellation token (#5065)
## Problem

Currently, anything that uses backoff::retry will delay the join of its
task by however long its backoff sleep is, multiplied by its max
retries.

Whenever we call a function that sleeps, we should be passing in a
CancellationToken.

## Summary of changes

- Add a `Cancel` type to backoff::retry that wraps a CancellationToken
and an error `Fn` to generate an error if the cancellation token fires.
- In call sites that already run in a `task_mgr` task, use
`shutdown_token()` to provide the token. In other locations, use a dead
`CancellationToken` to satisfy the interface, and leave a TODO to fix it
up when we broaden the use of explicit cancellation tokens.
2023-08-24 14:54:46 +03:00
Arseny Sher
d597e6d42b Track list of walreceivers and their voting/streaming state in shmem.
Also add both walsenders and walreceivers to TimelineStatus (available under
v1/tenant/xxx/timeline/yyy).

Prepares for
https://github.com/neondatabase/neon/pull/4875
2023-08-23 16:04:08 +03:00
Christian Schwarz
71ccb07a43 ci: fix upload-postgres-extensions-to-s3 job (#5063)
This is cherry-picked-then-improved version of release branch commit
4204960942 PR #4861)

The commit

	commit 5f8fd640bf
	Author: Alek Westover <alek.westover@gmail.com>
	Date:   Wed Jul 26 08:24:03 2023 -0400

	    Upload Test Remote Extensions (#4792)

switched to using the release tag instead of `latest`, but,
the `promote-images` job only uploads `latest` to the prod ECR.

The switch to using release tag was good in principle, but,
it broke the release pipeline. So, switch release pipeline
back to using `latest`.

Note that a proper fix should abandon use of `:latest` tag
at all: currently, if a `main` pipeline runs concurrently
with a `release` pipeline, the `release` pipeline may end
up using the `main` pipeline's images.

---------

Co-authored-by: Alexander Bayandin <alexander@neon.tech>
2023-08-22 22:45:25 +03:00
Joonas Koivunen
ad8d777c1c refactor: remove is_incremental=true for ImageLayers footgun (#5061)
Accidentially giving is_incremental=true for ImageLayers costs a lot of
debugging time. Removes all API which would allow to do that. They can
easily be restored later *when needed*.

Split off from #4938.
2023-08-22 22:12:05 +03:00
Joonas Koivunen
2f97b43315 build: update tar, get rid of duplicate xattr (#5071)
`tar` recently pushed to 0.4.40. No big changes, but less Cargo.lock and
one less nagging from `cargo-deny`.

The diff:
https://github.com/alexcrichton/tar-rs/compare/0.4.38...0.4.40.
2023-08-22 21:21:44 +03:00
Joonas Koivunen
533a92636c refactor: pre-cleanup Layer, PersistentLayer and impls (#5059)
Remove pub but dead code, move trait methods as inherent methods, remove
unnecessary. Split off from #4938.
2023-08-22 21:14:28 +03:00
Alek Westover
bf303a6575 Trigger workflow in remote (private) repo to build and upload private extensions (#4944) 2023-08-22 13:32:29 -04:00
Christian Schwarz
8cd20485f8 metrics: smgr query time: add a pre-aggregated histogram (#5064)
When doing global queries in VictoriaMetrics, the per-timeline
histograms make us run into cardinality limits.

We don't want to give them up just yet because we don't
have an alternative for drilling down on timeline-specific
performance issues.

So, add a pre-aggregated histogram and add observations to it
whenever we add observations to the per-timeline histogram.

While we're at it, switch to using a strummed enum for the operation
type names.
2023-08-22 20:08:31 +03:00
Joonas Koivunen
933a869f00 refactor: compaction becomes async again (#5058)
#4938 will make on-demand download of layers in compaction possible, so
it's not suitable for our "policy" of no `spawn_blocking(|| ...
Handle::block_on(async { spawn_blocking(...).await })` because this
poses a clear deadlock risk. Nested spawn_blockings are because of the
download using `tokio::fs::File`.

- Remove `spawn_blocking` from caller of `compact_level0_phase1`
- Remove `Handle::block_on` from `compact_level0_phase1` (indentation
change)
- Revert to `AsLayerDesc::layer_desc` usage temporarily (until it
becomes field access in #4938)
2023-08-22 20:03:14 +03:00
Conrad Ludgate
8c6541fea9 chore: add supported targets to deny (#5070)
## Problem

many duplicate windows crates pollute the cargo deny output

## Summary of changes

we don't build those crates, so remove those targets from being checked
2023-08-22 19:44:31 +03:00
Alek Westover
5cf75d92d8 Fix cargo deny errors (#5068)
## Problem
cargo deny lint broken

Links to the CVEs:

[rustsec.org/advisories/RUSTSEC-2023-0052](https://rustsec.org/advisories/RUSTSEC-2023-0052)

[rustsec.org/advisories/RUSTSEC-2023-0053](https://rustsec.org/advisories/RUSTSEC-2023-0053)
One is fixed, the other one isn't so we allow it (for now), to unbreak
CI. Then later we'll try to get rid of webpki in favour of the rustls
fork.

## Summary of changes
```
+ignore = ["RUSTSEC-2023-0052"]
```
2023-08-22 18:41:32 +03:00
Conrad Ludgate
0b001a0001 proxy: remove connections on shutdown (#5051)
## Problem

On shutdown, proxy connections are staying open.

## Summary of changes

Remove the connections on shutdown
2023-08-21 19:20:58 +01:00
Felix Prasanna
4a8bd866f6 bump vm-builder version to v0.16.3 (#5055)
This change to autoscaling allows agents to connect directly to the
monitor, completely removing the informant.
2023-08-21 13:29:16 -04:00
John Spray
615a490239 pageserver: refactor Tenant/Timeline args into structs (#5053)
## Problem

There are some common types that we pass into tenants and timelines as
we construct them, such as remote storage and the broker client.
Currently the list is small, but this is likely to grow -- the deletion
queue PR (#4960) pushed some methods to the point of clippy complaining
they had too many args, because of the extra deletion queue client being
passed around.

There are some shared objects that currently aren't passed around
explicitly because they use a static `once_cell` (e.g.
CONCURRENT_COMPACTIONS), but as we add more resource management and
concurreny control over time, it will be more readable & testable to
pass a type around in the respective Resources object, rather than to
coordinate via static objects. The `Resources` structures in this PR
will make it easier to add references to central coordination functions,
without having to rely on statics.

## Summary of changes

- For `Tenant`, the `broker_client` and `remote_storage` are bundled
into `TenantSharedResources`
- For `Timeline`, the `remote_client` is wrapped into
`TimelineResources`.

Both of these structures will get an additional deletion queue member in
#4960.
2023-08-21 17:30:28 +01:00
John Spray
b95addddd5 pageserver: do not read redundant timeline_layers from IndexPart, so that we can remove it later (#4972)
## Problem

IndexPart contains two redundant lists of layer names: a set of the
names, and then a map of name to metadata.

We already required that all the layers in `timeline_layers` are also in
`layers_metadata`, in `initialize_with_current_remote_index_part`, so if
there were any index_part.json files in the field that relied on these
sets being different, they would already be broken.

## Summary of changes

`timeline_layers` is made private and no longer read at runtime. It is
still serialized, but not deserialized.

`disk_consistent_lsn` is also made private, as this field only exists
for convenience of humans reading the serialized JSON.

This prepares us to entirely remove `timeline_layers` in a future
release, once this change is fully deployed, and therefore no
pageservers are trying to read the field.
2023-08-21 14:29:36 +03:00
Joonas Koivunen
130ccb4b67 Remove initial timeline id troubles (#5044)
I made a mistake when I adding `env.initial_timeline:
Optional[TimelineId]` in the #3839, should had just generated it and
used it to create a specific timeline. This PR fixes those mistakes, and
some extra calling into psql which must be slower than python field
access.
2023-08-20 12:33:19 +03:00
Dmitry Rodionov
9140a950f4 Resume tenant deletion on attach (#5039)
I'm still a bit nervous about attach -> crash case. But it should work.
(unlike case with timeline). Ideally would be cool to cover this with
test.

This continues tradition of adding bool flags for Tenant::set_stopping.
Probably lifecycle project will help with fixing it.
2023-08-20 12:28:50 +03:00
Arpad Müller
a23b0773f1 Fix DeltaLayer dumping (#5045)
## Problem

Before, DeltaLayer dumping (via `cargo run --release -p pagectl --
print-layer-file` ) would crash as one can't call `Handle::block_on` in
an async executor thread.

## Summary of changes

Avoid the problem by using `DeltaLayerInner::load_keys` to load the keys
into RAM (which we already do during compaction), and then load the
values one by one during dumping.
2023-08-19 00:56:03 +02:00
Joonas Koivunen
368ee6c8ca refactor: failpoint support (#5033)
- move them to pageserver which is the only dependant on the crate fail
- "move" the exported macro to the new module
- support at init time the same failpoints as runtime

Found while debugging test failures and making tests more repeatable by
allowing "exit" from pageserver start via environment variables. Made
those changes to `test_gc_cutoff.py`.

---------

Co-authored-by: Christian Schwarz <christian@neon.tech>
2023-08-19 01:01:44 +03:00
Felix Prasanna
5c6a692cf1 bump VM_BUILDER_VERSION to v0.16.2 (#5031)
A very slight change that allows us to configure the UID of the
neon-postgres cgroup owner. We start postgres in this cgroup so we can
scale it with the cgroups v2 api. Currently, the control plane
overwrites the entrypoint set by `vm-builder`, so `compute_ctl` (and
thus postgres), is not started in the neon-postgres cgroup. Having
`compute_ctl` start postgres in the cgroup should fix this. However, at
the moment appears like it does not have the correct permissions.
Configuring the neon-postgres UID to `postgres` (which is the UID
`compute_ctl` runs under) should hopefully fix this.


See #4920 - the PR to modify `compute_ctl` to start postgres in the
cgorup.
See: neondatabase/autoscaling#480, neondatabase/autoscaling#477. Both
these PR's are part of an effort to increase `vm-builder`'s
configurability and allow us to adjust it as we integrate in the
monitor.
2023-08-18 14:29:20 -04:00
Dmitry Rodionov
30888a24d9 Avoid flakiness in test_timeline_delete_fail_before_local_delete (#5032)
The problem was that timeline detail can return timelines in not only
active state. And by the time request comes timeline deletion can still
be in progress if we're unlucky (test execution happened to be slower
for some reason)

Reference for failed test run

https://neon-github-public-dev.s3.amazonaws.com/reports/pr-5022/5891420105/index.html#suites/f588e0a787c49e67b29490359c589fae/dab036e9bd673274

The error was `Exception: detail succeeded (it should return 404)`

reported by @koivunej
2023-08-18 20:49:11 +03:00
Dmitry Rodionov
f6c671c140 resume timeline deletions on attach (#5030)
closes [#5036](https://github.com/neondatabase/neon/issues/5036)
2023-08-18 20:48:33 +03:00
Christian Schwarz
ed5bce7cba rfcs: archive my MVCC S3 Notion Proposal (#5040)
This is a copy from the [original Notion
page](https://www.notion.so/neondatabase/Proposal-Pageserver-MVCC-S3-Storage-8a424c0c7ec5459e89d3e3f00e87657c?pvs=4),
taken on 2023-08-16.

This is for archival mostly.
The RFC that we're likely to go with is
https://github.com/neondatabase/neon/pull/4919.
2023-08-18 19:34:29 +02:00
Christian Schwarz
7a63685cde simplify page-caching of EphemeralFile (#4994)
(This PR is the successor of https://github.com/neondatabase/neon/pull/4984 )

## Summary

The current way in which `EphemeralFile` uses `PageCache` complicates
the Pageserver code base to a degree that isn't worth it.
This PR refactors how we cache `EphemeralFile` contents, by exploiting
the append-only nature of `EphemeralFile`.

The result is that `PageCache` only holds `ImmutableFilePage` and
`MaterializedPage`.
These types of pages are read-only and evictable without write-back.
This allows us to remove the writeback code from `PageCache`, also
eliminating an entire failure mode.

Futher, many great open-source libraries exist to solve the problem of a
read-only cache,
much better than our `page_cache.rs` (e.g., better replacement policy,
less global locking).
With this PR, we can now explore using them.

## Problem & Analysis

Before this PR, `PageCache` had three types of pages:

* `ImmutableFilePage`: caches Delta / Image layer file contents
* `MaterializedPage`: caches results of Timeline::get (page
materialization)
* `EphemeralPage`: caches `EphemeralFile` contents

`EphemeralPage` is quite different from `ImmutableFilePage` and
`MaterializedPage`:

* Immutable and materialized pages are for the acceleration of (future)
reads of the same data using `PAGE_CACHE_SIZE * PAGE_SIZE` bytes of
DRAM.
* Ephemeral pages are a write-back cache of `EphemeralFile` contents,
i.e., if there is pressure in the page cache, we spill `EphemeralFile`
contents to disk.

`EphemeralFile` is only used by `InMemoryLayer`, for the following
purposes:
* **write**: when filling up the `InMemoryLayer`, via `impl BlobWriter
for EphemeralFile`
* **read**: when doing **page reconstruction** for a page@lsn that isn't
written to disk
* **read**: when writing L0 layer files, we re-read the `InMemoryLayer`
and put the contents into the L0 delta writer
(**`create_delta_layer`**). This happens every 10min or when
InMemoryLayer reaches 256MB in size.

The access patterns of the `InMemoryLayer` use case are as follows:

* **write**: via `BlobWriter`, strictly append-only
* **read for page reconstruction**: via `BlobReader`, random
* **read for `create_delta_layer`**: via `BlobReader`, dependent on
data, but generally random. Why?
* in classical LSM terms, this function is what writes the
memory-resident `C0` tree into the disk-resident `C1` tree
* in our system, though, the values of InMemoryLayer are stored in an
EphemeralFile, and hence they are not guaranteed to be memory-resident
* the function reads `Value`s in `Key, LSN` order, which is `!=` insert
order

What do these `EphemeralFile`-level access patterns mean for the page
cache?

* **write**:
* the common case is that `Value` is a WAL record, and if it isn't a
full-page-image WAL record, then it's smaller than `PAGE_SIZE`
* So, the `EphemeralPage` pages act as a buffer for these `< PAGE_CACHE`
sized writes.
* If there's no page cache eviction between subsequent
`InMemoryLayer::put_value` calls, the `EphemeralPage` is still resident,
so the page cache avoids doing a `write` system call.
* In practice, a busy page server will have page cache evictions because
we only configure 64MB of page cache size.
* **reads for page reconstruction**: read acceleration, just as for the
other page types.
* **reads for `create_delta_layer`**:
* The `Value` reads happen through a `BlockCursor`, which optimizes the
case of repeated reads from the same page.
* So, the best case is that subsequent values are located on the same
page; hence `BlockCursor`s buffer is maximally effective.
* The worst case is that each `Value` is on a different page; hence the
`BlockCursor`'s 1-page-sized buffer is ineffective.
* The best case translates into `256MB/PAGE_SIZE` page cache accesses,
one per page.
    * the worst case translates into `#Values` page cache accesses
* again, the page cache accesses must be assumed to be random because
the `Value`s aren't accessed in insertion order but `Key, LSN` order.

## Summary of changes

Preliminaries for this PR were:

- #5003
- #5004 
- #5005 
  - uncommitted microbenchmark in #5011 

Based on the observations outlined above, this PR makes the following
changes:

* Rip out `EphemeralPage` from `page_cache.rs`
* Move the `block_io::FileId` to `page_cache::FileId`
* Add a `PAGE_SIZE`d buffer to the `EphemeralPage` struct.
  It's called `mutable_tail`.
* Change `write_blob` to use `mutable_tail` for the write buffering
instead of a page cache page.
* if `mutable_tail` is full, it writes it out to disk, zeroes it out,
and re-uses it.
* There is explicitly no double-buffering, so that memory allocation per
`EphemeralFile` instance is fixed.
* Change `read_blob` to return different `BlockLease` variants depending
on `blknum`
* for the `blknum` that corresponds to the `mutable_tail`, return a ref
to it
* Rust borrowing rules prevent `write_blob` calls while refs are
outstanding.
  * for all non-tail blocks, return a page-cached `ImmutablePage`
* It is safe to page-cache these as ImmutablePage because EphemeralFile
is append-only.

## Performance 

How doe the changes above affect performance?
M claim is: not significantly.

* **write path**:
* before this PR, the `EphemeralFile::write_blob` didn't issue its own
`write` system calls.
* If there were enough free pages, it didn't issue *any* `write` system
calls.
* If it had to evict other `EphemeralPage`s to get pages a page for its
writes (`get_buf_for_write`), the page cache code would implicitly issue
the writeback of victim pages as needed.
* With this PR, `EphemeralFile::write_blob` *always* issues *all* of its
*own* `write` system calls.
* Also, the writes are explicit instead of implicit through page cache
write back, which will help #4743
* The perf impact of always doing the writes is the CPU overhead and
syscall latency.
* Before this PR, we might have never issued them if there were enough
free pages.
* We don't issue `fsync` and can expect the writes to only hit the
kernel page cache.
* There is also an advantage in issuing the writes directly: the perf
impact is paid by the tenant that caused the writes, instead of whatever
tenant evicts the `EphemeralPage`.
* **reads for page reconstruction**: no impact.
* The `write_blob` function pre-warms the page cache when it writes the
`mutable_tail` to disk.
* So, the behavior is the same as with the EphemeralPages before this
PR.
* **reads for `create_delta_layer`**: no impact.
  * Same argument as for page reconstruction.
  * Note for the future:
* going through the page cache likely causes read amplification here.
Why?
* Due to the `Key,Lsn`-ordered access pattern, we don't read all the
values in the page before moving to the next page. In the worst case, we
might read the same page multiple times to read different `Values` from
it.
    * So, it might be better to bypass the page cache here.
    * Idea drafts:
      * bypass PS page cache + prefetch pipeline + iovec-based IO
* bypass PS page cache + use `copy_file_range` to copy from ephemeral
file into the L0 delta file, without going through user space
2023-08-18 20:31:03 +03:00
Joonas Koivunen
0a082aee77 test: allow race with flush and stopped queue (#5027)
A lucky race can happen with the shutdown order I guess right now. Seen
in [test_tenant_delete_smoke].

The message is not the greatest to match against.

[test_tenant_delete_smoke]:
https://neon-github-public-dev.s3.amazonaws.com/reports/main/5892262320/index.html#suites/3556ed71f2d69272a7014df6dcb02317/189a0d1245fb5a8c
2023-08-18 19:36:25 +03:00
Arthur Petukhovsky
0b90411380 Fix safekeeper recovery with auth (#5035)
Fix missing a password in walrcv_connect for a safekeeper recovery. Add
a test which restarts endpoint and triggers a recovery.
2023-08-18 16:48:55 +01:00
Arpad Müller
f4da010aee Make the compaction warning more tolerant (#5024)
## Problem

The performance benchmark in `test_runner/performance/test_layer_map.py`
is currently failing due to the warning added in #4888.

## Summary of changes

The test mentioned has a `compaction_target_size` of 8192, which is just
one page size. This is an unattainable goal, as we generate at least
three pages: one for the header, one for the b-tree (minimally sized
ones have just the root node in a single page), one for the data.

Therefore, we add two pages to the warning limit. The warning text
becomes a bit less accurate but I think this is okay.
2023-08-18 16:36:31 +02:00
Conrad Ludgate
ec10838aa4 proxy: pool connection logs (#5020)
## Problem

Errors and notices that happen during a pooled connection lifecycle have
no session identifiers

## Summary of changes

Using a watch channel, we set the session ID whenever it changes. This
way we can see the status of a connection for that session

Also, adding a connection id to be able to search the entire connection
lifecycle
2023-08-18 11:44:08 +01:00
Joonas Koivunen
67af24191e test: cleanup remote_timeline_client tests (#5013)
I will have to change these as I change remote_timeline_client api in
#4938. So a bit of cleanup, handle my comments which were just resolved
during initial review.

Cleanup:
- use unwrap in tests instead of mixed `?` and `unwrap`
- use `Handle` instead of `&'static Reactor` to make the
RemoteTimelineClient more natural
- use arrays in tests
- use plain `#[tokio::test]`
2023-08-17 19:27:30 +03:00
Joonas Koivunen
6af5f9bfe0 fix: format context (#5022)
We return an error with unformatted `{timeline_id}`.
2023-08-17 14:30:25 +00:00
Dmitry Rodionov
64fc7eafcd Increase timeout once again. (#5021)
When failpoint is early in deletion process it takes longer to complete
after failpoint is removed.

Example was: https://neon-github-public-dev.s3.amazonaws.com/reports/main/5889544346/index.html#suites/3556ed71f2d69272a7014df6dcb02317/49826c68ce8492b1
2023-08-17 15:37:28 +03:00
Conrad Ludgate
3e4710c59e proxy: add more sasl logs (#5012)
## Problem

A customer is having trouble connecting to neon from their production
environment. The logs show a mix of "Internal error" and "authentication
protocol violation" but not the full error

## Summary of changes

Make sure we don't miss any logs during SASL/SCRAM
2023-08-17 12:05:54 +01:00
Dmitry Rodionov
d8b0a298b7 Do not attach deleted tenants (#5008)
Rather temporary solution before proper:
https://github.com/neondatabase/neon/issues/5006

It requires more plumbing so lets not attach deleted tenants first and
then implement resume.

Additionally fix `assert_prefix_empty`. It had a buggy prefix calculation,
and since we always asserted for absence of stuff it worked. Here I
started to assert for presence of stuff too and it failed. Added more
"presence" asserts to other places to be confident that it works.

Resolves [#5016](https://github.com/neondatabase/neon/issues/5016)
2023-08-17 13:46:49 +03:00
Alexander Bayandin
c8094ee51e test_compatibility: run amcheck unconditionally (#4985)
## Problem

The previous version of neon (that we use in the forward compatibility test)
has installed `amcheck` extension now. We can run `pg_amcheck`
unconditionally.

## Summary of changes
- Run `pg_amcheck` in compatibility tests unconditionally
2023-08-17 11:46:00 +01:00
Christian Schwarz
957af049c2 ephemeral file: refactor write_blob impl to concentrate mutable state (#5004)
Before this patch, we had the `off` and `blknum` as function-wide
mutable state. Now it's contained in the `Writer` struct.

The use of `push_bytes` instead of index-based filling of the buffer
also makes it easier to reason about what's going on.

This is prep for https://github.com/neondatabase/neon/pull/4994
2023-08-17 13:07:25 +03:00
Anastasia Lubennikova
786c7b3708 Refactor remote extensions index download.
Don't download ext_index.json from s3, but instead receive it as a part of spec from control plane.
This eliminates s3 access for most compute starts,
and also allows us to update extensions spec on the fly
2023-08-17 12:48:33 +03:00
Joonas Koivunen
d3612ce266 delta_layer: Restore generic from last week (#5014)
Restores #4937 work relating to the ability to use `ResidentDeltaLayer`
(which is an Arc wrapper) in #4938 for the ValueRef's by removing the
borrow from `ValueRef` and providing it from an upper layer.

This should not have any functional changes, most importantly, the
`main` will continue to use the borrowed `DeltaLayerInner`. It might be
that I can change #4938 to be like this. If that is so, I'll gladly rip
out the `Ref` and move the borrow back. But I'll first want to look at
the current test failures.
2023-08-17 11:47:31 +03:00
Christian Schwarz
994411f5c2 page cache: newtype the blob_io and ephemeral_file file ids (#5005)
This makes it more explicit that these are different u64-sized
namespaces.
Re-using one in place of the other would be catastrophic.

Prep for https://github.com/neondatabase/neon/pull/4994
which will eliminate the ephemeral_file::FileId and move the
blob_io::FileId into page_cache.
It makes sense to have this preliminary commit though,
to minimize amount of new concept in #4994 and other
preliminaries that depend on that work.
2023-08-16 18:33:47 +02:00
Conrad Ludgate
25934ec1ba proxy: reduce global conn pool contention (#4747)
## Problem

As documented, the global connection pool will be high contention.

## Summary of changes

Use DashMap rather than Mutex<HashMap>.

Of note, DashMap currently uses a RwLock internally, but it's partially
sharded to reduce contention by a factor of N. We could potentially use
flurry which is a port of Java's concurrent hashmap, but I have no good
understanding of it's performance characteristics. Dashmap is at least
equivalent to hashmap but less contention.

See the read heavy benchmark to analyse our expected performance
<https://github.com/xacrimon/conc-map-bench#ready-heavy>

I also spoke with the developer of dashmap recently, and they are
working on porting the implementation to use concurrent HAMT FWIW
2023-08-16 17:20:28 +01:00
Arpad Müller
0bdbc39cb1 Compaction: unify key and value reference vecs (#4888)
## Problem

PR #4839 has already reduced the number of b-tree traversals and vec
creations from 3 to 2, but as pointed out in
https://github.com/neondatabase/neon/pull/4839#discussion_r1279167815 ,
we would ideally just traverse the b-tree once during compaction.

Afer #4836, the two vecs created are one for the list of keys, lsns and
sizes, and one for the list of `(key, lsn, value reference)`. However,
they are not equal, as pointed out in
https://github.com/neondatabase/neon/pull/4839#issuecomment-1660418012
and the following comment: the key vec creation combines multiple
entries for which the lsn is changing but the key stays the same into
one, with the size being the sum of the sub-sizes. In SQL, this would
correspond to something like `SELECT key, lsn, SUM(size) FROM b_tree
GROUP BY key;` and `SELECT key, lsn, val_ref FROM b_tree;`. Therefore,
the join operation is non-trivial.

## Summary of changes

This PR merges the two lists of keys and value references into one. It's
not a trivial change and affects the size pattern of the resulting
files, which is why this is in a separate PR from #4839 .

The key vec is used in compaction for determining when to start a new
layer file. The loop uses various thresholds to come to this conclusion,
but the grouping via the key has led to the behaviour that regardless of
the threshold, it only starts a new file when either a new key is
encountered, or a new delta file.

The new code now does the combination after the merging and sorting of
the various keys from the delta files. This *mostly* does the same as
the old code, except for a detail: with the grouping done on a
per-delta-layer basis, the sorted and merged vec would still have
multiple entries for multiple delta files, but now, we don't have an
easy way to tell when a new input delta layer file is encountered, so we
cannot create multiple entries on that basis easily.

To prevent possibly infinite growth, our new grouping code compares the
combined size with the threshold, and if it is exceeded, it cuts a new
entry so that the downstream code can cut a new output file. Here, we
perform a tradeoff however, as if the threshold is too small, we risk
putting entries for the same key into multiple layer files, but if the
threshold is too big, we can in some instances exceed the target size.

Currently, we set the threshold to the target size, so in theory we
would stay below or roughly at double the `target_file_size`.

We also fix the way the size was calculated for the last key. The calculation
was wrong and accounted for the old layer's btree, even though we
already account for the overhead of the in-construction btree.

Builds on top of #4839 .
2023-08-16 18:27:18 +03:00
Dmitry Rodionov
96b84ace89 Correctly remove orphaned objects in RemoteTimelineClient::delete_all (#5000)
Previously list_prefixes was incorrectly used for that purpose. Change
to use list_files. Add a test.

Some drive by refactorings on python side to move helpers out of
specific test file to be widely accessible

resolves https://github.com/neondatabase/neon/issues/4499
2023-08-16 17:31:16 +03:00
Christian Schwarz
368b783ada ephemeral_file: remove FileExt impl (was only used by tests) (#5003)
Extracted from https://github.com/neondatabase/neon/pull/4994
2023-08-16 15:41:25 +02:00
Dmitry Rodionov
0f47bc03eb Fix delete_objects in UnreliableWrapper (#5002)
For `delete_objects` it was injecting failures for whole delete_objects operation
and then for every delete it contains. Make it fail once for the whole operation.
2023-08-16 14:08:53 +03:00
Arseny Sher
fdbe8dc8e0 Fix test_s3_wal_replay flakiness.
ref https://github.com/neondatabase/neon/issues/4466
2023-08-16 12:57:43 +03:00
Arthur Petukhovsky
1b97a3074c Disable neon-pool-opt-in (#4995) 2023-08-15 20:57:56 +03:00
John Spray
5c836ee5b4 tests: extend timeout in timeline deletion test (#4992)
## Problem

This was set to 5 seconds, which was very close to how long a compaction
took on my workstation, and when deletion is blocked on compaction the
test would fail.

We will fix this to make compactions drop out on deletion, but for the
moment let's stabilize the test.

## Summary of changes

Change timeout on timeline deletion in
`test_timeline_deletion_with_files_stuck_in_upload_queue` from 5 seconds
to 30 seconds.
2023-08-15 20:14:03 +03:00
Arseny Sher
4687b2e597 Test that auth on pg/http services can be enabled separately in sks.
To this end add
1) -e option to 'neon_local safekeeper start' command appending extra options
   to safekeeper invocation;
2) Allow multiple occurrences of the same option in safekeepers, the last
   value is taken.
3) Allow to specify empty string for *-auth-public-key-path opts, it
   disables auth for the service.
2023-08-15 19:31:20 +03:00
Arseny Sher
13adc83fc3 Allow to enable http/pg/pg tenant only auth separately in safekeeper.
The same option enables auth and specifies public key, so this allows to use
different public keys as well. The motivation is to
1) Allow to e.g. change pageserver key/token without replacing all compute
  tokens.
2) Enable auth gradually.
2023-08-15 19:31:20 +03:00
Dmitry Rodionov
52c2c69351 fsync directory before mark file removal (#4986)
## Problem

Deletions can be possibly reordered. Use fsync to avoid the case when
mark file doesnt exist but other tenant/timeline files do.

See added comments.

resolves #4987
2023-08-15 19:24:23 +03:00
Alexander Bayandin
207919f5eb Upload test results to DB right after generation (#4967)
## Problem

While adding new test results format, I've also changed the way we
upload Allure reports to S3
(722c7956bb)
to avoid duplicated results from previous runs. But it broke links at
earlier results (results are still available but on different URLs).

This PR fixes this (by reverting logic in
722c7956bb
changes), and moves the logic for storing test results into db to allure
generate step. It allows us to avoid test results duplicates in the db
and saves some time on extra s3 downloads that happened in a different
job before the PR.

Ref https://neondb.slack.com/archives/C059ZC138NR/p1691669522160229

## Summary of changes
- Move test results storing logic from a workflow to
`actions/allure-report-generate`
2023-08-15 15:32:30 +01:00
George MacKerron
218be9eb32 Added deferrable transaction option to http batch queries (#4993)
## Problem

HTTP batch queries currently allow us to set the isolation level and
read only, but not deferrable.

## Summary of changes

Add support for deferrable.

Echo deferrable status in response headers only if true.

Likewise, now echo read-only status in response headers only if true.
2023-08-15 14:52:00 +01:00
Joonas Koivunen
8198b865c3 Remote storage metrics follow-up (#4957)
#4942 left old metrics in place for migration purposes. It was noticed
that from new metrics the total number of deleted objects was forgotten,
add it.

While reviewing, it was noticed that the delete_object could just be
delete_objects of one.

---------

Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
2023-08-15 12:30:27 +03:00
Arpad Müller
baf395983f Turn BlockLease associated type into an enum (#4982)
## Problem

The `BlockReader` trait is not ready to be asyncified, as associated
types are not supported by asyncification strategies like via the
`async_trait` macro, or via adopting enums.

## Summary of changes

Remove the `BlockLease` associated type from the `BlockReader` trait and
turn it into an enum instead, bearing the same name. The enum has two
variants, one of which is gated by `#[cfg(test)]`. Therefore, outside of
test settings, the enum has zero overhead over just having the
`PageReadGuard`. Using the enum allows us to impl `BlockReader` without
needing the page cache.

Part of https://github.com/neondatabase/neon/issues/4743
2023-08-14 18:48:09 +02:00
Arpad Müller
ce7efbe48a Turn BlockCursor::{read_blob,read_blob_into_buf} async fn (#4905)
## Problem

The `BlockCursor::read_blob` and `BlockCursor::read_blob_into_buf`
functions are calling `read_blk` internally, so if we want to make that
function async fn, they need to be async themselves.

## Summary of changes

* We first turn `ValueRef::load` into an async fn.
* Then, we switch the `RwLock` implementation in `InMemoryLayer` to use
the one from `tokio`.
* Last, we convert the `read_blob` and `read_blob_into_buf` functions
into async fn.

In three instances we use `Handle::block_on`:

* one use is in compaction code, which currently isn't async. We put the
entire loop into an `async` block to prevent the potentially hot loop
from doing cross-thread operations.
* one use is in dumping code for `DeltaLayer`. The "proper" way to
address this would be to enable the visit function to take async
closures, but then we'd need to be generic over async fs non async,
which [isn't supported by rust right
now](https://blog.rust-lang.org/inside-rust/2022/07/27/keyword-generics.html).
The other alternative would be to do a first pass where we cache the
data into memory, and only then to dump it.
* the third use is in writing code, inside a loop that copies from one
file to another. It is is synchronous and we'd like to keep it that way
(for now?).

Part of #4743
2023-08-14 17:20:37 +02:00
Tristan Partin
ef4a76c01e Update Postgres to v15.4 and v14.9 (#4965) 2023-08-14 16:19:45 +01:00
George MacKerron
1ca08cc523 Changed batch query body to from [...] to { queries: [...] } (#4975)
## Problem

It's nice if `single query : single response :: batch query : batch
response`.

But at present, in the single case we send `{ query: '', params: [] }`
and get back a single `{ rows: [], ... }` object, while in the batch
case we send an array of `{ query: '', params: [] }` objects and get
back not an array of `{ rows: [], ... }` objects but a `{ results: [ {
rows: [] , ... }, { rows: [] , ... }, ... ] }` object instead.

## Summary of changes

With this change, the batch query body becomes `{ queries: [{ query: '',
params: [] }, ... ] }`, which restores a consistent relationship between
the request and response bodies.
2023-08-14 16:07:33 +01:00
Dmitry Rodionov
4626d89eda Harden retries on tenant/timeline deletion path. (#4973)
Originated from test failure where we got SlowDown error from s3.
The patch generalizes `download_retry` to not be download specific.
Resulting `retry` function is moved to utils crate. `download_retries`
is now a thin wrapper around this `retry` function.

To ensure that all needed retries are in place test code now uses
`test_remote_failures=1` setting.

Ref https://neondb.slack.com/archives/C059ZC138NR/p1691743624353009
2023-08-14 17:16:49 +03:00
Arseny Sher
49c57c0b13 Add neon_local to docker image.
People sometimes ask about this.

https://community.neon.tech/t/is-the-neon-local-binary-in-any-of-the-official-docker-images/360/2
2023-08-14 14:08:51 +03:00
John Spray
d3a97fdf88 pageserver: avoid incrementing access time when reading layers for compaction (#4971)
## Problem

Currently, image generation reads delta layers before writing out
subsequent image layers, which updates the access time of the delta
layers and effectively puts them at the back of the queue for eviction.
This is the opposite of what we want, because after a delta layer is
covered by a later image layer, it's likely that subsequent reads of
latest data will hit the image rather than the delta layer, so the delta
layer should be quite a good candidate for eviction.

## Summary of changes

`RequestContext` gets a new `ATimeBehavior` field, and a
`RequestContextBuilder` helper so that we can optionally add the new
field without growing `RequestContext::new` every time we add something
like this.

Request context is passed into the `record_access` function, and the
access time is not updated if `ATimeBehavior::Skip` is set.

The compaction background task constructs its request context with this
skip policy.

Closes: https://github.com/neondatabase/neon/issues/4969
2023-08-14 10:18:22 +01:00
104 changed files with 4388 additions and 2719 deletions

View File

@@ -1,6 +1,13 @@
name: 'Create Allure report'
description: 'Generate Allure report from uploaded by actions/allure-report-store tests results'
inputs:
store-test-results-into-db:
description: 'Whether to store test results into the database. TEST_RESULT_CONNSTR/TEST_RESULT_CONNSTR_NEW should be set'
type: boolean
required: false
default: false
outputs:
base-url:
description: 'Base URL for Allure report'
@@ -139,9 +146,11 @@ runs:
sed -i 's|<a href="." class=|<a href="https://'${BUCKET}'.s3.amazonaws.com/'${REPORT_PREFIX}'/latest/index.html?nocache='"'+Date.now()+'"'" class=|g' ${WORKDIR}/report/app.js
# Upload a history and the final report (in this particular order to not to have duplicated history in 2 places)
# Use sync for the final report to delete files from previous runs
time aws s3 mv --recursive --only-show-errors "${WORKDIR}/report/history" "s3://${BUCKET}/${REPORT_PREFIX}/latest/history"
time aws s3 sync --delete --only-show-errors "${WORKDIR}/report" "s3://${BUCKET}/${REPORT_PREFIX}/${GITHUB_RUN_ID}"
# Use aws s3 cp (instead of aws s3 sync) to keep files from previous runs to make old URLs work,
# and to keep files on the host to upload them to the database
time aws s3 cp --recursive --only-show-errors "${WORKDIR}/report" "s3://${BUCKET}/${REPORT_PREFIX}/${GITHUB_RUN_ID}"
# Generate redirect
cat <<EOF > ${WORKDIR}/index.html
@@ -170,6 +179,41 @@ runs:
aws s3 rm "s3://${BUCKET}/${LOCK_FILE}"
fi
- name: Store Allure test stat in the DB
if: ${{ !cancelled() && inputs.store-test-results-into-db == 'true' }}
shell: bash -euxo pipefail {0}
env:
COMMIT_SHA: ${{ github.event.pull_request.head.sha || github.sha }}
REPORT_JSON_URL: ${{ steps.generate-report.outputs.report-json-url }}
run: |
export DATABASE_URL=${REGRESS_TEST_RESULT_CONNSTR}
./scripts/pysync
poetry run python3 scripts/ingest_regress_test_result.py \
--revision ${COMMIT_SHA} \
--reference ${GITHUB_REF} \
--build-type unified \
--ingest ${WORKDIR}/report/data/suites.json
- name: Store Allure test stat in the DB (new)
if: ${{ !cancelled() && inputs.store-test-results-into-db == 'true' }}
shell: bash -euxo pipefail {0}
env:
COMMIT_SHA: ${{ github.event.pull_request.head.sha || github.sha }}
BASE_S3_URL: ${{ steps.generate-report.outputs.base-s3-url }}
run: |
export DATABASE_URL=${REGRESS_TEST_RESULT_CONNSTR_NEW}
./scripts/pysync
poetry run python3 scripts/ingest_regress_test_result-new-format.py \
--reference ${GITHUB_REF} \
--revision ${COMMIT_SHA} \
--run-id ${GITHUB_RUN_ID} \
--run-attempt ${GITHUB_RUN_ATTEMPT} \
--test-cases-dir ${WORKDIR}/report/data/test-cases
- name: Cleanup
if: always()
shell: bash -euxo pipefail {0}

View File

@@ -432,6 +432,11 @@ jobs:
if: ${{ !cancelled() }}
id: create-allure-report
uses: ./.github/actions/allure-report-generate
with:
store-test-results-into-db: true
env:
REGRESS_TEST_RESULT_CONNSTR: ${{ secrets.REGRESS_TEST_RESULT_CONNSTR }}
REGRESS_TEST_RESULT_CONNSTR_NEW: ${{ secrets.REGRESS_TEST_RESULT_CONNSTR_NEW }}
- uses: actions/github-script@v6
if: ${{ !cancelled() }}
@@ -452,45 +457,6 @@ jobs:
report,
})
- name: Store Allure test stat in the DB
if: ${{ !cancelled() && steps.create-allure-report.outputs.report-json-url }}
env:
COMMIT_SHA: ${{ github.event.pull_request.head.sha || github.sha }}
REPORT_JSON_URL: ${{ steps.create-allure-report.outputs.report-json-url }}
TEST_RESULT_CONNSTR: ${{ secrets.REGRESS_TEST_RESULT_CONNSTR }}
run: |
./scripts/pysync
curl --fail --output suites.json "${REPORT_JSON_URL}"
export BUILD_TYPE=unified
export DATABASE_URL="$TEST_RESULT_CONNSTR"
poetry run python3 scripts/ingest_regress_test_result.py \
--revision ${COMMIT_SHA} \
--reference ${GITHUB_REF} \
--build-type ${BUILD_TYPE} \
--ingest suites.json
- name: Store Allure test stat in the DB (new)
if: ${{ !cancelled() && steps.create-allure-report.outputs.report-json-url }}
env:
COMMIT_SHA: ${{ github.event.pull_request.head.sha || github.sha }}
REPORT_JSON_URL: ${{ steps.create-allure-report.outputs.report-json-url }}
TEST_RESULT_CONNSTR: ${{ secrets.REGRESS_TEST_RESULT_CONNSTR_NEW }}
BASE_S3_URL: ${{ steps.create-allure-report.outputs.base-s3-url }}
run: |
aws s3 cp --only-show-errors --recursive ${BASE_S3_URL}/data/test-cases ./test-cases
./scripts/pysync
export DATABASE_URL="$TEST_RESULT_CONNSTR"
poetry run python3 scripts/ingest_regress_test_result-new-format.py \
--reference ${GITHUB_REF} \
--revision ${COMMIT_SHA} \
--run-id ${GITHUB_RUN_ID} \
--run-attempt ${GITHUB_RUN_ATTEMPT} \
--test-cases-dir ./test-cases
coverage-report:
runs-on: [ self-hosted, gen3, small ]
container:
@@ -771,34 +737,6 @@ jobs:
--destination neondatabase/compute-node-${{ matrix.version }}:${{needs.tag.outputs.build-tag}}
--cleanup
# Due to a kaniko bug, we can't use cache for extensions image, thus it takes about the same amount of time as compute-node image to build (~10 min)
# During the transition period we need to have extensions in both places (in S3 and in compute-node image),
# so we won't build extension twice, but extract them from compute-node.
#
# For now we use extensions image only for new custom extensitons
- name: Kaniko build extensions only
run: |
# Kaniko is suposed to clean up after itself if --cleanup flag is set, but it doesn't.
# Despite some fixes were made in https://github.com/GoogleContainerTools/kaniko/pull/2504 (in kaniko v1.11.0),
# it still fails with error:
# error building image: could not save file: copying file: symlink postgres /kaniko/1/usr/local/pgsql/bin/postmaster: file exists
#
# Ref https://github.com/GoogleContainerTools/kaniko/issues/1406
find /kaniko -maxdepth 1 -mindepth 1 -type d -regex "/kaniko/[0-9]*" -exec rm -rv {} \;
/kaniko/executor --reproducible --snapshot-mode=redo --skip-unused-stages --cache=true \
--cache-repo 369495373322.dkr.ecr.eu-central-1.amazonaws.com/cache \
--context . \
--build-arg GIT_VERSION=${{ github.event.pull_request.head.sha || github.sha }} \
--build-arg PG_VERSION=${{ matrix.version }} \
--build-arg BUILD_TAG=${{needs.tag.outputs.build-tag}} \
--build-arg REPOSITORY=369495373322.dkr.ecr.eu-central-1.amazonaws.com \
--dockerfile Dockerfile.compute-node \
--destination 369495373322.dkr.ecr.eu-central-1.amazonaws.com/extensions-${{ matrix.version }}:${{needs.tag.outputs.build-tag}} \
--destination neondatabase/extensions-${{ matrix.version }}:${{needs.tag.outputs.build-tag}} \
--cleanup \
--target postgres-extensions
# Cleanup script fails otherwise - rm: cannot remove '/nvme/actions-runner/_work/_temp/_github_home/.ecr': Permission denied
- name: Cleanup ECR folder
run: rm -rf ~/.ecr
@@ -814,7 +752,7 @@ jobs:
run:
shell: sh -eu {0}
env:
VM_BUILDER_VERSION: v0.15.4
VM_BUILDER_VERSION: v0.16.3
steps:
- name: Checkout
@@ -835,7 +773,11 @@ jobs:
- name: Build vm image
run: |
./vm-builder -enable-file-cache -src=369495373322.dkr.ecr.eu-central-1.amazonaws.com/compute-node-${{ matrix.version }}:${{needs.tag.outputs.build-tag}} -dst=369495373322.dkr.ecr.eu-central-1.amazonaws.com/vm-compute-node-${{ matrix.version }}:${{needs.tag.outputs.build-tag}}
./vm-builder \
-enable-file-cache \
-enable-monitor \
-src=369495373322.dkr.ecr.eu-central-1.amazonaws.com/compute-node-${{ matrix.version }}:${{needs.tag.outputs.build-tag}} \
-dst=369495373322.dkr.ecr.eu-central-1.amazonaws.com/vm-compute-node-${{ matrix.version }}:${{needs.tag.outputs.build-tag}}
- name: Pushing vm-compute-node image
run: |
@@ -963,57 +905,50 @@ jobs:
- name: Cleanup ECR folder
run: rm -rf ~/.ecr
upload-postgres-extensions-to-s3:
if: |
(github.ref_name == 'main' || github.ref_name == 'release') &&
github.event_name != 'workflow_dispatch'
runs-on: ${{ github.ref_name == 'release' && fromJSON('["self-hosted", "prod", "x64"]') || fromJSON('["self-hosted", "gen3", "small"]') }}
needs: [ tag, promote-images ]
strategy:
fail-fast: false
matrix:
version: [ v14, v15 ]
env:
EXTENSIONS_IMAGE: ${{ github.ref_name == 'release' && '093970136003' || '369495373322'}}.dkr.ecr.eu-central-1.amazonaws.com/extensions-${{ matrix.version }}:${{ needs.tag.outputs.build-tag }}
AWS_ACCESS_KEY_ID: ${{ github.ref_name == 'release' && secrets.AWS_ACCESS_KEY_PROD || secrets.AWS_ACCESS_KEY_DEV }}
AWS_SECRET_ACCESS_KEY: ${{ github.ref_name == 'release' && secrets.AWS_SECRET_KEY_PROD || secrets.AWS_SECRET_KEY_DEV }}
S3_BUCKETS: ${{ github.ref_name == 'release' && vars.S3_EXTENSIONS_BUCKETS_PROD || vars.S3_EXTENSIONS_BUCKETS_DEV }}
build-private-extensions:
runs-on: [ self-hosted, gen3, small ]
container:
image: 369495373322.dkr.ecr.eu-central-1.amazonaws.com/base:pinned
options: --init
needs: [ promote-images, tag ]
steps:
- name: Pull postgres-extensions image
- name: Set PR's status to pending and request a remote CI test
run: |
docker pull ${EXTENSIONS_IMAGE}
COMMIT_SHA=${{ github.event.pull_request.head.sha }}
COMMIT_SHA=${COMMIT_SHA:-${{ github.sha }}}
REMOTE_REPO="${{ github.repository_owner }}/build-custom-extensions"
- name: Create postgres-extensions container
id: create-container
run: |
EID=$(docker create ${EXTENSIONS_IMAGE} true)
echo "EID=${EID}" >> $GITHUB_OUTPUT
curl -f -X POST \
https://api.github.com/repos/${{ github.repository }}/statuses/$COMMIT_SHA \
-H "Accept: application/vnd.github.v3+json" \
--user "${{ secrets.CI_ACCESS_TOKEN }}" \
--data \
"{
\"state\": \"pending\",
\"context\": \"build-and-upload-extensions\",
\"description\": \"[$REMOTE_REPO] Remote CI job is about to start\"
}"
- name: Extract postgres-extensions from container
run: |
rm -rf ./extensions-to-upload # Just in case
mkdir -p extensions-to-upload
docker cp ${{ steps.create-container.outputs.EID }}:/extensions/ ./extensions-to-upload/
docker cp ${{ steps.create-container.outputs.EID }}:/ext_index.json ./extensions-to-upload/
- name: Upload postgres-extensions to S3
run: |
for BUCKET in $(echo ${S3_BUCKETS:-[]} | jq --raw-output '.[]'); do
aws s3 cp --recursive --only-show-errors ./extensions-to-upload s3://${BUCKET}/${{ needs.tag.outputs.build-tag }}/${{ matrix.version }}
done
- name: Cleanup
if: ${{ always() && steps.create-container.outputs.EID }}
run: |
docker rm ${{ steps.create-container.outputs.EID }} || true
curl -f -X POST \
https://api.github.com/repos/$REMOTE_REPO/actions/workflows/build_and_upload_extensions.yml/dispatches \
-H "Accept: application/vnd.github.v3+json" \
--user "${{ secrets.CI_ACCESS_TOKEN }}" \
--data \
"{
\"ref\": \"main\",
\"inputs\": {
\"ci_job_name\": \"build-and-upload-extensions\",
\"commit_hash\": \"$COMMIT_SHA\",
\"remote_repo\": \"${{ github.repository }}\",
\"compute_image_tag\": \"${{ needs.tag.outputs.build-tag }}\",
\"remote_branch_name\": \"${{ github.ref_name }}\"
}
}"
deploy:
runs-on: [ self-hosted, gen3, small ]
container: 369495373322.dkr.ecr.eu-central-1.amazonaws.com/ansible:latest
needs: [ upload-postgres-extensions-to-s3, promote-images, tag, regress-tests ]
needs: [ promote-images, tag, regress-tests ]
if: ( github.ref_name == 'main' || github.ref_name == 'release' ) && github.event_name != 'workflow_dispatch'
steps:
- name: Fix git ownership

80
Cargo.lock generated
View File

@@ -639,6 +639,12 @@ dependencies = [
"vsimd",
]
[[package]]
name = "base64ct"
version = "1.6.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "8c3c1a368f70d6cf7302d78f8f7093da241fb8e8807c05cc9e51a125895a6d5b"
[[package]]
name = "bincode"
version = "1.3.3"
@@ -886,6 +892,8 @@ version = "0.1.0"
dependencies = [
"anyhow",
"chrono",
"regex",
"remote_storage",
"serde",
"serde_json",
"serde_with",
@@ -1010,9 +1018,9 @@ checksum = "e496a50fda8aacccc86d7529e2c1e0892dbd0f898a6b5645b5561b89c3210efa"
[[package]]
name = "cpufeatures"
version = "0.2.7"
version = "0.2.9"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "3e4c1eaa2012c47becbbad2ab175484c2a84d1185b566fb2cc5b8707343dfe58"
checksum = "a17b76ff3a4162b0b27f354a0c87015ddad39d35f9c0c36607a3bdd175dde1f1"
dependencies = [
"libc",
]
@@ -1192,15 +1200,15 @@ dependencies = [
[[package]]
name = "dashmap"
version = "5.4.0"
version = "5.5.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "907076dfda823b0b36d2a1bb5f90c96660a5bbcd7729e10727f07858f22c4edc"
checksum = "6943ae99c34386c84a470c499d3414f66502a41340aa895406e0d2e4a207b91d"
dependencies = [
"cfg-if",
"hashbrown 0.12.3",
"hashbrown 0.14.0",
"lock_api",
"once_cell",
"parking_lot_core 0.9.7",
"parking_lot_core 0.9.8",
]
[[package]]
@@ -1649,6 +1657,12 @@ dependencies = [
"ahash",
]
[[package]]
name = "hashbrown"
version = "0.14.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "2c6201b9ff9fd90a5a3bac2e56a830d0caa509576f0e503818ee82c181b3437a"
[[package]]
name = "hashlink"
version = "0.8.2"
@@ -2073,9 +2087,9 @@ checksum = "ef53942eb7bf7ff43a617b3e2c1c4a5ecf5944a7c1bc12d7ee39bbb15e5c1519"
[[package]]
name = "lock_api"
version = "0.4.9"
version = "0.4.10"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "435011366fe56583b16cf956f9df0095b405b82d76425bc8981c0e22e60ec4df"
checksum = "c1cc9717a20b1bb222f333e6a92fd32f7d8a18ddc5a3191a11af45dcbf4dcd16"
dependencies = [
"autocfg",
"scopeguard",
@@ -2339,9 +2353,9 @@ dependencies = [
[[package]]
name = "once_cell"
version = "1.17.1"
version = "1.18.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "b7e5500299e16ebb147ae15a00a942af264cf3688f47923b8fc2cd5858f23ad3"
checksum = "dd8b5dd2ae5ed71462c540258bedcb51965123ad7e7ccf4b9a8cafaa4a63576d"
[[package]]
name = "oorandom"
@@ -2640,7 +2654,7 @@ source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "3742b2c103b9f06bc9fff0a37ff4912935851bee6d36f3c02bcc755bcfec228f"
dependencies = [
"lock_api",
"parking_lot_core 0.9.7",
"parking_lot_core 0.9.8",
]
[[package]]
@@ -2659,15 +2673,26 @@ dependencies = [
[[package]]
name = "parking_lot_core"
version = "0.9.7"
version = "0.9.8"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "9069cbb9f99e3a5083476ccb29ceb1de18b9118cafa53e90c9551235de2b9521"
checksum = "93f00c865fe7cabf650081affecd3871070f26767e7b2070a3ffae14c654b447"
dependencies = [
"cfg-if",
"libc",
"redox_syscall 0.2.16",
"redox_syscall 0.3.5",
"smallvec",
"windows-sys 0.45.0",
"windows-targets 0.48.0",
]
[[package]]
name = "password-hash"
version = "0.5.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "346f04948ba92c43e8469c1ee6736c7563d71012b17d40745260fe106aac2166"
dependencies = [
"base64ct",
"rand_core",
"subtle",
]
[[package]]
@@ -2678,6 +2703,8 @@ checksum = "f0ca0b5a68607598bf3bad68f32227a8164f6254833f84eafaac409cd6746c31"
dependencies = [
"digest",
"hmac",
"password-hash",
"sha2",
]
[[package]]
@@ -3056,6 +3083,7 @@ dependencies = [
"chrono",
"clap",
"consumption_metrics",
"dashmap",
"futures",
"git-version",
"hashbrown 0.13.2",
@@ -3541,9 +3569,9 @@ dependencies = [
[[package]]
name = "rustls-webpki"
version = "0.100.1"
version = "0.100.2"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "d6207cd5ed3d8dca7816f8f3725513a34609c0c765bf652b8c3cb4cfd87db46b"
checksum = "e98ff011474fa39949b7e5c0428f9b4937eda7da7848bbb947786b7be0b27dab"
dependencies = [
"ring",
"untrusted",
@@ -4112,13 +4140,13 @@ dependencies = [
[[package]]
name = "tar"
version = "0.4.38"
version = "0.4.40"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "4b55807c0344e1e6c04d7c965f5289c39a8d94ae23ed5c0b57aabac549f871c6"
checksum = "b16afcea1f22891c49a00c751c7b63b2233284064f11a200fc624137c51e2ddb"
dependencies = [
"filetime",
"libc",
"xattr 0.2.3",
"xattr",
]
[[package]]
@@ -4409,7 +4437,7 @@ dependencies = [
"redox_syscall 0.3.5",
"tokio",
"tokio-stream",
"xattr 1.0.0",
"xattr",
]
[[package]]
@@ -4887,6 +4915,7 @@ dependencies = [
"thiserror",
"tokio",
"tokio-stream",
"tokio-util",
"tracing",
"tracing-error",
"tracing-subscriber",
@@ -5376,15 +5405,6 @@ dependencies = [
"time",
]
[[package]]
name = "xattr"
version = "0.2.3"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "6d1526bbe5aaeb5eb06885f4d987bcdfa5e23187055de9b83fe00156a821fabc"
dependencies = [
"libc",
]
[[package]]
name = "xattr"
version = "1.0.0"

View File

@@ -54,6 +54,7 @@ comfy-table = "6.1"
const_format = "0.2"
crc32c = "0.6"
crossbeam-utils = "0.8.5"
dashmap = "5.5.0"
either = "1.8"
enum-map = "2.4.2"
enumset = "1.0.12"
@@ -88,7 +89,7 @@ opentelemetry = "0.19.0"
opentelemetry-otlp = { version = "0.12.0", default_features=false, features = ["http-proto", "trace", "http", "reqwest-client"] }
opentelemetry-semantic-conventions = "0.11.0"
parking_lot = "0.12"
pbkdf2 = "0.12.1"
pbkdf2 = { version = "0.12.1", features = ["simple", "std"] }
pin-project-lite = "0.2"
prometheus = {version = "0.13", default_features=false, features = ["process"]} # removes protobuf dependency
prost = "0.11"

View File

@@ -51,6 +51,7 @@ RUN set -e \
--bin safekeeper \
--bin storage_broker \
--bin proxy \
--bin neon_local \
--locked --release \
&& cachepot -s
@@ -76,6 +77,7 @@ COPY --from=build --chown=neon:neon /home/nonroot/target/release/pagectl
COPY --from=build --chown=neon:neon /home/nonroot/target/release/safekeeper /usr/local/bin
COPY --from=build --chown=neon:neon /home/nonroot/target/release/storage_broker /usr/local/bin
COPY --from=build --chown=neon:neon /home/nonroot/target/release/proxy /usr/local/bin
COPY --from=build --chown=neon:neon /home/nonroot/target/release/neon_local /usr/local/bin
COPY --from=pg-build /home/nonroot/pg_install/v14 /usr/local/v14/
COPY --from=pg-build /home/nonroot/pg_install/v15 /usr/local/v15/

View File

@@ -764,29 +764,6 @@ RUN rm -r /usr/local/pgsql/include
# if they were to be used by other libraries.
RUN rm /usr/local/pgsql/lib/lib*.a
#########################################################################################
#
# Extenstion only
#
#########################################################################################
FROM python:3.9-slim-bullseye AS generate-ext-index
ARG PG_VERSION
ARG BUILD_TAG
RUN apt update && apt install -y zstd
# copy the control files here
COPY --from=kq-imcx-pg-build /extensions/ /extensions/
COPY --from=pg-anon-pg-build /extensions/ /extensions/
COPY --from=postgis-build /extensions/ /extensions/
COPY scripts/combine_control_files.py ./combine_control_files.py
RUN python3 ./combine_control_files.py ${PG_VERSION} ${BUILD_TAG} --public_extensions="anon,postgis"
FROM scratch AS postgres-extensions
# After the transition this layer will include all extensitons.
# As for now, it's only a couple for testing purposses
COPY --from=generate-ext-index /extensions/*.tar.zst /extensions/
COPY --from=generate-ext-index /ext_index.json /ext_index.json
#########################################################################################
#
# Final layer

View File

@@ -38,7 +38,7 @@ use std::fs::File;
use std::panic;
use std::path::Path;
use std::process::exit;
use std::sync::{mpsc, Arc, Condvar, Mutex, OnceLock, RwLock};
use std::sync::{mpsc, Arc, Condvar, Mutex, RwLock};
use std::{thread, time::Duration};
use anyhow::{Context, Result};
@@ -147,6 +147,7 @@ fn main() -> Result<()> {
match spec_json {
// First, try to get cluster spec from the cli argument
Some(json) => {
info!("got spec from cli argument {}", json);
spec = Some(serde_json::from_str(json)?);
}
None => {
@@ -182,6 +183,7 @@ fn main() -> Result<()> {
if let Some(spec) = spec {
let pspec = ParsedSpec::try_from(spec).map_err(|msg| anyhow::anyhow!(msg))?;
info!("new pspec.spec: {:?}", pspec.spec);
new_state.pspec = Some(pspec);
spec_set = true;
} else {
@@ -196,9 +198,7 @@ fn main() -> Result<()> {
state: Mutex::new(new_state),
state_changed: Condvar::new(),
ext_remote_storage,
ext_remote_paths: OnceLock::new(),
ext_download_progress: RwLock::new(HashMap::new()),
library_index: OnceLock::new(),
build_tag,
};
let compute = Arc::new(compute_node);

View File

@@ -5,7 +5,7 @@ use std::os::unix::fs::PermissionsExt;
use std::path::Path;
use std::process::{Command, Stdio};
use std::str::FromStr;
use std::sync::{Condvar, Mutex, OnceLock, RwLock};
use std::sync::{Condvar, Mutex, RwLock};
use std::time::Instant;
use anyhow::{Context, Result};
@@ -14,7 +14,6 @@ use futures::future::join_all;
use futures::stream::FuturesUnordered;
use futures::StreamExt;
use postgres::{Client, NoTls};
use regex::Regex;
use tokio;
use tokio_postgres;
use tracing::{error, info, instrument, warn};
@@ -60,10 +59,6 @@ pub struct ComputeNode {
pub state_changed: Condvar,
/// the S3 bucket that we search for extensions in
pub ext_remote_storage: Option<GenericRemoteStorage>,
// (key: extension name, value: path to extension archive in remote storage)
pub ext_remote_paths: OnceLock<HashMap<String, RemotePath>>,
// (key: library name, value: name of extension containing this library)
pub library_index: OnceLock<HashMap<String, String>>,
// key: ext_archive_name, value: started download time, download_completed?
pub ext_download_progress: RwLock<HashMap<String, (DateTime<Utc>, bool)>>,
pub build_tag: String,
@@ -75,7 +70,6 @@ pub struct RemoteExtensionMetrics {
num_ext_downloaded: u64,
largest_ext_size: u64,
total_ext_download_size: u64,
prep_extensions_ms: u64,
}
#[derive(Clone, Debug)]
@@ -745,11 +739,19 @@ impl ComputeNode {
pspec.timeline_id,
);
info!(
"start_compute spec.remote_extensions {:?}",
pspec.spec.remote_extensions
);
// This part is sync, because we need to download
// remote shared_preload_libraries before postgres start (if any)
{
if let Some(remote_extensions) = &pspec.spec.remote_extensions {
// First, create control files for all availale extensions
extension_server::create_control_files(remote_extensions, &self.pgbin);
let library_load_start_time = Utc::now();
let remote_ext_metrics = self.prepare_preload_libraries(&compute_state)?;
let remote_ext_metrics = self.prepare_preload_libraries(&pspec.spec)?;
let library_load_time = Utc::now()
.signed_duration_since(library_load_start_time)
@@ -761,7 +763,6 @@ impl ComputeNode {
state.metrics.num_ext_downloaded = remote_ext_metrics.num_ext_downloaded;
state.metrics.largest_ext_size = remote_ext_metrics.largest_ext_size;
state.metrics.total_ext_download_size = remote_ext_metrics.total_ext_download_size;
state.metrics.prep_extensions_ms = remote_ext_metrics.prep_extensions_ms;
info!(
"Loading shared_preload_libraries took {:?}ms",
library_load_time
@@ -918,38 +919,11 @@ LIMIT 100",
}
}
// If remote extension storage is configured,
// download extension control files
pub async fn prepare_external_extensions(&self, compute_state: &ComputeState) -> Result<()> {
if let Some(ref ext_remote_storage) = self.ext_remote_storage {
let pspec = compute_state.pspec.as_ref().expect("spec must be set");
let spec = &pspec.spec;
let custom_ext = spec.custom_extensions.clone().unwrap_or(Vec::new());
info!("custom extensions: {:?}", &custom_ext);
let (ext_remote_paths, library_index) = extension_server::get_available_extensions(
ext_remote_storage,
&self.pgbin,
&self.pgversion,
&custom_ext,
&self.build_tag,
)
.await?;
self.ext_remote_paths
.set(ext_remote_paths)
.expect("this is the only time we set ext_remote_paths");
self.library_index
.set(library_index)
.expect("this is the only time we set library_index");
}
Ok(())
}
// download an archive, unzip and place files in correct locations
pub async fn download_extension(
&self,
ext_name: &str,
is_library: bool,
real_ext_name: String,
ext_path: RemotePath,
) -> Result<u64, DownloadError> {
let remote_storage = self
.ext_remote_storage
@@ -958,35 +932,6 @@ LIMIT 100",
"Remote extensions storage is not configured",
)))?;
let mut real_ext_name = ext_name;
if is_library {
// sometimes library names might have a suffix like
// library.so or library.so.3. We strip this off
// because library_index is based on the name without the file extension
let strip_lib_suffix = Regex::new(r"\.so.*").unwrap();
let lib_raw_name = strip_lib_suffix.replace(real_ext_name, "").to_string();
real_ext_name = self
.library_index
.get()
.expect("must have already downloaded the library_index")
.get(&lib_raw_name)
.ok_or(DownloadError::BadInput(anyhow::anyhow!(
"library {} is not found",
lib_raw_name
)))?;
}
let ext_path = &self
.ext_remote_paths
.get()
.expect("error accessing ext_remote_paths")
.get(real_ext_name)
.ok_or(DownloadError::BadInput(anyhow::anyhow!(
"real_ext_name {} is not found",
real_ext_name
)))?;
let ext_archive_name = ext_path.object_name().expect("bad path");
let mut first_try = false;
@@ -1039,8 +984,8 @@ LIMIT 100",
info!("downloading new extension {ext_archive_name}");
let download_size = extension_server::download_extension(
real_ext_name,
ext_path,
&real_ext_name,
&ext_path,
remote_storage,
&self.pgbin,
)
@@ -1058,18 +1003,19 @@ LIMIT 100",
#[tokio::main]
pub async fn prepare_preload_libraries(
&self,
compute_state: &ComputeState,
spec: &ComputeSpec,
) -> Result<RemoteExtensionMetrics> {
if self.ext_remote_storage.is_none() {
return Ok(RemoteExtensionMetrics {
num_ext_downloaded: 0,
largest_ext_size: 0,
total_ext_download_size: 0,
prep_extensions_ms: 0,
});
}
let pspec = compute_state.pspec.as_ref().expect("spec must be set");
let spec = &pspec.spec;
let remote_extensions = spec
.remote_extensions
.as_ref()
.ok_or(anyhow::anyhow!("Remote extensions are not configured",))?;
info!("parse shared_preload_libraries from spec.cluster.settings");
let mut libs_vec = Vec::new();
@@ -1081,6 +1027,7 @@ LIMIT 100",
.collect();
}
info!("parse shared_preload_libraries from provided postgresql.conf");
// that is used in neon_local and python tests
if let Some(conf) = &spec.cluster.postgresql_conf {
let conf_lines = conf.split('\n').collect::<Vec<&str>>();
@@ -1101,30 +1048,16 @@ LIMIT 100",
libs_vec.extend(preload_libs_vec);
}
info!("Download ext_index.json, find the extension paths");
let prep_ext_start_time = Utc::now();
self.prepare_external_extensions(compute_state).await?;
let prep_ext_time_delta = Utc::now()
.signed_duration_since(prep_ext_start_time)
.to_std()
.unwrap()
.as_millis() as u64;
info!("Prepare extensions took {prep_ext_time_delta}ms");
// Don't try to download libraries that are not in the index.
// Assume that they are already present locally.
libs_vec.retain(|lib| {
self.library_index
.get()
.expect("error accessing ext_remote_paths")
.contains_key(lib)
});
libs_vec.retain(|lib| remote_extensions.library_index.contains_key(lib));
info!("Downloading to shared preload libraries: {:?}", &libs_vec);
let mut download_tasks = Vec::new();
for library in &libs_vec {
download_tasks.push(self.download_extension(library, true));
let (ext_name, ext_path) = remote_extensions.get_ext(library, true)?;
download_tasks.push(self.download_extension(ext_name, ext_path));
}
let results = join_all(download_tasks).await;
@@ -1132,7 +1065,6 @@ LIMIT 100",
num_ext_downloaded: 0,
largest_ext_size: 0,
total_ext_download_size: 0,
prep_extensions_ms: prep_ext_time_delta,
};
for result in results {
let download_size = match result {

View File

@@ -73,10 +73,9 @@ More specifically, here is an example ext_index.json
*/
use anyhow::Context;
use anyhow::{self, Result};
use futures::future::join_all;
use compute_api::spec::RemoteExtSpec;
use remote_storage::*;
use serde_json;
use std::collections::HashMap;
use std::io::Read;
use std::num::{NonZeroU32, NonZeroUsize};
use std::path::Path;
@@ -117,81 +116,6 @@ pub fn get_pg_version(pgbin: &str) -> String {
panic!("Unsuported postgres version {human_version}");
}
// download control files for enabled_extensions
// return Hashmaps converting library names to extension names (library_index)
// and specifying the remote path to the archive for each extension name
pub async fn get_available_extensions(
remote_storage: &GenericRemoteStorage,
pgbin: &str,
pg_version: &str,
custom_extensions: &[String],
build_tag: &str,
) -> Result<(HashMap<String, RemotePath>, HashMap<String, String>)> {
let local_sharedir = Path::new(&get_pg_config("--sharedir", pgbin)).join("extension");
let index_path = format!("{build_tag}/{pg_version}/ext_index.json");
let index_path = RemotePath::new(Path::new(&index_path)).context("error forming path")?;
info!("download ext_index.json from: {:?}", &index_path);
let mut download = remote_storage.download(&index_path).await?;
let mut ext_idx_buffer = Vec::new();
download
.download_stream
.read_to_end(&mut ext_idx_buffer)
.await?;
info!("ext_index downloaded");
#[derive(Debug, serde::Deserialize)]
struct Index {
public_extensions: Vec<String>,
library_index: HashMap<String, String>,
extension_data: HashMap<String, ExtensionData>,
}
#[derive(Debug, serde::Deserialize)]
struct ExtensionData {
control_data: HashMap<String, String>,
archive_path: String,
}
let ext_index_full = serde_json::from_slice::<Index>(&ext_idx_buffer)?;
let mut enabled_extensions = ext_index_full.public_extensions;
enabled_extensions.extend_from_slice(custom_extensions);
let mut library_index = ext_index_full.library_index;
let all_extension_data = ext_index_full.extension_data;
info!("library_index: {:?}", library_index);
info!("enabled_extensions: {:?}", enabled_extensions);
let mut ext_remote_paths = HashMap::new();
let mut file_create_tasks = Vec::new();
for extension in enabled_extensions {
let ext_data = &all_extension_data[&extension];
for (control_file, control_contents) in &ext_data.control_data {
let extension_name = control_file
.strip_suffix(".control")
.expect("control files must end in .control");
let control_path = local_sharedir.join(control_file);
if !control_path.exists() {
ext_remote_paths.insert(
extension_name.to_string(),
RemotePath::from_string(&ext_data.archive_path)?,
);
info!("writing file {:?}{:?}", control_path, control_contents);
file_create_tasks.push(tokio::fs::write(control_path, control_contents));
} else {
warn!("control file {:?} exists both locally and remotely. ignoring the remote version.", control_file);
// also delete this from library index
library_index.retain(|_, value| value != extension_name);
}
}
}
let results = join_all(file_create_tasks).await;
for result in results {
result?;
}
info!("ext_remote_paths {:?}", ext_remote_paths);
Ok((ext_remote_paths, library_index))
}
// download the archive for a given extension,
// unzip it, and place files in the appropriate locations (share/lib)
pub async fn download_extension(
@@ -253,6 +177,22 @@ pub async fn download_extension(
Ok(download_size)
}
// Create extension control files from spec
pub fn create_control_files(remote_extensions: &RemoteExtSpec, pgbin: &str) {
let local_sharedir = Path::new(&get_pg_config("--sharedir", pgbin)).join("extension");
for ext_data in remote_extensions.extension_data.values() {
for (control_name, control_content) in &ext_data.control_data {
let control_path = local_sharedir.join(control_name);
if !control_path.exists() {
info!("writing file {:?}{:?}", control_path, control_content);
std::fs::write(control_path, control_content).unwrap();
} else {
warn!("control file {:?} exists both locally and remotely. ignoring the remote version.", control_path);
}
}
}
}
// This function initializes the necessary structs to use remote storage
pub fn init_remote_storage(remote_ext_config: &str) -> anyhow::Result<GenericRemoteStorage> {
#[derive(Debug, serde::Deserialize)]

View File

@@ -13,7 +13,7 @@ use hyper::{Body, Method, Request, Response, Server, StatusCode};
use num_cpus;
use serde_json;
use tokio::task;
use tracing::{error, info};
use tracing::{error, info, warn};
use tracing_utils::http::OtelName;
fn status_response_from_state(state: &ComputeState) -> ComputeStatusResponse {
@@ -126,6 +126,15 @@ async fn routes(req: Request<Body>, compute: &Arc<ComputeNode>) -> Response<Body
info!("serving {:?} POST request", route);
info!("req.uri {:?}", req.uri());
// don't even try to download extensions
// if no remote storage is configured
if compute.ext_remote_storage.is_none() {
info!("no extensions remote storage configured");
let mut resp = Response::new(Body::from("no remote storage configured"));
*resp.status_mut() = StatusCode::INTERNAL_SERVER_ERROR;
return resp;
}
let mut is_library = false;
if let Some(params) = req.uri().query() {
info!("serving {:?} POST request with params: {}", route, params);
@@ -137,24 +146,47 @@ async fn routes(req: Request<Body>, compute: &Arc<ComputeNode>) -> Response<Body
return resp;
}
}
let filename = route.split('/').last().unwrap().to_string();
info!("serving /extension_server POST request, filename: {filename:?} is_library: {is_library}");
// don't even try to download extensions
// if no remote storage is configured
if compute.ext_remote_storage.is_none() {
info!("no extensions remote storage configured");
let mut resp = Response::new(Body::from("no remote storage configured"));
*resp.status_mut() = StatusCode::INTERNAL_SERVER_ERROR;
return resp;
}
// get ext_name and path from spec
// don't lock compute_state for too long
let ext = {
let compute_state = compute.state.lock().unwrap();
let pspec = compute_state.pspec.as_ref().expect("spec must be set");
let spec = &pspec.spec;
match compute.download_extension(&filename, is_library).await {
Ok(_) => Response::new(Body::from("OK")),
// debug only
info!("spec: {:?}", spec);
let remote_extensions = match spec.remote_extensions.as_ref() {
Some(r) => r,
None => {
info!("no remote extensions spec was provided");
let mut resp = Response::new(Body::from("no remote storage configured"));
*resp.status_mut() = StatusCode::INTERNAL_SERVER_ERROR;
return resp;
}
};
remote_extensions.get_ext(&filename, is_library)
};
match ext {
Ok((ext_name, ext_path)) => {
match compute.download_extension(ext_name, ext_path).await {
Ok(_) => Response::new(Body::from("OK")),
Err(e) => {
error!("extension download failed: {}", e);
let mut resp = Response::new(Body::from(e.to_string()));
*resp.status_mut() = StatusCode::INTERNAL_SERVER_ERROR;
resp
}
}
}
Err(e) => {
error!("extension download failed: {}", e);
let mut resp = Response::new(Body::from(e.to_string()));
warn!("extension download failed to find extension: {}", e);
let mut resp = Response::new(Body::from("failed to find file"));
*resp.status_mut() = StatusCode::INTERNAL_SERVER_ERROR;
resp
}

View File

@@ -825,6 +825,16 @@ fn get_safekeeper(env: &local_env::LocalEnv, id: NodeId) -> Result<SafekeeperNod
}
}
// Get list of options to append to safekeeper command invocation.
fn safekeeper_extra_opts(init_match: &ArgMatches) -> Vec<String> {
init_match
.get_many::<String>("safekeeper-extra-opt")
.into_iter()
.flatten()
.map(|s| s.to_owned())
.collect()
}
fn handle_safekeeper(sub_match: &ArgMatches, env: &local_env::LocalEnv) -> Result<()> {
let (sub_name, sub_args) = match sub_match.subcommand() {
Some(safekeeper_command_data) => safekeeper_command_data,
@@ -841,7 +851,9 @@ fn handle_safekeeper(sub_match: &ArgMatches, env: &local_env::LocalEnv) -> Resul
match sub_name {
"start" => {
if let Err(e) = safekeeper.start() {
let extra_opts = safekeeper_extra_opts(sub_args);
if let Err(e) = safekeeper.start(extra_opts) {
eprintln!("safekeeper start failed: {}", e);
exit(1);
}
@@ -866,7 +878,8 @@ fn handle_safekeeper(sub_match: &ArgMatches, env: &local_env::LocalEnv) -> Resul
exit(1);
}
if let Err(e) = safekeeper.start() {
let extra_opts = safekeeper_extra_opts(sub_args);
if let Err(e) = safekeeper.start(extra_opts) {
eprintln!("safekeeper start failed: {}", e);
exit(1);
}
@@ -893,7 +906,7 @@ fn handle_start_all(sub_match: &ArgMatches, env: &local_env::LocalEnv) -> anyhow
for node in env.safekeepers.iter() {
let safekeeper = SafekeeperNode::from_env(env, node);
if let Err(e) = safekeeper.start() {
if let Err(e) = safekeeper.start(vec![]) {
eprintln!("safekeeper {} start failed: {:#}", safekeeper.id, e);
try_stop_all(env, false);
exit(1);
@@ -956,6 +969,14 @@ fn cli() -> Command {
let safekeeper_id_arg = Arg::new("id").help("safekeeper id").required(false);
let safekeeper_extra_opt_arg = Arg::new("safekeeper-extra-opt")
.short('e')
.long("safekeeper-extra-opt")
.num_args(1)
.action(ArgAction::Append)
.help("Additional safekeeper invocation options, e.g. -e=--http-auth-public-key-path=foo")
.required(false);
let tenant_id_arg = Arg::new("tenant-id")
.long("tenant-id")
.help("Tenant id. Represented as a hexadecimal string 32 symbols length")
@@ -1124,6 +1145,7 @@ fn cli() -> Command {
.subcommand(Command::new("start")
.about("Start local safekeeper")
.arg(safekeeper_id_arg.clone())
.arg(safekeeper_extra_opt_arg.clone())
)
.subcommand(Command::new("stop")
.about("Stop local safekeeper")
@@ -1134,6 +1156,7 @@ fn cli() -> Command {
.about("Restart local safekeeper")
.arg(safekeeper_id_arg)
.arg(stop_mode_arg.clone())
.arg(safekeeper_extra_opt_arg)
)
)
.subcommand(

View File

@@ -493,7 +493,7 @@ impl Endpoint {
pageserver_connstring: Some(pageserver_connstring),
safekeeper_connstrings,
storage_auth_token: auth_token.clone(),
custom_extensions: Some(vec![]),
remote_extensions: None,
};
let spec_path = self.endpoint_path().join("spec.json");
std::fs::write(spec_path, serde_json::to_string_pretty(&spec)?)?;

View File

@@ -101,7 +101,7 @@ impl SafekeeperNode {
self.datadir_path().join("safekeeper.pid")
}
pub fn start(&self) -> anyhow::Result<Child> {
pub fn start(&self, extra_opts: Vec<String>) -> anyhow::Result<Child> {
print!(
"Starting safekeeper at '{}' in '{}'",
self.pg_connection_config.raw_address(),
@@ -161,17 +161,28 @@ impl SafekeeperNode {
let key_path = self.env.base_data_dir.join("auth_public_key.pem");
if self.conf.auth_enabled {
let key_path_string = key_path
.to_str()
.with_context(|| {
format!("Key path {key_path:?} cannot be represented as a unicode string")
})?
.to_owned();
args.extend([
"--auth-validation-public-key-path".to_owned(),
key_path
.to_str()
.with_context(|| {
format!("Key path {key_path:?} cannot be represented as a unicode string")
})?
.to_owned(),
"--pg-auth-public-key-path".to_owned(),
key_path_string.clone(),
]);
args.extend([
"--pg-tenant-only-auth-public-key-path".to_owned(),
key_path_string.clone(),
]);
args.extend([
"--http-auth-public-key-path".to_owned(),
key_path_string.clone(),
]);
}
args.extend(extra_opts);
background_process::start_process(
&format!("safekeeper-{id}"),
&datadir,

View File

@@ -4,7 +4,12 @@
# to your expectations and requirements.
# Root options
targets = []
targets = [
{ triple = "x86_64-unknown-linux-gnu" },
{ triple = "aarch64-unknown-linux-gnu" },
{ triple = "aarch64-apple-darwin" },
{ triple = "x86_64-apple-darwin" },
]
all-features = false
no-default-features = false
feature-depth = 1
@@ -18,7 +23,7 @@ vulnerability = "deny"
unmaintained = "warn"
yanked = "warn"
notice = "warn"
ignore = []
ignore = ["RUSTSEC-2023-0052"]
# This section is considered when running `cargo deny check licenses`
# More documentation for the licenses section can be found here:

View File

@@ -0,0 +1,316 @@
This is a copy from the [original Notion page](https://www.notion.so/neondatabase/Proposal-Pageserver-MVCC-S3-Storage-8a424c0c7ec5459e89d3e3f00e87657c?pvs=4), taken on 2023-08-16.
This is for archival mostly.
The RFC that we're likely to go with is https://github.com/neondatabase/neon/pull/4919.
---
# Proposal: Pageserver MVCC S3 Storage
tl;dr: this proposal enables Control Plane to attach a tenant to a new pageserver without being 100% certain that it has been detached from the old pageserver. This enables us to automate failover if a pageserver dies (no human in the loop).
# Problem Statement
The current Neon architecture requires the Control Plane to guarantee that a tenant is only attached to one pageserver at a time. If a tenant is attached to multiple pageservers simultaneously, the pageservers will overwrite each others changes in S3 for that tenant, resulting in data loss for that tenant.
The above imposes limitations on tenant relocation and future designs for high availability. For instance, Control Plane cannot relocate a tenant to another pageserver before it is 100% certain that the tenant is detached from the source pageserver. If the source pageserver is unresponsive, the tenant detach procedure cannot proceed, and Control Plane has no choice but to wait for either the source to become responsive again, or rely on a node failure detection mechanism to detect that the source pageserver is dead, and give permission to skip the detachment step. Either way, the tenant is unavailable for an extended period, and we have no means to improve it in the current architecture.
Note that there is no 100% correct node failure detection mechanism, and even techniques to accelerate failure detection, such as ********************************shoot-the-other-node-in-the-head,******************************** have their limits. So, we currently rely on humans as node failure detectors: they get alerted via PagerDuty, assess the situation under high stress, and make the decision. If they make the wrong call, or the apparent dead pageserver somehow resurrects later, well have data loss.
Also, by relying on humans, were [incurring needless unscalable toil](https://sre.google/sre-book/eliminating-toil/): as Neon grows, pageserver failures will become more and more frequent because our fleet grows. Each instance will need quick response time to minimize downtime for the affected tenants, which implies higher toil, higher resulting attrition, and/or higher personnel cost.
Lastly, there are foreseeable needs by operation and product such as zero-downtime relocation and automatic failover/HA. For such features, the ability to have a tenant purposefully or accidentally attached to more than one pageserver will greatly reduce risk of data loss, and improve availability.
# High-Level Idea
The core idea is to evolve the per-Tenant S3 state to an MVCC-like scheme, allowing multiple pageservers to operate on the same tenant S3 state without interference. To make changes to S3, pageservers acquire long-running transactions from Control Plane. After opening a transaction, Pageservers make PUTs directly against S3, but they keys include the transaction ID, so overwrites never happen. Periodically, pageservers talk back to Control Plane to commit their transaction. This is where Control Plane enforces strict linearizability, favoring availability over work-conservation: commit is only granted if no transaction started after the one thats requesting commit. Garbage collection is done through deadlists, and its simplified tremendously by above commit grant/reject policy.
Minimal changes are required for safekeepers to allow WAL for a single timeline be consumed by more than one pageserver without premature truncation.
**Above scheme makes it safe to attach tenants without a 100% correct node failure detection mechanism. Further, it makes it safe to interleave tenant-attachment to pageservers, unlocking new capabilities for (internal) product features:**
- **Fast, Zero-Toil Failover on Network Partitions or Instance Failure**: if a pageserver is not reachable (network partition, hardware failure, overload) we want to spread its attached tenants to new pageservers to restore availability, within the range of *seconds*. We cannot afford gracious timeouts to maximize the probability that the unreachable pageserver has ceased writing to S3. This proposal enables us to attach the tenants to the replacement pageservers, and redirect their computes, without having to wait for confirmation that the unreachable pageserver has ceased writing to S3.
- **************************************Zero-Downtime Relocation:************************************** we want to be able to relocate tenants to different pageservers with minimized availability or a latency impact. This proposal enables us to attach the relocating Tenant to the destination Pageserver before detaching it from the source Pageserver. This can help minimize downtime because we can wait for the destination to catch up on WAL processing before redirecting Computes.
# Design
The core idea is to evolve the per-Tenant S3 state to a per-tenant MVCC-like scheme.
To make S3 changes for a given tenant, Pageserver requests a transaction ID from control plane for that tenant. Without a transaction ID, Pageserver does not write to S3.
Once Pageserver received a transaction ID it is allowed to produce new objects and overwrite objects created in this transaction. Pageserver is not allowed to delete any objects; instead, it marks the object as deleted by appending the key to the transactions deadlist for later deletion. Commits of transactions are serialized through Control Plane: when Pageserver wants to commit a transaction, it sends an RPC to Control Plane. Control Plane responds with a commit grant or commit reject message. Commit grant means that the transactions changes are now visible to subsequent transactions. Commit reject means that the transactions changes are not and never will be visible to another Pageserver instance, and the rejected Pageserver is to cease further activity on that tenant.
## ****************************************************Commit grant/reject policy****************************************************
For the purposes of Pageserver, we want **linearizability** of a tenants S3 state. Since our transactions are scoped per tenant, it is sufficient for linearizability to grant commit if and only if no other transaction has been started since the commit-requesting transaction started.
For example, consider the case of a single tenant, attached to Pageserver A. Pageserver A has an open transaction but becomes unresponsive. Control Plane decides to relocate the tenant to another Pageserver B. It need *not* wait for A to be 100%-certainly down before B can start uploading to S3 for that tenant. Instead, B can start a new transaction right away, make progress, and get commit grants; What about A? The transaction is RejectPending in Control Plane until A eventually becomes responsive again, tries to commit, gets a rejection, acknowledges it, and thus its transaction becomes RejectAcknowledge. If A is definitively dead, operator can also force-transition from state RejectPending to RejectAcknowledged. But critically, Control Plane doesnt have for As transaction to become RejectAcknowledge before attaching the tenant to B.
```mermaid
sequenceDiagram
participant CP
participant A
participant S3
participant B
CP -->> A: attach tenant
activate A
A -->> CP: start txn
CP -->> A: txn=23, last_committed_txn=22
Note over CP,A: network partition
CP --x A: heartbeat
CP --x A: heartbeat
Note over CP: relocate tenant to avoid downtime
CP -->> B: attach tenant
activate B
B -->> CP: start txn
Note over CP: mark A's txn 23 as RejectPending
CP -->> B: txn=24, last-committed txn is 22
B -->> S3: PUT X.layer.24<br>PUT index_part.json.24 referencing X.layer.24
B -->> CP: request commit
CP -->> B: granted
B -->> CP: start txn
CP -->> B: txn=25, last_committed_txn=22
A -->> S3: PUT Y.layer.23 <br> PUT index_part.json.23 referencing Y.layer.23
A --x CP: request commit
A --x CP: request commit
Note over CP,A: partition is over
A -->> CP: request commit
Note over CP: most recently started txn is 25, not 23, reject
CP -->> A: reject
A -->> CP: acknowledge reject
Note over CP: mark A's txn 23 as RejectAcknowledged
deactivate A
B -->> S3: PUT 000-FFF_X-Y.layer.**************25**************<br>...
deactivate B
```
If a Pageserver gets a rejection to a commit request, it acknowledges rejection and cedes further S3 uploads for the tenant, until it receives a `/detach` request for the tenant (control plane has most likely attached the tenant to another pageserver in the meantime).
In practice, Control Plane will probably extend the commit grant/reject schema above, taking into account the pageserver to which it last attached the tenant. In the above example, Control Plane could remember that the pageserver that is supposed to host the tenant is pageserver B, and reject start-txn and commit requests from pageserver A. It would also use such requests from A as a signal that A is reachable again, and retry the `/detach` .
<aside>
💡 A commit failure causes the tenant to become effectively `Broken`. Pageserver should persist this locally so it doesnt bother ControlPlane for a new txn when Pageserver is restarted.
</aside>
## ********************Visibility********************
We mentioned earlier that once a transaction commits, its changes are visible to subsequent transactions. But how does a given transaction know where to look for the data? There is no longer a single `index_part.json` per timeline, or a single `timelines/:timeline_id` prefix to look for; theyre all multi-versioned, suffixed by the txn number.
The solution is: at transaction start, Pageserver receives the last-committed transaction ID from Control Plane (`last_committed_txn` in the diagram). last_commited_txn is the upper bound for what is visible for the current transaction. Control Plane keeps track of each open transactions last_committed_txn for purposes of garbage collection (see later paragraph).
Equipped with last_committed_txn, Pageserver then discovers
- the current index part of a timeline at `tenants/:tenant_id/timelines/:timeline_id/index_part.json.$last_committed_txn`. The `index_part.json.$last_committed_txn` has the exact same contents as the current architectures index_part.json, i.e. full list of layers.
- the list of existent timelines as part of the `attach` RPC from CP;
There is no other S3 state per tenant, so, thats all the visibility required.
An alternative to receiving the list of existent timelines from CP is to introduce a proper **********SetOfTimelines********** object in S3, and multi-version it just like above. For example, we could have a `tenants/:tenant_id/timelines.json.$txn` file that references `index_part.json.$last_committed_txn` . It can be added later if more separation between CP and PS is desired.
So, the only MVCCed object types in this proposal are LayerFile and IndexPart (=individual timeline), but not the SetOfTimelines in a given tenant. Is this a problem? For example, the Pageservers garbage collection code needs to know the full set of timelines of a tenant. Otherwise itll make incorrect decisions. What if Pageserver A knows about timelines {R,S}, but another Pageserver B created an additional branch T, so, its set of timelines is {R,S,T}. Both pageservers will run GC code, and so, PS A may decide to delete a layer thats still needed for branch T. Not a problem with this propsoal, because the effect of GC (i.e., layer deletion) is properly MVCCed.
## Longevity Of Transactions & Availability
Pageserver depends on Control Plane to start a new transaction. If ControlPlane is down, no new transactions can be started.
Pageservers commit transactions based on a maximum amount of uncommitted changes that have accumulated in S3. A lower maximum increases dependence and load on ControlPlane which decreases availability. A higher maximum risks losing more work in the event of failover; the work will have to be re-done in a new transaction on the new node.
Pageservers are persist the open txn id in local storage, so that they can resume the transaction after restart, without dependence on Control Plane.
## **Operations**
********PUTs:********
- **layer files**
- current architecture: layer files are supposed to be write-once, but actually, there are edge-cases where we PUT the same layer file name twice; namely if we PUT the file to S3 but crash before uploading the index part that references it; then detach + attach, and re-run compaction, which is non-deterministic.
- this proposal: with transactions, we can now upload layers and index_part.json concurrently, just need to make sure layer file upload is done before we request txn commit.
- **index part** upload: `index_part.json.$txn` may be created and subsequently overwritten multiple times in a transaction; it is an availability/work-loss trade-off how often to request a commit from CP.
**************DELETEs**************: for deletion, we maintain a deadlist per transaction. It is located at `tenants/:tenant_id/deadlist/deadlist.json.$txn`. It is PUT once before the pageserver requests requests commit, and not changed after sending request to commit. An object created in the current txn need not (but can) be on the deadlist — it can be DELETEd immediately because its not visible to other transactions. An example use case would be an L0 layer that gets compacted within one transaction; or, if we ever start MVCCing the set of timelines of a tenant, a short-lived branch that is created & destroyed within one transaction.
<aside>
**Deadlist Invariant:** if a an object is on a deadlist of transaction T, it is not referenced from anywhere else in the full state visible to T or any later started transaction > T.
</aside>
### Rationale For Deadlist.json
Given that this proposal only MVCCs layers and indexparts, one may ask why the deadlist isnt part of indexpart. The reason is to not lose generality: the deadlist is just a list of keys; it is not necessary to understand the data format of the versioned object to process the deadlist. This is important for garbage collection / vacuuming, which well come to in the next section.
## Garbage Collection / Vacuuming
After a transaction has reached reject-acknowledged state, Control Plane initiates a garbage collection procedure for the aborted transaction.
Control Plane is in the unique position about transaction states. Here is a sketch of the exact transaction states and what Control Plane keeps track of.
```
struct Tenant {
...
txns: HashMap<TxnId, Transaction>,
// the most recently started txn's id; only most recently sarted can win
next_winner_txn: Option<TxnId>,
}
struct Transaction {
id: TxnId, // immutable
last_committed_txn: TxnId, // immutable; the most recent txn in state `Committed`
// when self was started
pageserver_id: PageserverId,
state: enum {
Open,
Committed,
RejectPending,
RejectAcknowledged, // invariant: we know all S3 activity has ceded
GarbageCollected,
}
}
```
Object creations & deletions by a rejected transaction have never been visible to other transactions. That is true for both RejectPending and RejectAcknowledged states. The difference is that, in RejectPending, the pageserver may still be uploading to S3, whereas in RejectAcknowledged, Control Plane can be certain that all S3 activity in the name of that transaction has ceded. So, once a transaction reaches state RejectAcknowledged state, it is safe to DELETE all objects created by that transaction, and discard the transactions deadlists.
A transaction T in state Committed has subsequent transactions that may or may not reference the objects it created. None of the subsequent transaction can reference the objects on Ts deadlist, though, as per the Deadlist Invariant (see previous section).
So, for garbage collection, we need to assess transactions in state Committed and RejectAcknowledged:
- Commited: delete objects on the deadlist.
- We dont need a LIST request here, the deadlist is sufficient. So, its really cheap.
- This is **not true MVCC garbage collection**; by deleting the objects on Committed transaction T s deadlist, we might delete data referenced by other transactions that were concurrent with T, i.e., they started while T was still open. However, the fact that T is committed means that the other transactions are RejectPending or RejectAcknowledged, so, they dont matter. Pageservers executing these doomed RejectPending transactions must handle 404 for GETs gracefully, e.g., by trying to commit txn so they observe the rejection theyre destined to get anyways. 404s for RejectAcknowledged is handled below.
- RejectAcknowledged: delete all objects created in that txn, and discard deadlists.
- 404s / object-already-deleted type messages must be expected because of Committed garbage collection (see above)
- How to get this list of objects created in a txn? Open but solvable design question; Ideas:
- **Brute force**: within tenant prefix, search for all keys ending in `.$txn` and delete them.
- **WAL for PUTs**: before a txn PUTs an object, it logs to S3, or some other equivalently durable storage, that its going to do it. If we log to S3, this means we have to do an additional WAL PUT per “readl” PUT.
- ******************************LIST with reorged S3 layout (preferred one right now):****************************** layout S3 key space such that `$txn` comes first, i.e., `tenants/:tenant_id/$txn/timelines/:timeline_id/*.json.$txn` . That way, when we need to GC a RejectAcknowledged txn, we just LIST the entire `tenants/:tenant_id/$txn` prefix and delete it. The cost of GC for RejectAcknowledged transactions is thus proportional to the number of objects created in that transaction.
## Branches
This proposal only MVCCs layer files and and index_part.json, but leaves the tenant object not-MVCCed. We argued earlier that its fine to ignore this for now, because
1. Control Plane can act as source-of-truth for the set of timelines, and
2. The only operation that makes decision based on “set of timelines” is GC, which in turn only does layer deletions, and layer deletions ***are*** properly MVCCed.
Now that weve introduced garbage collection, lets elaborate a little more on (2). Recall our example from earlier: Pageserver A knows about timelines {R,S}, but another Pageserver B created an additional branch T, so, its set of timelines is {R,S,T}. Both pageservers will run GC code, and so, PS A may decide to delete a layer thats still needed for branch T.
How does the MVCCing of layer files protect us here? If A decides to delete that layer, its just on As transactions deadlist, but still present in S3 and usable by B. If A commits first, B wont be able to commit and the layers in timeline T will be vacuumed. If B commits first, As deadlist is discarded and the layer continues to exist.
## Safekeeper Changes
We need to teach the safekeepers that there can be multiple pageservers requesting WAL for the same timeline, in order to prevent premature WAL truncation.
In the current architecture, the Safekeeper service currently assumes only one Pageserver and is allowed to prune WAL older than that Pageservers `remote_consistent_lsn`. Safekeeper currently learns the `remote_consistent_lsn` through the walreceiver protocol.
So, if we have a tenant attached to two pageservers at the same time, they will both try to stream WAL and the Safekeeper will get confused about which connections `remote_consistent_lsn` to use as a basis for WAL pruning.
What do we need to change to make it work? We need to make sure that the Safekeepers only prune WAL up to the `remote_consistent_lsn` of the last-committed transaction.
The straight-forward way to get it is to re-design WAL pruning as follows:
1. Pageserver reports remote_consistent_lsn as part of transaction commit to Control Plane.
2. Control Plane makes sure transaction state update is persisted.
3. Control Plane (asynchronous to transaction commit) reconciles with Safekeepers to ensure WAL pruning happens.
The above requires non-trivial changes, but, in the light of other planned projects such as restore-tenant-from-safekeeper-wal-backups, I think Control Plane will need to get involved in WAL pruning anyways.
# How This Proposal Unlocks Future Features
Let us revisit the example from the introduction where we were thinking about handling network partitions. Network partitions need to be solved first, because theyre unavoidable in distributed systems. We did that. Now lets see how we can solve actual product problems:
## **Fast, Zero-Toil Failover on Network Partitions or Instance Failure**
The “Problem Statement” section outlined the current architectures problems with regards to network partitions or instance failure: it requires a 100% correct node-dead detector to make decisions, which doesnt exist in reality. We rely instead on human toil: an oncall engineer has to inspect the situation and make a decision, which may be incorrect and in any case take time in the order of minutes, which means equivalent downtime for users.
With this proposal, automatic failover for pageservers is trivial:
If a pageserver is unresponsive from Control Planes / Computes perspective, Control Plane does the following:
- attach all tenants of the unresponsive pageserver to new pageservers
- switch over these tenants computes immediately;
At this point, availability is restored and user pain relieved.
Whats left is to somehow close the doomed transaction of the unresponsive pageserver, so that it beomes RejectAcknowledged, and GC can make progress. Since S3 is cheap, we can afford to wait a really long time here, especially if we put a soft bound on the amount of data a transaction may produce before it must commit. Procedure:
1. Ensure the unresponsive pageserver is taken out of rotation for new attachments. That probably should happen as part of the routine above.
2. Make a human operator investigate decide what to do (next morning, NO ONCALL ALERT):
1. Inspect the instance, investigate logs, understand root cause.
2. Try to re-establish connectivity between pageserver and Control Plane so that pageserver can retry commits, get rejected, ack rejection ⇒ enable GC.
3. Use below procedure to decomission pageserver.
### Decomissioning A Pageserver (Dead or Alive-but-Unrespsonive)
The solution, enabled by this proposal:
1. Ensure that pageservers S3 credentials are revoked so that it cannot make new uploads, which wouldnt be tracked anywhere.
2. Let enough time pass for the S3 credential revocation to propagate. Amazon doesnt give a guarantee here. As stated earlier, we can easily afford to wait here.
3. Mark all Open and RejectPending transactions of that pageserver as RejectAcknowledge.
Revocation of the S3 credentials is required so that, once we transition all the transactions of that pageserver to RejectAcknowledge, once garbage-collection pass is guaranteed to delete all objects that will ever exist for that pageserver. That way, we need not check *****GarbageCollected***** transactions every again.
## Workflow: Zero-Downtime Relocation
With zero-downtime relocation, the goal is to have the target pageserver warmed up, i.e., at the same `last_record_lsn` as the source pageserver, before switching over Computes from source to target pageserver.
With this proposal, it works like so:
1. Grant source pageserver its last open transaction. This one is doomed to be rejected later, unless the relocation fails.
2. Grant target pageserver its first open transaction.
3. Have target pageserver catch up on WAL, streaming from last-committed-txns remote_consistent_lsn onwards.
4. Once target pageserver reports `last_record_lsn` close enough to source pageserver, target pageserver requests commit.
5. Drain compute traffic from source to target pageserver. (Source can still answer requests until it tries to commit and gets reject, so, this will be quite smooth).
Note that as soon as we complete step (4), the source pageservers transaction is doomed to be rejected later. Conversely, if the target cant catch up fast enough, the source will make a transaction commit earlier. This will generally happen if there is a lot of write traffic coming in. The design space to make thing smooth here is large, but well explored in other areas of computing, e.g., VM live migration. We have all the important policy levers at hand, e.g.,
- delaying source commits if we see target making progress
- slowing down source consumption (need some signalling mechanism for it)
- slowing down compute wal generation
-
It doesnt really matter, whats important is that two pageservers can overlap.
# Additional Trade-Offs / Remarks Brought Up During Peer Review
This proposal was read by and discussed @Stas and @Dmitry Rodionov prior to publishing it with the broader team. (This does not mean they endorse this proposal!).
Issues that we discussed:
1. **Frequency of transactions:** If even idle tenants commit every 10min or so, thats quite a lot of load on Control Plane. Can we minimize it by Equating Transaction Commit Period to Attachment Period? I.e. start txn on attach, commit on detach?
1. Would be nice, but, if a tenant is attached for 1 month, then PS dies, we lose 1 month of work.
2. ⇒ my solution to this problem: Adjusted this proposal to make transaction commit frequency proportional to amount of uncommitted data.
1. Its ok to spend resources on active users, they pay us money to do it!
2. The amount of work per transaction is minimal.
1. In current Control Plane, its a small database transaction that is super unlikely to conflict with other transactions.
2. I have very little concerns about scalability of the commit workload on CP side because it's trivially horizontally scalable by sharding by tenant.
3. There's no super stringent availability requirement on control plane; if a txn can't commit because it can't reach the CP, PS can continue & retry in the background, speculating that it's CP downtime and not PS-partitioned-off scenario.
4. Without stringent availability requirement, there's flexibility for future changes to CP-side-implementation.
2. ************************************************Does this proposal address mirroring / no-performance-degradation failover ?************************************************
1. No it doesnt. It only provides the building block for attaching a tenant to a new pageserver without having to worry that the tenant is detached on the old pageserver.
2. A simple scheme to build no-performance-degradation failover on top of this proposal is to have an asynchronous read-only replica of a tenant on another pageserver in the same region.
3. Another more ambitious scheme to get no-performance-degradation would be [One-Pager: Layer File Spreading (Christian)](https://www.notion.so/One-Pager-Layer-File-Spreading-Christian-eb6b64182a214e11b3fceceee688d843?pvs=21); this proposal would be used in layer file spreading for risk-free automation of TenantLeader failover, which hasnt been addressed Ithere.
4. In any way, failover would restart from an older S3 state, and need to re-ingest WAL before being able to server recently written pages.
1. Is that a show-stopper? I think not.
2. Is it suboptimal? Absolutely: if a pageserver instance fails, all its tenants will be distributed among the remaining pageservers (OK), and all these tenants will ask the safekeepers for WAL at the same time (BAD). So, pageserver instance failure will cause a load spike in safekeepers.
1. Personally I think thats an OK trade-off to make.
2. There are countless options to avoid / mitigate the load spike. E.g., pro-actively streaming WAL to the standby read-only replica.
3. ********************************************Does this proposal allow multiple writers for a tenant?********************************************
1. In abstract terms, this proposal provides a linearized history for a given S3 prefix.
2. In concrete terms, this proposal provides a linearized history per tenant.
3. There can be multiple writers at a given time, but only one of them will win to become part of the linearized history.
4. ************************************************************************************Alternative ideas mentioned during meetings that should be turned into a written prospoal like this one:************************************************************************************
1. @Dmitry Rodionov : having linearized storage of index_part.json in some database that allows serializable transactions / atomic compare-and-swap PUT
2. @Dmitry Rodionov :
3. @Stas : something like this scheme, but somehow find a way to equate attachment duration with transaction duration, without losing work if pageserver dies months after attachment.

View File

@@ -10,6 +10,9 @@ chrono.workspace = true
serde.workspace = true
serde_with.workspace = true
serde_json.workspace = true
regex.workspace = true
utils = { path = "../utils" }
remote_storage = { version = "0.1", path = "../remote_storage/" }
workspace_hack.workspace = true

View File

@@ -107,7 +107,6 @@ pub struct ComputeMetrics {
pub num_ext_downloaded: u64,
pub largest_ext_size: u64, // these are measured in bytes
pub total_ext_download_size: u64,
pub prep_extensions_ms: u64,
}
/// Response of the `/computes/{compute_id}/spec` control-plane API.

View File

@@ -3,11 +3,16 @@
//! The spec.json file is used to pass information to 'compute_ctl'. It contains
//! all the information needed to start up the right version of PostgreSQL,
//! and connect it to the storage nodes.
use std::collections::HashMap;
use serde::{Deserialize, Serialize};
use serde_with::{serde_as, DisplayFromStr};
use utils::id::{TenantId, TimelineId};
use utils::lsn::Lsn;
use regex::Regex;
use remote_storage::RemotePath;
/// String type alias representing Postgres identifier and
/// intended to be used for DB / role names.
pub type PgIdent = String;
@@ -61,8 +66,55 @@ pub struct ComputeSpec {
/// the pageserver and safekeepers.
pub storage_auth_token: Option<String>,
// list of prefixes to search for custom extensions in remote extension storage
// information about available remote extensions
pub remote_extensions: Option<RemoteExtSpec>,
}
#[derive(Clone, Debug, Default, Deserialize, Serialize)]
pub struct RemoteExtSpec {
pub public_extensions: Option<Vec<String>>,
pub custom_extensions: Option<Vec<String>>,
pub library_index: HashMap<String, String>,
pub extension_data: HashMap<String, ExtensionData>,
}
#[derive(Clone, Debug, Serialize, Deserialize)]
pub struct ExtensionData {
pub control_data: HashMap<String, String>,
pub archive_path: String,
}
impl RemoteExtSpec {
pub fn get_ext(
&self,
ext_name: &str,
is_library: bool,
) -> anyhow::Result<(String, RemotePath)> {
let mut real_ext_name = ext_name;
if is_library {
// sometimes library names might have a suffix like
// library.so or library.so.3. We strip this off
// because library_index is based on the name without the file extension
let strip_lib_suffix = Regex::new(r"\.so.*").unwrap();
let lib_raw_name = strip_lib_suffix.replace(real_ext_name, "").to_string();
real_ext_name = self
.library_index
.get(&lib_raw_name)
.ok_or(anyhow::anyhow!("library {} is not found", lib_raw_name))?;
}
match self.extension_data.get(real_ext_name) {
Some(ext_data) => Ok((
real_ext_name.to_string(),
RemotePath::from_string(&ext_data.archive_path)?,
)),
None => Err(anyhow::anyhow!(
"real_ext_name {} is not found",
real_ext_name
)),
}
}
}
#[serde_as]

View File

@@ -205,5 +205,43 @@
"name": "zenith new",
"new_name": "zenith \"new\""
}
]
],
"remote_extensions": {
"library_index": {
"anon": "anon",
"postgis-3": "postgis",
"libpgrouting-3.4": "postgis",
"postgis_raster-3": "postgis",
"postgis_sfcgal-3": "postgis",
"postgis_topology-3": "postgis",
"address_standardizer-3": "postgis"
},
"extension_data": {
"anon": {
"archive_path": "5834329303/v15/extensions/anon.tar.zst",
"control_data": {
"anon.control": "# PostgreSQL Anonymizer (anon) extension\ncomment = ''Data anonymization tools''\ndefault_version = ''1.1.0''\ndirectory=''extension/anon''\nrelocatable = false\nrequires = ''pgcrypto''\nsuperuser = false\nmodule_pathname = ''$libdir/anon''\ntrusted = true\n"
}
},
"postgis": {
"archive_path": "5834329303/v15/extensions/postgis.tar.zst",
"control_data": {
"postgis.control": "# postgis extension\ncomment = ''PostGIS geometry and geography spatial types and functions''\ndefault_version = ''3.3.2''\nmodule_pathname = ''$libdir/postgis-3''\nrelocatable = false\ntrusted = true\n",
"pgrouting.control": "# pgRouting Extension\ncomment = ''pgRouting Extension''\ndefault_version = ''3.4.2''\nmodule_pathname = ''$libdir/libpgrouting-3.4''\nrelocatable = true\nrequires = ''plpgsql''\nrequires = ''postgis''\ntrusted = true\n",
"postgis_raster.control": "# postgis_raster extension\ncomment = ''PostGIS raster types and functions''\ndefault_version = ''3.3.2''\nmodule_pathname = ''$libdir/postgis_raster-3''\nrelocatable = false\nrequires = postgis\ntrusted = true\n",
"postgis_sfcgal.control": "# postgis topology extension\ncomment = ''PostGIS SFCGAL functions''\ndefault_version = ''3.3.2''\nrelocatable = true\nrequires = postgis\ntrusted = true\n",
"postgis_topology.control": "# postgis topology extension\ncomment = ''PostGIS topology spatial types and functions''\ndefault_version = ''3.3.2''\nrelocatable = false\nschema = topology\nrequires = postgis\ntrusted = true\n",
"address_standardizer.control": "# address_standardizer extension\ncomment = ''Used to parse an address into constituent elements. Generally used to support geocoding address normalization step.''\ndefault_version = ''3.3.2''\nrelocatable = true\ntrusted = true\n",
"postgis_tiger_geocoder.control": "# postgis tiger geocoder extension\ncomment = ''PostGIS tiger geocoder and reverse geocoder''\ndefault_version = ''3.3.2''\nrelocatable = false\nschema = tiger\nrequires = ''postgis,fuzzystrmatch''\nsuperuser= false\ntrusted = true\n",
"address_standardizer_data_us.control": "# address standardizer us dataset\ncomment = ''Address Standardizer US dataset example''\ndefault_version = ''3.3.2''\nrelocatable = true\ntrusted = true\n"
}
}
},
"custom_extensions": [
"anon"
],
"public_extensions": [
"postgis"
]
}
}

View File

@@ -189,8 +189,6 @@ impl S3Bucket {
let kind = RequestKind::Get;
let permit = self.owned_permit(kind).await;
metrics::inc_get_object();
let started_at = start_measuring_requests(kind);
let get_object = self
@@ -205,7 +203,6 @@ impl S3Bucket {
let started_at = ScopeGuard::into_inner(started_at);
if get_object.is_err() {
metrics::inc_get_object_fail();
metrics::BUCKET_METRICS.req_seconds.observe_elapsed(
kind,
AttemptOutcome::Err,
@@ -337,7 +334,6 @@ impl RemoteStorage for S3Bucket {
loop {
let _guard = self.permit(kind).await;
metrics::inc_list_objects();
let started_at = start_measuring_requests(kind);
let fetch_response = self
@@ -350,10 +346,6 @@ impl RemoteStorage for S3Bucket {
.set_max_keys(self.max_keys_per_list_response)
.send()
.await
.map_err(|e| {
metrics::inc_list_objects_fail();
e
})
.context("Failed to list S3 prefixes")
.map_err(DownloadError::Other);
@@ -395,7 +387,6 @@ impl RemoteStorage for S3Bucket {
let mut all_files = vec![];
loop {
let _guard = self.permit(kind).await;
metrics::inc_list_objects();
let started_at = start_measuring_requests(kind);
let response = self
@@ -407,10 +398,6 @@ impl RemoteStorage for S3Bucket {
.set_max_keys(self.max_keys_per_list_response)
.send()
.await
.map_err(|e| {
metrics::inc_list_objects_fail();
e
})
.context("Failed to list files in S3 bucket");
let started_at = ScopeGuard::into_inner(started_at);
@@ -443,7 +430,6 @@ impl RemoteStorage for S3Bucket {
let kind = RequestKind::Put;
let _guard = self.permit(kind).await;
metrics::inc_put_object();
let started_at = start_measuring_requests(kind);
let body = Body::wrap_stream(ReaderStream::new(from));
@@ -458,11 +444,7 @@ impl RemoteStorage for S3Bucket {
.content_length(from_size_bytes.try_into()?)
.body(bytes_stream)
.send()
.await
.map_err(|e| {
metrics::inc_put_object_fail();
e
});
.await;
let started_at = ScopeGuard::into_inner(started_at);
metrics::BUCKET_METRICS
@@ -519,7 +501,6 @@ impl RemoteStorage for S3Bucket {
}
for chunk in delete_objects.chunks(MAX_DELETE_OBJECTS_REQUEST_SIZE) {
metrics::inc_delete_objects(chunk.len() as u64);
let started_at = start_measuring_requests(kind);
let resp = self
@@ -537,8 +518,10 @@ impl RemoteStorage for S3Bucket {
match resp {
Ok(resp) => {
metrics::BUCKET_METRICS
.deleted_objects_total
.inc_by(chunk.len() as u64);
if let Some(errors) = resp.errors {
metrics::inc_delete_objects_fail(errors.len() as u64);
return Err(anyhow::format_err!(
"Failed to delete {} objects",
errors.len()
@@ -546,7 +529,6 @@ impl RemoteStorage for S3Bucket {
}
}
Err(e) => {
metrics::inc_delete_objects_fail(chunk.len() as u64);
return Err(e.into());
}
}
@@ -555,32 +537,8 @@ impl RemoteStorage for S3Bucket {
}
async fn delete(&self, path: &RemotePath) -> anyhow::Result<()> {
let kind = RequestKind::Delete;
let _guard = self.permit(kind).await;
metrics::inc_delete_object();
let started_at = start_measuring_requests(kind);
let res = self
.client
.delete_object()
.bucket(self.bucket_name.clone())
.key(self.relative_path_to_s3_object(path))
.send()
.await
.map_err(|e| {
metrics::inc_delete_object_fail();
e
});
let started_at = ScopeGuard::into_inner(started_at);
metrics::BUCKET_METRICS
.req_seconds
.observe_elapsed(kind, &res, started_at);
res?;
Ok(())
let paths = std::array::from_ref(path);
self.delete_objects(paths).await
}
}

View File

@@ -1,4 +1,6 @@
use metrics::{register_histogram_vec, register_int_counter_vec, Histogram, IntCounter};
use metrics::{
register_histogram_vec, register_int_counter, register_int_counter_vec, Histogram, IntCounter,
};
use once_cell::sync::Lazy;
pub(super) static BUCKET_METRICS: Lazy<BucketMetrics> = Lazy::new(Default::default);
@@ -125,41 +127,22 @@ impl PassFailCancelledRequestTyped<Histogram> {
}
pub(super) struct BucketMetrics {
/// Total requests attempted
// TODO: remove after next release and migrate dashboards to `sum by (result) (remote_storage_s3_requests_count)`
requests: RequestTyped<IntCounter>,
/// Subset of attempted requests failed
// TODO: remove after next release and migrate dashboards to `remote_storage_s3_requests_count{result="err"}`
failed: RequestTyped<IntCounter>,
/// Full request duration until successful completion, error or cancellation.
pub(super) req_seconds: PassFailCancelledRequestTyped<Histogram>,
/// Total amount of seconds waited on queue.
pub(super) wait_seconds: RequestTyped<Histogram>,
/// Track how many semaphore awaits were cancelled per request type.
///
/// This is in case cancellations are happening more than expected.
pub(super) cancelled_waits: RequestTyped<IntCounter>,
/// Total amount of deleted objects in batches or single requests.
pub(super) deleted_objects_total: IntCounter,
}
impl Default for BucketMetrics {
fn default() -> Self {
let requests = register_int_counter_vec!(
"remote_storage_s3_requests_count",
"Number of s3 requests of particular type",
&["request_type"],
)
.expect("failed to define a metric");
let requests =
RequestTyped::build_with(|kind| requests.with_label_values(&[kind.as_str()]));
let failed = register_int_counter_vec!(
"remote_storage_s3_failures_count",
"Number of failed s3 requests of particular type",
&["request_type"],
)
.expect("failed to define a metric");
let failed = RequestTyped::build_with(|kind| failed.with_label_values(&[kind.as_str()]));
let buckets = [0.01, 0.10, 0.5, 1.0, 5.0, 10.0, 50.0, 100.0];
let req_seconds = register_histogram_vec!(
@@ -192,52 +175,17 @@ impl Default for BucketMetrics {
let cancelled_waits =
RequestTyped::build_with(|kind| cancelled_waits.with_label_values(&[kind.as_str()]));
let deleted_objects_total = register_int_counter!(
"remote_storage_s3_deleted_objects_total",
"Amount of deleted objects in total",
)
.unwrap();
Self {
requests,
failed,
req_seconds,
wait_seconds,
cancelled_waits,
deleted_objects_total,
}
}
}
pub fn inc_get_object() {
BUCKET_METRICS.requests.get(Get).inc()
}
pub fn inc_get_object_fail() {
BUCKET_METRICS.failed.get(Get).inc()
}
pub fn inc_put_object() {
BUCKET_METRICS.requests.get(Put).inc()
}
pub fn inc_put_object_fail() {
BUCKET_METRICS.failed.get(Put).inc()
}
pub fn inc_delete_object() {
BUCKET_METRICS.requests.get(Delete).inc()
}
pub fn inc_delete_objects(count: u64) {
BUCKET_METRICS.requests.get(Delete).inc_by(count)
}
pub fn inc_delete_object_fail() {
BUCKET_METRICS.failed.get(Delete).inc()
}
pub fn inc_delete_objects_fail(count: u64) {
BUCKET_METRICS.failed.get(Delete).inc_by(count)
}
pub fn inc_list_objects() {
BUCKET_METRICS.requests.get(List).inc()
}
pub fn inc_list_objects_fail() {
BUCKET_METRICS.failed.get(List).inc()
}

View File

@@ -71,6 +71,13 @@ impl UnreliableWrapper {
}
}
}
async fn delete_inner(&self, path: &RemotePath, attempt: bool) -> anyhow::Result<()> {
if attempt {
self.attempt(RemoteOp::Delete(path.clone()))?;
}
self.inner.delete(path).await
}
}
#[async_trait::async_trait]
@@ -122,15 +129,15 @@ impl RemoteStorage for UnreliableWrapper {
}
async fn delete(&self, path: &RemotePath) -> anyhow::Result<()> {
self.attempt(RemoteOp::Delete(path.clone()))?;
self.inner.delete(path).await
self.delete_inner(path, true).await
}
async fn delete_objects<'a>(&self, paths: &'a [RemotePath]) -> anyhow::Result<()> {
self.attempt(RemoteOp::DeleteObjects(paths.to_vec()))?;
let mut error_counter = 0;
for path in paths {
if (self.delete(path).await).is_err() {
// Dont record attempt because it was already recorded above
if (self.delete_inner(path, false).await).is_err() {
error_counter += 1;
}
}

View File

@@ -26,6 +26,7 @@ serde_json.workspace = true
signal-hook.workspace = true
thiserror.workspace = true
tokio.workspace = true
tokio-util.workspace = true
tracing.workspace = true
tracing-error.workspace = true
tracing-subscriber = { workspace = true, features = ["json", "registry"] }

234
libs/utils/src/backoff.rs Normal file
View File

@@ -0,0 +1,234 @@
use std::fmt::{Debug, Display};
use futures::Future;
use tokio_util::sync::CancellationToken;
pub const DEFAULT_BASE_BACKOFF_SECONDS: f64 = 0.1;
pub const DEFAULT_MAX_BACKOFF_SECONDS: f64 = 3.0;
pub async fn exponential_backoff(
n: u32,
base_increment: f64,
max_seconds: f64,
cancel: &CancellationToken,
) {
let backoff_duration_seconds =
exponential_backoff_duration_seconds(n, base_increment, max_seconds);
if backoff_duration_seconds > 0.0 {
tracing::info!(
"Backoff: waiting {backoff_duration_seconds} seconds before processing with the task",
);
drop(
tokio::time::timeout(
std::time::Duration::from_secs_f64(backoff_duration_seconds),
cancel.cancelled(),
)
.await,
)
}
}
pub fn exponential_backoff_duration_seconds(n: u32, base_increment: f64, max_seconds: f64) -> f64 {
if n == 0 {
0.0
} else {
(1.0 + base_increment).powf(f64::from(n)).min(max_seconds)
}
}
/// Configure cancellation for a retried operation: when to cancel (the token), and
/// what kind of error to return on cancellation
pub struct Cancel<E, CF>
where
E: Display + Debug + 'static,
CF: Fn() -> E,
{
token: CancellationToken,
on_cancel: CF,
}
impl<E, CF> Cancel<E, CF>
where
E: Display + Debug + 'static,
CF: Fn() -> E,
{
pub fn new(token: CancellationToken, on_cancel: CF) -> Self {
Self { token, on_cancel }
}
}
/// retries passed operation until one of the following conditions are met:
/// Encountered error is considered as permanent (non-retryable)
/// Retries have been exhausted.
/// `is_permanent` closure should be used to provide distinction between permanent/non-permanent errors
/// When attempts cross `warn_threshold` function starts to emit log warnings.
/// `description` argument is added to log messages. Its value should identify the `op` is doing
/// `cancel` argument is required: any time we are looping on retry, we should be using a CancellationToken
/// to drop out promptly on shutdown.
pub async fn retry<T, O, F, E, CF>(
mut op: O,
is_permanent: impl Fn(&E) -> bool,
warn_threshold: u32,
max_retries: u32,
description: &str,
cancel: Cancel<E, CF>,
) -> Result<T, E>
where
// Not std::error::Error because anyhow::Error doesnt implement it.
// For context see https://github.com/dtolnay/anyhow/issues/63
E: Display + Debug + 'static,
O: FnMut() -> F,
F: Future<Output = Result<T, E>>,
CF: Fn() -> E,
{
let mut attempts = 0;
loop {
if cancel.token.is_cancelled() {
return Err((cancel.on_cancel)());
}
let result = op().await;
match result {
Ok(_) => {
if attempts > 0 {
tracing::info!("{description} succeeded after {attempts} retries");
}
return result;
}
// These are "permanent" errors that should not be retried.
Err(ref e) if is_permanent(e) => {
return result;
}
// Assume that any other failure might be transient, and the operation might
// succeed if we just keep trying.
Err(err) if attempts < warn_threshold => {
tracing::info!("{description} failed, will retry (attempt {attempts}): {err:#}");
}
Err(err) if attempts < max_retries => {
tracing::warn!("{description} failed, will retry (attempt {attempts}): {err:#}");
}
Err(ref err) => {
// Operation failed `max_attempts` times. Time to give up.
tracing::warn!(
"{description} still failed after {attempts} retries, giving up: {err:?}"
);
return result;
}
}
// sleep and retry
exponential_backoff(
attempts,
DEFAULT_BASE_BACKOFF_SECONDS,
DEFAULT_MAX_BACKOFF_SECONDS,
&cancel.token,
)
.await;
attempts += 1;
}
}
#[cfg(test)]
mod tests {
use std::io;
use tokio::sync::Mutex;
use super::*;
#[test]
fn backoff_defaults_produce_growing_backoff_sequence() {
let mut current_backoff_value = None;
for i in 0..10_000 {
let new_backoff_value = exponential_backoff_duration_seconds(
i,
DEFAULT_BASE_BACKOFF_SECONDS,
DEFAULT_MAX_BACKOFF_SECONDS,
);
if let Some(old_backoff_value) = current_backoff_value.replace(new_backoff_value) {
assert!(
old_backoff_value <= new_backoff_value,
"{i}th backoff value {new_backoff_value} is smaller than the previous one {old_backoff_value}"
)
}
}
assert_eq!(
current_backoff_value.expect("Should have produced backoff values to compare"),
DEFAULT_MAX_BACKOFF_SECONDS,
"Given big enough of retries, backoff should reach its allowed max value"
);
}
#[tokio::test(start_paused = true)]
async fn retry_always_error() {
let count = Mutex::new(0);
let err_result = retry(
|| async {
*count.lock().await += 1;
Result::<(), io::Error>::Err(io::Error::from(io::ErrorKind::Other))
},
|_e| false,
1,
1,
"work",
Cancel::new(CancellationToken::new(), || -> io::Error { unreachable!() }),
)
.await;
assert!(err_result.is_err());
assert_eq!(*count.lock().await, 2);
}
#[tokio::test(start_paused = true)]
async fn retry_ok_after_err() {
let count = Mutex::new(0);
retry(
|| async {
let mut locked = count.lock().await;
if *locked > 1 {
Ok(())
} else {
*locked += 1;
Err(io::Error::from(io::ErrorKind::Other))
}
},
|_e| false,
2,
2,
"work",
Cancel::new(CancellationToken::new(), || -> io::Error { unreachable!() }),
)
.await
.unwrap();
}
#[tokio::test(start_paused = true)]
async fn dont_retry_permanent_errors() {
let count = Mutex::new(0);
let _ = retry(
|| async {
let mut locked = count.lock().await;
if *locked > 1 {
Ok(())
} else {
*locked += 1;
Err(io::Error::from(io::ErrorKind::Other))
}
},
|_e| true,
2,
2,
"work",
Cancel::new(CancellationToken::new(), || -> io::Error { unreachable!() }),
)
.await
.unwrap_err();
assert_eq!(*count.lock().await, 1);
}
}

View File

@@ -111,6 +111,10 @@ pub fn fsync(path: &Path) -> io::Result<()> {
.map_err(|e| io::Error::new(e.kind(), format!("Failed to fsync file {path:?}: {e}")))
}
pub async fn fsync_async(path: impl AsRef<std::path::Path>) -> Result<(), std::io::Error> {
tokio::fs::File::open(path).await?.sync_all().await
}
#[cfg(test)]
mod tests {
use tempfile::tempdir;

View File

@@ -1,6 +1,8 @@
//! `utils` is intended to be a place to put code that is shared
//! between other crates in this repository.
pub mod backoff;
/// `Lsn` type implements common tasks on Log Sequence Numbers
pub mod lsn;
/// SeqWait allows waiting for a future sequence number to arrive
@@ -66,44 +68,6 @@ pub mod completion;
/// Reporting utilities
pub mod error;
mod failpoint_macro_helpers {
/// use with fail::cfg("$name", "return(2000)")
///
/// The effect is similar to a "sleep(2000)" action, i.e. we sleep for the
/// specified time (in milliseconds). The main difference is that we use async
/// tokio sleep function. Another difference is that we print lines to the log,
/// which can be useful in tests to check that the failpoint was hit.
#[macro_export]
macro_rules! failpoint_sleep_millis_async {
($name:literal) => {{
// If the failpoint is used with a "return" action, set should_sleep to the
// returned value (as string). Otherwise it's set to None.
let should_sleep = (|| {
::fail::fail_point!($name, |x| x);
::std::option::Option::None
})();
// Sleep if the action was a returned value
if let ::std::option::Option::Some(duration_str) = should_sleep {
$crate::failpoint_sleep_helper($name, duration_str).await
}
}};
}
// Helper function used by the macro. (A function has nicer scoping so we
// don't need to decorate everything with "::")
pub async fn failpoint_sleep_helper(name: &'static str, duration_str: String) {
let millis = duration_str.parse::<u64>().unwrap();
let d = std::time::Duration::from_millis(millis);
tracing::info!("failpoint {:?}: sleeping for {:?}", name, d);
tokio::time::sleep(d).await;
tracing::info!("failpoint {:?}: sleep done", name);
}
}
pub use failpoint_macro_helpers::failpoint_sleep_helper;
/// This is a shortcut to embed git sha into binaries and avoid copying the same build script to all packages
///
/// we have several cases:

View File

@@ -215,7 +215,6 @@ fn bench_sequential(c: &mut Criterion) {
TimelineId::generate(),
zero.add(10 * i32)..zero.add(10 * i32 + 1),
Lsn(i),
false,
0,
);
updates.insert_historic(layer);

View File

@@ -72,7 +72,7 @@ async fn read_delta_file(path: impl AsRef<Path>) -> Result<()> {
.await?;
let cursor = BlockCursor::new(&file);
for (k, v) in all {
let value = cursor.read_blob(v.pos())?;
let value = cursor.read_blob(v.pos()).await?;
println!("key:{} value_len:{}", k, value.len());
}
// TODO(chi): special handling for last key?

View File

@@ -6,11 +6,12 @@ use std::{env, ops::ControlFlow, path::Path, str::FromStr};
use anyhow::{anyhow, Context};
use clap::{Arg, ArgAction, Command};
use fail::FailScenario;
use metrics::launch_timestamp::{set_launch_timestamp_metric, LaunchTimestamp};
use pageserver::disk_usage_eviction_task::{self, launch_disk_usage_global_eviction_task};
use pageserver::metrics::{STARTUP_DURATION, STARTUP_IS_LOADING};
use pageserver::task_mgr::WALRECEIVER_RUNTIME;
use pageserver::tenant::TenantSharedResources;
use remote_storage::GenericRemoteStorage;
use tokio::time::Instant;
use tracing::*;
@@ -121,7 +122,7 @@ fn main() -> anyhow::Result<()> {
}
// Initialize up failpoints support
let scenario = FailScenario::setup();
let scenario = pageserver::failpoint_support::init();
// Basic initialization of things that don't change after startup
virtual_file::init(conf.max_file_descriptors);
@@ -382,8 +383,10 @@ fn start_pageserver(
BACKGROUND_RUNTIME.block_on(mgr::init_tenant_mgr(
conf,
broker_client.clone(),
remote_storage.clone(),
TenantSharedResources {
broker_client: broker_client.clone(),
remote_storage: remote_storage.clone(),
},
order,
))?;

View File

@@ -85,6 +85,7 @@
//! The solution is that all code paths are infected with precisely one
//! [`RequestContext`] argument. Functions in the middle of the call chain
//! only need to pass it on.
use crate::task_mgr::TaskKind;
// The main structure of this module, see module-level comment.
@@ -92,6 +93,7 @@ use crate::task_mgr::TaskKind;
pub struct RequestContext {
task_kind: TaskKind,
download_behavior: DownloadBehavior,
access_stats_behavior: AccessStatsBehavior,
}
/// Desired behavior if the operation requires an on-demand download
@@ -109,6 +111,67 @@ pub enum DownloadBehavior {
Error,
}
/// Whether this request should update access times used in LRU eviction
#[derive(Clone, Copy, PartialEq, Eq, Debug)]
pub(crate) enum AccessStatsBehavior {
/// Update access times: this request's access to data should be taken
/// as a hint that the accessed layer is likely to be accessed again
Update,
/// Do not update access times: this request is accessing the layer
/// but does not want to indicate that the layer should be retained in cache,
/// perhaps because the requestor is a compaction routine that will soon cover
/// this layer with another.
Skip,
}
pub struct RequestContextBuilder {
inner: RequestContext,
}
impl RequestContextBuilder {
/// A new builder with default settings
pub fn new(task_kind: TaskKind) -> Self {
Self {
inner: RequestContext {
task_kind,
download_behavior: DownloadBehavior::Download,
access_stats_behavior: AccessStatsBehavior::Update,
},
}
}
pub fn extend(original: &RequestContext) -> Self {
Self {
// This is like a Copy, but avoid implementing Copy because ordinary users of
// RequestContext should always move or ref it.
inner: RequestContext {
task_kind: original.task_kind,
download_behavior: original.download_behavior,
access_stats_behavior: original.access_stats_behavior,
},
}
}
/// Configure the DownloadBehavior of the context: whether to
/// download missing layers, and/or warn on the download.
pub fn download_behavior(mut self, b: DownloadBehavior) -> Self {
self.inner.download_behavior = b;
self
}
/// Configure the AccessStatsBehavior of the context: whether layer
/// accesses should update the access time of the layer.
pub(crate) fn access_stats_behavior(mut self, b: AccessStatsBehavior) -> Self {
self.inner.access_stats_behavior = b;
self
}
pub fn build(self) -> RequestContext {
self.inner
}
}
impl RequestContext {
/// Create a new RequestContext that has no parent.
///
@@ -123,10 +186,9 @@ impl RequestContext {
/// because someone explicitly canceled it.
/// It has no parent, so it cannot inherit cancellation from there.
pub fn new(task_kind: TaskKind, download_behavior: DownloadBehavior) -> Self {
RequestContext {
task_kind,
download_behavior,
}
RequestContextBuilder::new(task_kind)
.download_behavior(download_behavior)
.build()
}
/// Create a detached child context for a task that may outlive `self`.
@@ -187,10 +249,7 @@ impl RequestContext {
}
fn child_impl(&self, task_kind: TaskKind, download_behavior: DownloadBehavior) -> Self {
RequestContext {
task_kind,
download_behavior,
}
Self::new(task_kind, download_behavior)
}
pub fn task_kind(&self) -> TaskKind {
@@ -200,4 +259,8 @@ impl RequestContext {
pub fn download_behavior(&self) -> DownloadBehavior {
self.download_behavior
}
pub(crate) fn access_stats_behavior(&self) -> AccessStatsBehavior {
self.access_stats_behavior
}
}

View File

@@ -0,0 +1,86 @@
/// use with fail::cfg("$name", "return(2000)")
///
/// The effect is similar to a "sleep(2000)" action, i.e. we sleep for the
/// specified time (in milliseconds). The main difference is that we use async
/// tokio sleep function. Another difference is that we print lines to the log,
/// which can be useful in tests to check that the failpoint was hit.
#[macro_export]
macro_rules! __failpoint_sleep_millis_async {
($name:literal) => {{
// If the failpoint is used with a "return" action, set should_sleep to the
// returned value (as string). Otherwise it's set to None.
let should_sleep = (|| {
::fail::fail_point!($name, |x| x);
::std::option::Option::None
})();
// Sleep if the action was a returned value
if let ::std::option::Option::Some(duration_str) = should_sleep {
$crate::failpoint_support::failpoint_sleep_helper($name, duration_str).await
}
}};
}
pub use __failpoint_sleep_millis_async as sleep_millis_async;
// Helper function used by the macro. (A function has nicer scoping so we
// don't need to decorate everything with "::")
#[doc(hidden)]
pub(crate) async fn failpoint_sleep_helper(name: &'static str, duration_str: String) {
let millis = duration_str.parse::<u64>().unwrap();
let d = std::time::Duration::from_millis(millis);
tracing::info!("failpoint {:?}: sleeping for {:?}", name, d);
tokio::time::sleep(d).await;
tracing::info!("failpoint {:?}: sleep done", name);
}
pub fn init() -> fail::FailScenario<'static> {
// The failpoints lib provides support for parsing the `FAILPOINTS` env var.
// We want non-default behavior for `exit`, though, so, we handle it separately.
//
// Format for FAILPOINTS is "name=actions" separated by ";".
let actions = std::env::var("FAILPOINTS");
if actions.is_ok() {
std::env::remove_var("FAILPOINTS");
} else {
// let the library handle non-utf8, or nothing for not present
}
let scenario = fail::FailScenario::setup();
if let Ok(val) = actions {
val.split(';')
.enumerate()
.map(|(i, s)| s.split_once('=').ok_or((i, s)))
.for_each(|res| {
let (name, actions) = match res {
Ok(t) => t,
Err((i, s)) => {
panic!(
"startup failpoints: missing action on the {}th failpoint; try `{s}=return`",
i + 1,
);
}
};
if let Err(e) = apply_failpoint(name, actions) {
panic!("startup failpoints: failed to apply failpoint {name}={actions}: {e}");
}
});
}
scenario
}
pub(crate) fn apply_failpoint(name: &str, actions: &str) -> Result<(), String> {
if actions == "exit" {
fail::cfg_callback(name, exit_failpoint)
} else {
fail::cfg(name, actions)
}
}
#[inline(never)]
fn exit_failpoint() {
tracing::info!("Exit requested by failpoint");
std::process::exit(1);
}

View File

@@ -517,7 +517,6 @@ async fn timeline_delete_handler(
.instrument(info_span!("timeline_delete", %tenant_id, %timeline_id))
.await?;
// FIXME: needs to be an error for console to retry it. Ideally Accepted should be used and retried until 404.
json_response(StatusCode::ACCEPTED, ())
}
@@ -980,14 +979,7 @@ async fn failpoints_handler(
// We recognize one extra "action" that's not natively recognized
// by the failpoints crate: exit, to immediately kill the process
let cfg_result = if fp.actions == "exit" {
fail::cfg_callback(fp.name, || {
info!("Exit requested by failpoint");
std::process::exit(1);
})
} else {
fail::cfg(fp.name, &fp.actions)
};
let cfg_result = crate::failpoint_support::apply_failpoint(&fp.name, &fp.actions);
if let Err(err_msg) = cfg_result {
return Err(ApiError::BadRequest(anyhow!(

View File

@@ -21,6 +21,8 @@ pub mod walingest;
pub mod walrecord;
pub mod walredo;
pub mod failpoint_support;
use std::path::Path;
use crate::task_mgr::TaskKind;
@@ -95,28 +97,6 @@ pub async fn shutdown_pageserver(exit_code: i32) {
std::process::exit(exit_code);
}
const DEFAULT_BASE_BACKOFF_SECONDS: f64 = 0.1;
const DEFAULT_MAX_BACKOFF_SECONDS: f64 = 3.0;
async fn exponential_backoff(n: u32, base_increment: f64, max_seconds: f64) {
let backoff_duration_seconds =
exponential_backoff_duration_seconds(n, base_increment, max_seconds);
if backoff_duration_seconds > 0.0 {
info!(
"Backoff: waiting {backoff_duration_seconds} seconds before processing with the task",
);
tokio::time::sleep(std::time::Duration::from_secs_f64(backoff_duration_seconds)).await;
}
}
pub fn exponential_backoff_duration_seconds(n: u32, base_increment: f64, max_seconds: f64) -> f64 {
if n == 0 {
0.0
} else {
(1.0 + base_increment).powf(f64::from(n)).min(max_seconds)
}
}
/// The name of the metadata file pageserver creates per timeline.
/// Full path: `tenants/<tenant_id>/timelines/<timeline_id>/metadata`.
pub const METADATA_FILE_NAME: &str = "metadata";
@@ -238,37 +218,6 @@ async fn timed<Fut: std::future::Future>(
}
}
#[cfg(test)]
mod backoff_defaults_tests {
use super::*;
#[test]
fn backoff_defaults_produce_growing_backoff_sequence() {
let mut current_backoff_value = None;
for i in 0..10_000 {
let new_backoff_value = exponential_backoff_duration_seconds(
i,
DEFAULT_BASE_BACKOFF_SECONDS,
DEFAULT_MAX_BACKOFF_SECONDS,
);
if let Some(old_backoff_value) = current_backoff_value.replace(new_backoff_value) {
assert!(
old_backoff_value <= new_backoff_value,
"{i}th backoff value {new_backoff_value} is smaller than the previous one {old_backoff_value}"
)
}
}
assert_eq!(
current_backoff_value.expect("Should have produced backoff values to compare"),
DEFAULT_MAX_BACKOFF_SECONDS,
"Given big enough of retries, backoff should reach its allowed max value"
);
}
}
#[cfg(test)]
mod timed_tests {
use super::timed;

View File

@@ -6,7 +6,7 @@ use metrics::{
HistogramVec, IntCounter, IntCounterVec, IntGauge, IntGaugeVec, UIntGauge, UIntGaugeVec,
};
use once_cell::sync::Lazy;
use strum::VariantNames;
use strum::{EnumCount, IntoEnumIterator, VariantNames};
use strum_macros::{EnumVariantNames, IntoStaticStr};
use utils::id::{TenantId, TimelineId};
@@ -570,23 +570,160 @@ pub(crate) static STORAGE_IO_SIZE: Lazy<IntGaugeVec> = Lazy::new(|| {
.expect("failed to define a metric")
});
const SMGR_QUERY_TIME_OPERATIONS: &[&str] = &[
"get_rel_exists",
"get_rel_size",
"get_page_at_lsn",
"get_db_size",
];
#[derive(Debug)]
struct GlobalAndPerTimelineHistogram {
global: Histogram,
per_tenant_timeline: Histogram,
}
pub static SMGR_QUERY_TIME: Lazy<HistogramVec> = Lazy::new(|| {
impl GlobalAndPerTimelineHistogram {
fn observe(&self, value: f64) {
self.global.observe(value);
self.per_tenant_timeline.observe(value);
}
}
struct GlobalAndPerTimelineHistogramTimer<'a> {
h: &'a GlobalAndPerTimelineHistogram,
start: std::time::Instant,
}
impl<'a> Drop for GlobalAndPerTimelineHistogramTimer<'a> {
fn drop(&mut self) {
let elapsed = self.start.elapsed();
self.h.observe(elapsed.as_secs_f64());
}
}
#[derive(
Debug,
Clone,
Copy,
IntoStaticStr,
strum_macros::EnumCount,
strum_macros::EnumIter,
strum_macros::FromRepr,
)]
#[strum(serialize_all = "snake_case")]
pub enum SmgrQueryType {
GetRelExists,
GetRelSize,
GetPageAtLsn,
GetDbSize,
}
#[derive(Debug)]
pub struct SmgrQueryTimePerTimeline {
metrics: [GlobalAndPerTimelineHistogram; SmgrQueryType::COUNT],
}
static SMGR_QUERY_TIME_PER_TENANT_TIMELINE: Lazy<HistogramVec> = Lazy::new(|| {
register_histogram_vec!(
"pageserver_smgr_query_seconds",
"Time spent on smgr query handling",
"Time spent on smgr query handling, aggegated by query type and tenant/timeline.",
&["smgr_query_type", "tenant_id", "timeline_id"],
CRITICAL_OP_BUCKETS.into(),
)
.expect("failed to define a metric")
});
static SMGR_QUERY_TIME_GLOBAL: Lazy<HistogramVec> = Lazy::new(|| {
register_histogram_vec!(
"pageserver_smgr_query_seconds_global",
"Time spent on smgr query handling, aggregated by query type.",
&["smgr_query_type"],
CRITICAL_OP_BUCKETS.into(),
)
.expect("failed to define a metric")
});
impl SmgrQueryTimePerTimeline {
pub(crate) fn new(tenant_id: &TenantId, timeline_id: &TimelineId) -> Self {
let tenant_id = tenant_id.to_string();
let timeline_id = timeline_id.to_string();
let metrics = std::array::from_fn(|i| {
let op = SmgrQueryType::from_repr(i).unwrap();
let global = SMGR_QUERY_TIME_GLOBAL
.get_metric_with_label_values(&[op.into()])
.unwrap();
let per_tenant_timeline = SMGR_QUERY_TIME_PER_TENANT_TIMELINE
.get_metric_with_label_values(&[op.into(), &tenant_id, &timeline_id])
.unwrap();
GlobalAndPerTimelineHistogram {
global,
per_tenant_timeline,
}
});
Self { metrics }
}
pub(crate) fn start_timer(&self, op: SmgrQueryType) -> impl Drop + '_ {
let metric = &self.metrics[op as usize];
GlobalAndPerTimelineHistogramTimer {
h: metric,
start: std::time::Instant::now(),
}
}
}
#[cfg(test)]
mod smgr_query_time_tests {
use strum::IntoEnumIterator;
use utils::id::{TenantId, TimelineId};
// Regression test, we used hard-coded string constants before using an enum.
#[test]
fn op_label_name() {
use super::SmgrQueryType::*;
let expect: [(super::SmgrQueryType, &'static str); 4] = [
(GetRelExists, "get_rel_exists"),
(GetRelSize, "get_rel_size"),
(GetPageAtLsn, "get_page_at_lsn"),
(GetDbSize, "get_db_size"),
];
for (op, expect) in expect {
let actual: &'static str = op.into();
assert_eq!(actual, expect);
}
}
#[test]
fn basic() {
let ops: Vec<_> = super::SmgrQueryType::iter().collect();
for op in &ops {
let tenant_id = TenantId::generate();
let timeline_id = TimelineId::generate();
let metrics = super::SmgrQueryTimePerTimeline::new(&tenant_id, &timeline_id);
let get_counts = || {
let global: u64 = ops
.iter()
.map(|op| metrics.metrics[*op as usize].global.get_sample_count())
.sum();
let per_tenant_timeline: u64 = ops
.iter()
.map(|op| {
metrics.metrics[*op as usize]
.per_tenant_timeline
.get_sample_count()
})
.sum();
(global, per_tenant_timeline)
};
let (pre_global, pre_per_tenant_timeline) = get_counts();
assert_eq!(pre_per_tenant_timeline, 0);
let timer = metrics.start_timer(*op);
drop(timer);
let (post_global, post_per_tenant_timeline) = get_counts();
assert_eq!(post_per_tenant_timeline, 1);
assert!(post_global > pre_global);
}
}
}
// keep in sync with control plane Go code so that we can validate
// compute's basebackup_ms metric with our perspective in the context of SLI/SLO.
static COMPUTE_STARTUP_BUCKETS: Lazy<[f64; 28]> = Lazy::new(|| {
@@ -1045,6 +1182,12 @@ impl Drop for TimelineMetrics {
.write()
.unwrap()
.remove(tenant_id, timeline_id);
// The following metrics are born outside of the TimelineMetrics lifecycle but still
// removed at the end of it. The idea is to have the metrics outlive the
// entity during which they're observed, e.g., the smgr metrics shall
// outlive an individual smgr connection, but not the timeline.
for op in StorageTimeOperation::VARIANTS {
let _ =
STORAGE_TIME_SUM_PER_TIMELINE.remove_label_values(&[op, tenant_id, timeline_id]);
@@ -1056,8 +1199,12 @@ impl Drop for TimelineMetrics {
let _ = STORAGE_IO_SIZE.remove_label_values(&[op, tenant_id, timeline_id]);
}
for op in SMGR_QUERY_TIME_OPERATIONS {
let _ = SMGR_QUERY_TIME.remove_label_values(&[op, tenant_id, timeline_id]);
for op in SmgrQueryType::iter() {
let _ = SMGR_QUERY_TIME_PER_TENANT_TIMELINE.remove_label_values(&[
op.into(),
tenant_id,
timeline_id,
]);
}
}
}

View File

@@ -10,6 +10,42 @@
//! PostgreSQL buffer size, and a Slot struct for each buffer to contain
//! information about what's stored in the buffer.
//!
//! # Types Of Pages
//!
//! [`PageCache`] only supports immutable pages.
//! Hence there is no need to worry about coherency.
//!
//! Two types of pages are supported:
//!
//! * **Materialized pages**, filled & used by page reconstruction
//! * **Immutable File pages**, filled & used by [`crate::tenant::block_io`] and [`crate::tenant::ephemeral_file`].
//!
//! Note that [`crate::tenant::ephemeral_file::EphemeralFile`] is generally mutable, but, it's append-only.
//! It uses the page cache only for the blocks that are already fully written and immutable.
//!
//! # Filling The Page Cache
//!
//! Page cache maps from a cache key to a buffer slot.
//! The cache key uniquely identifies the piece of data that is being cached.
//!
//! The cache key for **materialized pages** is [`TenantId`], [`TimelineId`], [`Key`], and [`Lsn`].
//! Use [`PageCache::memorize_materialized_page`] and [`PageCache::lookup_materialized_page`] for fill & access.
//!
//! The cache key for **immutable file** pages is [`FileId`] and a block number.
//! Users of page cache that wish to page-cache an arbitrary (immutable!) on-disk file do the following:
//! * Have a mechanism to deterministically associate the on-disk file with a [`FileId`].
//! * Get a [`FileId`] using [`next_file_id`].
//! * Use the mechanism to associate the on-disk file with the returned [`FileId`].
//! * Use [`PageCache::read_immutable_buf`] to get a [`ReadBufResult`].
//! * If the page was already cached, it'll be the [`ReadBufResult::Found`] variant that contains
//! a read guard for the page. Just use it.
//! * If the page was not cached, it'll be the [`ReadBufResult::NotFound`] variant that contains
//! a write guard for the page. Fill the page with the contents of the on-disk file.
//! Then call [`PageWriteGuard::mark_valid`] to mark the page as valid.
//! Then try again to [`PageCache::read_immutable_buf`].
//! Unless there's high cache pressure, the page should now be cached.
//! (TODO: allow downgrading the write guard to a read guard to ensure forward progress.)
//!
//! # Locking
//!
//! There are two levels of locking involved: There's one lock for the "mapping"
@@ -40,7 +76,7 @@ use std::{
collections::{hash_map::Entry, HashMap},
convert::TryInto,
sync::{
atomic::{AtomicU8, AtomicUsize, Ordering},
atomic::{AtomicU64, AtomicU8, AtomicUsize, Ordering},
RwLock, RwLockReadGuard, RwLockWriteGuard, TryLockError,
},
};
@@ -85,6 +121,17 @@ pub fn get() -> &'static PageCache {
pub const PAGE_SZ: usize = postgres_ffi::BLCKSZ as usize;
const MAX_USAGE_COUNT: u8 = 5;
/// See module-level comment.
#[derive(Debug, Copy, Clone, PartialEq, Eq, Hash)]
pub struct FileId(u64);
static NEXT_ID: AtomicU64 = AtomicU64::new(1);
/// See module-level comment.
pub fn next_file_id() -> FileId {
FileId(NEXT_ID.fetch_add(1, Ordering::Relaxed))
}
///
/// CacheKey uniquely identifies a "thing" to cache in the page cache.
///
@@ -96,7 +143,7 @@ enum CacheKey {
lsn: Lsn,
},
ImmutableFilePage {
file_id: u64,
file_id: FileId,
blkno: u32,
},
}
@@ -170,7 +217,7 @@ pub struct PageCache {
/// can have a separate mapping map, next to this field.
materialized_page_map: RwLock<HashMap<MaterializedPageHashKey, Vec<Version>>>,
immutable_page_map: RwLock<HashMap<(u64, u32), usize>>,
immutable_page_map: RwLock<HashMap<(FileId, u32), usize>>,
/// The actual buffers with their metadata.
slots: Box<[Slot]>,
@@ -372,14 +419,14 @@ impl PageCache {
// Section 1.2: Public interface functions for working with immutable file pages.
pub fn read_immutable_buf(&self, file_id: u64, blkno: u32) -> anyhow::Result<ReadBufResult> {
pub fn read_immutable_buf(&self, file_id: FileId, blkno: u32) -> anyhow::Result<ReadBufResult> {
let mut cache_key = CacheKey::ImmutableFilePage { file_id, blkno };
self.lock_for_read(&mut cache_key)
}
/// Immediately drop all buffers belonging to given file, without writeback
pub fn drop_buffers_for_immutable(&self, drop_file_id: u64) {
/// Immediately drop all buffers belonging to given file
pub fn drop_buffers_for_immutable(&self, drop_file_id: FileId) {
for slot_idx in 0..self.slots.len() {
let slot = &self.slots[slot_idx];

View File

@@ -50,7 +50,8 @@ use crate::basebackup;
use crate::config::PageServerConf;
use crate::context::{DownloadBehavior, RequestContext};
use crate::import_datadir::import_wal_from_tar;
use crate::metrics::{LIVE_CONNECTIONS_COUNT, SMGR_QUERY_TIME};
use crate::metrics;
use crate::metrics::LIVE_CONNECTIONS_COUNT;
use crate::task_mgr;
use crate::task_mgr::TaskKind;
use crate::tenant;
@@ -306,39 +307,6 @@ async fn page_service_conn_main(
}
}
struct PageRequestMetrics {
get_rel_exists: metrics::Histogram,
get_rel_size: metrics::Histogram,
get_page_at_lsn: metrics::Histogram,
get_db_size: metrics::Histogram,
}
impl PageRequestMetrics {
fn new(tenant_id: &TenantId, timeline_id: &TimelineId) -> Self {
let tenant_id = tenant_id.to_string();
let timeline_id = timeline_id.to_string();
let get_rel_exists =
SMGR_QUERY_TIME.with_label_values(&["get_rel_exists", &tenant_id, &timeline_id]);
let get_rel_size =
SMGR_QUERY_TIME.with_label_values(&["get_rel_size", &tenant_id, &timeline_id]);
let get_page_at_lsn =
SMGR_QUERY_TIME.with_label_values(&["get_page_at_lsn", &tenant_id, &timeline_id]);
let get_db_size =
SMGR_QUERY_TIME.with_label_values(&["get_db_size", &tenant_id, &timeline_id]);
Self {
get_rel_exists,
get_rel_size,
get_page_at_lsn,
get_db_size,
}
}
}
struct PageServerHandler {
_conf: &'static PageServerConf,
broker_client: storage_broker::BrokerClientChannel,
@@ -406,7 +374,7 @@ impl PageServerHandler {
pgb.write_message_noflush(&BeMessage::CopyBothResponse)?;
pgb.flush().await?;
let metrics = PageRequestMetrics::new(&tenant_id, &timeline_id);
let metrics = metrics::SmgrQueryTimePerTimeline::new(&tenant_id, &timeline_id);
loop {
let msg = tokio::select! {
@@ -446,21 +414,21 @@ impl PageServerHandler {
let response = match neon_fe_msg {
PagestreamFeMessage::Exists(req) => {
let _timer = metrics.get_rel_exists.start_timer();
let _timer = metrics.start_timer(metrics::SmgrQueryType::GetRelExists);
self.handle_get_rel_exists_request(&timeline, &req, &ctx)
.await
}
PagestreamFeMessage::Nblocks(req) => {
let _timer = metrics.get_rel_size.start_timer();
let _timer = metrics.start_timer(metrics::SmgrQueryType::GetRelSize);
self.handle_get_nblocks_request(&timeline, &req, &ctx).await
}
PagestreamFeMessage::GetPage(req) => {
let _timer = metrics.get_page_at_lsn.start_timer();
let _timer = metrics.start_timer(metrics::SmgrQueryType::GetPageAtLsn);
self.handle_get_page_at_lsn_request(&timeline, &req, &ctx)
.await
}
PagestreamFeMessage::DbSize(req) => {
let _timer = metrics.get_db_size.start_timer();
let _timer = metrics.start_timer(metrics::SmgrQueryType::GetDbSize);
self.handle_db_size_request(&timeline, &req, &ctx).await
}
};
@@ -984,8 +952,8 @@ where
false
};
metrics::metric_vec_duration::observe_async_block_duration_by_result(
&*crate::metrics::BASEBACKUP_QUERY_TIME,
::metrics::metric_vec_duration::observe_async_block_duration_by_result(
&*metrics::BASEBACKUP_QUERY_TIME,
async move {
self.handle_basebackup_request(
pgb,

View File

@@ -29,6 +29,7 @@ use std::collections::hash_map::Entry;
use std::collections::BTreeSet;
use std::collections::HashMap;
use std::fmt::Debug;
use std::fmt::Display;
use std::fs;
use std::fs::File;
use std::fs::OpenOptions;
@@ -55,6 +56,7 @@ use self::remote_timeline_client::RemoteTimelineClient;
use self::timeline::uninit::TimelineUninitMark;
use self::timeline::uninit::UninitializedTimeline;
use self::timeline::EvictionTaskTenantState;
use self::timeline::TimelineResources;
use crate::config::PageServerConf;
use crate::context::{DownloadBehavior, RequestContext};
use crate::import_datadir;
@@ -70,7 +72,6 @@ use crate::tenant::remote_timeline_client::index::IndexPart;
use crate::tenant::remote_timeline_client::MaybeDeletedIndexPart;
use crate::tenant::storage_layer::DeltaLayer;
use crate::tenant::storage_layer::ImageLayer;
use crate::tenant::storage_layer::Layer;
use crate::InitializationOrder;
use crate::tenant::timeline::delete::DeleteTimelineFlow;
@@ -149,6 +150,14 @@ pub const TENANT_ATTACHING_MARKER_FILENAME: &str = "attaching";
pub const TENANT_DELETED_MARKER_FILE_NAME: &str = "deleted";
/// References to shared objects that are passed into each tenant, such
/// as the shared remote storage client and process initialization state.
#[derive(Clone)]
pub struct TenantSharedResources {
pub broker_client: storage_broker::BrokerClientChannel,
pub remote_storage: Option<GenericRemoteStorage>,
}
///
/// Tenant consists of multiple timelines. Keep them in a hash table.
///
@@ -388,7 +397,7 @@ impl Tenant {
async fn timeline_init_and_sync(
&self,
timeline_id: TimelineId,
remote_client: Option<RemoteTimelineClient>,
resources: TimelineResources,
remote_startup_data: Option<RemoteStartupData>,
local_metadata: Option<TimelineMetadata>,
ancestor: Option<Arc<Timeline>>,
@@ -409,17 +418,57 @@ impl Tenant {
timeline_id,
up_to_date_metadata,
ancestor.clone(),
remote_client,
resources,
init_order,
CreateTimelineCause::Load,
)?;
let new_disk_consistent_lsn = timeline.get_disk_consistent_lsn();
let disk_consistent_lsn = timeline.get_disk_consistent_lsn();
anyhow::ensure!(
new_disk_consistent_lsn.is_valid(),
disk_consistent_lsn.is_valid(),
"Timeline {tenant_id}/{timeline_id} has invalid disk_consistent_lsn"
);
assert_eq!(
disk_consistent_lsn,
up_to_date_metadata.disk_consistent_lsn(),
"these are used interchangeably"
);
// Save the metadata file to local disk.
if !picked_local {
save_metadata(
self.conf,
&tenant_id,
&timeline_id,
up_to_date_metadata,
first_save,
)
.context("save_metadata")?;
}
let index_part = remote_startup_data.as_ref().map(|x| &x.index_part);
if let Some(index_part) = index_part {
timeline
.remote_client
.as_ref()
.unwrap()
.init_upload_queue(index_part)?;
} else if self.remote_storage.is_some() {
// No data on the remote storage, but we have local metadata file. We can end up
// here with timeline_create being interrupted before finishing index part upload.
// By doing what we do here, the index part upload is retried.
// If control plane retries timeline creation in the meantime, the mgmt API handler
// for timeline creation will coalesce on the upload we queue here.
let rtc = timeline.remote_client.as_ref().unwrap();
rtc.init_upload_queue_for_empty_remote(up_to_date_metadata)?;
rtc.schedule_index_upload_for_metadata_update(up_to_date_metadata)?;
}
timeline
.load_layer_map(new_disk_consistent_lsn)
.load_layer_map(
disk_consistent_lsn,
remote_startup_data.map(|x| x.index_part),
)
.await
.with_context(|| {
format!("Failed to load layermap for timeline {tenant_id}/{timeline_id}")
@@ -443,19 +492,6 @@ impl Tenant {
}
};
if self.remote_storage.is_some() {
// Reconcile local state with remote storage, downloading anything that's
// missing locally, and scheduling uploads for anything that's missing
// in remote storage.
timeline
.reconcile_with_remote(
up_to_date_metadata,
remote_startup_data.as_ref().map(|r| &r.index_part),
)
.await
.context("failed to reconcile with remote")?
}
// Sanity check: a timeline should have some content.
anyhow::ensure!(
ancestor.is_some()
@@ -470,18 +506,6 @@ impl Tenant {
"Timeline has no ancestor and no layer files"
);
// Save the metadata file to local disk.
if !picked_local {
save_metadata(
self.conf,
&tenant_id,
&timeline_id,
up_to_date_metadata,
first_save,
)
.context("save_metadata")?;
}
Ok(())
}
@@ -499,6 +523,7 @@ impl Tenant {
conf: &'static PageServerConf,
tenant_id: TenantId,
broker_client: storage_broker::BrokerClientChannel,
tenants: &'static tokio::sync::RwLock<TenantsMap>,
remote_storage: GenericRemoteStorage,
ctx: &RequestContext,
) -> anyhow::Result<Arc<Tenant>> {
@@ -513,7 +538,7 @@ impl Tenant {
tenant_conf,
wal_redo_manager,
tenant_id,
Some(remote_storage),
Some(remote_storage.clone()),
));
// Do all the hard work in the background
@@ -528,17 +553,61 @@ impl Tenant {
"attach tenant",
false,
async move {
// Ideally we should use Tenant::set_broken_no_wait, but it is not supposed to be used when tenant is in loading state.
let make_broken = |t: &Tenant, err: anyhow::Error| {
error!("attach failed, setting tenant state to Broken: {err:?}");
t.state.send_modify(|state| {
assert_eq!(
*state,
TenantState::Attaching,
"the attach task owns the tenant state until activation is complete"
);
*state = TenantState::broken_from_reason(err.to_string());
});
};
let pending_deletion = {
match DeleteTenantFlow::should_resume_deletion(
conf,
Some(&remote_storage),
&tenant_clone,
)
.await
{
Ok(should_resume_deletion) => should_resume_deletion,
Err(err) => {
make_broken(&tenant_clone, anyhow::anyhow!(err));
return Ok(());
}
}
};
info!("pending_deletion {}", pending_deletion.is_some());
if let Some(deletion) = pending_deletion {
match DeleteTenantFlow::resume_from_attach(
deletion,
&tenant_clone,
tenants,
&ctx,
)
.await
{
Err(err) => {
make_broken(&tenant_clone, anyhow::anyhow!(err));
return Ok(());
}
Ok(()) => return Ok(()),
}
}
match tenant_clone.attach(&ctx).await {
Ok(()) => {
info!("attach finished, activating");
tenant_clone.activate(broker_client, None, &ctx);
}
Err(e) => {
error!("attach failed, setting tenant state to Broken: {:?}", e);
tenant_clone.state.send_modify(|state| {
assert_eq!(*state, TenantState::Attaching, "the attach task owns the tenant state until activation is complete");
*state = TenantState::broken_from_reason(e.to_string());
});
make_broken(&tenant_clone, anyhow::anyhow!(e));
}
}
Ok(())
@@ -616,6 +685,9 @@ impl Tenant {
.instrument(info_span!("download_index_part", %timeline_id)),
);
}
let mut timelines_to_resume_deletions = vec![];
// Wait for all the download tasks to complete & collect results.
let mut remote_index_and_client = HashMap::new();
let mut timeline_ancestors = HashMap::new();
@@ -632,9 +704,12 @@ impl Tenant {
);
remote_index_and_client.insert(timeline_id, (index_part, client));
}
MaybeDeletedIndexPart::Deleted(_) => {
info!("timeline {} is deleted, skipping", timeline_id);
continue;
MaybeDeletedIndexPart::Deleted(index_part) => {
info!(
"timeline {} is deleted, picking to resume deletion",
timeline_id
);
timelines_to_resume_deletions.push((timeline_id, index_part, client));
}
}
}
@@ -649,14 +724,41 @@ impl Tenant {
.expect("just put it in above");
// TODO again handle early failure
self.load_remote_timeline(timeline_id, index_part, remote_metadata, remote_client, ctx)
.await
.with_context(|| {
format!(
"failed to load remote timeline {} for tenant {}",
timeline_id, self.tenant_id
)
})?;
self.load_remote_timeline(
timeline_id,
index_part,
remote_metadata,
TimelineResources {
remote_client: Some(remote_client),
},
ctx,
)
.await
.with_context(|| {
format!(
"failed to load remote timeline {} for tenant {}",
timeline_id, self.tenant_id
)
})?;
}
// Walk through deleted timelines, resume deletion
for (timeline_id, index_part, remote_timeline_client) in timelines_to_resume_deletions {
remote_timeline_client
.init_upload_queue_stopped_to_continue_deletion(&index_part)
.context("init queue stopped")
.map_err(LoadLocalTimelineError::ResumeDeletion)?;
DeleteTimelineFlow::resume_deletion(
Arc::clone(self),
timeline_id,
&index_part.parse_metadata().context("parse_metadata")?,
Some(remote_timeline_client),
None,
)
.await
.context("resume_deletion")
.map_err(LoadLocalTimelineError::ResumeDeletion)?;
}
std::fs::remove_file(&marker_file)
@@ -664,7 +766,7 @@ impl Tenant {
crashsafe::fsync(marker_file.parent().expect("marker file has parent dir"))
.context("fsync tenant directory after unlinking attach marker file")?;
utils::failpoint_sleep_millis_async!("attach-before-activate");
crate::failpoint_support::sleep_millis_async!("attach-before-activate");
info!("Done");
@@ -692,7 +794,7 @@ impl Tenant {
timeline_id: TimelineId,
index_part: IndexPart,
remote_metadata: TimelineMetadata,
remote_client: RemoteTimelineClient,
resources: TimelineResources,
ctx: &RequestContext,
) -> anyhow::Result<()> {
span::debug_assert_current_span_has_tenant_id();
@@ -722,7 +824,7 @@ impl Tenant {
self.timeline_init_and_sync(
timeline_id,
Some(remote_client),
resources,
Some(RemoteStartupData {
index_part,
remote_metadata,
@@ -769,8 +871,7 @@ impl Tenant {
pub(crate) fn spawn_load(
conf: &'static PageServerConf,
tenant_id: TenantId,
broker_client: storage_broker::BrokerClientChannel,
remote_storage: Option<GenericRemoteStorage>,
resources: TenantSharedResources,
init_order: Option<InitializationOrder>,
tenants: &'static tokio::sync::RwLock<TenantsMap>,
ctx: &RequestContext,
@@ -785,6 +886,9 @@ impl Tenant {
}
};
let broker_client = resources.broker_client;
let remote_storage = resources.remote_storage;
let wal_redo_manager = Arc::new(PostgresRedoManager::new(conf, tenant_id));
let tenant = Tenant::new(
TenantState::Loading,
@@ -808,6 +912,7 @@ impl Tenant {
"initial tenant load",
false,
async move {
// Ideally we should use Tenant::set_broken_no_wait, but it is not supposed to be used when tenant is in loading state.
let make_broken = |t: &Tenant, err: anyhow::Error| {
error!("load failed, setting tenant state to Broken: {err:?}");
t.state.send_modify(|state| {
@@ -855,7 +960,7 @@ impl Tenant {
.as_mut()
.and_then(|x| x.initial_logical_size_attempt.take());
match DeleteTenantFlow::resume(
match DeleteTenantFlow::resume_from_load(
deletion,
&tenant_clone,
init_order.as_ref(),
@@ -877,7 +982,7 @@ impl Tenant {
match tenant_clone.load(init_order.as_ref(), &ctx).await {
Ok(()) => {
debug!("load finished",);
debug!("load finished");
tenant_clone.activate(broker_client, background_jobs_can_start, &ctx);
}
@@ -1073,7 +1178,7 @@ impl Tenant {
debug!("loading tenant task");
utils::failpoint_sleep_millis_async!("before-loading-tenant");
crate::failpoint_support::sleep_millis_async!("before-loading-tenant");
// Load in-memory state to reflect the local files on disk
//
@@ -1101,8 +1206,9 @@ impl Tenant {
{
match e {
LoadLocalTimelineError::Load(source) => {
return Err(anyhow::anyhow!(source)
.context("Failed to load local timeline: {timeline_id}"))
return Err(anyhow::anyhow!(source)).with_context(|| {
format!("Failed to load local timeline: {timeline_id}")
})
}
LoadLocalTimelineError::ResumeDeletion(source) => {
// Make sure resumed deletion wont fail loading for entire tenant.
@@ -1168,16 +1274,9 @@ impl Tenant {
) -> Result<(), LoadLocalTimelineError> {
span::debug_assert_current_span_has_tenant_id();
let remote_client = self.remote_storage.as_ref().map(|remote_storage| {
RemoteTimelineClient::new(
remote_storage.clone(),
self.conf,
self.tenant_id,
timeline_id,
)
});
let mut resources = self.build_timeline_resources(timeline_id);
let (remote_startup_data, remote_client) = match remote_client {
let (remote_startup_data, remote_client) = match resources.remote_client {
Some(remote_client) => match remote_client.download_index_file().await {
Ok(index_part) => {
let index_part = match index_part {
@@ -1265,9 +1364,10 @@ impl Tenant {
return Ok(());
}
(None, remote_client)
(None, resources.remote_client)
}
};
resources.remote_client = remote_client;
let ancestor = if let Some(ancestor_timeline_id) = local_metadata.ancestor_timeline() {
let ancestor_timeline = self.get_timeline(ancestor_timeline_id, false)
@@ -1280,7 +1380,7 @@ impl Tenant {
self.timeline_init_and_sync(
timeline_id,
remote_client,
resources,
remote_startup_data,
Some(local_metadata),
ancestor,
@@ -1769,7 +1869,7 @@ impl Tenant {
// It's mesed up.
// we just ignore the failure to stop
match self.set_stopping(shutdown_progress, false).await {
match self.set_stopping(shutdown_progress, false, false).await {
Ok(()) => {}
Err(SetStoppingError::Broken) => {
// assume that this is acceptable
@@ -1811,15 +1911,18 @@ impl Tenant {
/// This function is not cancel-safe!
///
/// `allow_transition_from_loading` is needed for the special case of loading task deleting the tenant.
/// `allow_transition_from_attaching` is needed for the special case of attaching deleted tenant.
async fn set_stopping(
&self,
progress: completion::Barrier,
allow_transition_from_loading: bool,
allow_transition_from_attaching: bool,
) -> Result<(), SetStoppingError> {
let mut rx = self.state.subscribe();
// cannot stop before we're done activating, so wait out until we're done activating
rx.wait_for(|state| match state {
TenantState::Attaching if allow_transition_from_attaching => true,
TenantState::Activating(_) | TenantState::Attaching => {
info!(
"waiting for {} to turn Active|Broken|Stopping",
@@ -1836,12 +1939,19 @@ impl Tenant {
// we now know we're done activating, let's see whether this task is the winner to transition into Stopping
let mut err = None;
let stopping = self.state.send_if_modified(|current_state| match current_state {
TenantState::Activating(_) | TenantState::Attaching => {
unreachable!("we ensured above that we're done with activation, and, there is no re-activation")
TenantState::Activating(_) => {
unreachable!("1we ensured above that we're done with activation, and, there is no re-activation")
}
TenantState::Attaching => {
if !allow_transition_from_attaching {
unreachable!("2we ensured above that we're done with activation, and, there is no re-activation")
};
*current_state = TenantState::Stopping { progress };
true
}
TenantState::Loading => {
if !allow_transition_from_loading {
unreachable!("we ensured above that we're done with activation, and, there is no re-activation")
unreachable!("3we ensured above that we're done with activation, and, there is no re-activation")
};
*current_state = TenantState::Stopping { progress };
true
@@ -1917,7 +2027,8 @@ impl Tenant {
self.set_broken_no_wait(reason)
}
pub(crate) fn set_broken_no_wait(&self, reason: String) {
pub(crate) fn set_broken_no_wait(&self, reason: impl Display) {
let reason = reason.to_string();
self.state.send_modify(|current_state| {
match *current_state {
TenantState::Activating(_) | TenantState::Loading | TenantState::Attaching => {
@@ -2141,7 +2252,7 @@ impl Tenant {
new_timeline_id: TimelineId,
new_metadata: &TimelineMetadata,
ancestor: Option<Arc<Timeline>>,
remote_client: Option<RemoteTimelineClient>,
resources: TimelineResources,
init_order: Option<&InitializationOrder>,
cause: CreateTimelineCause,
) -> anyhow::Result<Arc<Timeline>> {
@@ -2170,7 +2281,7 @@ impl Tenant {
new_timeline_id,
self.tenant_id,
Arc::clone(&self.walredo_mgr),
remote_client,
resources,
pg_version,
initial_logical_size_can_start.cloned(),
initial_logical_size_attempt.cloned().flatten(),
@@ -2412,7 +2523,9 @@ impl Tenant {
.refresh_gc_info_internal(target_timeline_id, horizon, pitr, ctx)
.await?;
utils::failpoint_sleep_millis_async!("gc_iteration_internal_after_getting_gc_timelines");
crate::failpoint_support::sleep_millis_async!(
"gc_iteration_internal_after_getting_gc_timelines"
);
// If there is nothing to GC, we don't want any messages in the INFO log.
if !gc_timelines.is_empty() {
@@ -2816,6 +2929,23 @@ impl Tenant {
Ok(timeline)
}
/// Call this before constructing a timeline, to build its required structures
fn build_timeline_resources(&self, timeline_id: TimelineId) -> TimelineResources {
let remote_client = if let Some(remote_storage) = self.remote_storage.as_ref() {
let remote_client = RemoteTimelineClient::new(
remote_storage.clone(),
self.conf,
self.tenant_id,
timeline_id,
);
Some(remote_client)
} else {
None
};
TimelineResources { remote_client }
}
/// Creates intermediate timeline structure and its files.
///
/// An empty layer map is initialized, and new data and WAL can be imported starting
@@ -2832,25 +2962,17 @@ impl Tenant {
) -> anyhow::Result<UninitializedTimeline> {
let tenant_id = self.tenant_id;
let remote_client = if let Some(remote_storage) = self.remote_storage.as_ref() {
let remote_client = RemoteTimelineClient::new(
remote_storage.clone(),
self.conf,
tenant_id,
new_timeline_id,
);
let resources = self.build_timeline_resources(new_timeline_id);
if let Some(remote_client) = &resources.remote_client {
remote_client.init_upload_queue_for_empty_remote(new_metadata)?;
Some(remote_client)
} else {
None
};
}
let timeline_struct = self
.create_timeline_struct(
new_timeline_id,
new_metadata,
ancestor,
remote_client,
resources,
None,
CreateTimelineCause::Load,
)
@@ -3923,6 +4045,31 @@ mod tests {
Ok(())
}
#[tokio::test]
async fn delta_layer_dumping() -> anyhow::Result<()> {
let (tenant, ctx) = TenantHarness::create("test_layer_dumping")?.load().await;
let tline = tenant
.create_test_timeline(TIMELINE_ID, Lsn(0x10), DEFAULT_PG_VERSION, &ctx)
.await?;
make_some_layers(tline.as_ref(), Lsn(0x20)).await?;
let layer_map = tline.layers.read().await;
let level0_deltas = layer_map.layer_map().get_level0_deltas()?;
assert!(!level0_deltas.is_empty());
for delta in level0_deltas {
let delta = layer_map.get_from_desc(&delta);
// Ensure we are dumping a delta layer here
let delta = delta.downcast_delta_layer().unwrap();
delta.dump(false, &ctx).await.unwrap();
delta.dump(true, &ctx).await.unwrap();
}
Ok(())
}
#[tokio::test]
async fn corrupt_metadata() -> anyhow::Result<()> {
const TEST_NAME: &str = "corrupt_metadata";

View File

@@ -21,14 +21,14 @@ where
R: BlockReader,
{
/// Read a blob into a new buffer.
pub fn read_blob(&self, offset: u64) -> Result<Vec<u8>, std::io::Error> {
pub async fn read_blob(&self, offset: u64) -> Result<Vec<u8>, std::io::Error> {
let mut buf = Vec::new();
self.read_blob_into_buf(offset, &mut buf)?;
self.read_blob_into_buf(offset, &mut buf).await?;
Ok(buf)
}
/// Read blob into the given buffer. Any previous contents in the buffer
/// are overwritten.
pub fn read_blob_into_buf(
pub async fn read_blob_into_buf(
&self,
offset: u64,
dstbuf: &mut Vec<u8>,

View File

@@ -2,12 +2,10 @@
//! Low-level Block-oriented I/O functions
//!
use crate::page_cache;
use crate::page_cache::{ReadBufResult, PAGE_SZ};
use crate::page_cache::{self, PageReadGuard, ReadBufResult, PAGE_SZ};
use bytes::Bytes;
use std::ops::{Deref, DerefMut};
use std::os::unix::fs::FileExt;
use std::sync::atomic::AtomicU64;
/// This is implemented by anything that can read 8 kB (PAGE_SZ)
/// blocks, using the page cache
@@ -15,14 +13,12 @@ use std::sync::atomic::AtomicU64;
/// There are currently two implementations: EphemeralFile, and FileBlockReader
/// below.
pub trait BlockReader {
type BlockLease: Deref<Target = [u8; PAGE_SZ]> + 'static;
///
/// Read a block. Returns a "lease" object that can be used to
/// access to the contents of the page. (For the page cache, the
/// lease object represents a lock on the buffer.)
///
fn read_blk(&self, blknum: u32) -> Result<Self::BlockLease, std::io::Error>;
fn read_blk(&self, blknum: u32) -> Result<BlockLease, std::io::Error>;
///
/// Create a new "cursor" for reading from this reader.
@@ -41,13 +37,45 @@ impl<B> BlockReader for &B
where
B: BlockReader,
{
type BlockLease = B::BlockLease;
fn read_blk(&self, blknum: u32) -> Result<Self::BlockLease, std::io::Error> {
fn read_blk(&self, blknum: u32) -> Result<BlockLease, std::io::Error> {
(*self).read_blk(blknum)
}
}
/// Reference to an in-memory copy of an immutable on-disk block.
pub enum BlockLease<'a> {
PageReadGuard(PageReadGuard<'static>),
EphemeralFileMutableTail(&'a [u8; PAGE_SZ]),
#[cfg(test)]
Rc(std::rc::Rc<[u8; PAGE_SZ]>),
}
impl From<PageReadGuard<'static>> for BlockLease<'static> {
fn from(value: PageReadGuard<'static>) -> BlockLease<'static> {
BlockLease::PageReadGuard(value)
}
}
#[cfg(test)]
impl<'a> From<std::rc::Rc<[u8; PAGE_SZ]>> for BlockLease<'a> {
fn from(value: std::rc::Rc<[u8; PAGE_SZ]>) -> Self {
BlockLease::Rc(value)
}
}
impl<'a> Deref for BlockLease<'a> {
type Target = [u8; PAGE_SZ];
fn deref(&self) -> &Self::Target {
match self {
BlockLease::PageReadGuard(v) => v.deref(),
BlockLease::EphemeralFileMutableTail(v) => v,
#[cfg(test)]
BlockLease::Rc(v) => v.deref(),
}
}
}
///
/// A "cursor" for efficiently reading multiple pages from a BlockReader
///
@@ -80,11 +108,10 @@ where
BlockCursor { reader }
}
pub fn read_blk(&self, blknum: u32) -> Result<R::BlockLease, std::io::Error> {
pub fn read_blk(&self, blknum: u32) -> Result<BlockLease, std::io::Error> {
self.reader.read_blk(blknum)
}
}
static NEXT_ID: AtomicU64 = AtomicU64::new(1);
/// An adapter for reading a (virtual) file using the page cache.
///
@@ -94,7 +121,7 @@ pub struct FileBlockReader<F> {
pub file: F,
/// Unique ID of this file, used as key in the page cache.
file_id: u64,
file_id: page_cache::FileId,
}
impl<F> FileBlockReader<F>
@@ -102,7 +129,7 @@ where
F: FileExt,
{
pub fn new(file: F) -> Self {
let file_id = NEXT_ID.fetch_add(1, std::sync::atomic::Ordering::Relaxed);
let file_id = page_cache::next_file_id();
FileBlockReader { file_id, file }
}
@@ -118,10 +145,7 @@ impl<F> BlockReader for FileBlockReader<F>
where
F: FileExt,
{
type BlockLease = page_cache::PageReadGuard<'static>;
fn read_blk(&self, blknum: u32) -> Result<Self::BlockLease, std::io::Error> {
// Look up the right page
fn read_blk(&self, blknum: u32) -> Result<BlockLease, std::io::Error> {
let cache = page_cache::get();
loop {
match cache
@@ -132,7 +156,7 @@ where
format!("Failed to read immutable buf: {e:#}"),
)
})? {
ReadBufResult::Found(guard) => break Ok(guard),
ReadBufResult::Found(guard) => break Ok(guard.into()),
ReadBufResult::NotFound(mut write_guard) => {
// Read the page from disk into the buffer
self.fill_buffer(write_guard.deref_mut(), blknum)?;

View File

@@ -7,10 +7,11 @@ use anyhow::Context;
use pageserver_api::models::TenantState;
use remote_storage::{DownloadError, GenericRemoteStorage, RemotePath};
use tokio::sync::OwnedMutexGuard;
use tokio_util::sync::CancellationToken;
use tracing::{error, info, instrument, warn, Instrument, Span};
use utils::{
completion, crashsafe, fs_ext,
backoff, completion, crashsafe, fs_ext,
id::{TenantId, TimelineId},
};
@@ -23,12 +24,13 @@ use crate::{
use super::{
mgr::{GetTenantError, TenantsMap},
remote_timeline_client::{FAILED_REMOTE_OP_RETRIES, FAILED_UPLOAD_WARN_THRESHOLD},
span,
timeline::delete::DeleteTimelineFlow,
tree_sort_timelines, DeleteTimelineError, Tenant,
};
const SHOULD_RESUME_DELETION_FETCH_MARK_ATTEMPTS: u8 = 3;
const SHOULD_RESUME_DELETION_FETCH_MARK_ATTEMPTS: u32 = 3;
#[derive(Debug, thiserror::Error)]
pub enum DeleteTenantError {
@@ -71,10 +73,21 @@ async fn create_remote_delete_mark(
let remote_mark_path = remote_tenant_delete_mark_path(conf, tenant_id)?;
let data: &[u8] = &[];
remote_storage
.upload(data, 0, &remote_mark_path, None)
.await
.context("mark upload")?;
backoff::retry(
|| async {
remote_storage
.upload(data, 0, &remote_mark_path, None)
.await
},
|_e| false,
FAILED_UPLOAD_WARN_THRESHOLD,
FAILED_REMOTE_OP_RETRIES,
"mark_upload",
// TODO: use a cancellation token (https://github.com/neondatabase/neon/issues/5066)
backoff::Cancel::new(CancellationToken::new(), || unreachable!()),
)
.await
.context("mark_upload")?;
Ok(())
}
@@ -154,9 +167,18 @@ async fn remove_tenant_remote_delete_mark(
tenant_id: &TenantId,
) -> Result<(), DeleteTenantError> {
if let Some(remote_storage) = remote_storage {
remote_storage
.delete(&remote_tenant_delete_mark_path(conf, tenant_id)?)
.await?;
let path = remote_tenant_delete_mark_path(conf, tenant_id)?;
backoff::retry(
|| async { remote_storage.delete(&path).await },
|_e| false,
FAILED_UPLOAD_WARN_THRESHOLD,
FAILED_REMOTE_OP_RETRIES,
"remove_tenant_remote_delete_mark",
// TODO: use a cancellation token (https://github.com/neondatabase/neon/issues/5066)
backoff::Cancel::new(CancellationToken::new(), || unreachable!()),
)
.await
.context("remove_tenant_remote_delete_mark")?;
}
Ok(())
}
@@ -195,6 +217,19 @@ async fn cleanup_remaining_fs_traces(
))?
});
// Make sure previous deletions are ordered before mark removal.
// Otherwise there is no guarantee that they reach the disk before mark deletion.
// So its possible for mark to reach disk first and for other deletions
// to be reordered later and thus missed if a crash occurs.
// Note that we dont need to sync after mark file is removed
// because we can tolerate the case when mark file reappears on startup.
let tenant_path = &conf.tenant_path(tenant_id);
if tenant_path.exists() {
crashsafe::fsync_async(&conf.tenant_path(tenant_id))
.await
.context("fsync_pre_mark_remove")?;
}
rm(conf.tenant_deleted_mark_file_path(tenant_id), false).await?;
fail::fail_point!("tenant-delete-before-remove-tenant-dir", |_| {
@@ -208,6 +243,32 @@ async fn cleanup_remaining_fs_traces(
Ok(())
}
pub(crate) async fn remote_delete_mark_exists(
conf: &PageServerConf,
tenant_id: &TenantId,
remote_storage: &GenericRemoteStorage,
) -> anyhow::Result<bool> {
// If remote storage is there we rely on it
let remote_mark_path = remote_tenant_delete_mark_path(conf, tenant_id).context("path")?;
let result = backoff::retry(
|| async { remote_storage.download(&remote_mark_path).await },
|e| matches!(e, DownloadError::NotFound),
SHOULD_RESUME_DELETION_FETCH_MARK_ATTEMPTS,
SHOULD_RESUME_DELETION_FETCH_MARK_ATTEMPTS,
"fetch_tenant_deletion_mark",
// TODO: use a cancellation token (https://github.com/neondatabase/neon/issues/5066)
backoff::Cancel::new(CancellationToken::new(), || unreachable!()),
)
.await;
match result {
Ok(_) => Ok(true),
Err(DownloadError::NotFound) => Ok(false),
Err(e) => Err(anyhow::anyhow!(e)).context("remote_delete_mark_exists")?,
}
}
/// Orchestrates tenant shut down of all tasks, removes its in-memory structures,
/// and deletes its data from both disk and s3.
/// The sequence of steps:
@@ -221,8 +282,9 @@ async fn cleanup_remaining_fs_traces(
/// It is resumable from any step in case a crash/restart occurs.
/// There are three entrypoints to the process:
/// 1. [`DeleteTenantFlow::run`] this is the main one called by a management api handler.
/// 2. [`DeleteTenantFlow::resume`] is called during restarts when local or remote deletion marks are still there.
/// Note the only other place that messes around timeline delete mark is the `Tenant::spawn_load` function.
/// 2. [`DeleteTenantFlow::resume_from_load`] is called during restarts when local or remote deletion marks are still there.
/// 3. [`DeleteTenantFlow::resume_from_attach`] is called when deletion is resumed tenant is found to be deleted during attach process.
/// Note the only other place that messes around timeline delete mark is the `Tenant::spawn_load` function.
#[derive(Default)]
pub enum DeleteTenantFlow {
#[default]
@@ -337,35 +399,19 @@ impl DeleteTenantFlow {
return Ok(acquire(tenant));
}
// If remote storage is there we rely on it
if let Some(remote_storage) = remote_storage {
let remote_mark_path = remote_tenant_delete_mark_path(conf, &tenant_id)?;
let remote_storage = match remote_storage {
Some(remote_storage) => remote_storage,
None => return Ok(None),
};
let attempt = 1;
loop {
match remote_storage.download(&remote_mark_path).await {
Ok(_) => return Ok(acquire(tenant)),
Err(e) => {
if matches!(e, DownloadError::NotFound) {
return Ok(None);
}
if attempt > SHOULD_RESUME_DELETION_FETCH_MARK_ATTEMPTS {
return Err(anyhow::anyhow!(e))?;
}
warn!(
"failed to fetch tenant deletion mark at {} attempt {}",
&remote_mark_path, attempt
)
}
}
}
if remote_delete_mark_exists(conf, &tenant_id, remote_storage).await? {
Ok(acquire(tenant))
} else {
Ok(None)
}
Ok(None)
}
pub(crate) async fn resume(
pub(crate) async fn resume_from_load(
guard: DeletionGuard,
tenant: &Arc<Tenant>,
init_order: Option<&InitializationOrder>,
@@ -375,7 +421,7 @@ impl DeleteTenantFlow {
let (_, progress) = completion::channel();
tenant
.set_stopping(progress, true)
.set_stopping(progress, true, false)
.await
.expect("cant be stopping or broken");
@@ -403,6 +449,31 @@ impl DeleteTenantFlow {
.await
}
pub(crate) async fn resume_from_attach(
guard: DeletionGuard,
tenant: &Arc<Tenant>,
tenants: &'static tokio::sync::RwLock<TenantsMap>,
ctx: &RequestContext,
) -> Result<(), DeleteTenantError> {
let (_, progress) = completion::channel();
tenant
.set_stopping(progress, false, true)
.await
.expect("cant be stopping or broken");
tenant.attach(ctx).await.context("attach")?;
Self::background(
guard,
tenant.conf,
tenant.remote_storage.clone(),
tenants,
tenant,
)
.await
}
async fn prepare(
tenants: &tokio::sync::RwLock<TenantsMap>,
tenant_id: TenantId,

View File

@@ -685,6 +685,7 @@ impl<const L: usize> BuildNode<L> {
#[cfg(test)]
mod tests {
use super::*;
use crate::tenant::block_io::BlockLease;
use rand::Rng;
use std::collections::BTreeMap;
use std::sync::atomic::{AtomicUsize, Ordering};
@@ -699,12 +700,10 @@ mod tests {
}
}
impl BlockReader for TestDisk {
type BlockLease = std::rc::Rc<[u8; PAGE_SZ]>;
fn read_blk(&self, blknum: u32) -> io::Result<Self::BlockLease> {
fn read_blk(&self, blknum: u32) -> io::Result<BlockLease> {
let mut buf = [0u8; PAGE_SZ];
buf.copy_from_slice(&self.blocks[blknum as usize]);
Ok(std::rc::Rc::new(buf))
Ok(std::rc::Rc::new(buf).into())
}
}
impl BlockWriter for &mut TestDisk {

View File

@@ -2,49 +2,31 @@
//! used to keep in-memory layers spilled on disk.
use crate::config::PageServerConf;
use crate::page_cache::PAGE_SZ;
use crate::page_cache::{self, PAGE_SZ};
use crate::tenant::blob_io::BlobWriter;
use crate::tenant::block_io::BlockReader;
use crate::tenant::block_io::{BlockLease, BlockReader};
use crate::virtual_file::VirtualFile;
use once_cell::sync::Lazy;
use std::cmp::min;
use std::collections::HashMap;
use std::fs::OpenOptions;
use std::io::{self};
use std::io::{self, ErrorKind};
use std::ops::DerefMut;
use std::os::unix::prelude::FileExt;
use std::path::PathBuf;
use std::sync::{Arc, RwLock};
use std::sync::atomic::AtomicU64;
use tracing::*;
use utils::id::{TenantId, TimelineId};
use std::os::unix::fs::FileExt;
mod buffer_pool;
mod dirty_buffer;
///
/// This is the global cache of file descriptors (File objects).
///
static EPHEMERAL_FILES: Lazy<RwLock<EphemeralFiles>> = Lazy::new(|| {
RwLock::new(EphemeralFiles {
next_file_id: 1,
files: HashMap::new(),
})
});
pub struct EphemeralFiles {
next_file_id: u64,
files: HashMap<u64, Arc<VirtualFile>>,
}
pub struct EphemeralFile {
file_id: u64,
page_cache_file_id: page_cache::FileId,
_tenant_id: TenantId,
_timeline_id: TimelineId,
file: Arc<VirtualFile>,
pub size: u64,
file: VirtualFile,
len: u64,
/// An ephemeral file is append-only.
/// We keep the last page, which can still be modified, in [`Self::mutable_tail`].
/// The other pages, which can no longer be modified, are accessed through the page cache.
mutable_tail: [u8; PAGE_SZ],
}
impl EphemeralFile {
@@ -53,57 +35,31 @@ impl EphemeralFile {
tenant_id: TenantId,
timeline_id: TimelineId,
) -> Result<EphemeralFile, io::Error> {
let mut l = EPHEMERAL_FILES.write().unwrap();
let file_id = l.next_file_id;
l.next_file_id += 1;
static NEXT_FILENAME: AtomicU64 = AtomicU64::new(1);
let filename_disambiguator =
NEXT_FILENAME.fetch_add(1, std::sync::atomic::Ordering::Relaxed);
let filename = conf
.timeline_path(&tenant_id, &timeline_id)
.join(PathBuf::from(format!("ephemeral-{}", file_id)));
.join(PathBuf::from(format!("ephemeral-{filename_disambiguator}")));
let file = VirtualFile::open_with_options(
&filename,
OpenOptions::new().read(true).write(true).create(true),
)?;
let file_rc = Arc::new(file);
l.files.insert(file_id, file_rc.clone());
Ok(EphemeralFile {
file_id,
page_cache_file_id: page_cache::next_file_id(),
_tenant_id: tenant_id,
_timeline_id: timeline_id,
file: file_rc,
size: 0,
file,
len: 0,
mutable_tail: [0u8; PAGE_SZ],
})
}
fn fill_buffer(&self, buf: &mut [u8], blkno: u32) -> Result<(), io::Error> {
let mut off = 0;
while off < PAGE_SZ {
let n = self
.file
.read_at(&mut buf[off..], blkno as u64 * PAGE_SZ as u64 + off as u64)?;
if n == 0 {
// Reached EOF. Fill the rest of the buffer with zeros.
const ZERO_BUF: [u8; PAGE_SZ] = [0u8; PAGE_SZ];
buf[off..].copy_from_slice(&ZERO_BUF[off..]);
break;
}
off += n;
}
Ok(())
}
fn get_buf_for_write(&self, blkno: u32) -> Result<dirty_buffer::Buffer, io::Error> {
let pool = buffer_pool::get();
let mut buf = pool.get_buffer();
// Read the page from disk into the buffer
// TODO: if we're overwriting the whole page, no need to read it in first
self.fill_buffer(buf.deref_mut(), blkno)?;
Ok(dirty_buffer::Buffer::new(self, buf, blkno))
pub(crate) fn len(&self) -> u64 {
self.len
}
}
@@ -118,60 +74,109 @@ pub fn is_ephemeral_file(filename: &str) -> bool {
impl BlobWriter for EphemeralFile {
fn write_blob(&mut self, srcbuf: &[u8]) -> Result<u64, io::Error> {
let pos = self.size;
struct Writer<'a> {
ephemeral_file: &'a mut EphemeralFile,
/// The block to which the next [`push_bytes`] will write.
blknum: u32,
/// The offset inside the block identified by [`blknum`] to which [`push_bytes`] will write.
off: usize,
}
impl<'a> Writer<'a> {
fn new(ephemeral_file: &'a mut EphemeralFile) -> io::Result<Writer<'a>> {
Ok(Writer {
blknum: (ephemeral_file.len / PAGE_SZ as u64) as u32,
off: (ephemeral_file.len % PAGE_SZ as u64) as usize,
ephemeral_file,
})
}
#[inline(always)]
fn push_bytes(&mut self, src: &[u8]) -> Result<(), io::Error> {
let mut src_remaining = src;
while !src_remaining.is_empty() {
let dst_remaining = &mut self.ephemeral_file.mutable_tail[self.off..];
let n = min(dst_remaining.len(), src_remaining.len());
dst_remaining[..n].copy_from_slice(&src_remaining[..n]);
self.off += n;
src_remaining = &src_remaining[n..];
if self.off == PAGE_SZ {
match self.ephemeral_file.file.write_all_at(
&self.ephemeral_file.mutable_tail,
self.blknum as u64 * PAGE_SZ as u64,
) {
Ok(_) => {
// Pre-warm the page cache with what we just wrote.
// This isn't necessary for coherency/correctness, but it's how we've always done it.
let cache = page_cache::get();
match cache.read_immutable_buf(
self.ephemeral_file.page_cache_file_id,
self.blknum,
) {
Ok(page_cache::ReadBufResult::Found(_guard)) => {
// This function takes &mut self, so, it shouldn't be possible to reach this point.
unreachable!("we just wrote blknum {} and this function takes &mut self, so, no concurrent read_blk is possible", self.blknum);
}
Ok(page_cache::ReadBufResult::NotFound(mut write_guard)) => {
let buf: &mut [u8] = write_guard.deref_mut();
debug_assert_eq!(buf.len(), PAGE_SZ);
buf.copy_from_slice(&self.ephemeral_file.mutable_tail);
write_guard.mark_valid();
// pre-warm successful
}
Err(e) => {
error!("ephemeral_file write_blob failed to get immutable buf to pre-warm page cache: {e:?}");
// fail gracefully, it's not the end of the world if we can't pre-warm the cache here
}
}
// Zero the buffer for re-use.
// Zeroing is critical for correcntess because the write_blob code below
// and similarly read_blk expect zeroed pages.
self.ephemeral_file.mutable_tail.fill(0);
// This block is done, move to next one.
self.blknum += 1;
self.off = 0;
}
Err(e) => {
return Err(std::io::Error::new(
ErrorKind::Other,
// order error before path because path is long and error is short
format!(
"ephemeral_file: write_blob: write-back full tail blk #{}: {:#}: {}",
self.blknum,
e,
self.ephemeral_file.file.path.display(),
),
));
}
}
}
}
Ok(())
}
}
let mut blknum = (self.size / PAGE_SZ as u64) as u32;
let mut off = (pos % PAGE_SZ as u64) as usize;
let mut buf = self.get_buf_for_write(blknum)?;
let pos = self.len;
let mut writer = Writer::new(self)?;
// Write the length field
if srcbuf.len() < 0x80 {
buf[off] = srcbuf.len() as u8;
off += 1;
// short one-byte length header
let len_buf = [srcbuf.len() as u8];
writer.push_bytes(&len_buf)?;
} else {
let mut len_buf = u32::to_be_bytes(srcbuf.len() as u32);
len_buf[0] |= 0x80;
let thislen = PAGE_SZ - off;
if thislen < 4 {
// it needs to be split across pages
buf[off..(off + thislen)].copy_from_slice(&len_buf[..thislen]);
blknum += 1;
buf.writeback()?;
buf = self.get_buf_for_write(blknum)?;
buf[0..4 - thislen].copy_from_slice(&len_buf[thislen..]);
off = 4 - thislen;
} else {
buf[off..off + 4].copy_from_slice(&len_buf);
off += 4;
}
writer.push_bytes(&len_buf)?;
}
// Write the payload
let mut buf_remain = srcbuf;
while !buf_remain.is_empty() {
let mut page_remain = PAGE_SZ - off;
if page_remain == 0 {
blknum += 1;
buf.writeback()?;
buf = self.get_buf_for_write(blknum)?;
off = 0;
page_remain = PAGE_SZ;
}
let this_blk_len = min(page_remain, buf_remain.len());
buf[off..(off + this_blk_len)].copy_from_slice(&buf_remain[..this_blk_len]);
off += this_blk_len;
buf_remain = &buf_remain[this_blk_len..];
}
buf.writeback()?;
writer.push_bytes(srcbuf)?;
if srcbuf.len() < 0x80 {
self.size += 1;
self.len += 1;
} else {
self.size += 4;
self.len += 4;
}
self.size += srcbuf.len() as u64;
self.len += srcbuf.len() as u64;
Ok(pos)
}
@@ -179,8 +184,9 @@ impl BlobWriter for EphemeralFile {
impl Drop for EphemeralFile {
fn drop(&mut self) {
// remove entry from the hash map
EPHEMERAL_FILES.write().unwrap().files.remove(&self.file_id);
// drop all pages from page cache
let cache = page_cache::get();
cache.drop_buffers_for_immutable(self.page_cache_file_id);
// unlink the file
let res = std::fs::remove_file(&self.file.path);
@@ -201,14 +207,44 @@ impl Drop for EphemeralFile {
}
impl BlockReader for EphemeralFile {
type BlockLease = buffer_pool::Handle;
fn read_blk(&self, blknum: u32) -> Result<BlockLease, io::Error> {
let flushed_blknums = 0..self.len / PAGE_SZ as u64;
if flushed_blknums.contains(&(blknum as u64)) {
let cache = page_cache::get();
loop {
match cache
.read_immutable_buf(self.page_cache_file_id, blknum)
.map_err(|e| {
std::io::Error::new(
std::io::ErrorKind::Other,
// order path before error because error is anyhow::Error => might have many contexts
format!(
"ephemeral file: read immutable page #{}: {}: {:#}",
blknum,
self.file.path.display(),
e,
),
)
})? {
page_cache::ReadBufResult::Found(guard) => {
return Ok(BlockLease::PageReadGuard(guard))
}
page_cache::ReadBufResult::NotFound(mut write_guard) => {
let buf: &mut [u8] = write_guard.deref_mut();
debug_assert_eq!(buf.len(), PAGE_SZ);
self.file
.read_exact_at(&mut buf[..], blknum as u64 * PAGE_SZ as u64)?;
write_guard.mark_valid();
fn read_blk(&self, blknum: u32) -> Result<Self::BlockLease, io::Error> {
// Read the page from disk into the buffer
let pool = buffer_pool::get();
let mut buf = pool.get_buffer();
self.fill_buffer(buf.deref_mut(), blknum)?;
Ok(buf)
// Swap for read lock
continue;
}
};
}
} else {
debug_assert_eq!(blknum as u64, self.len / PAGE_SZ as u64);
Ok(BlockLease::EphemeralFileMutableTail(&self.mutable_tail))
}
}
}
@@ -238,17 +274,26 @@ mod tests {
Ok((conf, tenant_id, timeline_id))
}
#[test]
fn test_ephemeral_blobs() -> Result<(), io::Error> {
#[tokio::test]
async fn test_ephemeral_blobs() -> Result<(), io::Error> {
let (conf, tenant_id, timeline_id) = harness("ephemeral_blobs")?;
let mut file = EphemeralFile::create(conf, tenant_id, timeline_id)?;
let pos_foo = file.write_blob(b"foo")?;
assert_eq!(b"foo", file.block_cursor().read_blob(pos_foo)?.as_slice());
assert_eq!(
b"foo",
file.block_cursor().read_blob(pos_foo).await?.as_slice()
);
let pos_bar = file.write_blob(b"bar")?;
assert_eq!(b"foo", file.block_cursor().read_blob(pos_foo)?.as_slice());
assert_eq!(b"bar", file.block_cursor().read_blob(pos_bar)?.as_slice());
assert_eq!(
b"foo",
file.block_cursor().read_blob(pos_foo).await?.as_slice()
);
assert_eq!(
b"bar",
file.block_cursor().read_blob(pos_bar).await?.as_slice()
);
let mut blobs = Vec::new();
for i in 0..10000 {
@@ -265,7 +310,7 @@ mod tests {
let cursor = BlockCursor::new(&file);
for (pos, expected) in blobs {
let actual = cursor.read_blob(pos)?;
let actual = cursor.read_blob(pos).await?;
assert_eq!(actual, expected);
}
@@ -274,7 +319,7 @@ mod tests {
large_data.resize(20000, 0);
thread_rng().fill_bytes(&mut large_data);
let pos_large = file.write_blob(&large_data)?;
let result = file.block_cursor().read_blob(pos_large)?;
let result = file.block_cursor().read_blob(pos_large).await?;
assert_eq!(result, large_data);
Ok(())

View File

@@ -1,66 +0,0 @@
//! Buffer pool for ephemeral file buffers.
//!
//! Currently this is a very simple implementation that just uses `malloc`.
//! But the interface is such that we can switch to a more sophisticated
//! implementation later, e.g., one that caps that amount of memory used.
use std::ops::{Deref, DerefMut};
use crate::page_cache::PAGE_SZ;
pub struct BufferPool;
const POOL: BufferPool = BufferPool;
pub(super) fn get() -> &'static BufferPool {
&POOL
}
impl BufferPool {
/// Get a [`Handle`] to a buffer in the pool.
///
/// The buffer is guaranteed to be zeroed out.
///
/// The implementation may block to wait for buffers to become available,
/// and a future async version of this method may `.await` internally to
/// wait for buffers to become available.
///
/// To avoid deadlocks, a thread/task must get all the buffers it needs
/// with a single call to `get_buffer`. Without this rule, a deadlock
/// can happen. Take for example a buffer pool with 2 buffers X, Y
/// and a program with two threads A and B, each requiring 2 buffers.
/// If A gets X and B gets Y, then both threads will block forever trying
/// to get their second buffer.
pub fn get_buffer(&self) -> Handle {
Handle {
data: vec![0; PAGE_SZ],
}
}
}
pub struct Handle {
data: Vec<u8>,
}
impl std::fmt::Debug for Handle {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
f.debug_struct("Handle")
.field("data", &self.data.as_ptr())
.finish()
}
}
impl Deref for Handle {
type Target = [u8; PAGE_SZ];
fn deref(&self) -> &Self::Target {
let slice: &[u8] = &self.data[..];
slice.try_into().unwrap()
}
}
impl DerefMut for Handle {
fn deref_mut(&mut self) -> &mut Self::Target {
let slice: &mut [u8] = &mut self.data[..];
slice.try_into().unwrap()
}
}

View File

@@ -1,111 +0,0 @@
//! Newtypes to ensure that dirty buffers are written back to the filesystem before they are dropped.
use std::io::ErrorKind;
use std::ops::Deref;
use std::ops::DerefMut;
use std::os::unix::prelude::FileExt;
use crate::page_cache::PAGE_SZ;
use super::buffer_pool;
use super::EphemeralFile;
pub(super) struct Buffer<'f> {
inner: Inner<'f>,
}
enum Inner<'f> {
Dirty {
ephemeral_file: &'f EphemeralFile,
buf: buffer_pool::Handle,
blkno: u32,
},
WritebackOngoing,
WrittenBack,
WritebackError,
Dropped,
}
impl<'f> Buffer<'f> {
pub(super) fn new(
ephemeral_file: &'f EphemeralFile,
buf: buffer_pool::Handle,
blkno: u32,
) -> Self {
Self {
inner: Inner::Dirty {
ephemeral_file,
buf,
blkno,
},
}
}
pub(super) fn writeback(mut self) -> Result<(), std::io::Error> {
let Inner::Dirty {
ephemeral_file,
buf,
blkno,
} = std::mem::replace(&mut self.inner, Inner::WritebackOngoing) else {
unreachable!("writeback consumes");
};
match ephemeral_file
.file
.write_all_at(buf.deref(), blkno as u64 * PAGE_SZ as u64)
{
Ok(_) => {
self.inner = Inner::WrittenBack;
Ok(())
}
Err(e) => {
self.inner = Inner::WritebackError;
Err(std::io::Error::new(
ErrorKind::Other,
format!(
"failed to write back to ephemeral file at {} error: {}",
ephemeral_file.file.path.display(),
e
),
))
}
}
}
}
impl<'f> Deref for Buffer<'f> {
type Target = [u8];
fn deref(&self) -> &[u8] {
match &self.inner {
Inner::Dirty { buf, .. } => &**buf,
Inner::WritebackOngoing => unreachable!("writeback consumes"),
Inner::WrittenBack => unreachable!("writeback consumes"),
Inner::WritebackError => unreachable!("writeback consumes"),
Inner::Dropped => unreachable!(),
}
}
}
impl<'f> DerefMut for Buffer<'f> {
fn deref_mut(&mut self) -> &mut [u8] {
match &mut self.inner {
Inner::Dirty { buf, .. } => &mut **buf,
Inner::WritebackOngoing => unreachable!("writeback consumes"),
Inner::WrittenBack => unreachable!("writeback consumes"),
Inner::WritebackError => unreachable!("writeback consumes"),
Inner::Dropped => unreachable!(),
}
}
}
impl Drop for Buffer<'_> {
fn drop(&mut self) {
let prev = std::mem::replace(&mut self.inner, Inner::Dropped);
match prev {
// TODO: check this at compile time
Inner::Dirty { .. } => panic!("dropped dirty buffer, need to writeback() first"),
Inner::WritebackOngoing => unreachable!("transitory state"),
Inner::WrittenBack | Inner::WritebackError => {}
Inner::Dropped => unreachable!("drop only happens once"),
}
}
}

View File

@@ -50,7 +50,6 @@ use crate::context::RequestContext;
use crate::keyspace::KeyPartitioning;
use crate::repository::Key;
use crate::tenant::storage_layer::InMemoryLayer;
use crate::tenant::storage_layer::Layer;
use anyhow::Result;
use std::collections::VecDeque;
use std::ops::Range;

View File

@@ -29,6 +29,7 @@ use utils::id::{TenantId, TimelineId};
use super::delete::DeleteTenantError;
use super::timeline::delete::DeleteTimelineFlow;
use super::TenantSharedResources;
/// The tenants known to the pageserver.
/// The enum variants are used to distinguish the different states that the pageserver can be in.
@@ -66,8 +67,7 @@ static TENANTS: Lazy<RwLock<TenantsMap>> = Lazy::new(|| RwLock::new(TenantsMap::
#[instrument(skip_all)]
pub async fn init_tenant_mgr(
conf: &'static PageServerConf,
broker_client: storage_broker::BrokerClientChannel,
remote_storage: Option<GenericRemoteStorage>,
resources: TenantSharedResources,
init_order: InitializationOrder,
) -> anyhow::Result<()> {
// Scan local filesystem for attached tenants
@@ -125,8 +125,7 @@ pub async fn init_tenant_mgr(
match schedule_local_tenant_processing(
conf,
&tenant_dir_path,
broker_client.clone(),
remote_storage.clone(),
resources.clone(),
Some(init_order.clone()),
&TENANTS,
&ctx,
@@ -162,8 +161,7 @@ pub async fn init_tenant_mgr(
pub(crate) fn schedule_local_tenant_processing(
conf: &'static PageServerConf,
tenant_path: &Path,
broker_client: storage_broker::BrokerClientChannel,
remote_storage: Option<GenericRemoteStorage>,
resources: TenantSharedResources,
init_order: Option<InitializationOrder>,
tenants: &'static tokio::sync::RwLock<TenantsMap>,
ctx: &RequestContext,
@@ -200,8 +198,15 @@ pub(crate) fn schedule_local_tenant_processing(
let tenant = if conf.tenant_attaching_mark_file_path(&tenant_id).exists() {
info!("tenant {tenant_id} has attaching mark file, resuming its attach operation");
if let Some(remote_storage) = remote_storage {
match Tenant::spawn_attach(conf, tenant_id, broker_client, remote_storage, ctx) {
if let Some(remote_storage) = resources.remote_storage {
match Tenant::spawn_attach(
conf,
tenant_id,
resources.broker_client,
tenants,
remote_storage,
ctx,
) {
Ok(tenant) => tenant,
Err(e) => {
error!("Failed to spawn_attach tenant {tenant_id}, reason: {e:#}");
@@ -219,15 +224,7 @@ pub(crate) fn schedule_local_tenant_processing(
} else {
info!("tenant {tenant_id} is assumed to be loadable, starting load operation");
// Start loading the tenant into memory. It will initially be in Loading state.
Tenant::spawn_load(
conf,
tenant_id,
broker_client,
remote_storage,
init_order,
tenants,
ctx,
)
Tenant::spawn_load(conf, tenant_id, resources, init_order, tenants, ctx)
};
Ok(tenant)
}
@@ -362,8 +359,12 @@ pub async fn create_tenant(
// TODO: tenant directory remains on disk if we bail out from here on.
// See https://github.com/neondatabase/neon/issues/4233
let tenant_resources = TenantSharedResources {
broker_client,
remote_storage,
};
let created_tenant =
schedule_local_tenant_processing(conf, &tenant_directory, broker_client, remote_storage, None, &TENANTS, ctx)?;
schedule_local_tenant_processing(conf, &tenant_directory, tenant_resources, None, &TENANTS, ctx)?;
// TODO: tenant object & its background loops remain, untracked in tenant map, if we fail here.
// See https://github.com/neondatabase/neon/issues/4233
@@ -522,7 +523,11 @@ pub async fn load_tenant(
.with_context(|| format!("Failed to remove tenant ignore mark {tenant_ignore_mark:?} during tenant loading"))?;
}
let new_tenant = schedule_local_tenant_processing(conf, &tenant_path, broker_client, remote_storage, None, &TENANTS, ctx)
let resources = TenantSharedResources {
broker_client,
remote_storage,
};
let new_tenant = schedule_local_tenant_processing(conf, &tenant_path, resources, None, &TENANTS, ctx)
.with_context(|| {
format!("Failed to schedule tenant processing in path {tenant_path:?}")
})?;
@@ -603,7 +608,11 @@ pub async fn attach_tenant(
.context("check for attach marker file existence")?;
anyhow::ensure!(marker_file_exists, "create_tenant_files should have created the attach marker file");
let attached_tenant = schedule_local_tenant_processing(conf, &tenant_dir, broker_client, Some(remote_storage), None, &TENANTS, ctx)?;
let resources = TenantSharedResources {
broker_client,
remote_storage: Some(remote_storage),
};
let attached_tenant = schedule_local_tenant_processing(conf, &tenant_dir, resources, None, &TENANTS, ctx)?;
// TODO: tenant object & its background loops remain, untracked in tenant map, if we fail here.
// See https://github.com/neondatabase/neon/issues/4233

View File

@@ -135,7 +135,7 @@
//! - Initiate upload queue with that [`IndexPart`].
//! - Reschedule all lost operations by comparing the local filesystem state
//! and remote state as per [`IndexPart`]. This is done in
//! [`Tenant::timeline_init_and_sync`] and [`Timeline::reconcile_with_remote`].
//! [`Tenant::timeline_init_and_sync`].
//!
//! Note that if we crash during file deletion between the index update
//! that removes the file from the list of files, and deleting the remote file,
@@ -172,7 +172,6 @@
//! transitioning it from `TenantState::Attaching` to `TenantState::Active` state.
//! This starts the timelines' WAL-receivers and the tenant's GC & Compaction loops.
//!
//! Most of the above steps happen in [`Timeline::reconcile_with_remote`] or its callers.
//! We keep track of the fact that a client is in `Attaching` state in a marker
//! file on the local disk. This is critical because, when we restart the pageserver,
//! we do not want to do the `List timelines` step for each tenant that has already
@@ -192,14 +191,14 @@
//! not created and the uploads are skipped.
//! Theoretically, it should be ok to remove and re-add remote storage configuration to
//! the pageserver config at any time, since it doesn't make a difference to
//! `reconcile_with_remote`.
//! [`Timeline::load_layer_map`].
//! Of course, the remote timeline dir must not change while we have de-configured
//! remote storage, i.e., the pageserver must remain the owner of the given prefix
//! in remote storage.
//! But note that we don't test any of this right now.
//!
//! [`Tenant::timeline_init_and_sync`]: super::Tenant::timeline_init_and_sync
//! [`Timeline::reconcile_with_remote`]: super::Timeline::reconcile_with_remote
//! [`Timeline::load_layer_map`]: super::Timeline::load_layer_map
mod delete;
mod download;
@@ -211,6 +210,10 @@ use chrono::{NaiveDateTime, Utc};
// re-export these
pub use download::{is_temp_download_file, list_remote_timelines};
use scopeguard::ScopeGuard;
use tokio_util::sync::CancellationToken;
use utils::backoff::{
self, exponential_backoff, DEFAULT_BASE_BACKOFF_SECONDS, DEFAULT_MAX_BACKOFF_SECONDS,
};
use std::collections::{HashMap, VecDeque};
use std::path::Path;
@@ -219,7 +222,6 @@ use std::sync::{Arc, Mutex};
use remote_storage::{DownloadError, GenericRemoteStorage, RemotePath};
use std::ops::DerefMut;
use tokio::runtime::Runtime;
use tracing::{debug, error, info, instrument, warn};
use tracing::{info_span, Instrument};
use utils::lsn::Lsn;
@@ -229,6 +231,7 @@ use crate::metrics::{
RemoteTimelineClientMetricsCallTrackSize, REMOTE_ONDEMAND_DOWNLOADED_BYTES,
REMOTE_ONDEMAND_DOWNLOADED_LAYERS,
};
use crate::task_mgr::shutdown_token;
use crate::tenant::debug_assert_current_span_has_tenant_and_timeline_id;
use crate::tenant::remote_timeline_client::index::LayerFileMetadata;
use crate::tenant::upload_queue::Delete;
@@ -241,7 +244,6 @@ use crate::{
tenant::upload_queue::{
UploadOp, UploadQueue, UploadQueueInitialized, UploadQueueStopped, UploadTask,
},
{exponential_backoff, DEFAULT_BASE_BACKOFF_SECONDS, DEFAULT_MAX_BACKOFF_SECONDS},
};
use utils::id::{TenantId, TimelineId};
@@ -256,12 +258,12 @@ use super::upload_queue::SetDeletedFlagProgress;
// But after FAILED_DOWNLOAD_WARN_THRESHOLD retries, we start to log it at WARN
// level instead, as repeated failures can mean a more serious problem. If it
// fails more than FAILED_DOWNLOAD_RETRIES times, we give up
const FAILED_DOWNLOAD_WARN_THRESHOLD: u32 = 3;
const FAILED_DOWNLOAD_RETRIES: u32 = 10;
pub(crate) const FAILED_DOWNLOAD_WARN_THRESHOLD: u32 = 3;
pub(crate) const FAILED_REMOTE_OP_RETRIES: u32 = 10;
// Similarly log failed uploads and deletions at WARN level, after this many
// retries. Uploads and deletions are retried forever, though.
const FAILED_UPLOAD_WARN_THRESHOLD: u32 = 3;
pub(crate) const FAILED_UPLOAD_WARN_THRESHOLD: u32 = 3;
pub enum MaybeDeletedIndexPart {
IndexPart(IndexPart),
@@ -309,7 +311,7 @@ pub enum PersistIndexPartWithDeletedFlagError {
pub struct RemoteTimelineClient {
conf: &'static PageServerConf,
runtime: &'static Runtime,
runtime: tokio::runtime::Handle,
tenant_id: TenantId,
timeline_id: TimelineId,
@@ -336,7 +338,7 @@ impl RemoteTimelineClient {
) -> RemoteTimelineClient {
RemoteTimelineClient {
conf,
runtime: &BACKGROUND_RUNTIME,
runtime: BACKGROUND_RUNTIME.handle().to_owned(),
tenant_id,
timeline_id,
storage_impl: remote_storage,
@@ -352,6 +354,10 @@ impl RemoteTimelineClient {
let mut upload_queue = self.upload_queue.lock().unwrap();
upload_queue.initialize_with_current_remote_index_part(index_part)?;
self.update_remote_physical_size_gauge(Some(index_part));
info!(
"initialized upload queue from remote index with {} layer files",
index_part.layer_metadata.len()
);
Ok(())
}
@@ -364,6 +370,7 @@ impl RemoteTimelineClient {
let mut upload_queue = self.upload_queue.lock().unwrap();
upload_queue.initialize_empty_remote(local_metadata)?;
self.update_remote_physical_size_gauge(None);
info!("initialized upload queue as empty");
Ok(())
}
@@ -752,12 +759,25 @@ impl RemoteTimelineClient {
pausable_failpoint!("persist_deleted_index_part");
upload::upload_index_part(
self.conf,
&self.storage_impl,
&self.tenant_id,
&self.timeline_id,
&index_part_with_deleted_at,
backoff::retry(
|| {
upload::upload_index_part(
self.conf,
&self.storage_impl,
&self.tenant_id,
&self.timeline_id,
&index_part_with_deleted_at,
)
},
|_e| false,
1,
// have just a couple of attempts
// when executed as part of timeline deletion this happens in context of api call
// when executed as part of tenant deletion this happens in the background
2,
"persist_index_part_with_deleted_flag",
// TODO: use a cancellation token (https://github.com/neondatabase/neon/issues/5066)
backoff::Cancel::new(CancellationToken::new(), || unreachable!()),
)
.await?;
@@ -834,10 +854,20 @@ impl RemoteTimelineClient {
let timeline_path = self.conf.timeline_path(&self.tenant_id, &self.timeline_id);
let timeline_storage_path = self.conf.remote_path(&timeline_path)?;
let remaining = self
.storage_impl
.list_prefixes(Some(&timeline_storage_path))
.await?;
let remaining = backoff::retry(
|| async {
self.storage_impl
.list_files(Some(&timeline_storage_path))
.await
},
|_e| false,
FAILED_DOWNLOAD_WARN_THRESHOLD,
FAILED_REMOTE_OP_RETRIES,
"list_prefixes",
backoff::Cancel::new(shutdown_token(), || anyhow::anyhow!("Cancelled!")),
)
.await
.context("list prefixes")?;
let remaining: Vec<RemotePath> = remaining
.into_iter()
@@ -852,7 +882,16 @@ impl RemoteTimelineClient {
.collect();
if !remaining.is_empty() {
self.storage_impl.delete_objects(&remaining).await?;
backoff::retry(
|| async { self.storage_impl.delete_objects(&remaining).await },
|_e| false,
FAILED_UPLOAD_WARN_THRESHOLD,
FAILED_REMOTE_OP_RETRIES,
"delete_objects",
backoff::Cancel::new(shutdown_token(), || anyhow::anyhow!("Cancelled!")),
)
.await
.context("delete_objects")?;
}
fail::fail_point!("timeline-delete-before-index-delete", |_| {
@@ -864,7 +903,17 @@ impl RemoteTimelineClient {
let index_file_path = timeline_storage_path.join(Path::new(IndexPart::FILE_NAME));
debug!("deleting index part");
self.storage_impl.delete(&index_file_path).await?;
backoff::retry(
|| async { self.storage_impl.delete(&index_file_path).await },
|_e| false,
FAILED_UPLOAD_WARN_THRESHOLD,
FAILED_REMOTE_OP_RETRIES,
"delete_index",
backoff::Cancel::new(shutdown_token(), || anyhow::anyhow!("Cancelled")),
)
.await
.context("delete_index")?;
fail::fail_point!("timeline-delete-after-index-delete", |_| {
Err(anyhow::anyhow!(
@@ -954,7 +1003,7 @@ impl RemoteTimelineClient {
let tenant_id = self.tenant_id;
let timeline_id = self.timeline_id;
task_mgr::spawn(
self.runtime.handle(),
&self.runtime,
TaskKind::RemoteUploadTask,
Some(self.tenant_id),
Some(self.timeline_id),
@@ -1027,6 +1076,15 @@ impl RemoteTimelineClient {
.await
}
UploadOp::UploadMetadata(ref index_part, _lsn) => {
let mention_having_future_layers = if cfg!(feature = "testing") {
index_part
.layer_metadata
.keys()
.any(|x| x.is_in_future(*_lsn))
} else {
false
};
let res = upload::upload_index_part(
self.conf,
&self.storage_impl,
@@ -1044,6 +1102,10 @@ impl RemoteTimelineClient {
.await;
if res.is_ok() {
self.update_remote_physical_size_gauge(Some(index_part));
if mention_having_future_layers {
// find rationale near crate::tenant::timeline::init::cleanup_future_layer
tracing::info!(disk_consistent_lsn=%_lsn, "uploaded an index_part.json with future layers -- this is ok! if shutdown now, expect future layer cleanup");
}
}
res
}
@@ -1095,14 +1157,13 @@ impl RemoteTimelineClient {
}
// sleep until it's time to retry, or we're cancelled
tokio::select! {
_ = task_mgr::shutdown_watcher() => { },
_ = exponential_backoff(
retries,
DEFAULT_BASE_BACKOFF_SECONDS,
DEFAULT_MAX_BACKOFF_SECONDS,
) => { },
};
exponential_backoff(
retries,
DEFAULT_BASE_BACKOFF_SECONDS,
DEFAULT_MAX_BACKOFF_SECONDS,
&shutdown_token(),
)
.await;
}
}
}
@@ -1307,7 +1368,7 @@ mod tests {
context::RequestContext,
tenant::{
harness::{TenantHarness, TIMELINE_ID},
Tenant,
Tenant, Timeline,
},
DEFAULT_PG_VERSION,
};
@@ -1316,7 +1377,6 @@ mod tests {
collections::HashSet,
path::{Path, PathBuf},
};
use tokio::runtime::EnterGuard;
use utils::lsn::Lsn;
pub(super) fn dummy_contents(name: &str) -> Vec<u8> {
@@ -1366,35 +1426,25 @@ mod tests {
}
struct TestSetup {
runtime: &'static tokio::runtime::Runtime,
entered_runtime: EnterGuard<'static>,
harness: TenantHarness,
tenant: Arc<Tenant>,
timeline: Arc<Timeline>,
tenant_ctx: RequestContext,
remote_fs_dir: PathBuf,
client: Arc<RemoteTimelineClient>,
}
impl TestSetup {
fn new(test_name: &str) -> anyhow::Result<Self> {
async fn new(test_name: &str) -> anyhow::Result<Self> {
// Use a current-thread runtime in the test
let runtime = Box::leak(Box::new(
tokio::runtime::Builder::new_current_thread()
.enable_all()
.build()?,
));
let entered_runtime = runtime.enter();
let test_name = Box::leak(Box::new(format!("remote_timeline_client__{test_name}")));
let harness = TenantHarness::create(test_name)?;
let (tenant, ctx) = runtime.block_on(harness.load());
let (tenant, ctx) = harness.load().await;
// create an empty timeline directory
let _ = runtime.block_on(tenant.create_test_timeline(
TIMELINE_ID,
Lsn(8),
DEFAULT_PG_VERSION,
&ctx,
))?;
let timeline = tenant
.create_test_timeline(TIMELINE_ID, Lsn(8), DEFAULT_PG_VERSION, &ctx)
.await?;
let remote_fs_dir = harness.conf.workdir.join("remote_fs");
std::fs::create_dir_all(remote_fs_dir)?;
@@ -1416,7 +1466,7 @@ mod tests {
let client = Arc::new(RemoteTimelineClient {
conf: harness.conf,
runtime,
runtime: tokio::runtime::Handle::current(),
tenant_id: harness.tenant_id,
timeline_id: TIMELINE_ID,
storage_impl: storage,
@@ -1428,10 +1478,9 @@ mod tests {
});
Ok(Self {
runtime,
entered_runtime,
harness,
tenant,
timeline,
tenant_ctx: ctx,
remote_fs_dir,
client,
@@ -1440,8 +1489,8 @@ mod tests {
}
// Test scheduling
#[test]
fn upload_scheduling() -> anyhow::Result<()> {
#[tokio::test]
async fn upload_scheduling() {
// Test outline:
//
// Schedule upload of a bunch of layers. Check that they are started immediately, not queued
@@ -1457,25 +1506,26 @@ mod tests {
// Schedule index upload. Check that it's queued
let TestSetup {
runtime,
entered_runtime: _entered_runtime,
harness,
tenant: _tenant,
timeline: _timeline,
tenant_ctx: _tenant_ctx,
remote_fs_dir,
client,
} = TestSetup::new("upload_scheduling").unwrap();
} = TestSetup::new("upload_scheduling").await.unwrap();
let timeline_path = harness.timeline_path(&TIMELINE_ID);
println!("workdir: {}", harness.conf.workdir.display());
let remote_timeline_dir =
remote_fs_dir.join(timeline_path.strip_prefix(&harness.conf.workdir)?);
remote_fs_dir.join(timeline_path.strip_prefix(&harness.conf.workdir).unwrap());
println!("remote_timeline_dir: {}", remote_timeline_dir.display());
let metadata = dummy_metadata(Lsn(0x10));
client.init_upload_queue_for_empty_remote(&metadata)?;
client
.init_upload_queue_for_empty_remote(&metadata)
.unwrap();
// Create a couple of dummy files, schedule upload for them
let layer_file_name_1: LayerFileName = "000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000016B59D8-00000000016B5A51".parse().unwrap();
@@ -1484,26 +1534,32 @@ mod tests {
let content_1 = dummy_contents("foo");
let content_2 = dummy_contents("bar");
let content_3 = dummy_contents("baz");
std::fs::write(
timeline_path.join(layer_file_name_1.file_name()),
&content_1,
)?;
std::fs::write(
timeline_path.join(layer_file_name_2.file_name()),
&content_2,
)?;
std::fs::write(timeline_path.join(layer_file_name_3.file_name()), content_3)?;
client.schedule_layer_file_upload(
&layer_file_name_1,
&LayerFileMetadata::new(content_1.len() as u64),
)?;
client.schedule_layer_file_upload(
&layer_file_name_2,
&LayerFileMetadata::new(content_2.len() as u64),
)?;
for (filename, content) in [
(&layer_file_name_1, &content_1),
(&layer_file_name_2, &content_2),
(&layer_file_name_3, &content_3),
] {
std::fs::write(timeline_path.join(filename.file_name()), content).unwrap();
}
client
.schedule_layer_file_upload(
&layer_file_name_1,
&LayerFileMetadata::new(content_1.len() as u64),
)
.unwrap();
client
.schedule_layer_file_upload(
&layer_file_name_2,
&LayerFileMetadata::new(content_2.len() as u64),
)
.unwrap();
// Check that they are started immediately, not queued
//
// this works because we running within block_on, so any futures are now queued up until
// our next await point.
{
let mut guard = client.upload_queue.lock().unwrap();
let upload_queue = guard.initialized_mut().unwrap();
@@ -1517,7 +1573,9 @@ mod tests {
// Schedule upload of index. Check that it is queued
let metadata = dummy_metadata(Lsn(0x20));
client.schedule_index_upload_for_metadata_update(&metadata)?;
client
.schedule_index_upload_for_metadata_update(&metadata)
.unwrap();
{
let mut guard = client.upload_queue.lock().unwrap();
let upload_queue = guard.initialized_mut().unwrap();
@@ -1526,7 +1584,7 @@ mod tests {
}
// Wait for the uploads to finish
runtime.block_on(client.wait_completion())?;
client.wait_completion().await.unwrap();
{
let mut guard = client.upload_queue.lock().unwrap();
let upload_queue = guard.initialized_mut().unwrap();
@@ -1536,29 +1594,35 @@ mod tests {
}
// Download back the index.json, and check that the list of files is correct
let index_part = match runtime.block_on(client.download_index_file())? {
let index_part = match client.download_index_file().await.unwrap() {
MaybeDeletedIndexPart::IndexPart(index_part) => index_part,
MaybeDeletedIndexPart::Deleted(_) => panic!("unexpectedly got deleted index part"),
};
assert_file_list(
&index_part.timeline_layers,
&index_part
.layer_metadata
.keys()
.map(|f| f.to_owned())
.collect(),
&[
&layer_file_name_1.file_name(),
&layer_file_name_2.file_name(),
],
);
let downloaded_metadata = index_part.parse_metadata()?;
let downloaded_metadata = index_part.parse_metadata().unwrap();
assert_eq!(downloaded_metadata, metadata);
// Schedule upload and then a deletion. Check that the deletion is queued
let content_baz = dummy_contents("baz");
std::fs::write(timeline_path.join("baz"), &content_baz)?;
client.schedule_layer_file_upload(
&layer_file_name_3,
&LayerFileMetadata::new(content_baz.len() as u64),
)?;
client.schedule_layer_file_deletion(&[layer_file_name_1.clone()])?;
client
.schedule_layer_file_upload(
&layer_file_name_3,
&LayerFileMetadata::new(content_3.len() as u64),
)
.unwrap();
client
.schedule_layer_file_deletion(&[layer_file_name_1.clone()])
.unwrap();
{
let mut guard = client.upload_queue.lock().unwrap();
let upload_queue = guard.initialized_mut().unwrap();
@@ -1580,7 +1644,7 @@ mod tests {
);
// Finish them
runtime.block_on(client.wait_completion())?;
client.wait_completion().await.unwrap();
assert_remote_files(
&[
@@ -1590,23 +1654,24 @@ mod tests {
],
&remote_timeline_dir,
);
Ok(())
}
#[test]
fn bytes_unfinished_gauge_for_layer_file_uploads() -> anyhow::Result<()> {
#[tokio::test]
async fn bytes_unfinished_gauge_for_layer_file_uploads() {
// Setup
let TestSetup {
runtime,
harness,
tenant: _tenant,
timeline: _timeline,
client,
..
} = TestSetup::new("metrics")?;
} = TestSetup::new("metrics").await.unwrap();
let metadata = dummy_metadata(Lsn(0x10));
client.init_upload_queue_for_empty_remote(&metadata)?;
client
.init_upload_queue_for_empty_remote(&metadata)
.unwrap();
let timeline_path = harness.timeline_path(&TIMELINE_ID);
@@ -1615,7 +1680,8 @@ mod tests {
std::fs::write(
timeline_path.join(layer_file_name_1.file_name()),
&content_1,
)?;
)
.unwrap();
#[derive(Debug, PartialEq)]
struct BytesStartedFinished {
@@ -1641,14 +1707,16 @@ mod tests {
let init = get_bytes_started_stopped();
client.schedule_layer_file_upload(
&layer_file_name_1,
&LayerFileMetadata::new(content_1.len() as u64),
)?;
client
.schedule_layer_file_upload(
&layer_file_name_1,
&LayerFileMetadata::new(content_1.len() as u64),
)
.unwrap();
let pre = get_bytes_started_stopped();
runtime.block_on(client.wait_completion())?;
client.wait_completion().await.unwrap();
let post = get_bytes_started_stopped();
@@ -1676,7 +1744,5 @@ mod tests {
finished: Some(content_1.len())
}
);
Ok(())
}
}

View File

@@ -11,23 +11,18 @@ use std::time::Duration;
use anyhow::{anyhow, Context};
use tokio::fs;
use tokio::io::AsyncWriteExt;
use tracing::{info, warn};
use tokio_util::sync::CancellationToken;
use utils::{backoff, crashsafe};
use crate::config::PageServerConf;
use crate::tenant::storage_layer::LayerFileName;
use crate::tenant::timeline::span::debug_assert_current_span_has_tenant_and_timeline_id;
use crate::{exponential_backoff, DEFAULT_BASE_BACKOFF_SECONDS, DEFAULT_MAX_BACKOFF_SECONDS};
use remote_storage::{DownloadError, GenericRemoteStorage};
use utils::crashsafe::path_with_suffix_extension;
use utils::id::{TenantId, TimelineId};
use super::index::{IndexPart, LayerFileMetadata};
use super::{FAILED_DOWNLOAD_RETRIES, FAILED_DOWNLOAD_WARN_THRESHOLD};
async fn fsync_path(path: impl AsRef<std::path::Path>) -> Result<(), std::io::Error> {
fs::File::open(path).await?.sync_all().await
}
use super::{FAILED_DOWNLOAD_WARN_THRESHOLD, FAILED_REMOTE_OP_RETRIES};
static MAX_DOWNLOAD_DURATION: Duration = Duration::from_secs(120);
@@ -152,7 +147,7 @@ pub async fn download_layer_file<'a>(
})
.map_err(DownloadError::Other)?;
fsync_path(&local_path)
crashsafe::fsync_async(&local_path)
.await
.with_context(|| format!("Could not fsync layer file {}", local_path.display(),))
.map_err(DownloadError::Other)?;
@@ -268,7 +263,6 @@ pub(super) async fn download_index_part(
Ok(index_part)
}
///
/// Helper function to handle retries for a download operation.
///
/// Remote operations can fail due to rate limits (IAM, S3), spurious network
@@ -276,47 +270,21 @@ pub(super) async fn download_index_part(
/// with backoff.
///
/// (See similar logic for uploads in `perform_upload_task`)
async fn download_retry<T, O, F>(mut op: O, description: &str) -> Result<T, DownloadError>
async fn download_retry<T, O, F>(op: O, description: &str) -> Result<T, DownloadError>
where
O: FnMut() -> F,
F: Future<Output = Result<T, DownloadError>>,
{
let mut attempts = 0;
loop {
let result = op().await;
match result {
Ok(_) => {
if attempts > 0 {
info!("{description} succeeded after {attempts} retries");
}
return result;
}
// These are "permanent" errors that should not be retried.
Err(DownloadError::BadInput(_)) | Err(DownloadError::NotFound) => {
return result;
}
// Assume that any other failure might be transient, and the operation might
// succeed if we just keep trying.
Err(DownloadError::Other(err)) if attempts < FAILED_DOWNLOAD_WARN_THRESHOLD => {
info!("{description} failed, will retry (attempt {attempts}): {err:#}");
}
Err(DownloadError::Other(err)) if attempts < FAILED_DOWNLOAD_RETRIES => {
warn!("{description} failed, will retry (attempt {attempts}): {err:#}");
}
Err(DownloadError::Other(ref err)) => {
// Operation failed FAILED_DOWNLOAD_RETRIES times. Time to give up.
warn!("{description} still failed after {attempts} retries, giving up: {err:?}");
return result;
}
}
// sleep and retry
exponential_backoff(
attempts,
DEFAULT_BASE_BACKOFF_SECONDS,
DEFAULT_MAX_BACKOFF_SECONDS,
)
.await;
attempts += 1;
}
backoff::retry(
op,
|e| matches!(e, DownloadError::BadInput(_) | DownloadError::NotFound),
FAILED_DOWNLOAD_WARN_THRESHOLD,
FAILED_REMOTE_OP_RETRIES,
description,
// TODO: use a cancellation token (https://github.com/neondatabase/neon/issues/5066)
backoff::Cancel::new(CancellationToken::new(), || -> DownloadError {
unreachable!()
}),
)
.await
}

View File

@@ -62,10 +62,9 @@ pub struct IndexPart {
#[serde(skip_serializing_if = "Option::is_none")]
pub deleted_at: Option<NaiveDateTime>,
/// Layer names, which are stored on the remote storage.
///
/// Additional metadata can might exist in `layer_metadata`.
pub timeline_layers: HashSet<LayerFileName>,
/// Legacy field: equal to the keys of `layer_metadata`, only written out for forward compat
#[serde(default, skip_deserializing)]
timeline_layers: HashSet<LayerFileName>,
/// Per layer file name metadata, which can be present for a present or missing layer file.
///
@@ -74,9 +73,10 @@ pub struct IndexPart {
pub layer_metadata: HashMap<LayerFileName, IndexLayerMetadata>,
// 'disk_consistent_lsn' is a copy of the 'disk_consistent_lsn' in the metadata.
// It's duplicated here for convenience.
// It's duplicated for convenience when reading the serialized structure, but is
// private because internally we would read from metadata instead.
#[serde_as(as = "DisplayFromStr")]
pub disk_consistent_lsn: Lsn,
disk_consistent_lsn: Lsn,
metadata_bytes: Vec<u8>,
}
@@ -85,7 +85,11 @@ impl IndexPart {
/// used to understand later versions.
///
/// Version is currently informative only.
const LATEST_VERSION: usize = 2;
/// Version history
/// - 2: added `deleted_at`
/// - 3: no longer deserialize `timeline_layers` (serialized format is the same, but timeline_layers
/// is always generated from the keys of `layer_metadata`)
const LATEST_VERSION: usize = 3;
pub const FILE_NAME: &'static str = "index_part.json";
pub fn new(
@@ -166,7 +170,7 @@ mod tests {
let expected = IndexPart {
// note this is not verified, could be anything, but exists for humans debugging.. could be the git version instead?
version: 1,
timeline_layers: HashSet::from(["000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__0000000001696070-00000000016960E9".parse().unwrap()]),
timeline_layers: HashSet::new(),
layer_metadata: HashMap::from([
("000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__0000000001696070-00000000016960E9".parse().unwrap(), IndexLayerMetadata {
file_size: 25600000,
@@ -203,7 +207,7 @@ mod tests {
let expected = IndexPart {
// note this is not verified, could be anything, but exists for humans debugging.. could be the git version instead?
version: 1,
timeline_layers: HashSet::from(["000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__0000000001696070-00000000016960E9".parse().unwrap()]),
timeline_layers: HashSet::new(),
layer_metadata: HashMap::from([
("000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__0000000001696070-00000000016960E9".parse().unwrap(), IndexLayerMetadata {
file_size: 25600000,
@@ -241,7 +245,7 @@ mod tests {
let expected = IndexPart {
// note this is not verified, could be anything, but exists for humans debugging.. could be the git version instead?
version: 2,
timeline_layers: HashSet::from(["000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__0000000001696070-00000000016960E9".parse().unwrap()]),
timeline_layers: HashSet::new(),
layer_metadata: HashMap::from([
("000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__0000000001696070-00000000016960E9".parse().unwrap(), IndexLayerMetadata {
file_size: 25600000,

View File

@@ -8,7 +8,7 @@ mod layer_desc;
mod remote_layer;
use crate::config::PageServerConf;
use crate::context::RequestContext;
use crate::context::{AccessStatsBehavior, RequestContext};
use crate::repository::Key;
use crate::task_mgr::TaskKind;
use crate::walrecord::NeonWalRecord;
@@ -41,8 +41,6 @@ pub use inmemory_layer::InMemoryLayer;
pub use layer_desc::{PersistentLayerDesc, PersistentLayerKey};
pub use remote_layer::RemoteLayer;
use super::timeline::layer_manager::LayerManager;
pub fn range_overlaps<T>(a: &Range<T>, b: &Range<T>) -> bool
where
T: PartialOrd<T>,
@@ -175,16 +173,9 @@ impl LayerAccessStats {
///
/// [`LayerLoad`]: LayerResidenceEventReason::LayerLoad
/// [`record_residence_event`]: Self::record_residence_event
pub(crate) fn for_loading_layer(
layer_map_lock_held_witness: &LayerManager,
status: LayerResidenceStatus,
) -> Self {
pub(crate) fn for_loading_layer(status: LayerResidenceStatus) -> Self {
let new = LayerAccessStats(Mutex::new(LayerAccessStatsLocked::default()));
new.record_residence_event(
layer_map_lock_held_witness,
status,
LayerResidenceEventReason::LayerLoad,
);
new.record_residence_event(status, LayerResidenceEventReason::LayerLoad);
new
}
@@ -197,7 +188,6 @@ impl LayerAccessStats {
/// [`record_residence_event`]: Self::record_residence_event
pub(crate) fn clone_for_residence_change(
&self,
layer_map_lock_held_witness: &LayerManager,
new_status: LayerResidenceStatus,
) -> LayerAccessStats {
let clone = {
@@ -205,11 +195,7 @@ impl LayerAccessStats {
inner.clone()
};
let new = LayerAccessStats(Mutex::new(clone));
new.record_residence_event(
layer_map_lock_held_witness,
new_status,
LayerResidenceEventReason::ResidenceChange,
);
new.record_residence_event(new_status, LayerResidenceEventReason::ResidenceChange);
new
}
@@ -229,7 +215,6 @@ impl LayerAccessStats {
///
pub(crate) fn record_residence_event(
&self,
_layer_map_lock_held_witness: &LayerManager,
status: LayerResidenceStatus,
reason: LayerResidenceEventReason,
) {
@@ -241,10 +226,14 @@ impl LayerAccessStats {
});
}
fn record_access(&self, access_kind: LayerAccessKind, task_kind: TaskKind) {
fn record_access(&self, access_kind: LayerAccessKind, ctx: &RequestContext) {
if ctx.access_stats_behavior() == AccessStatsBehavior::Skip {
return;
}
let this_access = LayerAccessStatFullDetails {
when: SystemTime::now(),
task_kind,
task_kind: ctx.task_kind(),
access_kind,
};
@@ -252,7 +241,7 @@ impl LayerAccessStats {
locked.iter_mut().for_each(|inner| {
inner.first_access.get_or_insert(this_access);
inner.count_by_access_kind[access_kind] += 1;
inner.task_kind_flag |= task_kind;
inner.task_kind_flag |= ctx.task_kind();
inner.last_accesses.write(this_access);
})
}
@@ -340,23 +329,6 @@ impl LayerAccessStats {
/// are used in (timeline).
#[async_trait::async_trait]
pub trait Layer: std::fmt::Debug + std::fmt::Display + Send + Sync + 'static {
/// Range of keys that this layer covers
fn get_key_range(&self) -> Range<Key>;
/// Inclusive start bound of the LSN range that this layer holds
/// Exclusive end bound of the LSN range that this layer holds.
///
/// - For an open in-memory layer, this is MAX_LSN.
/// - For a frozen in-memory layer or a delta layer, this is a valid end bound.
/// - An image layer represents snapshot at one LSN, so end_lsn is always the snapshot LSN + 1
fn get_lsn_range(&self) -> Range<Lsn>;
/// Does this layer only contain some data for the key-range (incremental),
/// or does it contain a version of every page? This is important to know
/// for garbage collecting old layers: an incremental layer depends on
/// the previous non-incremental layer.
fn is_incremental(&self) -> bool;
///
/// Return data needed to reconstruct given page at LSN.
///
@@ -376,9 +348,6 @@ pub trait Layer: std::fmt::Debug + std::fmt::Display + Send + Sync + 'static {
reconstruct_data: &mut ValueReconstructState,
ctx: &RequestContext,
) -> Result<ValueReconstructResult>;
/// Dump summary of the contents of the layer to stdout
async fn dump(&self, verbose: bool, ctx: &RequestContext) -> Result<()>;
}
/// Get a layer descriptor from a layer.
@@ -463,7 +432,6 @@ pub mod tests {
TimelineId::from_array([0; 16]),
value.key_range,
value.lsn,
false,
233,
)
}

View File

@@ -29,10 +29,10 @@
//!
use crate::config::PageServerConf;
use crate::context::RequestContext;
use crate::page_cache::{PageReadGuard, PAGE_SZ};
use crate::page_cache::PAGE_SZ;
use crate::repository::{Key, Value, KEY_SIZE};
use crate::tenant::blob_io::{BlobWriter, WriteBlobWriter};
use crate::tenant::block_io::{BlockBuf, BlockCursor, BlockReader, FileBlockReader};
use crate::tenant::block_io::{BlockBuf, BlockCursor, BlockLease, BlockReader, FileBlockReader};
use crate::tenant::disk_btree::{DiskBtreeBuilder, DiskBtreeReader, VisitDirection};
use crate::tenant::storage_layer::{
PersistentLayer, ValueReconstructResult, ValueReconstructState,
@@ -176,10 +176,6 @@ impl DeltaKey {
Lsn(u64::from_be_bytes(self.0[KEY_SIZE..].try_into().unwrap()))
}
fn extract_key_from_buf(buf: &[u8]) -> Key {
Key::from_slice(&buf[..KEY_SIZE])
}
fn extract_lsn_from_buf(buf: &[u8]) -> Lsn {
let mut lsn_buf = [0u8; 8];
lsn_buf.copy_from_slice(&buf[KEY_SIZE..]);
@@ -243,8 +239,54 @@ impl std::fmt::Debug for DeltaLayerInner {
#[async_trait::async_trait]
impl Layer for DeltaLayer {
/// debugging function to print out the contents of the layer
async fn dump(&self, verbose: bool, ctx: &RequestContext) -> Result<()> {
async fn get_value_reconstruct_data(
&self,
key: Key,
lsn_range: Range<Lsn>,
reconstruct_state: &mut ValueReconstructState,
ctx: &RequestContext,
) -> anyhow::Result<ValueReconstructResult> {
self.get_value_reconstruct_data(key, lsn_range, reconstruct_state, ctx)
.await
}
}
/// Boilerplate to implement the Layer trait, always use layer_desc for persistent layers.
impl std::fmt::Display for DeltaLayer {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
write!(f, "{}", self.layer_desc().short_id())
}
}
impl AsLayerDesc for DeltaLayer {
fn layer_desc(&self) -> &PersistentLayerDesc {
&self.desc
}
}
impl PersistentLayer for DeltaLayer {
fn downcast_delta_layer(self: Arc<Self>) -> Option<std::sync::Arc<DeltaLayer>> {
Some(self)
}
fn local_path(&self) -> Option<PathBuf> {
self.local_path()
}
fn delete_resident_layer_file(&self) -> Result<()> {
self.delete_resident_layer_file()
}
fn info(&self, reset: LayerAccessStatsReset) -> HistoricLayerInfo {
self.info(reset)
}
fn access_stats(&self) -> &LayerAccessStats {
self.access_stats()
}
}
impl DeltaLayer {
pub(crate) async fn dump(&self, verbose: bool, ctx: &RequestContext) -> Result<()> {
println!(
"----- delta layer for ten {} tli {} keys {}-{} lsn {}-{} size {} ----",
self.desc.tenant_id,
@@ -276,52 +318,47 @@ impl Layer for DeltaLayer {
tree_reader.dump().await?;
let cursor = file.block_cursor();
let keys = DeltaLayerInner::load_keys(&Ref(&**inner)).await?;
// A subroutine to dump a single blob
let dump_blob = |blob_ref: BlobRef| -> anyhow::Result<String> {
let buf = cursor.read_blob(blob_ref.pos())?;
let val = Value::des(&buf)?;
let desc = match val {
Value::Image(img) => {
format!(" img {} bytes", img.len())
}
Value::WalRecord(rec) => {
let wal_desc = walrecord::describe_wal_record(&rec)?;
format!(
" rec {} bytes will_init: {} {}",
buf.len(),
rec.will_init(),
wal_desc
)
}
};
Ok(desc)
let dump_blob = |val: ValueRef<_>| -> _ {
async move {
let buf = val.reader.read_blob(val.blob_ref.pos()).await?;
let val = Value::des(&buf)?;
let desc = match val {
Value::Image(img) => {
format!(" img {} bytes", img.len())
}
Value::WalRecord(rec) => {
let wal_desc = walrecord::describe_wal_record(&rec)?;
format!(
" rec {} bytes will_init: {} {}",
buf.len(),
rec.will_init(),
wal_desc
)
}
};
Ok(desc)
}
};
tree_reader
.visit(
&[0u8; DELTA_KEY_SIZE],
VisitDirection::Forwards,
|delta_key, val| {
let blob_ref = BlobRef(val);
let key = DeltaKey::extract_key_from_buf(delta_key);
let lsn = DeltaKey::extract_lsn_from_buf(delta_key);
let desc = match dump_blob(blob_ref) {
Ok(desc) => desc,
Err(err) => format!("ERROR: {}", err),
};
println!(" key {} at {}: {}", key, lsn, desc);
true
},
)
.await?;
for entry in keys {
let DeltaEntry { key, lsn, val, .. } = entry;
let desc = match dump_blob(val).await {
Ok(desc) => desc,
Err(err) => {
let err: anyhow::Error = err;
format!("ERROR: {err}")
}
};
println!(" key {key} at {lsn}: {desc}");
}
Ok(())
}
async fn get_value_reconstruct_data(
pub(crate) async fn get_value_reconstruct_data(
&self,
key: Key,
lsn_range: Range<Lsn>,
@@ -335,58 +372,24 @@ impl Layer for DeltaLayer {
let inner = self
.load(LayerAccessKind::GetValueReconstructData, ctx)
.await?;
inner
.get_value_reconstruct_data(key, lsn_range, reconstruct_state)
.await
}
/// Boilerplate to implement the Layer trait, always use layer_desc for persistent layers.
fn get_key_range(&self) -> Range<Key> {
self.layer_desc().key_range.clone()
}
/// Boilerplate to implement the Layer trait, always use layer_desc for persistent layers.
fn get_lsn_range(&self) -> Range<Lsn> {
self.layer_desc().lsn_range.clone()
}
/// Boilerplate to implement the Layer trait, always use layer_desc for persistent layers.
fn is_incremental(&self) -> bool {
self.layer_desc().is_incremental
}
}
/// Boilerplate to implement the Layer trait, always use layer_desc for persistent layers.
impl std::fmt::Display for DeltaLayer {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
write!(f, "{}", self.layer_desc().short_id())
}
}
impl AsLayerDesc for DeltaLayer {
fn layer_desc(&self) -> &PersistentLayerDesc {
&self.desc
}
}
impl PersistentLayer for DeltaLayer {
fn downcast_delta_layer(self: Arc<Self>) -> Option<std::sync::Arc<DeltaLayer>> {
Some(self)
}
fn local_path(&self) -> Option<PathBuf> {
pub(crate) fn local_path(&self) -> Option<PathBuf> {
Some(self.path())
}
fn delete_resident_layer_file(&self) -> Result<()> {
pub(crate) fn delete_resident_layer_file(&self) -> Result<()> {
// delete underlying file
fs::remove_file(self.path())?;
Ok(())
}
fn info(&self, reset: LayerAccessStatsReset) -> HistoricLayerInfo {
let layer_file_name = self.filename().file_name();
let lsn_range = self.get_lsn_range();
pub(crate) fn info(&self, reset: LayerAccessStatsReset) -> HistoricLayerInfo {
let layer_file_name = self.layer_desc().filename().file_name();
let lsn_range = self.layer_desc().lsn_range.clone();
let access_stats = self.access_stats.as_api_model(reset);
@@ -400,12 +403,10 @@ impl PersistentLayer for DeltaLayer {
}
}
fn access_stats(&self) -> &LayerAccessStats {
pub(crate) fn access_stats(&self) -> &LayerAccessStats {
&self.access_stats
}
}
impl DeltaLayer {
fn path_for(
path_or_conf: &PathOrConf,
tenant_id: &TenantId,
@@ -452,8 +453,7 @@ impl DeltaLayer {
access_kind: LayerAccessKind,
ctx: &RequestContext,
) -> Result<&Arc<DeltaLayerInner>> {
self.access_stats
.record_access(access_kind, ctx.task_kind());
self.access_stats.record_access(access_kind, ctx);
// Quick exit if already loaded
self.inner
.get_or_try_init(|| self.load_inner())
@@ -549,30 +549,20 @@ impl DeltaLayer {
&self.layer_name(),
)
}
/// Obtains all keys and value references stored in the layer
/// Loads all keys stored in the layer. Returns key, lsn, value size and value reference.
///
/// The value can be obtained via the [`ValueRef::load`] function.
pub async fn load_val_refs(
pub(crate) async fn load_keys(
&self,
ctx: &RequestContext,
) -> Result<Vec<(Key, Lsn, ValueRef<Arc<DeltaLayerInner>>)>> {
let inner = self
.load(LayerAccessKind::Iter, ctx)
.await
.context("load delta layer")?;
DeltaLayerInner::load_val_refs(inner)
.await
.context("Layer index is corrupted")
}
/// Loads all keys stored in the layer. Returns key, lsn and value size.
pub async fn load_keys(&self, ctx: &RequestContext) -> Result<Vec<(Key, Lsn, u64)>> {
) -> Result<Vec<DeltaEntry<Ref<&'_ DeltaLayerInner>>>> {
let inner = self
.load(LayerAccessKind::KeyIter, ctx)
.await
.context("load delta layer keys")?;
DeltaLayerInner::load_keys(inner)
let inner = Ref(&**inner);
DeltaLayerInner::load_keys(&inner)
.await
.context("Layer index is corrupted")
}
@@ -711,6 +701,17 @@ impl DeltaLayerWriterInner {
.metadata()
.context("get file metadata to determine size")?;
// 5GB limit for objects without multipart upload (which we don't want to use)
// Make it a little bit below to account for differing GB units
// https://docs.aws.amazon.com/AmazonS3/latest/userguide/upload-objects.html
const S3_UPLOAD_LIMIT: u64 = 4_500_000_000;
ensure!(
metadata.len() <= S3_UPLOAD_LIMIT,
"Created delta layer file at {} of size {} above limit {S3_UPLOAD_LIMIT}!",
file.path.display(),
metadata.len()
);
// Note: Because we opened the file in write-only mode, we cannot
// reuse the same VirtualFile for reading later. That's why we don't
// set inner.file here. The first read will have to re-open it.
@@ -913,12 +914,15 @@ impl DeltaLayerInner {
let cursor = file.block_cursor();
let mut buf = Vec::new();
for (entry_lsn, pos) in offsets {
cursor.read_blob_into_buf(pos, &mut buf).with_context(|| {
format!(
"Failed to read blob from virtual file {}",
file.file.path.display()
)
})?;
cursor
.read_blob_into_buf(pos, &mut buf)
.await
.with_context(|| {
format!(
"Failed to read blob from virtual file {}",
file.file.path.display()
)
})?;
let val = Value::des(&buf).with_context(|| {
format!(
"Failed to deserialize file blob from virtual file {}",
@@ -952,15 +956,17 @@ impl DeltaLayerInner {
}
}
pub(super) async fn load_val_refs<T: AsRef<DeltaLayerInner> + Clone>(
pub(super) async fn load_keys<T: AsRef<DeltaLayerInner> + Clone>(
this: &T,
) -> Result<Vec<(Key, Lsn, ValueRef<T>)>> {
) -> Result<Vec<DeltaEntry<T>>> {
let dl = this.as_ref();
let file = &dl.file;
let tree_reader =
DiskBtreeReader::<_, DELTA_KEY_SIZE>::new(dl.index_start_blk, dl.index_root_blk, file);
let mut all_offsets = Vec::<(Key, Lsn, ValueRef<T>)>::new();
let mut all_keys: Vec<DeltaEntry<T>> = Vec::new();
tree_reader
.visit(
&[0u8; DELTA_KEY_SIZE],
@@ -971,54 +977,63 @@ impl DeltaLayerInner {
blob_ref: BlobRef(value),
reader: BlockCursor::new(Adapter(this.clone())),
};
all_offsets.push((delta_key.key(), delta_key.lsn(), val_ref));
true
},
)
.await?;
Ok(all_offsets)
}
pub(super) async fn load_keys(&self) -> Result<Vec<(Key, Lsn, u64)>> {
let file = &self.file;
let tree_reader = DiskBtreeReader::<_, DELTA_KEY_SIZE>::new(
self.index_start_blk,
self.index_root_blk,
file,
);
let mut all_keys: Vec<(Key, Lsn, u64)> = Vec::new();
tree_reader
.visit(
&[0u8; DELTA_KEY_SIZE],
VisitDirection::Forwards,
|key, value| {
let delta_key = DeltaKey::from_slice(key);
let pos = BlobRef(value).pos();
if let Some(last) = all_keys.last_mut() {
if last.0 == delta_key.key() {
return true;
} else {
// subtract offset of new key BLOB and first blob of this key
// to get total size if values associated with this key
let first_pos = last.2;
last.2 = pos - first_pos;
}
// subtract offset of the current and last entries to get the size
// of the value associated with this (key, lsn) tuple
let first_pos = last.size;
last.size = pos - first_pos;
}
all_keys.push((delta_key.key(), delta_key.lsn(), pos));
let entry = DeltaEntry {
key: delta_key.key(),
lsn: delta_key.lsn(),
size: pos,
val: val_ref,
};
all_keys.push(entry);
true
},
)
.await?;
if let Some(last) = all_keys.last_mut() {
// Last key occupies all space till end of layer
last.2 = std::fs::metadata(&file.file.path)?.len() - last.2;
// Last key occupies all space till end of value storage,
// which corresponds to beginning of the index
last.size = dl.index_start_blk as u64 * PAGE_SZ as u64 - last.size;
}
Ok(all_keys)
}
}
/// Cloneable borrow wrapper to make borrows behave like smart pointers.
///
/// Shared references are trivially copyable. This wrapper avoids (confusion) to otherwise attempt
/// cloning DeltaLayerInner.
pub(crate) struct Ref<T>(T);
impl<'a, T> AsRef<T> for Ref<&'a T> {
fn as_ref(&self) -> &T {
self.0
}
}
impl<'a, T> Clone for Ref<&'a T> {
fn clone(&self) -> Self {
*self
}
}
impl<'a, T> Copy for Ref<&'a T> {}
/// A set of data associated with a delta layer key and its value
pub struct DeltaEntry<T: AsRef<DeltaLayerInner>> {
pub key: Key,
pub lsn: Lsn,
/// Size of the stored value
pub size: u64,
/// Reference to the on-disk value
pub val: ValueRef<T>,
}
/// Reference to an on-disk value
pub struct ValueRef<T: AsRef<DeltaLayerInner>> {
blob_ref: BlobRef,
@@ -1027,9 +1042,9 @@ pub struct ValueRef<T: AsRef<DeltaLayerInner>> {
impl<T: AsRef<DeltaLayerInner>> ValueRef<T> {
/// Loads the value from disk
pub fn load(&self) -> Result<Value> {
pub async fn load(&self) -> Result<Value> {
// theoretically we *could* record an access time for each, but it does not really matter
let buf = self.reader.read_blob(self.blob_ref.pos())?;
let buf = self.reader.read_blob(self.blob_ref.pos()).await?;
let val = Value::des(&buf)?;
Ok(val)
}
@@ -1038,9 +1053,7 @@ impl<T: AsRef<DeltaLayerInner>> ValueRef<T> {
struct Adapter<T: AsRef<DeltaLayerInner>>(T);
impl<T: AsRef<DeltaLayerInner>> BlockReader for Adapter<T> {
type BlockLease = PageReadGuard<'static>;
fn read_blk(&self, blknum: u32) -> Result<Self::BlockLease, std::io::Error> {
fn read_blk(&self, blknum: u32) -> Result<BlockLease, std::io::Error> {
self.0.as_ref().file.read_blk(blknum)
}
}

View File

@@ -212,9 +212,20 @@ pub enum LayerFileName {
}
impl LayerFileName {
pub fn file_name(&self) -> String {
pub(crate) fn file_name(&self) -> String {
self.to_string()
}
/// Determines if this layer file is considered to be in future meaning we will discard these
/// layers during timeline initialization from the given disk_consistent_lsn.
pub(crate) fn is_in_future(&self, disk_consistent_lsn: Lsn) -> bool {
use LayerFileName::*;
match self {
Image(file_name) if file_name.lsn > disk_consistent_lsn => true,
Delta(file_name) if file_name.lsn_range.end > disk_consistent_lsn + 1 => true,
_ => false,
}
}
}
impl fmt::Display for LayerFileName {
@@ -263,8 +274,8 @@ impl serde::Serialize for LayerFileName {
S: serde::Serializer,
{
match self {
Self::Image(fname) => serializer.serialize_str(&fname.to_string()),
Self::Delta(fname) => serializer.serialize_str(&fname.to_string()),
Self::Image(fname) => serializer.collect_str(fname),
Self::Delta(fname) => serializer.collect_str(fname),
}
}
}

View File

@@ -169,8 +169,52 @@ impl std::fmt::Debug for ImageLayerInner {
#[async_trait::async_trait]
impl Layer for ImageLayer {
/// debugging function to print out the contents of the layer
async fn dump(&self, verbose: bool, ctx: &RequestContext) -> Result<()> {
/// Look up given page in the file
async fn get_value_reconstruct_data(
&self,
key: Key,
lsn_range: Range<Lsn>,
reconstruct_state: &mut ValueReconstructState,
ctx: &RequestContext,
) -> anyhow::Result<ValueReconstructResult> {
self.get_value_reconstruct_data(key, lsn_range, reconstruct_state, ctx)
.await
}
}
/// Boilerplate to implement the Layer trait, always use layer_desc for persistent layers.
impl std::fmt::Display for ImageLayer {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
write!(f, "{}", self.layer_desc().short_id())
}
}
impl AsLayerDesc for ImageLayer {
fn layer_desc(&self) -> &PersistentLayerDesc {
&self.desc
}
}
impl PersistentLayer for ImageLayer {
fn local_path(&self) -> Option<PathBuf> {
self.local_path()
}
fn delete_resident_layer_file(&self) -> Result<()> {
self.delete_resident_layer_file()
}
fn info(&self, reset: LayerAccessStatsReset) -> HistoricLayerInfo {
self.info(reset)
}
fn access_stats(&self) -> &LayerAccessStats {
self.access_stats()
}
}
impl ImageLayer {
pub(crate) async fn dump(&self, verbose: bool, ctx: &RequestContext) -> Result<()> {
println!(
"----- image layer for ten {} tli {} key {}-{} at {} is_incremental {} size {} ----",
self.desc.tenant_id,
@@ -178,7 +222,7 @@ impl Layer for ImageLayer {
self.desc.key_range.start,
self.desc.key_range.end,
self.lsn,
self.desc.is_incremental,
self.desc.is_incremental(),
self.desc.file_size
);
@@ -203,8 +247,7 @@ impl Layer for ImageLayer {
Ok(())
}
/// Look up given page in the file
async fn get_value_reconstruct_data(
pub(crate) async fn get_value_reconstruct_data(
&self,
key: Key,
lsn_range: Range<Lsn>,
@@ -225,65 +268,33 @@ impl Layer for ImageLayer {
.with_context(|| format!("read {}", self.path().display()))
}
/// Boilerplate to implement the Layer trait, always use layer_desc for persistent layers.
fn get_key_range(&self) -> Range<Key> {
self.layer_desc().key_range.clone()
}
/// Boilerplate to implement the Layer trait, always use layer_desc for persistent layers.
fn get_lsn_range(&self) -> Range<Lsn> {
self.layer_desc().lsn_range.clone()
}
/// Boilerplate to implement the Layer trait, always use layer_desc for persistent layers.
fn is_incremental(&self) -> bool {
self.layer_desc().is_incremental
}
}
/// Boilerplate to implement the Layer trait, always use layer_desc for persistent layers.
impl std::fmt::Display for ImageLayer {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
write!(f, "{}", self.layer_desc().short_id())
}
}
impl AsLayerDesc for ImageLayer {
fn layer_desc(&self) -> &PersistentLayerDesc {
&self.desc
}
}
impl PersistentLayer for ImageLayer {
fn local_path(&self) -> Option<PathBuf> {
pub(crate) fn local_path(&self) -> Option<PathBuf> {
Some(self.path())
}
fn delete_resident_layer_file(&self) -> Result<()> {
pub(crate) fn delete_resident_layer_file(&self) -> Result<()> {
// delete underlying file
fs::remove_file(self.path())?;
Ok(())
}
fn info(&self, reset: LayerAccessStatsReset) -> HistoricLayerInfo {
let layer_file_name = self.filename().file_name();
let lsn_range = self.get_lsn_range();
pub(crate) fn info(&self, reset: LayerAccessStatsReset) -> HistoricLayerInfo {
let layer_file_name = self.layer_desc().filename().file_name();
let lsn_start = self.layer_desc().image_layer_lsn();
HistoricLayerInfo::Image {
layer_file_name,
layer_file_size: self.desc.file_size,
lsn_start: lsn_range.start,
lsn_start,
remote: false,
access_stats: self.access_stats.as_api_model(reset),
}
}
fn access_stats(&self) -> &LayerAccessStats {
pub(crate) fn access_stats(&self) -> &LayerAccessStats {
&self.access_stats
}
}
impl ImageLayer {
fn path_for(
path_or_conf: &PathOrConf,
timeline_id: TimelineId,
@@ -323,8 +334,7 @@ impl ImageLayer {
access_kind: LayerAccessKind,
ctx: &RequestContext,
) -> Result<&ImageLayerInner> {
self.access_stats
.record_access(access_kind, ctx.task_kind());
self.access_stats.record_access(access_kind, ctx);
self.inner
.get_or_try_init(|| self.load_inner())
.await
@@ -372,7 +382,6 @@ impl ImageLayer {
timeline_id,
filename.key_range.clone(),
filename.lsn,
false,
file_size,
), // Now we assume image layer ALWAYS covers the full range. This may change in the future.
lsn: filename.lsn,
@@ -399,7 +408,6 @@ impl ImageLayer {
summary.timeline_id,
summary.key_range,
summary.lsn,
false,
metadata.len(),
), // Now we assume image layer ALWAYS covers the full range. This may change in the future.
lsn: summary.lsn,
@@ -471,6 +479,7 @@ impl ImageLayerInner {
let blob = file
.block_cursor()
.read_blob(offset)
.await
.with_context(|| format!("failed to read value from offset {}", offset))?;
let value = Bytes::from(blob);
@@ -500,7 +509,6 @@ struct ImageLayerWriterInner {
tenant_id: TenantId,
key_range: Range<Key>,
lsn: Lsn,
is_incremental: bool,
blob_writer: WriteBlobWriter<VirtualFile>,
tree: DiskBtreeBuilder<BlockBuf, KEY_SIZE>,
@@ -516,7 +524,6 @@ impl ImageLayerWriterInner {
tenant_id: TenantId,
key_range: &Range<Key>,
lsn: Lsn,
is_incremental: bool,
) -> anyhow::Result<Self> {
// Create the file initially with a temporary filename.
// We'll atomically rename it to the final name when we're done.
@@ -551,7 +558,6 @@ impl ImageLayerWriterInner {
lsn,
tree: tree_builder,
blob_writer,
is_incremental,
};
Ok(writer)
@@ -612,7 +618,6 @@ impl ImageLayerWriterInner {
self.timeline_id,
self.key_range.clone(),
self.lsn,
self.is_incremental, // for now, image layer ALWAYS covers the full range
metadata.len(),
);
@@ -687,7 +692,6 @@ impl ImageLayerWriter {
tenant_id: TenantId,
key_range: &Range<Key>,
lsn: Lsn,
is_incremental: bool,
) -> anyhow::Result<ImageLayerWriter> {
Ok(Self {
inner: Some(ImageLayerWriterInner::new(
@@ -696,7 +700,6 @@ impl ImageLayerWriter {
tenant_id,
key_range,
lsn,
is_incremental,
)?),
})
}

View File

@@ -28,7 +28,7 @@ use utils::{
// while being able to use std::fmt::Write's methods
use std::fmt::Write as _;
use std::ops::Range;
use std::sync::RwLock;
use tokio::sync::RwLock;
use super::{DeltaLayer, DeltaLayerWriter, Layer};
@@ -85,11 +85,11 @@ impl std::fmt::Debug for InMemoryLayerInner {
}
impl InMemoryLayer {
pub fn get_timeline_id(&self) -> TimelineId {
pub(crate) fn get_timeline_id(&self) -> TimelineId {
self.timeline_id
}
pub fn info(&self) -> InMemoryLayerInfo {
pub(crate) fn info(&self) -> InMemoryLayerInfo {
let lsn_start = self.start_lsn;
if let Some(&lsn_end) = self.end_lsn.get() {
@@ -99,33 +99,23 @@ impl InMemoryLayer {
}
}
fn assert_writable(&self) {
pub(crate) fn assert_writable(&self) {
assert!(self.end_lsn.get().is_none());
}
fn end_lsn_or_max(&self) -> Lsn {
pub(crate) fn end_lsn_or_max(&self) -> Lsn {
self.end_lsn.get().copied().unwrap_or(Lsn::MAX)
}
}
#[async_trait::async_trait]
impl Layer for InMemoryLayer {
fn get_key_range(&self) -> Range<Key> {
Key::MIN..Key::MAX
}
fn get_lsn_range(&self) -> Range<Lsn> {
pub(crate) fn get_lsn_range(&self) -> Range<Lsn> {
self.start_lsn..self.end_lsn_or_max()
}
fn is_incremental(&self) -> bool {
// in-memory layer is always considered incremental.
true
}
/// debugging function to print out the contents of the layer
async fn dump(&self, verbose: bool, _ctx: &RequestContext) -> Result<()> {
let inner = self.inner.read().unwrap();
///
/// this is likely completly unused
pub async fn dump(&self, verbose: bool, _ctx: &RequestContext) -> Result<()> {
let inner = self.inner.read().await;
let end_str = self.end_lsn_or_max();
@@ -143,7 +133,7 @@ impl Layer for InMemoryLayer {
for (key, vec_map) in inner.index.iter() {
for (lsn, pos) in vec_map.as_slice() {
let mut desc = String::new();
cursor.read_blob_into_buf(*pos, &mut buf)?;
cursor.read_blob_into_buf(*pos, &mut buf).await?;
let val = Value::des(&buf);
match val {
Ok(Value::Image(img)) => {
@@ -171,7 +161,7 @@ impl Layer for InMemoryLayer {
}
/// Look up given value in the layer.
async fn get_value_reconstruct_data(
pub(crate) async fn get_value_reconstruct_data(
&self,
key: Key,
lsn_range: Range<Lsn>,
@@ -181,7 +171,7 @@ impl Layer for InMemoryLayer {
ensure!(lsn_range.start >= self.start_lsn);
let mut need_image = true;
let inner = self.inner.read().unwrap();
let inner = self.inner.read().await;
let reader = inner.file.block_cursor();
@@ -189,7 +179,7 @@ impl Layer for InMemoryLayer {
if let Some(vec_map) = inner.index.get(&key) {
let slice = vec_map.slice_range(lsn_range);
for (entry_lsn, pos) in slice.iter().rev() {
let buf = reader.read_blob(*pos)?;
let buf = reader.read_blob(*pos).await?;
let value = Value::des(&buf)?;
match value {
Value::Image(img) => {
@@ -221,6 +211,20 @@ impl Layer for InMemoryLayer {
}
}
#[async_trait::async_trait]
impl Layer for InMemoryLayer {
async fn get_value_reconstruct_data(
&self,
key: Key,
lsn_range: Range<Lsn>,
reconstruct_data: &mut ValueReconstructState,
ctx: &RequestContext,
) -> Result<ValueReconstructResult> {
self.get_value_reconstruct_data(key, lsn_range, reconstruct_data, ctx)
.await
}
}
impl std::fmt::Display for InMemoryLayer {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
let end_lsn = self.end_lsn_or_max();
@@ -230,11 +234,11 @@ impl std::fmt::Display for InMemoryLayer {
impl InMemoryLayer {
///
/// Get layer size on the disk
/// Get layer size.
///
pub fn size(&self) -> Result<u64> {
let inner = self.inner.read().unwrap();
Ok(inner.file.size)
pub async fn size(&self) -> Result<u64> {
let inner = self.inner.read().await;
Ok(inner.file.len())
}
///
@@ -267,9 +271,9 @@ impl InMemoryLayer {
/// Common subroutine of the public put_wal_record() and put_page_image() functions.
/// Adds the page version to the in-memory tree
pub fn put_value(&self, key: Key, lsn: Lsn, val: &Value) -> Result<()> {
pub async fn put_value(&self, key: Key, lsn: Lsn, val: &Value) -> Result<()> {
trace!("put_value key {} at {}/{}", key, self.timeline_id, lsn);
let mut inner = self.inner.write().unwrap();
let mut inner = self.inner.write().await;
self.assert_writable();
let off = {
@@ -301,8 +305,8 @@ impl InMemoryLayer {
/// Make the layer non-writeable. Only call once.
/// Records the end_lsn for non-dropped layers.
/// `end_lsn` is exclusive
pub fn freeze(&self, end_lsn: Lsn) {
let inner = self.inner.write().unwrap();
pub async fn freeze(&self, end_lsn: Lsn) {
let inner = self.inner.write().await;
assert!(self.start_lsn < end_lsn);
self.end_lsn.set(end_lsn).expect("end_lsn set only once");
@@ -317,7 +321,7 @@ impl InMemoryLayer {
/// Write this frozen in-memory layer to disk.
///
/// Returns a new delta layer with all the same data as this in-memory layer
pub fn write_to_disk(&self) -> Result<DeltaLayer> {
pub(crate) async fn write_to_disk(&self) -> Result<DeltaLayer> {
// Grab the lock in read-mode. We hold it over the I/O, but because this
// layer is not writeable anymore, no one should be trying to acquire the
// write lock on it, so we shouldn't block anyone. There's one exception
@@ -327,7 +331,7 @@ impl InMemoryLayer {
// lock, it will see that it's not writeable anymore and retry, but it
// would have to wait until we release it. That race condition is very
// rare though, so we just accept the potential latency hit for now.
let inner = self.inner.read().unwrap();
let inner = self.inner.read().await;
let end_lsn = *self.end_lsn.get().unwrap();
@@ -350,7 +354,7 @@ impl InMemoryLayer {
let key = **key;
// Write all page versions
for (lsn, pos) in vec_map.as_slice() {
cursor.read_blob_into_buf(*pos, &mut buf)?;
cursor.read_blob_into_buf(*pos, &mut buf).await?;
let will_init = Value::des(&buf)?.will_init();
delta_layer_writer.put_value_bytes(key, *lsn, &buf, will_init)?;
}

View File

@@ -19,16 +19,17 @@ use serde::{Deserialize, Serialize};
pub struct PersistentLayerDesc {
pub tenant_id: TenantId,
pub timeline_id: TimelineId,
/// Range of keys that this layer covers
pub key_range: Range<Key>,
/// For image layer, this is `[lsn, lsn+1)`.
/// Inclusive start, exclusive end of the LSN range that this layer holds.
///
/// - For an open in-memory layer, the end bound is MAX_LSN
/// - For a frozen in-memory layer or a delta layer, the end bound is a valid lsn after the
/// range start
/// - An image layer represents snapshot at one LSN, so end_lsn is always the snapshot LSN + 1
pub lsn_range: Range<Lsn>,
/// Whether this is a delta layer.
/// Whether this is a delta layer, and also, is this incremental.
pub is_delta: bool,
/// Whether this layer only contains page images for part of the keys in the range. In the current implementation, this should
/// always be equal to `is_delta`. If we land the partial image layer PR someday, image layer could also be
/// incremental.
pub is_incremental: bool,
/// File size
pub file_size: u64,
}
@@ -61,7 +62,6 @@ impl PersistentLayerDesc {
key_range,
lsn_range: Lsn(0)..Lsn(1),
is_delta: false,
is_incremental: false,
file_size: 0,
}
}
@@ -71,7 +71,6 @@ impl PersistentLayerDesc {
timeline_id: TimelineId,
key_range: Range<Key>,
lsn: Lsn,
is_incremental: bool,
file_size: u64,
) -> Self {
Self {
@@ -80,7 +79,6 @@ impl PersistentLayerDesc {
key_range,
lsn_range: Self::image_layer_lsn_range(lsn),
is_delta: false,
is_incremental,
file_size,
}
}
@@ -98,7 +96,6 @@ impl PersistentLayerDesc {
key_range,
lsn_range,
is_delta: true,
is_incremental: true,
file_size,
}
}
@@ -164,8 +161,12 @@ impl PersistentLayerDesc {
self.tenant_id
}
/// Does this layer only contain some data for the key-range (incremental),
/// or does it contain a version of every page? This is important to know
/// for garbage collecting old layers: an incremental layer depends on
/// the previous non-incremental layer.
pub fn is_incremental(&self) -> bool {
self.is_incremental
self.is_delta
}
pub fn is_delta(&self) -> bool {
@@ -182,7 +183,7 @@ impl PersistentLayerDesc {
self.lsn_range.start,
self.lsn_range.end,
self.is_delta,
self.is_incremental,
self.is_incremental(),
self.file_size,
);

View File

@@ -60,7 +60,7 @@ impl std::fmt::Debug for RemoteLayer {
f.debug_struct("RemoteLayer")
.field("file_name", &self.desc.filename())
.field("layer_metadata", &self.layer_metadata)
.field("is_incremental", &self.desc.is_incremental)
.field("is_incremental", &self.desc.is_incremental())
.finish()
}
}
@@ -76,39 +76,6 @@ impl Layer for RemoteLayer {
) -> Result<ValueReconstructResult> {
bail!("layer {self} needs to be downloaded");
}
/// debugging function to print out the contents of the layer
async fn dump(&self, _verbose: bool, _ctx: &RequestContext) -> Result<()> {
println!(
"----- remote layer for ten {} tli {} keys {}-{} lsn {}-{} is_delta {} is_incremental {} size {} ----",
self.desc.tenant_id,
self.desc.timeline_id,
self.desc.key_range.start,
self.desc.key_range.end,
self.desc.lsn_range.start,
self.desc.lsn_range.end,
self.desc.is_delta,
self.desc.is_incremental,
self.desc.file_size,
);
Ok(())
}
/// Boilerplate to implement the Layer trait, always use layer_desc for persistent layers.
fn get_key_range(&self) -> Range<Key> {
self.layer_desc().key_range.clone()
}
/// Boilerplate to implement the Layer trait, always use layer_desc for persistent layers.
fn get_lsn_range(&self) -> Range<Lsn> {
self.layer_desc().lsn_range.clone()
}
/// Boilerplate to implement the Layer trait, always use layer_desc for persistent layers.
fn is_incremental(&self) -> bool {
self.layer_desc().is_incremental
}
}
/// Boilerplate to implement the Layer trait, always use layer_desc for persistent layers.
@@ -142,8 +109,8 @@ impl PersistentLayer for RemoteLayer {
}
fn info(&self, reset: LayerAccessStatsReset) -> HistoricLayerInfo {
let layer_file_name = self.filename().file_name();
let lsn_range = self.get_lsn_range();
let layer_file_name = self.layer_desc().filename().file_name();
let lsn_range = self.layer_desc().lsn_range.clone();
if self.desc.is_delta {
HistoricLayerInfo::Delta {
@@ -184,7 +151,6 @@ impl RemoteLayer {
timelineid,
fname.key_range.clone(),
fname.lsn,
false,
layer_metadata.file_size(),
),
layer_metadata: layer_metadata.clone(),
@@ -217,9 +183,9 @@ impl RemoteLayer {
}
/// Create a Layer struct representing this layer, after it has been downloaded.
pub fn create_downloaded_layer(
pub(crate) fn create_downloaded_layer(
&self,
layer_map_lock_held_witness: &LayerManager,
_layer_map_lock_held_witness: &LayerManager,
conf: &'static PageServerConf,
file_size: u64,
) -> Arc<dyn PersistentLayer> {
@@ -231,10 +197,8 @@ impl RemoteLayer {
self.desc.tenant_id,
&fname,
file_size,
self.access_stats.clone_for_residence_change(
layer_map_lock_held_witness,
LayerResidenceStatus::Resident,
),
self.access_stats
.clone_for_residence_change(LayerResidenceStatus::Resident),
))
} else {
let fname = self.desc.image_file_name();
@@ -244,10 +208,8 @@ impl RemoteLayer {
self.desc.tenant_id,
&fname,
file_size,
self.access_stats.clone_for_residence_change(
layer_map_lock_held_witness,
LayerResidenceStatus::Resident,
),
self.access_stats
.clone_for_residence_change(LayerResidenceStatus::Resident),
))
}
}

View File

@@ -1,5 +1,6 @@
pub mod delete;
mod eviction_task;
mod init;
pub mod layer_manager;
mod logical_size;
pub mod span;
@@ -27,7 +28,6 @@ use utils::id::TenantTimelineId;
use std::cmp::{max, min, Ordering};
use std::collections::{BinaryHeap, HashMap, HashSet};
use std::fs;
use std::ops::{Deref, Range};
use std::path::{Path, PathBuf};
use std::pin::pin;
@@ -35,15 +35,16 @@ use std::sync::atomic::Ordering as AtomicOrdering;
use std::sync::{Arc, Mutex, RwLock, Weak};
use std::time::{Duration, Instant, SystemTime};
use crate::context::{DownloadBehavior, RequestContext};
use crate::tenant::remote_timeline_client::{self, index::LayerFileMetadata};
use crate::context::{
AccessStatsBehavior, DownloadBehavior, RequestContext, RequestContextBuilder,
};
use crate::tenant::remote_timeline_client::index::LayerFileMetadata;
use crate::tenant::storage_layer::delta_layer::DeltaEntry;
use crate::tenant::storage_layer::{
DeltaFileName, DeltaLayerWriter, ImageFileName, ImageLayerWriter, InMemoryLayer,
LayerAccessStats, LayerFileName, RemoteLayer,
DeltaLayerWriter, ImageLayerWriter, InMemoryLayer, LayerAccessStats, LayerFileName, RemoteLayer,
};
use crate::tenant::timeline::logical_size::CurrentLogicalSize;
use crate::tenant::{
ephemeral_file::is_ephemeral_file,
layer_map::{LayerMap, SearchResult},
metadata::{save_metadata, TimelineMetadata},
par_fsync,
@@ -75,11 +76,10 @@ use utils::{
use crate::page_cache;
use crate::repository::GcResult;
use crate::repository::{Key, Value};
use crate::task_mgr;
use crate::task_mgr::TaskKind;
use crate::walredo::WalRedoManager;
use crate::METADATA_FILE_NAME;
use crate::ZERO_PAGE;
use crate::{is_temporary, task_mgr};
use self::delete::DeleteTimelineFlow;
pub(super) use self::eviction_task::EvictionTaskTenantState;
@@ -92,7 +92,7 @@ use super::config::TenantConf;
use super::remote_timeline_client::index::IndexPart;
use super::remote_timeline_client::RemoteTimelineClient;
use super::storage_layer::{
AsLayerDesc, DeltaLayer, ImageLayer, Layer, LayerAccessStatsReset, PersistentLayerDesc,
AsLayerDesc, DeltaLayer, ImageLayer, LayerAccessStatsReset, PersistentLayerDesc,
};
#[derive(Debug, PartialEq, Eq, Clone, Copy)]
@@ -137,6 +137,12 @@ fn drop_rlock<T>(rlock: tokio::sync::OwnedRwLockReadGuard<T>) {
fn drop_wlock<T>(rlock: tokio::sync::RwLockWriteGuard<'_, T>) {
drop(rlock)
}
/// The outward-facing resources required to build a Timeline
pub struct TimelineResources {
pub remote_client: Option<RemoteTimelineClient>,
}
pub struct Timeline {
conf: &'static PageServerConf,
tenant_conf: Arc<RwLock<TenantConfOpt>>,
@@ -799,10 +805,15 @@ impl Timeline {
.await
{
Ok((partitioning, lsn)) => {
// Disables access_stats updates, so that the files we read remain candidates for eviction after we're done with them
let image_ctx = RequestContextBuilder::extend(ctx)
.access_stats_behavior(AccessStatsBehavior::Skip)
.build();
// 2. Create new image layers for partitions that have been modified
// "enough".
let layer_paths_to_upload = self
.create_image_layers(&partitioning, lsn, false, ctx)
.create_image_layers(&partitioning, lsn, false, &image_ctx)
.await
.map_err(anyhow::Error::from)?;
if let Some(remote_client) = &self.remote_client {
@@ -875,7 +886,7 @@ impl Timeline {
let Some(open_layer) = layers.open_layer.as_ref() else {
return Ok(());
};
open_layer.size()?
open_layer.size().await?
};
let last_freeze_at = self.last_freeze_at.load();
let last_freeze_ts = *(self.last_freeze_ts.read().unwrap());
@@ -1197,7 +1208,7 @@ impl Timeline {
&layer_metadata,
local_layer
.access_stats()
.clone_for_residence_change(layer_mgr, LayerResidenceStatus::Evicted),
.clone_for_residence_change(LayerResidenceStatus::Evicted),
),
LayerFileName::Delta(delta_name) => RemoteLayer::new_delta(
self.tenant_id,
@@ -1206,7 +1217,7 @@ impl Timeline {
&layer_metadata,
local_layer
.access_stats()
.clone_for_residence_change(layer_mgr, LayerResidenceStatus::Evicted),
.clone_for_residence_change(LayerResidenceStatus::Evicted),
),
});
@@ -1366,7 +1377,7 @@ impl Timeline {
timeline_id: TimelineId,
tenant_id: TenantId,
walredo_mgr: Arc<dyn WalRedoManager + Send + Sync>,
remote_client: Option<RemoteTimelineClient>,
resources: TimelineResources,
pg_version: u32,
initial_logical_size_can_start: Option<completion::Barrier>,
initial_logical_size_attempt: Option<completion::Completion>,
@@ -1401,7 +1412,7 @@ impl Timeline {
walredo_mgr,
walreceiver: Mutex::new(None),
remote_client: remote_client.map(Arc::new),
remote_client: resources.remote_client.map(Arc::new),
// initialize in-memory 'last_record_lsn' from 'disk_consistent_lsn'.
last_record_lsn: SeqWait::new(RecordLsn {
@@ -1504,7 +1515,7 @@ impl Timeline {
let layer_flush_start_rx = self.layer_flush_start_tx.subscribe();
let self_clone = Arc::clone(self);
info!("spawning flush loop");
debug!("spawning flush loop");
*flush_loop_state = FlushLoopState::Running {
#[cfg(test)]
expect_initdb_optimization: false,
@@ -1575,9 +1586,7 @@ impl Timeline {
));
}
///
/// Initialize with an empty layer map. Used when creating a new timeline.
///
pub(super) fn init_empty_layer_map(&self, start_lsn: Lsn) {
let mut layers = self.layers.try_write().expect(
"in the context where we call this function, no other task has access to the object",
@@ -1585,10 +1594,16 @@ impl Timeline {
layers.initialize_empty(Lsn(start_lsn.0));
}
///
/// Scan the timeline directory to populate the layer map.
///
pub(super) async fn load_layer_map(&self, disk_consistent_lsn: Lsn) -> anyhow::Result<()> {
/// Scan the timeline directory, cleanup, populate the layer map, and schedule uploads for local-only
/// files.
pub(super) async fn load_layer_map(
&self,
disk_consistent_lsn: Lsn,
index_part: Option<IndexPart>,
) -> anyhow::Result<()> {
use init::{Decision::*, Discovered, FutureLayer};
use LayerFileName::*;
let mut guard = self.layers.write().await;
let timer = self.metrics.load_layer_map_histo.start_timer();
@@ -1596,102 +1611,150 @@ impl Timeline {
// Scan timeline directory and create ImageFileName and DeltaFilename
// structs representing all files on disk
let timeline_path = self.conf.timeline_path(&self.tenant_id, &self.timeline_id);
// total size of layer files in the current timeline directory
let mut total_physical_size = 0;
let (conf, tenant_id, timeline_id) = (self.conf, self.tenant_id, self.timeline_id);
let span = tracing::Span::current();
let mut loaded_layers = Vec::<Arc<dyn PersistentLayer>>::new();
let (loaded_layers, needs_upload, total_physical_size) = tokio::task::spawn_blocking({
move || {
let _g = span.entered();
let discovered = init::scan_timeline_dir(&timeline_path)?;
let mut discovered_layers = Vec::with_capacity(discovered.len());
let mut unrecognized_files = Vec::new();
for direntry in fs::read_dir(timeline_path)? {
let direntry = direntry?;
let direntry_path = direntry.path();
let fname = direntry.file_name();
let fname = fname.to_string_lossy();
let mut path = timeline_path;
if let Some(filename) = ImageFileName::parse_str(&fname) {
// create an ImageLayer struct for each image file.
if filename.lsn > disk_consistent_lsn {
info!(
"found future image layer {} on timeline {} disk_consistent_lsn is {}",
filename, self.timeline_id, disk_consistent_lsn
);
rename_to_backup(&direntry_path)?;
continue;
for discovered in discovered {
let (name, kind) = match discovered {
Discovered::Layer(file_name, file_size) => {
discovered_layers.push((file_name, file_size));
continue;
}
Discovered::Metadata | Discovered::IgnoredBackup => {
continue;
}
Discovered::Unknown(file_name) => {
// we will later error if there are any
unrecognized_files.push(file_name);
continue;
}
Discovered::Ephemeral(name) => (name, "old ephemeral file"),
Discovered::Temporary(name) => (name, "temporary timeline file"),
Discovered::TemporaryDownload(name) => (name, "temporary download"),
};
path.push(name);
init::cleanup(&path, kind)?;
path.pop();
}
let file_size = direntry_path.metadata()?.len();
let stats =
LayerAccessStats::for_loading_layer(&guard, LayerResidenceStatus::Resident);
let layer = ImageLayer::new(
self.conf,
self.timeline_id,
self.tenant_id,
&filename,
file_size,
stats,
);
total_physical_size += file_size;
loaded_layers.push(Arc::new(layer));
} else if let Some(filename) = DeltaFileName::parse_str(&fname) {
// Create a DeltaLayer struct for each delta file.
// The end-LSN is exclusive, while disk_consistent_lsn is
// inclusive. For example, if disk_consistent_lsn is 100, it is
// OK for a delta layer to have end LSN 101, but if the end LSN
// is 102, then it might not have been fully flushed to disk
// before crash.
if filename.lsn_range.end > disk_consistent_lsn + 1 {
info!(
"found future delta layer {} on timeline {} disk_consistent_lsn is {}",
filename, self.timeline_id, disk_consistent_lsn
if !unrecognized_files.is_empty() {
// assume that if there are any there are many many.
let n = unrecognized_files.len();
let first = &unrecognized_files[..n.min(10)];
anyhow::bail!(
"unrecognized files in timeline dir (total {n}), first 10: {first:?}"
);
rename_to_backup(&direntry_path)?;
continue;
}
let file_size = direntry_path.metadata()?.len();
let stats =
LayerAccessStats::for_loading_layer(&guard, LayerResidenceStatus::Resident);
let decided =
init::reconcile(discovered_layers, index_part.as_ref(), disk_consistent_lsn);
let layer = DeltaLayer::new(
self.conf,
self.timeline_id,
self.tenant_id,
&filename,
file_size,
stats,
);
let mut loaded_layers = Vec::new();
let mut needs_upload = Vec::new();
let mut total_physical_size = 0;
total_physical_size += file_size;
loaded_layers.push(Arc::new(layer));
} else if fname == METADATA_FILE_NAME || fname.ends_with(".old") {
// ignore these
} else if remote_timeline_client::is_temp_download_file(&direntry_path) {
info!(
"skipping temp download file, reconcile_with_remote will resume / clean up: {}",
fname
);
} else if is_ephemeral_file(&fname) {
// Delete any old ephemeral files
trace!("deleting old ephemeral file in timeline dir: {}", fname);
fs::remove_file(&direntry_path)?;
} else if is_temporary(&direntry_path) {
info!("removing temp timeline file at {}", direntry_path.display());
fs::remove_file(&direntry_path).with_context(|| {
format!(
"failed to remove temp download file at {}",
direntry_path.display()
)
})?;
} else {
warn!("unrecognized filename in timeline dir: {}", fname);
for (name, decision) in decided {
let decision = match decision {
Ok(UseRemote { local, remote }) => {
path.push(name.file_name());
init::cleanup_local_file_for_remote(&path, &local, &remote)?;
path.pop();
UseRemote { local, remote }
}
Ok(decision) => decision,
Err(FutureLayer { local }) => {
if local.is_some() {
path.push(name.file_name());
init::cleanup_future_layer(&path, name, disk_consistent_lsn)?;
path.pop();
} else {
// we cannot do anything for remote layers, but not continuing to
// process it will leave it out index_part.json as well.
}
//
// we do not currently schedule deletions for these.
continue;
}
};
match &name {
Delta(d) => assert!(d.lsn_range.end <= disk_consistent_lsn + 1),
Image(i) => assert!(i.lsn <= disk_consistent_lsn),
}
let status = match &decision {
UseLocal(_) | NeedsUpload(_) => LayerResidenceStatus::Resident,
Evicted(_) | UseRemote { .. } => LayerResidenceStatus::Evicted,
};
let stats = LayerAccessStats::for_loading_layer(status);
let layer: Arc<dyn PersistentLayer> = match (name, &decision) {
(Delta(d), UseLocal(m) | NeedsUpload(m)) => {
total_physical_size += m.file_size();
Arc::new(DeltaLayer::new(
conf,
timeline_id,
tenant_id,
&d,
m.file_size(),
stats,
))
}
(Image(i), UseLocal(m) | NeedsUpload(m)) => {
total_physical_size += m.file_size();
Arc::new(ImageLayer::new(
conf,
timeline_id,
tenant_id,
&i,
m.file_size(),
stats,
))
}
(Delta(d), Evicted(remote) | UseRemote { remote, .. }) => Arc::new(
RemoteLayer::new_delta(tenant_id, timeline_id, &d, remote, stats),
),
(Image(i), Evicted(remote) | UseRemote { remote, .. }) => Arc::new(
RemoteLayer::new_img(tenant_id, timeline_id, &i, remote, stats),
),
};
if let NeedsUpload(m) = decision {
needs_upload.push((layer.clone(), m));
}
loaded_layers.push(layer);
}
Ok((loaded_layers, needs_upload, total_physical_size))
}
}
})
.await
.map_err(anyhow::Error::new)
.and_then(|x| x)?;
let num_layers = loaded_layers.len();
guard.initialize_local_layers(loaded_layers, Lsn(disk_consistent_lsn.0) + 1);
guard.initialize_local_layers(loaded_layers, disk_consistent_lsn + 1);
if let Some(rtc) = self.remote_client.as_ref() {
for (layer, m) in needs_upload {
rtc.schedule_layer_file_upload(&layer.layer_desc().filename(), &m)?;
}
rtc.schedule_index_upload_for_file_changes()?;
// Tenant::create_timeline will wait for these uploads to happen before returning, or
// on retry.
}
info!(
"loaded layer map with {} layers at {}, total physical size: {}",
@@ -1702,236 +1765,6 @@ impl Timeline {
.set(total_physical_size);
timer.stop_and_record();
Ok(())
}
async fn create_remote_layers(
&self,
index_part: &IndexPart,
local_layers: HashMap<LayerFileName, Arc<dyn PersistentLayer>>,
up_to_date_disk_consistent_lsn: Lsn,
) -> anyhow::Result<HashMap<LayerFileName, Arc<dyn PersistentLayer>>> {
// Are we missing some files that are present in remote storage?
// Create RemoteLayer instances for them.
let mut local_only_layers = local_layers;
// We're holding a layer map lock for a while but this
// method is only called during init so it's fine.
let mut guard = self.layers.write().await;
let mut corrupted_local_layers = Vec::new();
let mut added_remote_layers = Vec::new();
for remote_layer_name in &index_part.timeline_layers {
let local_layer = local_only_layers.remove(remote_layer_name);
let remote_layer_metadata = index_part
.layer_metadata
.get(remote_layer_name)
.map(LayerFileMetadata::from)
.with_context(|| {
format!(
"No remote layer metadata found for layer {}",
remote_layer_name.file_name()
)
})?;
// Is the local layer's size different from the size stored in the
// remote index file?
// If so, rename_to_backup those files & replace their local layer with
// a RemoteLayer in the layer map so that we re-download them on-demand.
if let Some(local_layer) = local_layer {
let local_layer_path = local_layer
.local_path()
.expect("caller must ensure that local_layers only contains local layers");
ensure!(
local_layer_path.exists(),
"every layer from local_layers must exist on disk: {}",
local_layer_path.display()
);
let remote_size = remote_layer_metadata.file_size();
let metadata = local_layer_path.metadata().with_context(|| {
format!(
"get file size of local layer {}",
local_layer_path.display()
)
})?;
let local_size = metadata.len();
if local_size != remote_size {
warn!("removing local file {local_layer_path:?} because it has unexpected length {local_size}; length in remote index is {remote_size}");
if let Err(err) = rename_to_backup(&local_layer_path) {
assert!(local_layer_path.exists(), "we would leave the local_layer without a file if this does not hold: {}", local_layer_path.display());
anyhow::bail!("could not rename file {local_layer_path:?}: {err:?}");
} else {
self.metrics.resident_physical_size_gauge.sub(local_size);
corrupted_local_layers.push(local_layer);
// fall-through to adding the remote layer
}
} else {
debug!(
"layer is present locally and file size matches remote, using it: {}",
local_layer_path.display()
);
continue;
}
}
info!(
"remote layer does not exist locally, creating remote layer: {}",
remote_layer_name.file_name()
);
match remote_layer_name {
LayerFileName::Image(imgfilename) => {
if imgfilename.lsn > up_to_date_disk_consistent_lsn {
info!(
"found future image layer {} on timeline {} remote_consistent_lsn is {}",
imgfilename, self.timeline_id, up_to_date_disk_consistent_lsn
);
continue;
}
let stats =
LayerAccessStats::for_loading_layer(&guard, LayerResidenceStatus::Evicted);
let remote_layer = RemoteLayer::new_img(
self.tenant_id,
self.timeline_id,
imgfilename,
&remote_layer_metadata,
stats,
);
let remote_layer = Arc::new(remote_layer);
added_remote_layers.push(remote_layer);
}
LayerFileName::Delta(deltafilename) => {
// Create a RemoteLayer for the delta file.
// The end-LSN is exclusive, while disk_consistent_lsn is
// inclusive. For example, if disk_consistent_lsn is 100, it is
// OK for a delta layer to have end LSN 101, but if the end LSN
// is 102, then it might not have been fully flushed to disk
// before crash.
if deltafilename.lsn_range.end > up_to_date_disk_consistent_lsn + 1 {
info!(
"found future delta layer {} on timeline {} remote_consistent_lsn is {}",
deltafilename, self.timeline_id, up_to_date_disk_consistent_lsn
);
continue;
}
let stats =
LayerAccessStats::for_loading_layer(&guard, LayerResidenceStatus::Evicted);
let remote_layer = RemoteLayer::new_delta(
self.tenant_id,
self.timeline_id,
deltafilename,
&remote_layer_metadata,
stats,
);
let remote_layer = Arc::new(remote_layer);
added_remote_layers.push(remote_layer);
}
}
}
guard.initialize_remote_layers(corrupted_local_layers, added_remote_layers);
Ok(local_only_layers)
}
/// This function will synchronize local state with what we have in remote storage.
///
/// Steps taken:
/// 1. Initialize upload queue based on `index_part`.
/// 2. Create `RemoteLayer` instances for layers that exist only on the remote.
/// The list of layers on the remote comes from `index_part`.
/// The list of local layers is given by the layer map's `iter_historic_layers()`.
/// So, the layer map must have been loaded already.
/// 3. Schedule upload of local-only layer files (which will then also update the remote
/// IndexPart to include the new layer files).
///
/// Refer to the [`remote_timeline_client`] module comment for more context.
///
/// # TODO
/// May be a bit cleaner to do things based on populated remote client,
/// and then do things based on its upload_queue.latest_files.
#[instrument(skip(self, index_part, up_to_date_metadata))]
pub async fn reconcile_with_remote(
&self,
up_to_date_metadata: &TimelineMetadata,
index_part: Option<&IndexPart>,
) -> anyhow::Result<()> {
info!("starting");
let remote_client = self
.remote_client
.as_ref()
.ok_or_else(|| anyhow!("cannot download without remote storage"))?;
let disk_consistent_lsn = up_to_date_metadata.disk_consistent_lsn();
let local_layers = {
let guard = self.layers.read().await;
let layers = guard.layer_map();
layers
.iter_historic_layers()
.map(|l| (l.filename(), guard.get_from_desc(&l)))
.collect::<HashMap<_, _>>()
};
// If no writes happen, new branches do not have any layers, only the metadata file.
let has_local_layers = !local_layers.is_empty();
let local_only_layers = match index_part {
Some(index_part) => {
info!(
"initializing upload queue from remote index with {} layer files",
index_part.timeline_layers.len()
);
remote_client.init_upload_queue(index_part)?;
self.create_remote_layers(index_part, local_layers, disk_consistent_lsn)
.await?
}
None => {
info!("initializing upload queue as empty");
remote_client.init_upload_queue_for_empty_remote(up_to_date_metadata)?;
local_layers
}
};
if has_local_layers {
// Are there local files that don't exist remotely? Schedule uploads for them.
// Local timeline metadata will get uploaded to remove along witht he layers.
for (layer_name, layer) in &local_only_layers {
// XXX solve this in the type system
let layer_path = layer
.local_path()
.expect("local_only_layers only contains local layers");
let layer_size = layer_path
.metadata()
.with_context(|| format!("failed to get file {layer_path:?} metadata"))?
.len();
info!("scheduling {layer_path:?} for upload");
remote_client
.schedule_layer_file_upload(layer_name, &LayerFileMetadata::new(layer_size))?;
}
remote_client.schedule_index_upload_for_file_changes()?;
} else if index_part.is_none() {
// No data on the remote storage, no local layers, local metadata file.
//
// TODO https://github.com/neondatabase/neon/issues/3865
// Currently, console does not wait for the timeline data upload to the remote storage
// and considers the timeline created, expecting other pageserver nodes to work with it.
// Branch metadata upload could get interrupted (e.g pageserver got killed),
// hence any locally existing branch metadata with no remote counterpart should be uploaded,
// otherwise any other pageserver won't see the branch on `attach`.
//
// After the issue gets implemented, pageserver should rather remove the branch,
// since absence on S3 means we did not acknowledge the branch creation and console will have to retry,
// no need to keep the old files.
remote_client.schedule_index_upload_for_metadata_update(up_to_date_metadata)?;
} else {
// Local timeline has a metadata file, remote one too, both have no layers to sync.
}
info!("Done");
Ok(())
}
@@ -2647,7 +2480,7 @@ impl Timeline {
async fn put_value(&self, key: Key, lsn: Lsn, val: &Value) -> anyhow::Result<()> {
//info!("PUT: key {} at {}", key, lsn);
let layer = self.get_layer_for_write(lsn).await?;
layer.put_value(key, lsn, val)?;
layer.put_value(key, lsn, val).await?;
Ok(())
}
@@ -2673,7 +2506,9 @@ impl Timeline {
Some(self.write_lock.lock().await)
};
let mut guard = self.layers.write().await;
guard.try_freeze_in_memory_layer(self.get_last_record_lsn(), &self.last_freeze_at);
guard
.try_freeze_in_memory_layer(self.get_last_record_lsn(), &self.last_freeze_at)
.await;
}
/// Layer flusher task's main loop.
@@ -2836,7 +2671,6 @@ impl Timeline {
if let Some(ref l) = delta_layer_to_add {
// TODO: move access stats, metrics update, etc. into layer manager.
l.access_stats().record_residence_event(
&guard,
LayerResidenceStatus::Resident,
LayerResidenceEventReason::LayerCreate,
);
@@ -2955,7 +2789,11 @@ impl Timeline {
let frozen_layer = Arc::clone(frozen_layer);
move || {
// Write it out
let new_delta = frozen_layer.write_to_disk()?;
// Keep this inside `spawn_blocking` and `Handle::current`
// as long as the write path is still sync and the read impl
// is still not fully async. Otherwise executor threads would
// be blocked.
let new_delta = Handle::current().block_on(frozen_layer.write_to_disk())?;
let new_delta_path = new_delta.path();
// Sync it to disk.
@@ -3129,7 +2967,6 @@ impl Timeline {
self.tenant_id,
&img_range,
lsn,
false, // image layer always covers the full range
)?;
fail_point!("image-layer-writer-fail-before-finish", |_| {
@@ -3222,7 +3059,6 @@ impl Timeline {
.add(metadata.len());
let l = Arc::new(l);
l.access_stats().record_residence_event(
&guard,
LayerResidenceStatus::Resident,
LayerResidenceEventReason::LayerCreate,
);
@@ -3299,10 +3135,10 @@ struct CompactLevel0Phase1StatsBuilder {
timeline_id: Option<TimelineId>,
read_lock_acquisition_micros: DurationRecorder,
read_lock_held_spawn_blocking_startup_micros: DurationRecorder,
read_lock_held_key_sort_micros: DurationRecorder,
read_lock_held_prerequisites_micros: DurationRecorder,
read_lock_held_compute_holes_micros: DurationRecorder,
read_lock_drop_micros: DurationRecorder,
prepare_iterators_micros: DurationRecorder,
write_layer_files_micros: DurationRecorder,
level0_deltas_count: Option<usize>,
new_deltas_count: Option<usize>,
@@ -3319,10 +3155,10 @@ struct CompactLevel0Phase1Stats {
timeline_id: TimelineId,
read_lock_acquisition_micros: RecordedDuration,
read_lock_held_spawn_blocking_startup_micros: RecordedDuration,
read_lock_held_key_sort_micros: RecordedDuration,
read_lock_held_prerequisites_micros: RecordedDuration,
read_lock_held_compute_holes_micros: RecordedDuration,
read_lock_drop_micros: RecordedDuration,
prepare_iterators_micros: RecordedDuration,
write_layer_files_micros: RecordedDuration,
level0_deltas_count: usize,
new_deltas_count: usize,
@@ -3349,6 +3185,10 @@ impl TryFrom<CompactLevel0Phase1StatsBuilder> for CompactLevel0Phase1Stats {
.read_lock_held_spawn_blocking_startup_micros
.into_recorded()
.ok_or_else(|| anyhow!("read_lock_held_spawn_blocking_startup_micros not set"))?,
read_lock_held_key_sort_micros: value
.read_lock_held_key_sort_micros
.into_recorded()
.ok_or_else(|| anyhow!("read_lock_held_key_sort_micros not set"))?,
read_lock_held_prerequisites_micros: value
.read_lock_held_prerequisites_micros
.into_recorded()
@@ -3361,10 +3201,6 @@ impl TryFrom<CompactLevel0Phase1StatsBuilder> for CompactLevel0Phase1Stats {
.read_lock_drop_micros
.into_recorded()
.ok_or_else(|| anyhow!("read_lock_drop_micros not set"))?,
prepare_iterators_micros: value
.prepare_iterators_micros
.into_recorded()
.ok_or_else(|| anyhow!("prepare_iterators_micros not set"))?,
write_layer_files_micros: value
.write_layer_files_micros
.into_recorded()
@@ -3390,8 +3226,8 @@ impl Timeline {
/// start of level0 files compaction, the on-demand download should be revisited as well.
///
/// [`compact_inner`]: Self::compact_inner
fn compact_level0_phase1(
self: Arc<Self>,
async fn compact_level0_phase1(
self: &Arc<Self>,
_layer_removal_cs: Arc<tokio::sync::OwnedMutexGuard<()>>,
guard: tokio::sync::OwnedRwLockReadGuard<LayerManager>,
mut stats: CompactLevel0Phase1StatsBuilder,
@@ -3534,28 +3370,24 @@ impl Timeline {
let mut heap: BinaryHeap<Hole> = BinaryHeap::with_capacity(max_holes + 1);
let mut prev: Option<Key> = None;
let mut all_value_refs = Vec::new();
let mut all_keys = Vec::new();
for l in deltas_to_compact.iter() {
let downcast_deltas: Vec<_> = deltas_to_compact
.iter()
.map(|l| l.clone().downcast_delta_layer().expect("delta layer"))
.collect();
for dl in downcast_deltas.iter() {
// TODO: replace this with an await once we fully go async
let delta = l.clone().downcast_delta_layer().expect("delta layer");
Handle::current().block_on(async {
all_value_refs.extend(delta.load_val_refs(ctx).await?);
all_keys.extend(delta.load_keys(ctx).await?);
anyhow::Ok(())
})?;
all_keys.extend(DeltaLayer::load_keys(dl, ctx).await?);
}
// The current stdlib sorting implementation is designed in a way where it is
// particularly fast where the slice is made up of sorted sub-ranges.
all_value_refs.sort_by_key(|(key, lsn, _value_ref)| (*key, *lsn));
all_keys.sort_by_key(|DeltaEntry { key, lsn, .. }| (*key, *lsn));
// The current stdlib sorting implementation is designed in a way where it is
// particularly fast where the slice is made up of sorted sub-ranges.
all_keys.sort_by_key(|(key, lsn, _size)| (*key, *lsn));
stats.read_lock_held_key_sort_micros = stats.read_lock_held_prerequisites_micros.till_now();
for (next_key, _next_lsn, _size) in all_keys.iter() {
for DeltaEntry { key: next_key, .. } in all_keys.iter() {
let next_key = *next_key;
if let Some(prev_key) = prev {
// just first fast filter
@@ -3579,8 +3411,7 @@ impl Timeline {
}
prev = Some(next_key.next());
}
stats.read_lock_held_compute_holes_micros =
stats.read_lock_held_prerequisites_micros.till_now();
stats.read_lock_held_compute_holes_micros = stats.read_lock_held_key_sort_micros.till_now();
drop_rlock(guard);
stats.read_lock_drop_micros = stats.read_lock_held_compute_holes_micros.till_now();
let mut holes = heap.into_vec();
@@ -3589,12 +3420,26 @@ impl Timeline {
// This iterator walks through all key-value pairs from all the layers
// we're compacting, in key, LSN order.
let all_values_iter = all_value_refs.into_iter();
let all_values_iter = all_keys.iter();
// This iterator walks through all keys and is needed to calculate size used by each key
let mut all_keys_iter = all_keys.into_iter();
stats.prepare_iterators_micros = stats.read_lock_drop_micros.till_now();
let mut all_keys_iter = all_keys
.iter()
.map(|DeltaEntry { key, lsn, size, .. }| (*key, *lsn, *size))
.coalesce(|mut prev, cur| {
// Coalesce keys that belong to the same key pair.
// This ensures that compaction doesn't put them
// into different layer files.
// Still limit this by the target file size,
// so that we keep the size of the files in
// check.
if prev.0 == cur.0 && prev.2 < target_file_size {
prev.2 += cur.2;
Ok(prev)
} else {
Err((prev, cur))
}
});
// Merge the contents of all the input delta layers into a new set
// of delta layers, based on the current partitioning.
@@ -3646,8 +3491,12 @@ impl Timeline {
let mut key_values_total_size = 0u64;
let mut dup_start_lsn: Lsn = Lsn::INVALID; // start LSN of layer containing values of the single key
let mut dup_end_lsn: Lsn = Lsn::INVALID; // end LSN of layer containing values of the single key
for (key, lsn, value_ref) in all_values_iter {
let value = value_ref.load()?;
for &DeltaEntry {
key, lsn, ref val, ..
} in all_values_iter
{
let value = val.load().await?;
let same_key = prev_key.map_or(false, |prev_key| prev_key == key);
// We need to check key boundaries once we reach next key or end of layer with the same key
if !same_key || lsn == dup_end_lsn {
@@ -3732,7 +3581,9 @@ impl Timeline {
}
fail_point!("delta-layer-writer-fail-before-finish", |_| {
Err(anyhow::anyhow!("failpoint delta-layer-writer-fail-before-finish").into())
Err(CompactionError::Other(anyhow::anyhow!(
"failpoint delta-layer-writer-fail-before-finish"
)))
});
writer.as_mut().unwrap().put_value(key, lsn, value)?;
@@ -3744,6 +3595,19 @@ impl Timeline {
// Sync layers
if !new_layers.is_empty() {
// Print a warning if the created layer is larger than double the target size
// Add two pages for potential overhead. This should in theory be already
// accounted for in the target calculation, but for very small targets,
// we still might easily hit the limit otherwise.
let warn_limit = target_file_size * 2 + page_cache::PAGE_SZ as u64 * 2;
for layer in new_layers.iter() {
if layer.layer_desc().file_size > warn_limit {
warn!(
%layer,
"created delta file of size {} larger than double of target of {target_file_size}", layer.layer_desc().file_size
);
}
}
let mut layer_paths: Vec<PathBuf> = new_layers.iter().map(|l| l.path()).collect();
// Fsync all the layer files and directory using multiple threads to
@@ -3756,11 +3620,9 @@ impl Timeline {
layer_paths.pop().unwrap();
}
stats.write_layer_files_micros = stats.prepare_iterators_micros.till_now();
stats.write_layer_files_micros = stats.read_lock_drop_micros.till_now();
stats.new_deltas_count = Some(new_layers.len());
stats.new_deltas_size = Some(new_layers.iter().map(|l| l.desc.file_size).sum());
drop(all_keys_iter); // So that deltas_to_compact is no longer borrowed
stats.new_deltas_size = Some(new_layers.iter().map(|l| l.layer_desc().file_size).sum());
match TryInto::<CompactLevel0Phase1Stats>::try_into(stats)
.and_then(|stats| serde_json::to_string(&stats).context("serde_json::to_string"))
@@ -3800,8 +3662,7 @@ impl Timeline {
deltas_to_compact,
} = {
let phase1_span = info_span!("compact_level0_phase1");
let myself = Arc::clone(self);
let ctx = ctx.attached_child(); // technically, the spawn_blocking can outlive this future
let ctx = ctx.attached_child();
let mut stats = CompactLevel0Phase1StatsBuilder {
version: Some(2),
tenant_id: Some(self.tenant_id),
@@ -3815,18 +3676,15 @@ impl Timeline {
stats.read_lock_acquisition_micros =
DurationRecorder::Recorded(RecordedDuration(now - begin), now);
let layer_removal_cs = layer_removal_cs.clone();
tokio::task::spawn_blocking(move || {
let _entered = phase1_span.enter();
myself.compact_level0_phase1(
layer_removal_cs,
phase1_layers_locked,
stats,
target_file_size,
&ctx,
)
})
.await
.context("spawn_blocking")??
self.compact_level0_phase1(
layer_removal_cs,
phase1_layers_locked,
stats,
target_file_size,
&ctx,
)
.instrument(phase1_span)
.await?
};
if new_layers.is_empty() && deltas_to_compact.is_empty() {
@@ -3880,7 +3738,6 @@ impl Timeline {
new_layer_paths.insert(new_delta_path, LayerFileMetadata::new(metadata.len()));
l.access_stats().record_residence_event(
&guard,
LayerResidenceStatus::Resident,
LayerResidenceEventReason::LayerCreate,
);
@@ -4799,7 +4656,8 @@ fn rename_to_backup(path: &Path) -> anyhow::Result<()> {
for i in 0u32.. {
new_path.set_file_name(format!("{filename}.{i}.old"));
if !new_path.exists() {
std::fs::rename(path, &new_path)?;
std::fs::rename(path, &new_path)
.with_context(|| format!("rename {path:?} to {new_path:?}"))?;
return Ok(());
}
}

View File

@@ -25,7 +25,7 @@ use crate::{
InitializationOrder,
};
use super::Timeline;
use super::{Timeline, TimelineResources};
/// Now that the Timeline is in Stopping state, request all the related tasks to shut down.
async fn stop_tasks(timeline: &Timeline) -> Result<(), DeleteTimelineError> {
@@ -279,6 +279,17 @@ async fn cleanup_remaining_timeline_fs_traces(
Err(anyhow::anyhow!("failpoint: timeline-delete-after-rm-dir"))?
});
// Make sure previous deletions are ordered before mark removal.
// Otherwise there is no guarantee that they reach the disk before mark deletion.
// So its possible for mark to reach disk first and for other deletions
// to be reordered later and thus missed if a crash occurs.
// Note that we dont need to sync after mark file is removed
// because we can tolerate the case when mark file reappears on startup.
let timeline_path = conf.timelines_path(&tenant_id);
crashsafe::fsync_async(timeline_path)
.await
.context("fsync_pre_mark_remove")?;
// Remove delete mark
tokio::fs::remove_file(conf.timeline_delete_mark_file_path(tenant_id, timeline_id))
.await
@@ -405,7 +416,7 @@ impl DeleteTimelineFlow {
timeline_id,
local_metadata,
None, // Ancestor is not needed for deletion.
remote_client,
TimelineResources { remote_client },
init_order,
// Important. We dont pass ancestor above because it can be missing.
// Thus we need to skip the validation here.

View File

@@ -0,0 +1,199 @@
use crate::{
is_temporary,
tenant::{
ephemeral_file::is_ephemeral_file,
remote_timeline_client::{
self,
index::{IndexPart, LayerFileMetadata},
},
storage_layer::LayerFileName,
},
METADATA_FILE_NAME,
};
use anyhow::Context;
use std::{collections::HashMap, ffi::OsString, path::Path, str::FromStr};
use utils::lsn::Lsn;
/// Identified files in the timeline directory.
pub(super) enum Discovered {
/// The only one we care about
Layer(LayerFileName, u64),
/// Old ephmeral files from previous launches, should be removed
Ephemeral(OsString),
/// Old temporary timeline files, unsure what these really are, should be removed
Temporary(OsString),
/// Temporary on-demand download files, should be removed
TemporaryDownload(OsString),
/// "metadata" file we persist locally and include in `index_part.json`
Metadata,
/// Backup file from previously future layers
IgnoredBackup,
/// Unrecognized, warn about these
Unknown(OsString),
}
/// Scans the timeline directory for interesting files.
pub(super) fn scan_timeline_dir(path: &Path) -> anyhow::Result<Vec<Discovered>> {
let mut ret = Vec::new();
for direntry in std::fs::read_dir(path)? {
let direntry = direntry?;
let direntry_path = direntry.path();
let file_name = direntry.file_name();
let fname = file_name.to_string_lossy();
let discovered = match LayerFileName::from_str(&fname) {
Ok(file_name) => {
let file_size = direntry.metadata()?.len();
Discovered::Layer(file_name, file_size)
}
Err(_) => {
if fname == METADATA_FILE_NAME {
Discovered::Metadata
} else if fname.ends_with(".old") {
// ignore these
Discovered::IgnoredBackup
} else if remote_timeline_client::is_temp_download_file(&direntry_path) {
Discovered::TemporaryDownload(file_name)
} else if is_ephemeral_file(&fname) {
Discovered::Ephemeral(file_name)
} else if is_temporary(&direntry_path) {
Discovered::Temporary(file_name)
} else {
Discovered::Unknown(file_name)
}
}
};
ret.push(discovered);
}
Ok(ret)
}
/// Decision on what to do with a layer file after considering its local and remote metadata.
#[derive(Clone)]
pub(super) enum Decision {
/// The layer is not present locally.
Evicted(LayerFileMetadata),
/// The layer is present locally, but local metadata does not match remote; we must
/// delete it and treat it as evicted.
UseRemote {
local: LayerFileMetadata,
remote: LayerFileMetadata,
},
/// The layer is present locally, and metadata matches.
UseLocal(LayerFileMetadata),
/// The layer is only known locally, it needs to be uploaded.
NeedsUpload(LayerFileMetadata),
}
/// The related layer is is in future compared to disk_consistent_lsn, it must not be loaded.
#[derive(Debug)]
pub(super) struct FutureLayer {
/// The local metadata. `None` if the layer is only known through [`IndexPart`].
pub(super) local: Option<LayerFileMetadata>,
}
/// Merges local discoveries and remote [`IndexPart`] to a collection of decisions.
///
/// This function should not gain additional reasons to fail than [`FutureLayer`], consider adding
/// the checks earlier to [`scan_timeline_dir`].
pub(super) fn reconcile(
discovered: Vec<(LayerFileName, u64)>,
index_part: Option<&IndexPart>,
disk_consistent_lsn: Lsn,
) -> Vec<(LayerFileName, Result<Decision, FutureLayer>)> {
use Decision::*;
// name => (local, remote)
type Collected = HashMap<LayerFileName, (Option<LayerFileMetadata>, Option<LayerFileMetadata>)>;
let mut discovered = discovered
.into_iter()
.map(|(name, file_size)| (name, (Some(LayerFileMetadata::new(file_size)), None)))
.collect::<Collected>();
// merge any index_part information, when available
index_part
.as_ref()
.map(|ip| ip.layer_metadata.iter())
.into_iter()
.flatten()
.map(|(name, metadata)| (name, LayerFileMetadata::from(metadata)))
.for_each(|(name, metadata)| {
if let Some(existing) = discovered.get_mut(name) {
existing.1 = Some(metadata);
} else {
discovered.insert(name.to_owned(), (None, Some(metadata)));
}
});
discovered
.into_iter()
.map(|(name, (local, remote))| {
let decision = if name.is_in_future(disk_consistent_lsn) {
Err(FutureLayer { local })
} else {
Ok(match (local, remote) {
(Some(local), Some(remote)) if local != remote => UseRemote { local, remote },
(Some(x), Some(_)) => UseLocal(x),
(None, Some(x)) => Evicted(x),
(Some(x), None) => NeedsUpload(x),
(None, None) => {
unreachable!("there must not be any non-local non-remote files")
}
})
};
(name, decision)
})
.collect::<Vec<_>>()
}
pub(super) fn cleanup(path: &Path, kind: &str) -> anyhow::Result<()> {
let file_name = path.file_name().expect("must be file path");
tracing::debug!(kind, ?file_name, "cleaning up");
std::fs::remove_file(path)
.with_context(|| format!("failed to remove {kind} at {}", path.display()))
}
pub(super) fn cleanup_local_file_for_remote(
path: &Path,
local: &LayerFileMetadata,
remote: &LayerFileMetadata,
) -> anyhow::Result<()> {
let local_size = local.file_size();
let remote_size = remote.file_size();
let file_name = path.file_name().expect("must be file path");
tracing::warn!("removing local file {file_name:?} because it has unexpected length {local_size}; length in remote index is {remote_size}");
if let Err(err) = crate::tenant::timeline::rename_to_backup(path) {
assert!(
path.exists(),
"we would leave the local_layer without a file if this does not hold: {}",
path.display()
);
Err(err)
} else {
Ok(())
}
}
pub(super) fn cleanup_future_layer(
path: &Path,
name: LayerFileName,
disk_consistent_lsn: Lsn,
) -> anyhow::Result<()> {
use LayerFileName::*;
let kind = match name {
Delta(_) => "delta",
Image(_) => "image",
};
// future image layers are allowed to be produced always for not yet flushed to disk
// lsns stored in InMemoryLayer.
tracing::info!("found future {kind} layer {name} disk_consistent_lsn is {disk_consistent_lsn}");
crate::tenant::timeline::rename_to_backup(path)?;
Ok(())
}

View File

@@ -12,38 +12,38 @@ use crate::{
tenant::{
layer_map::{BatchedUpdates, LayerMap},
storage_layer::{
AsLayerDesc, DeltaLayer, ImageLayer, InMemoryLayer, Layer, PersistentLayer,
PersistentLayerDesc, PersistentLayerKey, RemoteLayer,
AsLayerDesc, DeltaLayer, ImageLayer, InMemoryLayer, PersistentLayer,
PersistentLayerDesc, PersistentLayerKey,
},
timeline::compare_arced_layers,
},
};
/// Provides semantic APIs to manipulate the layer map.
pub struct LayerManager {
pub(crate) struct LayerManager {
layer_map: LayerMap,
layer_fmgr: LayerFileManager,
}
/// After GC, the layer map changes will not be applied immediately. Users should manually apply the changes after
/// scheduling deletes in remote client.
pub struct ApplyGcResultGuard<'a>(BatchedUpdates<'a>);
pub(crate) struct ApplyGcResultGuard<'a>(BatchedUpdates<'a>);
impl ApplyGcResultGuard<'_> {
pub fn flush(self) {
pub(crate) fn flush(self) {
self.0.flush();
}
}
impl LayerManager {
pub fn create() -> Self {
pub(crate) fn create() -> Self {
Self {
layer_map: LayerMap::default(),
layer_fmgr: LayerFileManager::new(),
}
}
pub fn get_from_desc(&self, desc: &PersistentLayerDesc) -> Arc<dyn PersistentLayer> {
pub(crate) fn get_from_desc(&self, desc: &PersistentLayerDesc) -> Arc<dyn PersistentLayer> {
self.layer_fmgr.get_from_desc(desc)
}
@@ -51,18 +51,12 @@ impl LayerManager {
///
/// We expect users only to be able to get an immutable layer map. If users want to make modifications,
/// they should use the below semantic APIs. This design makes us step closer to immutable storage state.
pub fn layer_map(&self) -> &LayerMap {
pub(crate) fn layer_map(&self) -> &LayerMap {
&self.layer_map
}
/// Get a mutable reference to the layer map. This function will be removed once `flush_frozen_layer`
/// gets a refactor.
pub fn layer_map_mut(&mut self) -> &mut LayerMap {
&mut self.layer_map
}
/// Replace layers in the layer file manager, used in evictions and layer downloads.
pub fn replace_and_verify(
pub(crate) fn replace_and_verify(
&mut self,
expected: Arc<dyn PersistentLayer>,
new: Arc<dyn PersistentLayer>,
@@ -73,7 +67,7 @@ impl LayerManager {
/// Called from `load_layer_map`. Initialize the layer manager with:
/// 1. all on-disk layers
/// 2. next open layer (with disk disk_consistent_lsn LSN)
pub fn initialize_local_layers(
pub(crate) fn initialize_local_layers(
&mut self,
on_disk_layers: Vec<Arc<dyn PersistentLayer>>,
next_open_layer_at: Lsn,
@@ -87,28 +81,13 @@ impl LayerManager {
}
/// Initialize when creating a new timeline, called in `init_empty_layer_map`.
pub fn initialize_empty(&mut self, next_open_layer_at: Lsn) {
pub(crate) fn initialize_empty(&mut self, next_open_layer_at: Lsn) {
self.layer_map.next_open_layer_at = Some(next_open_layer_at);
}
pub fn initialize_remote_layers(
&mut self,
corrupted_local_layers: Vec<Arc<dyn PersistentLayer>>,
remote_layers: Vec<Arc<RemoteLayer>>,
) {
let mut updates = self.layer_map.batch_update();
for layer in corrupted_local_layers {
Self::remove_historic_layer(layer, &mut updates, &mut self.layer_fmgr);
}
for layer in remote_layers {
Self::insert_historic_layer(layer, &mut updates, &mut self.layer_fmgr);
}
updates.flush();
}
/// Open a new writable layer to append data if there is no open layer, otherwise return the current open layer,
/// called within `get_layer_for_write`.
pub fn get_layer_for_write(
pub(crate) fn get_layer_for_write(
&mut self,
lsn: Lsn,
last_record_lsn: Lsn,
@@ -163,7 +142,7 @@ impl LayerManager {
}
/// Called from `freeze_inmem_layer`, returns true if successfully frozen.
pub fn try_freeze_in_memory_layer(
pub(crate) async fn try_freeze_in_memory_layer(
&mut self,
Lsn(last_record_lsn): Lsn,
last_freeze_at: &AtomicLsn,
@@ -173,7 +152,7 @@ impl LayerManager {
if let Some(open_layer) = &self.layer_map.open_layer {
let open_layer_rc = Arc::clone(open_layer);
// Does this layer need freezing?
open_layer.freeze(end_lsn);
open_layer.freeze(end_lsn).await;
// The layer is no longer open, update the layer map to reflect this.
// We will replace it with on-disk historics below.
@@ -185,7 +164,7 @@ impl LayerManager {
}
/// Add image layers to the layer map, called from `create_image_layers`.
pub fn track_new_image_layers(&mut self, image_layers: Vec<ImageLayer>) {
pub(crate) fn track_new_image_layers(&mut self, image_layers: Vec<ImageLayer>) {
let mut updates = self.layer_map.batch_update();
for layer in image_layers {
Self::insert_historic_layer(Arc::new(layer), &mut updates, &mut self.layer_fmgr);
@@ -194,7 +173,7 @@ impl LayerManager {
}
/// Flush a frozen layer and add the written delta layer to the layer map.
pub fn finish_flush_l0_layer(
pub(crate) fn finish_flush_l0_layer(
&mut self,
delta_layer: Option<DeltaLayer>,
frozen_layer_for_check: &Arc<InMemoryLayer>,
@@ -214,7 +193,7 @@ impl LayerManager {
}
/// Called when compaction is completed.
pub fn finish_compact_l0(
pub(crate) fn finish_compact_l0(
&mut self,
layer_removal_cs: Arc<tokio::sync::OwnedMutexGuard<()>>,
compact_from: Vec<Arc<dyn PersistentLayer>>,
@@ -242,7 +221,7 @@ impl LayerManager {
}
/// Called when garbage collect the timeline. Returns a guard that will apply the updates to the layer map.
pub fn finish_gc_timeline(
pub(crate) fn finish_gc_timeline(
&mut self,
layer_removal_cs: Arc<tokio::sync::OwnedMutexGuard<()>>,
gc_layers: Vec<Arc<dyn PersistentLayer>>,
@@ -271,16 +250,6 @@ impl LayerManager {
mapping.insert(layer);
}
/// Helper function to remove a layer into the layer map and file manager
fn remove_historic_layer(
layer: Arc<dyn PersistentLayer>,
updates: &mut BatchedUpdates<'_>,
mapping: &mut LayerFileManager,
) {
updates.remove_historic(layer.layer_desc());
mapping.remove(layer);
}
/// Removes the layer from local FS (if present) and from memory.
/// Remote storage is not affected by this operation.
fn delete_historic_layer(
@@ -313,7 +282,7 @@ impl LayerManager {
}
}
pub struct LayerFileManager<T: AsLayerDesc + ?Sized = dyn PersistentLayer>(
pub(crate) struct LayerFileManager<T: AsLayerDesc + ?Sized = dyn PersistentLayer>(
HashMap<PersistentLayerKey, Arc<T>>,
);

View File

@@ -17,7 +17,7 @@ use crate::metrics::{
WALRECEIVER_ACTIVE_MANAGERS, WALRECEIVER_BROKER_UPDATES, WALRECEIVER_CANDIDATES_ADDED,
WALRECEIVER_CANDIDATES_REMOVED, WALRECEIVER_SWITCHES,
};
use crate::task_mgr::TaskKind;
use crate::task_mgr::{shutdown_token, TaskKind};
use crate::tenant::{debug_assert_current_span_has_tenant_and_timeline_id, Timeline};
use anyhow::Context;
use chrono::{NaiveDateTime, Utc};
@@ -31,8 +31,10 @@ use storage_broker::Streaming;
use tokio::select;
use tracing::*;
use crate::{exponential_backoff, DEFAULT_BASE_BACKOFF_SECONDS, DEFAULT_MAX_BACKOFF_SECONDS};
use postgres_connection::{parse_host_port, PgConnectionConfig};
use utils::backoff::{
exponential_backoff, DEFAULT_BASE_BACKOFF_SECONDS, DEFAULT_MAX_BACKOFF_SECONDS,
};
use utils::{
id::{NodeId, TenantTimelineId},
lsn::Lsn,
@@ -209,11 +211,14 @@ async fn subscribe_for_timeline_updates(
id: TenantTimelineId,
) -> Streaming<SafekeeperTimelineInfo> {
let mut attempt = 0;
let cancel = shutdown_token();
loop {
exponential_backoff(
attempt,
DEFAULT_BASE_BACKOFF_SECONDS,
DEFAULT_MAX_BACKOFF_SECONDS,
&cancel,
)
.await;
attempt += 1;

View File

@@ -140,23 +140,12 @@ impl UploadQueue {
}
}
let mut files = HashMap::with_capacity(index_part.timeline_layers.len());
for layer_name in &index_part.timeline_layers {
match index_part
.layer_metadata
.get(layer_name)
.map(LayerFileMetadata::from)
{
Some(layer_metadata) => {
files.insert(layer_name.to_owned(), layer_metadata);
}
None => {
anyhow::bail!(
"No remote layer metadata found for layer {}",
layer_name.file_name()
);
}
}
let mut files = HashMap::with_capacity(index_part.layer_metadata.len());
for (layer_name, layer_metadata) in &index_part.layer_metadata {
files.insert(
layer_name.to_owned(),
LayerFileMetadata::from(layer_metadata),
);
}
let index_part_metadata = index_part.parse_metadata()?;

View File

@@ -312,7 +312,7 @@ impl<'a> WalIngest<'a> {
// particular point in the WAL. For more fine-grained control,
// we could peek into the message and only pause if it contains
// a particular string, for example, but this is enough for now.
utils::failpoint_sleep_millis_async!("wal-ingest-logical-message-sleep");
crate::failpoint_support::sleep_millis_async!("wal-ingest-logical-message-sleep");
}
}

View File

@@ -14,6 +14,7 @@
*/
#include <sys/file.h>
#include <sys/mman.h>
#include <sys/statvfs.h>
#include <unistd.h>
#include <fcntl.h>
@@ -59,12 +60,17 @@
* 2. Improve access locality, subsequent pages will be allocated together improving seqscan speed
*/
#define BLOCKS_PER_CHUNK 128 /* 1Mb chunk */
#define CHUNK_SIZE (BLOCKS_PER_CHUNK * BLCKSZ)
#define MB ((uint64)1024*1024)
#ifndef MADV_REMOVE
#define MADV_REMOVE MADV_FREE /* MacOS doesn't have MADV_REMOVE and at Linux MADV_FREE works only for MAP_PRIVATE */
#endif
#define SIZE_MB_TO_CHUNKS(size) ((uint32)((size) * MB / BLCKSZ / BLOCKS_PER_CHUNK))
#define MAX_MONITOR_INTERVAL_USEC 1000000 /* 1 second */
#define MAX_DISK_WRITE_RATE 1000 /* MB/sec */
#define MAX_MEM_WRITE_RATE 10000 /* MB/sec */
typedef struct FileCacheEntry
{
@@ -83,21 +89,56 @@ typedef struct FileCacheControl
} FileCacheControl;
static HTAB* lfc_hash;
static int lfc_desc;
static LWLockId lfc_lock;
static int lfc_max_size;
static int lfc_max_mem;
static int lfc_size_limit;
static int lfc_free_space_watermark;
static int lfc_free_memory_watermark;
static char* lfc_base_addr;
static char* lfc_path;
static FileCacheControl* lfc_ctl;
static shmem_startup_hook_type prev_shmem_startup_hook;
#if PG_VERSION_NUM>=150000
static shmem_request_hook_type prev_shmem_request_hook;
#endif
static int lfc_shrinking_factor; /* power of two by which local cache size will be shrinked when lfc_free_space_watermark is reached */
static int lfc_shrinking_factor; /* power of two by which local cache size will be shrinked when lfc_free_space_watermark or lfc_free_memory_watermak are reached */
void FileCacheMonitorMain(Datum main_arg);
#ifdef __APPLE__
#include <sys/types.h>
#include <sys/sysctl.h>
static size_t
get_available_memory(void)
{
size_t total;
size_t sizeof_total = sizeof(total);
if (sysctlbyname("hw.memsize", &total, &sizeof_total, NULL, 0) < 0)
elog(ERROR, "Failed to get amount of RAM: %m");
return total;
}
#else
#include <sys/sysinfo.h>
static size_t
get_available_memory(void)
{
struct sysinfo si;
if (sysinfo(&si) < 0)
elog(ERROR, "Failed to get amount of RAM: %m");
return si.totalram*si.mem_unit;
}
#endif
static void
lfc_shmem_startup(void)
{
@@ -111,7 +152,7 @@ lfc_shmem_startup(void)
LWLockAcquire(AddinShmemInitLock, LW_EXCLUSIVE);
lfc_ctl = (FileCacheControl*)ShmemInitStruct("lfc", sizeof(FileCacheControl), &found);
lfc_ctl = (FileCacheControl*)ShmemInitStruct("lfc", sizeof(FileCacheControl) + lfc_max_size*MB + CHUNK_SIZE, &found);
if (!found)
{
uint32 lfc_size = SIZE_MB_TO_CHUNKS(lfc_max_size);
@@ -126,10 +167,10 @@ lfc_shmem_startup(void)
lfc_ctl->size = 0;
lfc_ctl->used = 0;
dlist_init(&lfc_ctl->lru);
/* Remove file cache on restart */
(void)unlink(lfc_path);
}
lfc_base_addr = (char*)TYPEALIGN(CHUNK_SIZE, lfc_ctl+1);
if (!found)
madvise(lfc_base_addr, lfc_max_size*MB, MADV_REMOVE);
LWLockRelease(AddinShmemInitLock);
}
@@ -141,7 +182,9 @@ lfc_shmem_request(void)
prev_shmem_request_hook();
#endif
RequestAddinShmemSpace(sizeof(FileCacheControl) + hash_estimate_size(SIZE_MB_TO_CHUNKS(lfc_max_size)+1, sizeof(FileCacheEntry)));
lfc_max_size = Min(lfc_max_size, lfc_max_mem);
lfc_size_limit = Min(lfc_size_limit, lfc_max_mem);
RequestAddinShmemSpace(sizeof(FileCacheControl) + lfc_max_size*MB + CHUNK_SIZE + hash_estimate_size(SIZE_MB_TO_CHUNKS(lfc_max_size)+1, sizeof(FileCacheEntry)));
RequestNamedLWLockTranche("lfc_lock", 1);
}
@@ -167,26 +210,14 @@ lfc_change_limit_hook(int newval, void *extra)
if (!lfc_ctl || !UsedShmemSegAddr || IsParallelWorker())
return;
/* Open cache file if not done yet */
if (lfc_desc == 0)
{
lfc_desc = BasicOpenFile(lfc_path, O_RDWR|O_CREAT);
if (lfc_desc < 0) {
elog(WARNING, "Failed to open file cache %s: %m, disabling file cache", lfc_path);
lfc_size_limit = 0; /* disable file cache */
return;
}
}
LWLockAcquire(lfc_lock, LW_EXCLUSIVE);
while (new_size < lfc_ctl->used && !dlist_is_empty(&lfc_ctl->lru))
{
/* Shrink cache by throwing away least recently accessed chunks and returning their space to file system */
FileCacheEntry* victim = dlist_container(FileCacheEntry, lru_node, dlist_pop_head_node(&lfc_ctl->lru));
Assert(victim->access_count == 0);
#ifdef FALLOC_FL_PUNCH_HOLE
if (fallocate(lfc_desc, FALLOC_FL_PUNCH_HOLE|FALLOC_FL_KEEP_SIZE, (off_t)victim->offset*BLOCKS_PER_CHUNK*BLCKSZ, BLOCKS_PER_CHUNK*BLCKSZ) < 0)
elog(LOG, "Failed to punch hole in file: %m");
#endif
if (madvise(lfc_base_addr + victim->offset*CHUNK_SIZE, CHUNK_SIZE, MADV_REMOVE) < 0)
elog(LOG, "Failed to punch hole in memory: %m");
hash_search(lfc_hash, &victim->key, HASH_REMOVE, NULL);
lfc_ctl->used -= 1;
}
@@ -195,10 +226,11 @@ lfc_change_limit_hook(int newval, void *extra)
}
/*
* Local file system state monitor check available free space.
* If it is lower than lfc_free_space_watermark then we shrink size of local cache
* Local file system state monitor check available free space and memory.
* If available disk space is lower than lfc_free_space_watermark or
* available memory is lower than lfc_free_memory_watermark then we shrink size of local cache
* but throwing away least recently accessed chunks.
* First time low space watermark is reached cache size is divided by two,
* First time the watermark is reached cache size is divided by two,
* second time by four,... Finally we remove all chunks from local cache.
*
* Please notice that we are not changing lfc_cache_size: it is used to be adjusted by autoscaler.
@@ -213,9 +245,9 @@ FileCacheMonitorMain(Datum main_arg)
{
/*
* Choose file system state monitor interval so that space can not be exosted
* during this period but not longer than MAX_MONITOR_INTERVAL (10 sec)
* during this period but not longer than MAX_MONITOR_INTERVAL (1 sec)
*/
uint64 monitor_interval = Min(MAX_MONITOR_INTERVAL_USEC, lfc_free_space_watermark*MB/MAX_DISK_WRITE_RATE);
uint64 monitor_interval = Min(MAX_MONITOR_INTERVAL_USEC, lfc_free_space_watermark*MB/MAX_MEM_WRITE_RATE);
/* Establish signal handlers. */
pqsignal(SIGUSR1, procsignal_sigusr1_handler);
@@ -226,25 +258,17 @@ FileCacheMonitorMain(Datum main_arg)
/* Periodically dump buffers until terminated. */
while (!ShutdownRequestPending)
{
if (lfc_size_limit != 0)
if (lfc_size_limit != 0 && lfc_free_memory_watermark != 0 )
{
struct statvfs sfs;
if (statvfs(lfc_path, &sfs) < 0)
if (get_available_memory() < lfc_free_memory_watermark*MB)
{
elog(WARNING, "Failed to obtain status of %s: %m", lfc_path);
if (lfc_shrinking_factor < 31) {
lfc_shrinking_factor += 1;
}
lfc_change_limit_hook(lfc_size_limit >> lfc_shrinking_factor, NULL);
}
else
{
if (sfs.f_bavail*sfs.f_bsize < lfc_free_space_watermark*MB)
{
if (lfc_shrinking_factor < 31) {
lfc_shrinking_factor += 1;
}
lfc_change_limit_hook(lfc_size_limit >> lfc_shrinking_factor, NULL);
}
else
lfc_shrinking_factor = 0; /* reset to initial value */
}
lfc_shrinking_factor = 0; /* reset to initial value */
}
pg_usleep(monitor_interval);
}
@@ -278,6 +302,18 @@ lfc_init(void)
if (!process_shared_preload_libraries_in_progress)
elog(ERROR, "Neon module should be loaded via shared_preload_libraries");
/* TODO: left only for compatibility with on-disk cache */
DefineCustomStringVariable("neon.file_cache_path",
"Path to local file cache (can be raw device)",
NULL,
&lfc_path,
"file.cache",
PGC_POSTMASTER,
0,
NULL,
NULL,
NULL);
DefineCustomIntVariable("neon.max_file_cache_size",
"Maximal size of Neon local file cache",
NULL,
@@ -291,6 +327,19 @@ lfc_init(void)
NULL,
NULL);
DefineCustomIntVariable("neon.max_inmem_cache_size",
"Maximal size used by Neon local file cache in memory",
NULL,
&lfc_max_mem,
128, /* 128Mb */
0,
INT_MAX,
PGC_POSTMASTER,
GUC_UNIT_MB,
NULL,
NULL,
NULL);
DefineCustomIntVariable("neon.file_cache_size_limit",
"Current limit for size of Neon local file cache",
NULL,
@@ -304,11 +353,11 @@ lfc_init(void)
lfc_change_limit_hook,
NULL);
DefineCustomIntVariable("neon.free_space_watermark",
"Minimal free space in local file system after reaching which local file cache will be truncated",
DefineCustomIntVariable("neon.free_memory_watermark",
"Minimal free memory in system after reaching which local file cache will be truncated",
NULL,
&lfc_free_space_watermark,
1024, /* 1GB */
&lfc_free_memory_watermark,
0, /* disabled by default, because iurt makes sense only when local file cache is located i tmpfs */
0,
INT_MAX,
PGC_SIGHUP,
@@ -317,17 +366,6 @@ lfc_init(void)
NULL,
NULL);
DefineCustomStringVariable("neon.file_cache_path",
"Path to local file cache (can be raw device)",
NULL,
&lfc_path,
"file.cache",
PGC_POSTMASTER,
0,
NULL,
NULL,
NULL);
if (lfc_max_size == 0)
return;
@@ -476,27 +514,7 @@ lfc_read(RelFileNode rnode, ForkNumber forkNum, BlockNumber blkno,
dlist_delete(&entry->lru_node);
LWLockRelease(lfc_lock);
/* Open cache file if not done yet */
if (lfc_desc == 0)
{
lfc_desc = BasicOpenFile(lfc_path, O_RDWR|O_CREAT);
if (lfc_desc < 0) {
elog(LOG, "Failed to open file cache %s: %m", lfc_path);
lfc_size_limit = 0; /* disable file cache */
result = false;
}
}
if (lfc_desc > 0)
{
rc = pread(lfc_desc, buffer, BLCKSZ, ((off_t)entry->offset*BLOCKS_PER_CHUNK + chunk_offs)*BLCKSZ);
if (rc != BLCKSZ)
{
elog(INFO, "Failed to read file cache: %m");
lfc_size_limit = 0; /* disable file cache */
result = false;
}
}
memcpy(buffer, lfc_base_addr + ((size_t)entry->offset*BLOCKS_PER_CHUNK + chunk_offs)*BLCKSZ, BLCKSZ);
/* Place entry to the head of LRU list */
LWLockAcquire(lfc_lock, LW_EXCLUSIVE);
@@ -569,24 +587,8 @@ lfc_write(RelFileNode rnode, ForkNumber forkNum, BlockNumber blkno,
}
LWLockRelease(lfc_lock);
/* Open cache file if not done yet */
if (lfc_desc == 0)
{
lfc_desc = BasicOpenFile(lfc_path, O_RDWR|O_CREAT);
if (lfc_desc < 0) {
elog(WARNING, "Failed to open file cache %s: %m, disabling file cache", lfc_path);
lfc_size_limit = 0; /* disable file cache */
}
}
if (lfc_desc > 0)
{
rc = pwrite(lfc_desc, buffer, BLCKSZ, ((off_t)entry->offset*BLOCKS_PER_CHUNK + chunk_offs)*BLCKSZ);
if (rc != BLCKSZ)
{
elog(WARNING, "Failed to write file cache: %m, disabling file cache");
lfc_size_limit = 0; /* disable file cache */
}
}
memcpy(lfc_base_addr + ((size_t)entry->offset*BLOCKS_PER_CHUNK + chunk_offs)*BLCKSZ, buffer, BLCKSZ);
/* Place entry to the head of LRU list */
LWLockAcquire(lfc_lock, LW_EXCLUSIVE);
Assert(entry->access_count > 0);

View File

@@ -74,7 +74,7 @@ walprop_connect_start(char *conninfo, char *password)
if (password)
{
keywords[n] = "password";
values[n] = neon_auth_token;
values[n] = password;
n++;
}
keywords[n] = "dbname";

View File

@@ -1393,8 +1393,22 @@ WalProposerRecovery(int donor, TimeLineID timeline, XLogRecPtr startpos, XLogRec
char *err;
WalReceiverConn *wrconn;
WalRcvStreamOptions options;
char conninfo[MAXCONNINFO];
wrconn = walrcv_connect(safekeeper[donor].conninfo, false, "wal_proposer_recovery", &err);
if (!neon_auth_token)
{
memcpy(conninfo, safekeeper[donor].conninfo, MAXCONNINFO);
}
else
{
int written = 0;
written = snprintf((char *) conninfo, MAXCONNINFO, "password=%s %s", neon_auth_token, safekeeper[donor].conninfo);
if (written > MAXCONNINFO || written < 0)
elog(FATAL, "could not append password to the safekeeper connection string");
}
wrconn = walrcv_connect(conninfo, false, "wal_proposer_recovery", &err);
if (!wrconn)
{
ereport(WARNING,

View File

@@ -13,6 +13,7 @@ bytes = { workspace = true, features = ["serde"] }
chrono.workspace = true
clap.workspace = true
consumption_metrics.workspace = true
dashmap.workspace = true
futures.workspace = true
git-version.workspace = true
hashbrown.workspace = true
@@ -29,7 +30,7 @@ metrics.workspace = true
once_cell.workspace = true
opentelemetry.workspace = true
parking_lot.workspace = true
pbkdf2.workspace = true
pbkdf2 = { workspace = true, features = ["simple", "std"] }
pin-project-lite.workspace = true
postgres_backend.workspace = true
pq_proto.workspace = true

View File

@@ -36,7 +36,18 @@ pub(super) async fn authenticate(
AuthInfo::Scram(secret) => {
info!("auth endpoint chooses SCRAM");
let scram = auth::Scram(&secret);
let client_key = match flow.begin(scram).await?.authenticate().await? {
let auth_flow = flow.begin(scram).await.map_err(|error| {
warn!(?error, "error sending scram acknowledgement");
error
})?;
let auth_outcome = auth_flow.authenticate().await.map_err(|error| {
warn!(?error, "error processing scram messages");
error
})?;
let client_key = match auth_outcome {
sasl::Outcome::Success(key) => key,
sasl::Outcome::Failure(reason) => {
info!("auth backend failed with an error: {reason}");
@@ -51,7 +62,6 @@ pub(super) async fn authenticate(
}
};
info!("compute node's state has likely changed; requesting a wake-up");
let mut num_retries = 0;
let mut node = loop {
let wake_res = api.wake_compute(extra, creds).await;

View File

@@ -1,10 +1,21 @@
use anyhow::Context;
use async_trait::async_trait;
use parking_lot::Mutex;
use dashmap::DashMap;
use futures::future::poll_fn;
use parking_lot::RwLock;
use pbkdf2::{
password_hash::{PasswordHashString, PasswordHasher, PasswordVerifier, SaltString},
Params, Pbkdf2,
};
use pq_proto::StartupMessageParams;
use std::fmt;
use std::sync::atomic::{self, AtomicUsize};
use std::{collections::HashMap, sync::Arc};
use std::{
fmt,
task::{ready, Poll},
};
use tokio::time;
use tokio_postgres::AsyncMessage;
use crate::{auth, console};
use crate::{compute, config};
@@ -13,8 +24,8 @@ use super::sql_over_http::MAX_RESPONSE_SIZE;
use crate::proxy::ConnectMechanism;
use tracing::error;
use tracing::info;
use tracing::{error, warn};
use tracing::{info, info_span, Instrument};
pub const APP_NAME: &str = "sql_over_http";
const MAX_CONNS_PER_ENDPOINT: usize = 20;
@@ -42,23 +53,44 @@ impl fmt::Display for ConnInfo {
}
struct ConnPoolEntry {
conn: tokio_postgres::Client,
conn: Client,
_last_access: std::time::Instant,
}
// Per-endpoint connection pool, (dbname, username) -> Vec<ConnPoolEntry>
// Per-endpoint connection pool, (dbname, username) -> DbUserConnPool
// Number of open connections is limited by the `max_conns_per_endpoint`.
pub struct EndpointConnPool {
pools: HashMap<(String, String), Vec<ConnPoolEntry>>,
pools: HashMap<(String, String), DbUserConnPool>,
total_conns: usize,
}
/// 4096 is the number of rounds that SCRAM-SHA-256 recommends.
/// It's not the 600,000 that OWASP recommends... but our passwords are high entropy anyway.
///
/// Still takes 1.4ms to hash on my hardware.
/// We don't want to ruin the latency improvements of using the pool by making password verification take too long
const PARAMS: Params = Params {
rounds: 4096,
output_length: 32,
};
#[derive(Default)]
pub struct DbUserConnPool {
conns: Vec<ConnPoolEntry>,
password_hash: Option<PasswordHashString>,
}
pub struct GlobalConnPool {
// endpoint -> per-endpoint connection pool
//
// That should be a fairly conteded map, so return reference to the per-endpoint
// pool as early as possible and release the lock.
global_pool: Mutex<HashMap<String, Arc<Mutex<EndpointConnPool>>>>,
global_pool: DashMap<String, Arc<RwLock<EndpointConnPool>>>,
/// [`DashMap::len`] iterates over all inner pools and acquires a read lock on each.
/// That seems like far too much effort, so we're using a relaxed increment counter instead.
/// It's only used for diagnostics.
global_pool_size: AtomicUsize,
// Maximum number of connections per one endpoint.
// Can mix different (dbname, username) connections.
@@ -67,85 +99,173 @@ pub struct GlobalConnPool {
max_conns_per_endpoint: usize,
proxy_config: &'static crate::config::ProxyConfig,
// Using a lock to remove any race conditions.
// Eg cleaning up connections while a new connection is returned
closed: RwLock<bool>,
}
impl GlobalConnPool {
pub fn new(config: &'static crate::config::ProxyConfig) -> Arc<Self> {
Arc::new(Self {
global_pool: Mutex::new(HashMap::new()),
global_pool: DashMap::new(),
global_pool_size: AtomicUsize::new(0),
max_conns_per_endpoint: MAX_CONNS_PER_ENDPOINT,
proxy_config: config,
closed: RwLock::new(false),
})
}
pub fn shutdown(&self) {
*self.closed.write() = true;
self.global_pool.retain(|_, endpoint_pool| {
let mut pool = endpoint_pool.write();
// by clearing this hashmap, we remove the slots that a connection can be returned to.
// when returning, it drops the connection if the slot doesn't exist
pool.pools.clear();
pool.total_conns = 0;
false
});
}
pub async fn get(
&self,
conn_info: &ConnInfo,
force_new: bool,
) -> anyhow::Result<tokio_postgres::Client> {
let mut client: Option<tokio_postgres::Client> = None;
session_id: uuid::Uuid,
) -> anyhow::Result<Client> {
let mut client: Option<Client> = None;
let mut hash_valid = false;
if !force_new {
let pool = self.get_endpoint_pool(&conn_info.hostname).await;
let pool = self.get_or_create_endpoint_pool(&conn_info.hostname);
let mut hash = None;
// find a pool entry by (dbname, username) if exists
let mut pool = pool.lock();
let pool_entries = pool.pools.get_mut(&conn_info.db_and_user());
if let Some(pool_entries) = pool_entries {
if let Some(entry) = pool_entries.pop() {
client = Some(entry.conn);
pool.total_conns -= 1;
{
let pool = pool.read();
if let Some(pool_entries) = pool.pools.get(&conn_info.db_and_user()) {
if !pool_entries.conns.is_empty() {
hash = pool_entries.password_hash.clone();
}
}
}
// a connection exists in the pool, verify the password hash
if let Some(hash) = hash {
let pw = conn_info.password.clone();
let validate = tokio::task::spawn_blocking(move || {
Pbkdf2.verify_password(pw.as_bytes(), &hash.password_hash())
})
.await?;
// if the hash is invalid, don't error
// we will continue with the regular connection flow
if validate.is_ok() {
hash_valid = true;
let mut pool = pool.write();
if let Some(pool_entries) = pool.pools.get_mut(&conn_info.db_and_user()) {
if let Some(entry) = pool_entries.conns.pop() {
client = Some(entry.conn);
pool.total_conns -= 1;
}
}
}
}
}
// ok return cached connection if found and establish a new one otherwise
if let Some(client) = client {
if client.is_closed() {
let new_client = if let Some(client) = client {
if client.inner.is_closed() {
info!("pool: cached connection '{conn_info}' is closed, opening a new one");
connect_to_compute(self.proxy_config, conn_info).await
connect_to_compute(self.proxy_config, conn_info, session_id).await
} else {
info!("pool: reusing connection '{conn_info}'");
Ok(client)
client.session.send(session_id)?;
return Ok(client);
}
} else {
info!("pool: opening a new connection '{conn_info}'");
connect_to_compute(self.proxy_config, conn_info).await
connect_to_compute(self.proxy_config, conn_info, session_id).await
};
match &new_client {
// clear the hash. it's no longer valid
// TODO: update tokio-postgres fork to allow access to this error kind directly
Err(err)
if hash_valid && err.to_string().contains("password authentication failed") =>
{
let pool = self.get_or_create_endpoint_pool(&conn_info.hostname);
let mut pool = pool.write();
if let Some(entry) = pool.pools.get_mut(&conn_info.db_and_user()) {
entry.password_hash = None;
}
}
// new password is valid and we should insert/update it
Ok(_) if !force_new && !hash_valid => {
let pw = conn_info.password.clone();
let new_hash = tokio::task::spawn_blocking(move || {
let salt = SaltString::generate(rand::rngs::OsRng);
Pbkdf2
.hash_password_customized(pw.as_bytes(), None, None, PARAMS, &salt)
.map(|s| s.serialize())
})
.await??;
let pool = self.get_or_create_endpoint_pool(&conn_info.hostname);
let mut pool = pool.write();
pool.pools
.entry(conn_info.db_and_user())
.or_default()
.password_hash = Some(new_hash);
}
_ => {}
}
new_client
}
pub async fn put(
&self,
conn_info: &ConnInfo,
client: tokio_postgres::Client,
) -> anyhow::Result<()> {
let pool = self.get_endpoint_pool(&conn_info.hostname).await;
pub fn put(&self, conn_info: &ConnInfo, client: Client) -> anyhow::Result<()> {
// We want to hold this open while we return. This ensures that the pool can't close
// while we are in the middle of returning the connection.
let closed = self.closed.read();
if *closed {
info!("pool: throwing away connection '{conn_info}' because pool is closed");
return Ok(());
}
if client.inner.is_closed() {
info!("pool: throwing away connection '{conn_info}' because connection is closed");
return Ok(());
}
let pool = self.get_or_create_endpoint_pool(&conn_info.hostname);
// return connection to the pool
let mut total_conns;
let mut returned = false;
let mut per_db_size = 0;
{
let mut pool = pool.lock();
total_conns = pool.total_conns;
let total_conns = {
let mut pool = pool.write();
let pool_entries: &mut Vec<ConnPoolEntry> = pool
.pools
.entry(conn_info.db_and_user())
.or_insert_with(|| Vec::with_capacity(1));
if total_conns < self.max_conns_per_endpoint {
pool_entries.push(ConnPoolEntry {
conn: client,
_last_access: std::time::Instant::now(),
});
if pool.total_conns < self.max_conns_per_endpoint {
// we create this db-user entry in get, so it should not be None
if let Some(pool_entries) = pool.pools.get_mut(&conn_info.db_and_user()) {
pool_entries.conns.push(ConnPoolEntry {
conn: client,
_last_access: std::time::Instant::now(),
});
total_conns += 1;
returned = true;
per_db_size = pool_entries.len();
returned = true;
per_db_size = pool_entries.conns.len();
pool.total_conns += 1;
pool.total_conns += 1;
}
}
}
pool.total_conns
};
// do logging outside of the mutex
if returned {
@@ -157,25 +277,35 @@ impl GlobalConnPool {
Ok(())
}
async fn get_endpoint_pool(&self, endpoint: &String) -> Arc<Mutex<EndpointConnPool>> {
fn get_or_create_endpoint_pool(&self, endpoint: &String) -> Arc<RwLock<EndpointConnPool>> {
// fast path
if let Some(pool) = self.global_pool.get(endpoint) {
return pool.clone();
}
// slow path
let new_pool = Arc::new(RwLock::new(EndpointConnPool {
pools: HashMap::new(),
total_conns: 0,
}));
// find or create a pool for this endpoint
let mut created = false;
let mut global_pool = self.global_pool.lock();
let pool = global_pool
let pool = self
.global_pool
.entry(endpoint.clone())
.or_insert_with(|| {
created = true;
Arc::new(Mutex::new(EndpointConnPool {
pools: HashMap::new(),
total_conns: 0,
}))
new_pool
})
.clone();
let global_pool_size = global_pool.len();
drop(global_pool);
// log new global pool size
if created {
let global_pool_size = self
.global_pool_size
.fetch_add(1, atomic::Ordering::Relaxed)
+ 1;
info!(
"pool: created new pool for '{endpoint}', global pool size now {global_pool_size}"
);
@@ -187,11 +317,12 @@ impl GlobalConnPool {
struct TokioMechanism<'a> {
conn_info: &'a ConnInfo,
session_id: uuid::Uuid,
}
#[async_trait]
impl ConnectMechanism for TokioMechanism<'_> {
type Connection = tokio_postgres::Client;
type Connection = Client;
type ConnectError = tokio_postgres::Error;
type Error = anyhow::Error;
@@ -200,7 +331,7 @@ impl ConnectMechanism for TokioMechanism<'_> {
node_info: &console::CachedNodeInfo,
timeout: time::Duration,
) -> Result<Self::Connection, Self::ConnectError> {
connect_to_compute_once(node_info, self.conn_info, timeout).await
connect_to_compute_once(node_info, self.conn_info, timeout, self.session_id).await
}
fn update_connect_config(&self, _config: &mut compute::ConnCfg) {}
@@ -213,7 +344,8 @@ impl ConnectMechanism for TokioMechanism<'_> {
async fn connect_to_compute(
config: &config::ProxyConfig,
conn_info: &ConnInfo,
) -> anyhow::Result<tokio_postgres::Client> {
session_id: uuid::Uuid,
) -> anyhow::Result<Client> {
let tls = config.tls_config.as_ref();
let common_names = tls.and_then(|tls| tls.common_names.clone());
@@ -244,17 +376,27 @@ async fn connect_to_compute(
.await?
.context("missing cache entry from wake_compute")?;
crate::proxy::connect_to_compute(&TokioMechanism { conn_info }, node_info, &extra, &creds).await
crate::proxy::connect_to_compute(
&TokioMechanism {
conn_info,
session_id,
},
node_info,
&extra,
&creds,
)
.await
}
async fn connect_to_compute_once(
node_info: &console::CachedNodeInfo,
conn_info: &ConnInfo,
timeout: time::Duration,
) -> Result<tokio_postgres::Client, tokio_postgres::Error> {
mut session: uuid::Uuid,
) -> Result<Client, tokio_postgres::Error> {
let mut config = (*node_info.config).clone();
let (client, connection) = config
let (client, mut connection) = config
.user(&conn_info.username)
.password(&conn_info.password)
.dbname(&conn_info.dbname)
@@ -263,11 +405,53 @@ async fn connect_to_compute_once(
.connect(tokio_postgres::NoTls)
.await?;
tokio::spawn(async move {
if let Err(e) = connection.await {
error!("connection error: {}", e);
}
let (tx, mut rx) = tokio::sync::watch::channel(session);
let conn_id = uuid::Uuid::new_v4();
let span = info_span!(parent: None, "connection", %conn_info, %conn_id);
span.in_scope(|| {
info!(%session, "new connection");
});
Ok(client)
tokio::spawn(
poll_fn(move |cx| {
if matches!(rx.has_changed(), Ok(true)) {
session = *rx.borrow_and_update();
info!(%session, "changed session");
}
let message = ready!(connection.poll_message(cx));
match message {
Some(Ok(AsyncMessage::Notice(notice))) => {
info!(%session, "notice: {}", notice);
Poll::Pending
}
Some(Ok(AsyncMessage::Notification(notif))) => {
warn!(%session, pid = notif.process_id(), channel = notif.channel(), "notification received");
Poll::Pending
}
Some(Ok(_)) => {
warn!(%session, "unknown message");
Poll::Pending
}
Some(Err(e)) => {
error!(%session, "connection error: {}", e);
Poll::Ready(())
}
None => Poll::Ready(()),
}
})
.instrument(span)
);
Ok(Client {
inner: client,
session: tx,
})
}
pub struct Client {
pub inner: tokio_postgres::Client,
session: tokio::sync::watch::Sender<uuid::Uuid>,
}

View File

@@ -27,11 +27,16 @@ struct QueryData {
params: Vec<serde_json::Value>,
}
#[derive(serde::Deserialize)]
struct BatchQueryData {
queries: Vec<QueryData>,
}
#[derive(serde::Deserialize)]
#[serde(untagged)]
enum Payload {
Single(QueryData),
Batch(Vec<QueryData>),
Batch(BatchQueryData),
}
pub const MAX_RESPONSE_SIZE: usize = 10 * 1024 * 1024; // 10 MB
@@ -42,6 +47,7 @@ static ARRAY_MODE: HeaderName = HeaderName::from_static("neon-array-mode");
static ALLOW_POOL: HeaderName = HeaderName::from_static("neon-pool-opt-in");
static TXN_ISOLATION_LEVEL: HeaderName = HeaderName::from_static("neon-batch-isolation-level");
static TXN_READ_ONLY: HeaderName = HeaderName::from_static("neon-batch-read-only");
static TXN_DEFERRABLE: HeaderName = HeaderName::from_static("neon-batch-deferrable");
static HEADER_VALUE_TRUE: HeaderValue = HeaderValue::from_static("true");
@@ -175,6 +181,7 @@ pub async fn handle(
request: Request<Body>,
sni_hostname: Option<String>,
conn_pool: Arc<GlobalConnPool>,
session_id: uuid::Uuid,
) -> anyhow::Result<(Value, HashMap<HeaderName, HeaderValue>)> {
//
// Determine the destination and connection params
@@ -190,7 +197,7 @@ pub async fn handle(
// Allow connection pooling only if explicitly requested
let allow_pool = headers.get(&ALLOW_POOL) == Some(&HEADER_VALUE_TRUE);
// isolation level and read only
// isolation level, read only and deferrable
let txn_isolation_level_raw = headers.get(&TXN_ISOLATION_LEVEL).cloned();
let txn_isolation_level = match txn_isolation_level_raw {
@@ -204,8 +211,8 @@ pub async fn handle(
None => None,
};
let txn_read_only_raw = headers.get(&TXN_READ_ONLY).cloned();
let txn_read_only = txn_read_only_raw.as_ref() == Some(&HEADER_VALUE_TRUE);
let txn_read_only = headers.get(&TXN_READ_ONLY) == Some(&HEADER_VALUE_TRUE);
let txn_deferrable = headers.get(&TXN_DEFERRABLE) == Some(&HEADER_VALUE_TRUE);
let request_content_length = match request.body().size_hint().upper() {
Some(v) => v,
@@ -224,26 +231,29 @@ pub async fn handle(
let body = hyper::body::to_bytes(request.into_body()).await?;
let payload: Payload = serde_json::from_slice(&body)?;
let mut client = conn_pool.get(&conn_info, !allow_pool).await?;
let mut client = conn_pool.get(&conn_info, !allow_pool, session_id).await?;
//
// Now execute the query and return the result
//
let result = match payload {
Payload::Single(query) => query_to_json(&client, query, raw_output, array_mode)
Payload::Single(query) => query_to_json(&client.inner, query, raw_output, array_mode)
.await
.map(|x| (x, HashMap::default())),
Payload::Batch(queries) => {
Payload::Batch(batch_query) => {
let mut results = Vec::new();
let mut builder = client.build_transaction();
let mut builder = client.inner.build_transaction();
if let Some(isolation_level) = txn_isolation_level {
builder = builder.isolation_level(isolation_level);
}
if txn_read_only {
builder = builder.read_only(true);
}
if txn_deferrable {
builder = builder.deferrable(true);
}
let transaction = builder.start().await?;
for query in queries {
for query in batch_query.queries {
let result = query_to_json(&transaction, query, raw_output, array_mode).await;
match result {
Ok(r) => results.push(r),
@@ -255,21 +265,31 @@ pub async fn handle(
}
transaction.commit().await?;
let mut headers = HashMap::default();
headers.insert(
TXN_READ_ONLY.clone(),
HeaderValue::try_from(txn_read_only.to_string())?,
);
if let Some(txn_isolation_level_raw) = txn_isolation_level_raw {
headers.insert(TXN_ISOLATION_LEVEL.clone(), txn_isolation_level_raw);
if txn_read_only {
headers.insert(
TXN_READ_ONLY.clone(),
HeaderValue::try_from(txn_read_only.to_string())?,
);
}
if txn_deferrable {
headers.insert(
TXN_DEFERRABLE.clone(),
HeaderValue::try_from(txn_deferrable.to_string())?,
);
}
if let Some(txn_isolation_level) = txn_isolation_level_raw {
headers.insert(TXN_ISOLATION_LEVEL.clone(), txn_isolation_level);
}
Ok((json!({ "results": results }), headers))
}
};
if allow_pool {
let current_span = tracing::Span::current();
// return connection to the pool
tokio::task::spawn(async move {
let _ = conn_pool.put(&conn_info, client).await;
tokio::task::spawn_blocking(move || {
let _span = current_span.enter();
let _ = conn_pool.put(&conn_info, client);
});
}

View File

@@ -203,7 +203,7 @@ async fn ws_handler(
// TODO: that deserves a refactor as now this function also handles http json client besides websockets.
// Right now I don't want to blow up sql-over-http patch with file renames and do that as a follow up instead.
} else if request.uri().path() == "/sql" && request.method() == Method::POST {
let result = sql_over_http::handle(request, sni_hostname, conn_pool)
let result = sql_over_http::handle(request, sni_hostname, conn_pool, session_id)
.instrument(info_span!("sql-over-http"))
.await;
let status_code = match result {
@@ -269,6 +269,18 @@ pub async fn task_main(
let conn_pool: Arc<GlobalConnPool> = GlobalConnPool::new(config);
// shutdown the connection pool
tokio::spawn({
let cancellation_token = cancellation_token.clone();
let conn_pool = conn_pool.clone();
async move {
cancellation_token.cancelled().await;
tokio::task::spawn_blocking(move || conn_pool.shutdown())
.await
.unwrap();
}
});
let tls_config = config.tls_config.as_ref().map(|cfg| cfg.to_server_config());
let tls_acceptor: tokio_rustls::TlsAcceptor = match tls_config {
Some(config) => config.into(),
@@ -307,7 +319,7 @@ pub async fn task_main(
ws_handler(req, config, conn_pool, cancel_map, session_id, sni_name)
.instrument(info_span!(
"ws-client",
session = format_args!("{session_id}")
session = %session_id
))
.await
}

View File

@@ -4,6 +4,7 @@ use super::{messages::ServerMessage, Mechanism};
use crate::stream::PqStream;
use std::io;
use tokio::io::{AsyncRead, AsyncWrite};
use tracing::info;
/// Abstracts away all peculiarities of the libpq's protocol.
pub struct SaslStream<'a, S> {
@@ -68,7 +69,10 @@ impl<S: AsyncRead + AsyncWrite + Unpin> SaslStream<'_, S> {
) -> super::Result<Outcome<M::Output>> {
loop {
let input = self.recv().await?;
let step = mechanism.exchange(input)?;
let step = mechanism.exchange(input).map_err(|error| {
info!(?error, "error during SASL exchange");
error
})?;
use super::Step;
return Ok(match step {

View File

@@ -15,6 +15,7 @@ use toml_edit::Document;
use std::fs::{self, File};
use std::io::{ErrorKind, Write};
use std::path::{Path, PathBuf};
use std::str::FromStr;
use std::sync::Arc;
use std::time::Duration;
use storage_broker::Uri;
@@ -122,9 +123,24 @@ struct Args {
/// WAL backup horizon.
#[arg(long)]
disable_wal_backup: bool,
/// Path to a .pem public key which is used to check JWT tokens.
#[arg(long)]
auth_validation_public_key_path: Option<PathBuf>,
/// If given, enables auth on incoming connections to WAL service endpoint
/// (--listen-pg). Value specifies path to a .pem public key used for
/// validations of JWT tokens. Empty string is allowed and means disabling
/// auth.
#[arg(long, verbatim_doc_comment, value_parser = opt_pathbuf_parser)]
pg_auth_public_key_path: Option<PathBuf>,
/// If given, enables auth on incoming connections to tenant only WAL
/// service endpoint (--listen-pg-tenant-only). Value specifies path to a
/// .pem public key used for validations of JWT tokens. Empty string is
/// allowed and means disabling auth.
#[arg(long, verbatim_doc_comment, value_parser = opt_pathbuf_parser)]
pg_tenant_only_auth_public_key_path: Option<PathBuf>,
/// If given, enables auth on incoming connections to http management
/// service endpoint (--listen-http). Value specifies path to a .pem public
/// key used for validations of JWT tokens. Empty string is allowed and
/// means disabling auth.
#[arg(long, verbatim_doc_comment, value_parser = opt_pathbuf_parser)]
http_auth_public_key_path: Option<PathBuf>,
/// Format for logging, either 'plain' or 'json'.
#[arg(long, default_value = "plain")]
log_format: String,
@@ -134,9 +150,39 @@ struct Args {
current_thread_runtime: bool,
}
// Like PathBufValueParser, but allows empty string.
fn opt_pathbuf_parser(s: &str) -> Result<PathBuf, String> {
Ok(PathBuf::from_str(s).unwrap())
}
#[tokio::main(flavor = "current_thread")]
async fn main() -> anyhow::Result<()> {
let args = Args::parse();
// We want to allow multiple occurences of the same arg (taking the last) so
// that neon_local could generate command with defaults + overrides without
// getting 'argument cannot be used multiple times' error. This seems to be
// impossible with pure Derive API, so convert struct to Command, modify it,
// parse arguments, and then fill the struct back.
let cmd = <Args as clap::CommandFactory>::command().args_override_self(true);
let mut matches = cmd.get_matches();
let mut args = <Args as clap::FromArgMatches>::from_arg_matches_mut(&mut matches)?;
// I failed to modify opt_pathbuf_parser to return Option<PathBuf> in
// reasonable time, so turn empty string into option post factum.
if let Some(pb) = &args.pg_auth_public_key_path {
if pb.as_os_str().is_empty() {
args.pg_auth_public_key_path = None;
}
}
if let Some(pb) = &args.pg_tenant_only_auth_public_key_path {
if pb.as_os_str().is_empty() {
args.pg_tenant_only_auth_public_key_path = None;
}
}
if let Some(pb) = &args.http_auth_public_key_path {
if pb.as_os_str().is_empty() {
args.http_auth_public_key_path = None;
}
}
if let Some(addr) = args.dump_control_file {
let state = control_file::FileStorage::load_control_file(addr)?;
@@ -170,13 +216,40 @@ async fn main() -> anyhow::Result<()> {
return Ok(());
}
let auth = match args.auth_validation_public_key_path.as_ref() {
let pg_auth = match args.pg_auth_public_key_path.as_ref() {
None => {
info!("auth is disabled");
info!("pg auth is disabled");
None
}
Some(path) => {
info!("loading JWT auth key from {}", path.display());
info!("loading pg auth JWT key from {}", path.display());
Some(Arc::new(
JwtAuth::from_key_path(path).context("failed to load the auth key")?,
))
}
};
let pg_tenant_only_auth = match args.pg_tenant_only_auth_public_key_path.as_ref() {
None => {
info!("pg tenant only auth is disabled");
None
}
Some(path) => {
info!(
"loading pg tenant only auth JWT key from {}",
path.display()
);
Some(Arc::new(
JwtAuth::from_key_path(path).context("failed to load the auth key")?,
))
}
};
let http_auth = match args.http_auth_public_key_path.as_ref() {
None => {
info!("http auth is disabled");
None
}
Some(path) => {
info!("loading http auth JWT key from {}", path.display());
Some(Arc::new(
JwtAuth::from_key_path(path).context("failed to load the auth key")?,
))
@@ -199,7 +272,9 @@ async fn main() -> anyhow::Result<()> {
max_offloader_lag_bytes: args.max_offloader_lag,
wal_backup_enabled: !args.disable_wal_backup,
backup_parallel_jobs: args.wal_backup_parallel_jobs,
auth,
pg_auth,
pg_tenant_only_auth,
http_auth,
current_thread_runtime: args.current_thread_runtime,
};
@@ -288,7 +363,7 @@ async fn start_safekeeper(conf: SafeKeeperConf) -> Result<()> {
.spawn(wal_service::task_main(
conf_,
pg_listener,
Some(Scope::SafekeeperData),
Scope::SafekeeperData,
))
// wrap with task name for error reporting
.map(|res| ("WAL service main".to_owned(), res));
@@ -302,7 +377,7 @@ async fn start_safekeeper(conf: SafeKeeperConf) -> Result<()> {
.spawn(wal_service::task_main(
conf_,
pg_listener_tenant_only,
Some(Scope::Tenant),
Scope::Tenant,
))
// wrap with task name for error reporting
.map(|res| ("WAL service tenant only main".to_owned(), res));

View File

@@ -4,6 +4,7 @@
use anyhow::Context;
use std::str::FromStr;
use std::str::{self};
use std::sync::Arc;
use tokio::io::{AsyncRead, AsyncWrite};
use tracing::{info, info_span, Instrument};
@@ -20,7 +21,7 @@ use postgres_backend::{self, PostgresBackend};
use postgres_ffi::PG_TLI;
use pq_proto::{BeMessage, FeStartupPacket, RowDescriptor, INT4_OID, TEXT_OID};
use regex::Regex;
use utils::auth::{Claims, Scope};
use utils::auth::{Claims, JwtAuth, Scope};
use utils::{
id::{TenantId, TenantTimelineId, TimelineId},
lsn::Lsn,
@@ -36,8 +37,8 @@ pub struct SafekeeperPostgresHandler {
pub ttid: TenantTimelineId,
/// Unique connection id is logged in spans for observability.
pub conn_id: ConnectionId,
/// Auth scope allowed on the connections. None if auth is not configured.
allowed_auth_scope: Option<Scope>,
/// Auth scope allowed on the connections and public key used to check auth tokens. None if auth is not configured.
auth: Option<(Scope, Arc<JwtAuth>)>,
claims: Option<Claims>,
io_metrics: Option<TrafficMetrics>,
}
@@ -154,18 +155,17 @@ impl<IO: AsyncRead + AsyncWrite + Unpin + Send> postgres_backend::Handler<IO>
) -> Result<(), QueryError> {
// this unwrap is never triggered, because check_auth_jwt only called when auth_type is NeonJWT
// which requires auth to be present
let data = self
.conf
let (allowed_auth_scope, auth) = self
.auth
.as_ref()
.unwrap()
.decode(str::from_utf8(jwt_response).context("jwt response is not UTF-8")?)?;
.expect("auth_type is configured but .auth of handler is missing");
let data =
auth.decode(str::from_utf8(jwt_response).context("jwt response is not UTF-8")?)?;
let scope = self
.allowed_auth_scope
.expect("auth is enabled but scope is not configured");
// The handler might be configured to allow only tenant scope tokens.
if matches!(scope, Scope::Tenant) && !matches!(data.claims.scope, Scope::Tenant) {
if matches!(allowed_auth_scope, Scope::Tenant)
&& !matches!(data.claims.scope, Scope::Tenant)
{
return Err(QueryError::Other(anyhow::anyhow!(
"passed JWT token is for full access, but only tenant scope is allowed"
)));
@@ -244,7 +244,7 @@ impl SafekeeperPostgresHandler {
conf: SafeKeeperConf,
conn_id: u32,
io_metrics: Option<TrafficMetrics>,
allowed_auth_scope: Option<Scope>,
auth: Option<(Scope, Arc<JwtAuth>)>,
) -> Self {
SafekeeperPostgresHandler {
conf,
@@ -254,7 +254,7 @@ impl SafekeeperPostgresHandler {
ttid: TenantTimelineId::empty(),
conn_id,
claims: None,
allowed_auth_scope,
auth,
io_metrics,
}
}
@@ -262,7 +262,7 @@ impl SafekeeperPostgresHandler {
// when accessing management api supply None as an argument
// when using to authorize tenant pass corresponding tenant id
fn check_permission(&self, tenant_id: Option<TenantId>) -> anyhow::Result<()> {
if self.conf.auth.is_none() {
if self.auth.is_none() {
// auth is set to Trust, nothing to check so just return ok
return Ok(());
}

View File

@@ -15,8 +15,10 @@ use tokio::fs::File;
use tokio::io::AsyncReadExt;
use utils::http::endpoint::request_span;
use crate::receive_wal::WalReceiverState;
use crate::safekeeper::ServerInfo;
use crate::safekeeper::Term;
use crate::send_wal::WalSenderState;
use crate::{debug_dump, pull_timeline};
use crate::timelines_global_map::TimelineDeleteForceResult;
@@ -99,6 +101,8 @@ pub struct TimelineStatus {
pub peer_horizon_lsn: Lsn,
#[serde_as(as = "DisplayFromStr")]
pub remote_consistent_lsn: Lsn,
pub walsenders: Vec<WalSenderState>,
pub walreceivers: Vec<WalReceiverState>,
}
fn check_permission(request: &Request<Body>, tenant_id: Option<TenantId>) -> Result<(), ApiError> {
@@ -149,6 +153,8 @@ async fn timeline_status_handler(request: Request<Body>) -> Result<Response<Body
backup_lsn: inmem.backup_lsn,
peer_horizon_lsn: inmem.peer_horizon_lsn,
remote_consistent_lsn: tli.get_walsenders().get_remote_consistent_lsn(),
walsenders: tli.get_walsenders().get_all(),
walreceivers: tli.get_walreceivers().get_all(),
};
json_response(StatusCode::OK, status)
}
@@ -359,7 +365,7 @@ async fn dump_debug_handler(mut request: Request<Body>) -> Result<Response<Body>
/// Safekeeper http router.
pub fn make_router(conf: SafeKeeperConf) -> RouterBuilder<hyper::Body, ApiError> {
let mut router = endpoint::make_router();
if conf.auth.is_some() {
if conf.http_auth.is_some() {
router = router.middleware(auth_middleware(|request| {
#[allow(clippy::mutable_key_type)]
static ALLOWLIST_ROUTES: Lazy<HashSet<Uri>> =
@@ -375,7 +381,7 @@ pub fn make_router(conf: SafeKeeperConf) -> RouterBuilder<hyper::Body, ApiError>
// NB: on any changes do not forget to update the OpenAPI spec
// located nearby (/safekeeper/src/http/openapi_spec.yaml).
let auth = conf.auth.clone();
let auth = conf.http_auth.clone();
router
.data(Arc::new(conf))
.data(auth)

View File

@@ -65,7 +65,9 @@ pub struct SafeKeeperConf {
pub max_offloader_lag_bytes: u64,
pub backup_parallel_jobs: usize,
pub wal_backup_enabled: bool,
pub auth: Option<Arc<JwtAuth>>,
pub pg_auth: Option<Arc<JwtAuth>>,
pub pg_tenant_only_auth: Option<Arc<JwtAuth>>,
pub http_auth: Option<Arc<JwtAuth>>,
pub current_thread_runtime: bool,
}
@@ -99,7 +101,9 @@ impl SafeKeeperConf {
broker_keepalive_interval: Duration::from_secs(5),
wal_backup_enabled: true,
backup_parallel_jobs: 1,
auth: None,
pg_auth: None,
pg_tenant_only_auth: None,
http_auth: None,
heartbeat_timeout: Duration::new(5, 0),
max_offloader_lag_bytes: defaults::DEFAULT_MAX_OFFLOADER_LAG_BYTES,
current_thread_runtime: false,

View File

@@ -11,11 +11,16 @@ use crate::wal_service::ConnectionId;
use crate::GlobalTimelines;
use anyhow::{anyhow, Context};
use bytes::BytesMut;
use parking_lot::MappedMutexGuard;
use parking_lot::Mutex;
use parking_lot::MutexGuard;
use postgres_backend::CopyStreamHandlerEnd;
use postgres_backend::PostgresBackend;
use postgres_backend::PostgresBackendReader;
use postgres_backend::QueryError;
use pq_proto::BeMessage;
use serde::Deserialize;
use serde::Serialize;
use std::net::SocketAddr;
use std::sync::Arc;
use tokio::io::AsyncRead;
@@ -32,6 +37,105 @@ use tracing::*;
use utils::id::TenantTimelineId;
use utils::lsn::Lsn;
/// Registry of WalReceivers (compute connections). Timeline holds it (wrapped
/// in Arc).
pub struct WalReceivers {
mutex: Mutex<WalReceiversShared>,
}
/// Id under which walreceiver is registered in shmem.
type WalReceiverId = usize;
impl WalReceivers {
pub fn new() -> Arc<WalReceivers> {
Arc::new(WalReceivers {
mutex: Mutex::new(WalReceiversShared { slots: Vec::new() }),
})
}
/// Register new walreceiver. Returned guard provides access to the slot and
/// automatically deregisters in Drop.
pub fn register(self: &Arc<WalReceivers>) -> WalReceiverGuard {
let slots = &mut self.mutex.lock().slots;
let walreceiver = WalReceiverState::Voting;
// find empty slot or create new one
let pos = if let Some(pos) = slots.iter().position(|s| s.is_none()) {
slots[pos] = Some(walreceiver);
pos
} else {
let pos = slots.len();
slots.push(Some(walreceiver));
pos
};
WalReceiverGuard {
id: pos,
walreceivers: self.clone(),
}
}
/// Get reference to locked slot contents. Slot must exist (registered
/// earlier).
fn get_slot<'a>(
self: &'a Arc<WalReceivers>,
id: WalReceiverId,
) -> MappedMutexGuard<'a, WalReceiverState> {
MutexGuard::map(self.mutex.lock(), |locked| {
locked.slots[id]
.as_mut()
.expect("walreceiver doesn't exist")
})
}
/// Get number of walreceivers (compute connections).
pub fn get_num(self: &Arc<WalReceivers>) -> usize {
self.mutex.lock().slots.iter().flatten().count()
}
/// Get state of all walreceivers.
pub fn get_all(self: &Arc<WalReceivers>) -> Vec<WalReceiverState> {
self.mutex.lock().slots.iter().flatten().cloned().collect()
}
/// Unregister walsender.
fn unregister(self: &Arc<WalReceivers>, id: WalReceiverId) {
let mut shared = self.mutex.lock();
shared.slots[id] = None;
}
}
/// Only a few connections are expected (normally one), so store in Vec.
struct WalReceiversShared {
slots: Vec<Option<WalReceiverState>>,
}
/// Walreceiver status. Currently only whether it passed voting stage and
/// started receiving the stream, but it is easy to add more if needed.
#[derive(Debug, Clone, Serialize, Deserialize)]
pub enum WalReceiverState {
Voting,
Streaming,
}
/// Scope guard to access slot in WalSenders registry and unregister from it in
/// Drop.
pub struct WalReceiverGuard {
id: WalReceiverId,
walreceivers: Arc<WalReceivers>,
}
impl WalReceiverGuard {
/// Get reference to locked shared state contents.
fn get(&self) -> MappedMutexGuard<WalReceiverState> {
self.walreceivers.get_slot(self.id)
}
}
impl Drop for WalReceiverGuard {
fn drop(&mut self) {
self.walreceivers.unregister(self.id);
}
}
const MSG_QUEUE_SIZE: usize = 256;
const REPLY_QUEUE_SIZE: usize = 16;
@@ -246,10 +350,13 @@ impl WalAcceptor {
/// it must mean that network thread terminated.
async fn run(&mut self) -> anyhow::Result<()> {
// Register the connection and defer unregister.
self.tli.on_compute_connect().await?;
let _guard = ComputeConnectionGuard {
// Order of the next two lines is important: we want first to remove our entry and then
// update status which depends on registered connections.
let _compute_conn_guard = ComputeConnectionGuard {
timeline: Arc::clone(&self.tli),
};
let walreceiver_guard = self.tli.get_walreceivers().register();
self.tli.update_status_notify().await?;
// After this timestamp we will stop processing AppendRequests and send a response
// to the walproposer. walproposer sends at least one AppendRequest per second,
@@ -263,6 +370,11 @@ impl WalAcceptor {
}
let mut next_msg = opt_msg.unwrap();
// Update walreceiver state in shmem for reporting.
if let ProposerAcceptorMessage::Elected(_) = &next_msg {
*walreceiver_guard.get() = WalReceiverState::Streaming;
}
let reply_msg = if matches!(next_msg, ProposerAcceptorMessage::AppendRequest(_)) {
// loop through AppendRequest's while it's readily available to
// write as many WAL as possible without fsyncing
@@ -311,6 +423,7 @@ impl WalAcceptor {
}
}
/// Calls update_status_notify in drop to update timeline status.
struct ComputeConnectionGuard {
timeline: Arc<Timeline>,
}
@@ -318,11 +431,9 @@ struct ComputeConnectionGuard {
impl Drop for ComputeConnectionGuard {
fn drop(&mut self) {
let tli = self.timeline.clone();
// tokio forbids to call blocking_send inside the runtime, and see
// comments in on_compute_disconnect why we call blocking_send.
tokio::spawn(async move {
if let Err(e) = tli.on_compute_disconnect().await {
error!("failed to unregister compute connection: {}", e);
if let Err(e) = tli.update_status_notify().await {
error!("failed to update timeline status: {}", e);
}
});
}

View File

@@ -23,6 +23,7 @@ use utils::{
use storage_broker::proto::SafekeeperTimelineInfo;
use storage_broker::proto::TenantTimelineId as ProtoTenantTimelineId;
use crate::receive_wal::WalReceivers;
use crate::safekeeper::{
AcceptorProposerMessage, ProposerAcceptorMessage, SafeKeeper, SafeKeeperState,
SafekeeperMemState, ServerInfo, Term,
@@ -164,8 +165,8 @@ impl SharedState {
})
}
fn is_active(&self, remote_consistent_lsn: Lsn) -> bool {
self.is_wal_backup_required()
fn is_active(&self, num_computes: usize, remote_consistent_lsn: Lsn) -> bool {
self.is_wal_backup_required(num_computes)
// FIXME: add tracking of relevant pageservers and check them here individually,
// otherwise migration won't work (we suspend too early).
|| remote_consistent_lsn < self.sk.inmem.commit_lsn
@@ -173,29 +174,34 @@ impl SharedState {
/// Mark timeline active/inactive and return whether s3 offloading requires
/// start/stop action.
fn update_status(&mut self, remote_consistent_lsn: Lsn, ttid: TenantTimelineId) -> bool {
let is_active = self.is_active(remote_consistent_lsn);
fn update_status(
&mut self,
num_computes: usize,
remote_consistent_lsn: Lsn,
ttid: TenantTimelineId,
) -> bool {
let is_active = self.is_active(num_computes, remote_consistent_lsn);
if self.active != is_active {
info!("timeline {} active={} now", ttid, is_active);
}
self.active = is_active;
self.is_wal_backup_action_pending()
self.is_wal_backup_action_pending(num_computes)
}
/// Should we run s3 offloading in current state?
fn is_wal_backup_required(&self) -> bool {
fn is_wal_backup_required(&self, num_computes: usize) -> bool {
let seg_size = self.get_wal_seg_size();
self.num_computes > 0 ||
num_computes > 0 ||
// Currently only the whole segment is offloaded, so compare segment numbers.
(self.sk.inmem.commit_lsn.segment_number(seg_size) >
self.sk.inmem.backup_lsn.segment_number(seg_size))
(self.sk.inmem.commit_lsn.segment_number(seg_size) >
self.sk.inmem.backup_lsn.segment_number(seg_size))
}
/// Is current state of s3 offloading is not what it ought to be?
fn is_wal_backup_action_pending(&self) -> bool {
let res = self.wal_backup_active != self.is_wal_backup_required();
fn is_wal_backup_action_pending(&self, num_computes: usize) -> bool {
let res = self.wal_backup_active != self.is_wal_backup_required(num_computes);
if res {
let action_pending = if self.is_wal_backup_required() {
let action_pending = if self.is_wal_backup_required(num_computes) {
"start"
} else {
"stop"
@@ -210,8 +216,8 @@ impl SharedState {
/// Returns whether s3 offloading is required and sets current status as
/// matching.
fn wal_backup_attend(&mut self) -> bool {
self.wal_backup_active = self.is_wal_backup_required();
fn wal_backup_attend(&mut self, num_computes: usize) -> bool {
self.wal_backup_active = self.is_wal_backup_required(num_computes);
self.wal_backup_active
}
@@ -295,6 +301,7 @@ pub struct Timeline {
/// while holding it, ensuring that consensus checks are in order.
mutex: Mutex<SharedState>,
walsenders: Arc<WalSenders>,
walreceivers: Arc<WalReceivers>,
/// Cancellation channel. Delete/cancel will send `true` here as a cancellation signal.
cancellation_tx: watch::Sender<bool>,
@@ -329,6 +336,7 @@ impl Timeline {
commit_lsn_watch_rx,
mutex: Mutex::new(shared_state),
walsenders: WalSenders::new(rcl),
walreceivers: WalReceivers::new(),
cancellation_rx,
cancellation_tx,
timeline_dir: conf.timeline_dir(&ttid),
@@ -355,6 +363,7 @@ impl Timeline {
commit_lsn_watch_rx,
mutex: Mutex::new(SharedState::create_new(&conf, &ttid, state)?),
walsenders: WalSenders::new(Lsn(0)),
walreceivers: WalReceivers::new(),
cancellation_rx,
cancellation_tx,
timeline_dir: conf.timeline_dir(&ttid),
@@ -441,40 +450,22 @@ impl Timeline {
}
fn update_status(&self, shared_state: &mut SharedState) -> bool {
shared_state.update_status(self.get_walsenders().get_remote_consistent_lsn(), self.ttid)
shared_state.update_status(
self.walreceivers.get_num(),
self.get_walsenders().get_remote_consistent_lsn(),
self.ttid,
)
}
/// Register compute connection, starting timeline-related activity if it is
/// not running yet.
pub async fn on_compute_connect(&self) -> Result<()> {
/// Update timeline status and kick wal backup launcher to stop/start offloading if needed.
pub async fn update_status_notify(&self) -> Result<()> {
if self.is_cancelled() {
bail!(TimelineError::Cancelled(self.ttid));
}
let is_wal_backup_action_pending: bool;
{
let is_wal_backup_action_pending: bool = {
let mut shared_state = self.write_shared_state().await;
shared_state.num_computes += 1;
is_wal_backup_action_pending = self.update_status(&mut shared_state);
}
// Wake up wal backup launcher, if offloading not started yet.
if is_wal_backup_action_pending {
// Can fail only if channel to a static thread got closed, which is not normal at all.
self.wal_backup_launcher_tx.send(self.ttid).await?;
}
Ok(())
}
/// De-register compute connection, shutting down timeline activity if
/// pageserver doesn't need catchup.
pub async fn on_compute_disconnect(&self) -> Result<()> {
let is_wal_backup_action_pending: bool;
{
let mut shared_state = self.write_shared_state().await;
shared_state.num_computes -= 1;
is_wal_backup_action_pending = self.update_status(&mut shared_state);
}
// Wake up wal backup launcher, if it is time to stop the offloading.
self.update_status(&mut shared_state)
};
if is_wal_backup_action_pending {
// Can fail only if channel to a static thread got closed, which is not normal at all.
self.wal_backup_launcher_tx.send(self.ttid).await?;
@@ -519,7 +510,9 @@ impl Timeline {
return false;
}
self.write_shared_state().await.wal_backup_attend()
self.write_shared_state()
.await
.wal_backup_attend(self.walreceivers.get_num())
}
/// Returns commit_lsn watch channel.
@@ -650,6 +643,10 @@ impl Timeline {
&self.walsenders
}
pub fn get_walreceivers(&self) -> &Arc<WalReceivers> {
&self.walreceivers
}
/// Returns flush_lsn.
pub async fn get_flush_lsn(&self) -> Lsn {
self.write_shared_state().await.sk.wal_store.flush_lsn()

View File

@@ -16,10 +16,13 @@ use crate::SafeKeeperConf;
use postgres_backend::{AuthType, PostgresBackend};
/// Accept incoming TCP connections and spawn them into a background thread.
/// allowed_auth_scope is either SafekeeperData (wide JWT tokens giving access
/// to any tenant are allowed) or Tenant (only tokens giving access to specific
/// tenant are allowed). Doesn't matter if auth is disabled in conf.
pub async fn task_main(
conf: SafeKeeperConf,
pg_listener: std::net::TcpListener,
allowed_auth_scope: Option<Scope>,
allowed_auth_scope: Scope,
) -> anyhow::Result<()> {
// Tokio's from_std won't do this for us, per its comment.
pg_listener.set_nonblocking(true)?;
@@ -50,7 +53,7 @@ async fn handle_socket(
socket: TcpStream,
conf: SafeKeeperConf,
conn_id: ConnectionId,
allowed_auth_scope: Option<Scope>,
allowed_auth_scope: Scope,
) -> Result<(), QueryError> {
socket.set_nodelay(true)?;
let peer_addr = socket.peer_addr()?;
@@ -82,16 +85,17 @@ async fn handle_socket(
},
);
let auth_type = match conf.auth {
let auth_key = match allowed_auth_scope {
Scope::Tenant => conf.pg_tenant_only_auth.clone(),
_ => conf.pg_auth.clone(),
};
let auth_type = match auth_key {
None => AuthType::Trust,
Some(_) => AuthType::NeonJWT,
};
let mut conn_handler = SafekeeperPostgresHandler::new(
conf,
conn_id,
Some(traffic_metrics.clone()),
allowed_auth_scope,
);
let auth_pair = auth_key.map(|key| (allowed_auth_scope, key));
let mut conn_handler =
SafekeeperPostgresHandler::new(conf, conn_id, Some(traffic_metrics.clone()), auth_pair);
let pgbackend = PostgresBackend::new_from_io(socket, peer_addr, auth_type, None)?;
// libpq protocol between safekeeper and walproposer / pageserver
// We don't use shutdown.

View File

@@ -1,76 +0,0 @@
#! /usr/bin/env python3
# Script to generate ext_index.json metadata file
# that stores content of the control files and location of extension archives
# for all extensions in extensions subdir.
import argparse
import json
import subprocess
from pathlib import Path
"""
# ext_index.json example:
{
"public_extensions": [
"anon"
],
"library_index": {
"anon": "anon",
// for more complex extensions like postgis
// we might have something like:
// address_standardizer: postgis
// postgis_tiger: postgis
},
"extension_data": {
"anon": {
"control_data": {
"anon.control": "# PostgreSQL Anonymizer (anon) extension \ncomment = 'Data anonymization tools' \ndefault_version = '1.1.0' \ndirectory='extension/anon' \nrelocatable = false \nrequires = 'pgcrypto' \nsuperuser = false \nmodule_pathname = '$libdir/anon' \ntrusted = true \n"
},
"archive_path": "5648391853/v15/extensions/anon.tar.zst"
}
}
}
"""
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="generate ext_index.json")
parser.add_argument("pg_version", type=str, choices=["v14", "v15"], help="pg_version")
parser.add_argument("BUILD_TAG", type=str, help="BUILD_TAG for this compute image")
parser.add_argument("--public_extensions", type=str, help="list of public extensions")
args = parser.parse_args()
pg_version = args.pg_version
BUILD_TAG = args.BUILD_TAG
public_ext_list = args.public_extensions.split(",")
ext_index = {}
library_index = {}
EXT_PATH = Path("extensions")
for extension in EXT_PATH.iterdir():
if extension.is_dir():
control_data = {}
for control_file in extension.glob("*.control"):
if control_file.suffix != ".control":
continue
with open(control_file, "r") as f:
control_data[control_file.name] = f.read()
ext_index[extension.name] = {
"control_data": control_data,
"archive_path": f"{BUILD_TAG}/{pg_version}/extensions/{extension.name}.tar.zst",
}
elif extension.suffix == ".zst":
file_list = (
str(subprocess.check_output(["tar", "tf", str(extension)]), "utf-8")
.strip()
.split("\n")
)
for file in file_list:
if file.endswith(".so") and file.startswith("lib/"):
lib_name = file[4:-3]
library_index[lib_name] = extension.name.replace(".tar.zst", "")
all_data = {
"public_extensions": public_ext_list,
"library_index": library_index,
"extension_data": ext_index,
}
with open("ext_index.json", "w") as f:
json.dump(all_data, f)

View File

@@ -70,6 +70,7 @@ PAGESERVER_GLOBAL_METRICS: Tuple[str, ...] = (
"pageserver_getpage_reconstruct_seconds_count",
"pageserver_getpage_reconstruct_seconds_sum",
*[f"pageserver_basebackup_query_seconds_{x}" for x in ["bucket", "count", "sum"]],
*histogram("pageserver_smgr_query_seconds_global"),
*histogram("pageserver_read_num_fs_layers"),
*histogram("pageserver_getpage_get_reconstruct_data_seconds"),
*histogram("pageserver_wait_lsn_seconds"),

View File

@@ -427,6 +427,7 @@ class NeonEnvBuilder:
default_branch_name: str = DEFAULT_BRANCH_NAME,
preserve_database_files: bool = False,
initial_tenant: Optional[TenantId] = None,
initial_timeline: Optional[TimelineId] = None,
):
self.repo_dir = repo_dir
self.rust_log_override = rust_log_override
@@ -452,6 +453,7 @@ class NeonEnvBuilder:
self.pg_version = pg_version
self.preserve_database_files = preserve_database_files
self.initial_tenant = initial_tenant or TenantId.generate()
self.initial_timeline = initial_timeline or TimelineId.generate()
def init_configs(self) -> NeonEnv:
# Cannot create more than one environment from one builder
@@ -473,9 +475,10 @@ class NeonEnvBuilder:
f"Services started, creating initial tenant {env.initial_tenant} and its initial timeline"
)
initial_tenant, initial_timeline = env.neon_cli.create_tenant(
tenant_id=env.initial_tenant, conf=initial_tenant_conf
tenant_id=env.initial_tenant, conf=initial_tenant_conf, timeline_id=env.initial_timeline
)
env.initial_timeline = initial_timeline
assert env.initial_tenant == initial_tenant
assert env.initial_timeline == initial_timeline
log.info(f"Initial timeline {initial_tenant}/{initial_timeline} created successfully")
return env
@@ -784,7 +787,7 @@ class NeonEnv:
# generate initial tenant ID here instead of letting 'neon init' generate it,
# so that we don't need to dig it out of the config file afterwards.
self.initial_tenant = config.initial_tenant
self.initial_timeline: Optional[TimelineId] = None
self.initial_timeline = config.initial_timeline
# Create a config file corresponding to the options
toml = textwrap.dedent(
@@ -1313,12 +1316,20 @@ class NeonCli(AbstractNeonCli):
log.info(f"Stopping pageserver with {cmd}")
return self.raw_cli(cmd)
def safekeeper_start(self, id: int) -> "subprocess.CompletedProcess[str]":
def safekeeper_start(
self, id: int, extra_opts: Optional[List[str]] = None
) -> "subprocess.CompletedProcess[str]":
s3_env_vars = None
if self.env.remote_storage is not None and isinstance(self.env.remote_storage, S3Storage):
s3_env_vars = self.env.remote_storage.access_env_vars()
return self.raw_cli(["safekeeper", "start", str(id)], extra_env_vars=s3_env_vars)
if extra_opts is not None:
extra_opts = [f"-e={opt}" for opt in extra_opts]
else:
extra_opts = []
return self.raw_cli(
["safekeeper", "start", str(id), *extra_opts], extra_env_vars=s3_env_vars
)
def safekeeper_stop(
self, id: Optional[int] = None, immediate=False
@@ -1761,6 +1772,15 @@ class VanillaPostgres(PgProtocol):
with open(os.path.join(self.pgdatadir, "postgresql.conf"), "a") as conf_file:
conf_file.write("\n".join(options))
def edit_hba(self, hba: List[str]):
"""Prepend hba lines into pg_hba.conf file."""
assert not self.running
with open(os.path.join(self.pgdatadir, "pg_hba.conf"), "r+") as conf_file:
data = conf_file.read()
conf_file.seek(0)
conf_file.write("\n".join(hba) + "\n")
conf_file.write(data)
def start(self, log_path: Optional[str] = None):
assert not self.running
self.running = True
@@ -2158,15 +2178,18 @@ def static_proxy(
) -> Iterator[NeonProxy]:
"""Neon proxy that routes directly to vanilla postgres."""
# For simplicity, we use the same user for both `--auth-endpoint` and `safe_psql`
vanilla_pg.start()
vanilla_pg.safe_psql("create user proxy with login superuser password 'password'")
port = vanilla_pg.default_options["port"]
host = vanilla_pg.default_options["host"]
dbname = vanilla_pg.default_options["dbname"]
auth_endpoint = f"postgres://proxy:password@{host}:{port}/{dbname}"
# require password for 'http_auth' user
vanilla_pg.edit_hba([f"host {dbname} http_auth {host} password"])
# For simplicity, we use the same user for both `--auth-endpoint` and `safe_psql`
vanilla_pg.start()
vanilla_pg.safe_psql("create user proxy with login superuser password 'password'")
proxy_port = port_distributor.get_port()
mgmt_port = port_distributor.get_port()
http_port = port_distributor.get_port()
@@ -2507,9 +2530,9 @@ class Safekeeper:
id: int
running: bool = False
def start(self) -> "Safekeeper":
def start(self, extra_opts: Optional[List[str]] = None) -> "Safekeeper":
assert self.running is False
self.env.neon_cli.safekeeper_start(self.id)
self.env.neon_cli.safekeeper_start(self.id, extra_opts=extra_opts)
self.running = True
# wait for wal acceptor start by checking its status
started_at = time.time()

View File

@@ -1,6 +1,8 @@
import time
from typing import TYPE_CHECKING, Any, Dict, Optional
from mypy_boto3_s3.type_defs import ListObjectsV2OutputTypeDef
from fixtures.log_helper import log
from fixtures.pageserver.http import PageserverApiException, PageserverHttpClient
from fixtures.remote_storage import RemoteStorageKind, S3Storage
@@ -191,7 +193,11 @@ def wait_timeline_detail_404(
tenant_id: TenantId,
timeline_id: TimelineId,
iterations: int,
interval: Optional[float] = None,
):
if interval is None:
interval = 0.25
def timeline_is_missing():
data = {}
try:
@@ -204,7 +210,7 @@ def wait_timeline_detail_404(
raise RuntimeError(f"Timeline exists state {data.get('state')}")
wait_until(iterations, interval=0.250, func=timeline_is_missing)
wait_until(iterations, interval, func=timeline_is_missing)
def timeline_delete_wait_completed(
@@ -212,10 +218,11 @@ def timeline_delete_wait_completed(
tenant_id: TenantId,
timeline_id: TimelineId,
iterations: int = 20,
interval: Optional[float] = None,
**delete_args,
):
pageserver_http.timeline_delete(tenant_id=tenant_id, timeline_id=timeline_id, **delete_args)
wait_timeline_detail_404(pageserver_http, tenant_id, timeline_id, iterations)
wait_timeline_detail_404(pageserver_http, tenant_id, timeline_id, iterations, interval)
if TYPE_CHECKING:
@@ -225,6 +232,24 @@ if TYPE_CHECKING:
def assert_prefix_empty(neon_env_builder: "NeonEnvBuilder", prefix: Optional[str] = None):
response = list_prefix(neon_env_builder, prefix)
objects = response.get("Contents")
assert (
response["KeyCount"] == 0
), f"remote dir with prefix {prefix} is not empty after deletion: {objects}"
def assert_prefix_not_empty(neon_env_builder: "NeonEnvBuilder", prefix: Optional[str] = None):
response = list_prefix(neon_env_builder, prefix)
assert response["KeyCount"] != 0, f"remote dir with prefix {prefix} is empty: {response}"
def list_prefix(
neon_env_builder: "NeonEnvBuilder", prefix: Optional[str] = None
) -> ListObjectsV2OutputTypeDef:
"""
Note that this function takes into account prefix_in_bucket.
"""
# For local_fs we need to properly handle empty directories, which we currently dont, so for simplicity stick to s3 api.
assert neon_env_builder.remote_storage_kind in (
RemoteStorageKind.MOCK_S3,
@@ -234,15 +259,21 @@ def assert_prefix_empty(neon_env_builder: "NeonEnvBuilder", prefix: Optional[str
assert isinstance(neon_env_builder.remote_storage, S3Storage)
assert neon_env_builder.remote_storage_client is not None
prefix_in_bucket = neon_env_builder.remote_storage.prefix_in_bucket or ""
if not prefix:
prefix = prefix_in_bucket
else:
# real s3 tests have uniqie per test prefix
# mock_s3 tests use special pageserver prefix for pageserver stuff
prefix = "/".join((prefix_in_bucket, prefix))
# Note that this doesnt use pagination, so list is not guaranteed to be exhaustive.
response = neon_env_builder.remote_storage_client.list_objects_v2(
Delimiter="/",
Bucket=neon_env_builder.remote_storage.bucket_name,
Prefix=prefix or neon_env_builder.remote_storage.prefix_in_bucket or "",
Prefix=prefix,
)
objects = response.get("Contents")
assert (
response["KeyCount"] == 0
), f"remote dir with prefix {prefix} is not empty after deletion: {objects}"
return response
def wait_tenant_status_404(
@@ -284,4 +315,4 @@ MANY_SMALL_LAYERS_TENANT_CONFIG = {
def poll_for_remote_storage_iterations(remote_storage_kind: RemoteStorageKind) -> int:
return 20 if remote_storage_kind is RemoteStorageKind.REAL_S3 else 6
return 40 if remote_storage_kind is RemoteStorageKind.REAL_S3 else 15

View File

@@ -7,6 +7,9 @@ from pathlib import Path
from typing import Dict, List, Optional, Union
from fixtures.log_helper import log
from fixtures.types import TenantId, TimelineId
TIMELINE_INDEX_PART_FILE_NAME = "index_part.json"
class MockS3Server:
@@ -89,6 +92,19 @@ def available_s3_storages() -> List[RemoteStorageKind]:
class LocalFsStorage:
root: Path
def tenant_path(self, tenant_id: TenantId) -> Path:
return self.root / "tenants" / str(tenant_id)
def timeline_path(self, tenant_id: TenantId, timeline_id: TimelineId) -> Path:
return self.tenant_path(tenant_id) / "timelines" / str(timeline_id)
def index_path(self, tenant_id: TenantId, timeline_id: TimelineId) -> Path:
return self.timeline_path(tenant_id, timeline_id) / TIMELINE_INDEX_PART_FILE_NAME
def index_content(self, tenant_id: TenantId, timeline_id: TimelineId):
with self.index_path(tenant_id, timeline_id).open("r") as f:
return json.load(f)
@dataclass
class S3Storage:

View File

@@ -394,13 +394,7 @@ def check_neon_works(
test_output_dir / "dump-from-wal.filediff",
)
# TODO: Run pg_amcheck unconditionally after the next release
try:
pg_bin.run(["psql", connstr, "--command", "CREATE EXTENSION IF NOT EXISTS amcheck"])
except subprocess.CalledProcessError:
log.info("Extension amcheck is not available, skipping pg_amcheck")
else:
pg_bin.run_capture(["pg_amcheck", connstr, "--install-missing", "--verbose"])
pg_bin.run_capture(["pg_amcheck", connstr, "--install-missing", "--verbose"])
# Check that we can interract with the data
pg_bin.run_capture(["pgbench", "--time=10", "--progress=2", connstr])

View File

@@ -1,4 +1,3 @@
import shutil
import time
from dataclasses import dataclass
from typing import Dict, Tuple
@@ -14,7 +13,7 @@ from fixtures.neon_fixtures import (
)
from fixtures.pageserver.http import PageserverHttpClient
from fixtures.pageserver.utils import wait_for_upload_queue_empty
from fixtures.remote_storage import LocalFsStorage, RemoteStorageKind
from fixtures.remote_storage import RemoteStorageKind
from fixtures.types import Lsn, TenantId, TimelineId
from fixtures.utils import wait_until
@@ -138,22 +137,14 @@ def eviction_env(request, neon_env_builder: NeonEnvBuilder, pg_bin: PgBin) -> Ev
neon_env_builder.enable_remote_storage(RemoteStorageKind.LOCAL_FS, f"{request.node.name}")
env = neon_env_builder.init_start()
# initial tenant will not be present on this pageserver
env = neon_env_builder.init_configs()
env.start()
pageserver_http = env.pageserver.http_client()
# allow because we are invoking this manually; we always warn on executing disk based eviction
env.pageserver.allowed_errors.append(r".* running disk usage based eviction due to pressure.*")
# remove the initial tenant
assert env.initial_timeline
pageserver_http.tenant_detach(env.initial_tenant)
assert isinstance(env.remote_storage, LocalFsStorage)
tenant_remote_storage = env.remote_storage.root / "tenants" / str(env.initial_tenant)
assert tenant_remote_storage.is_dir()
shutil.rmtree(tenant_remote_storage)
env.initial_tenant = TenantId("0" * 32)
env.initial_timeline = None
# Choose small layer_size so that we can use low pgbench_scales and still get a large count of layers.
# Large count of layers and small layer size is good for testing because it makes evictions predictable.
# Predictable in the sense that many layer evictions will be required to reach the eviction target, because

View File

@@ -11,8 +11,7 @@ from fixtures.neon_fixtures import (
wait_for_last_flush_lsn,
)
from fixtures.remote_storage import RemoteStorageKind
from fixtures.types import TenantId, TimelineId
from fixtures.utils import query_scalar
from fixtures.types import TimelineId
# Test configuration
#
@@ -71,13 +70,11 @@ def test_gc_aggressive(neon_env_builder: NeonEnvBuilder):
# Disable pitr, because here we want to test branch creation after GC
neon_env_builder.pageserver_config_override = "tenant_config={pitr_interval = '0 sec'}"
env = neon_env_builder.init_start()
env.neon_cli.create_branch("test_gc_aggressive", "main")
timeline = env.neon_cli.create_branch("test_gc_aggressive", "main")
endpoint = env.endpoints.create_start("test_gc_aggressive")
log.info("postgres is running on test_gc_aggressive branch")
with endpoint.cursor() as cur:
timeline = TimelineId(query_scalar(cur, "SHOW neon.timeline_id"))
# Create table, and insert the first 100 rows
cur.execute("CREATE TABLE foo (id int, counter int, t text)")
cur.execute(
@@ -109,7 +106,8 @@ def test_gc_index_upload(neon_env_builder: NeonEnvBuilder, remote_storage_kind:
)
env = neon_env_builder.init_start()
env.neon_cli.create_branch("test_gc_index_upload", "main")
tenant_id = env.initial_tenant
timeline_id = env.neon_cli.create_branch("test_gc_index_upload", "main")
endpoint = env.endpoints.create_start("test_gc_index_upload")
pageserver_http = env.pageserver.http_client()
@@ -117,9 +115,6 @@ def test_gc_index_upload(neon_env_builder: NeonEnvBuilder, remote_storage_kind:
pg_conn = endpoint.connect()
cur = pg_conn.cursor()
tenant_id = TenantId(query_scalar(cur, "SHOW neon.tenant_id"))
timeline_id = TimelineId(query_scalar(cur, "SHOW neon.timeline_id"))
cur.execute("CREATE TABLE foo (id int, counter int, t text)")
cur.execute(
"""

View File

@@ -12,13 +12,8 @@ from fixtures.neon_fixtures import NeonEnvBuilder, PgBin
# test anyway, so it doesn't need any special attention here.
@pytest.mark.timeout(600)
def test_gc_cutoff(neon_env_builder: NeonEnvBuilder, pg_bin: PgBin):
env = neon_env_builder.init_start()
pageserver_http = env.pageserver.http_client()
# Use aggressive GC and checkpoint settings, so that we also exercise GC during the test
tenant_id, _ = env.neon_cli.create_tenant(
conf={
env = neon_env_builder.init_start(
initial_tenant_conf={
"gc_period": "10 s",
"gc_horizon": f"{1024 ** 2}",
"checkpoint_distance": f"{1024 ** 2}",
@@ -29,6 +24,11 @@ def test_gc_cutoff(neon_env_builder: NeonEnvBuilder, pg_bin: PgBin):
"image_creation_threshold": "2",
}
)
pageserver_http = env.pageserver.http_client()
# Use aggressive GC and checkpoint settings, so that we also exercise GC during the test
tenant_id = env.initial_tenant
endpoint = env.endpoints.create_start("main", tenant_id=tenant_id)
connstr = endpoint.connstr(options="-csynchronous_commit=off")
pg_bin.run_capture(["pgbench", "-i", "-s10", connstr])
@@ -39,5 +39,4 @@ def test_gc_cutoff(neon_env_builder: NeonEnvBuilder, pg_bin: PgBin):
with pytest.raises(subprocess.SubprocessError):
pg_bin.run_capture(["pgbench", "-P1", "-N", "-c5", "-T500", "-Mprepared", connstr])
env.pageserver.stop()
env.pageserver.start()
pageserver_http.configure_failpoints(("after-timeline-gc-removed-layers", "exit"))
env.pageserver.start(extra_env_vars={"FAILPOINTS": "after-timeline-gc-removed-layers=exit"})

View File

@@ -74,9 +74,9 @@ def test_large_schema(neon_env_builder: NeonEnvBuilder):
cur.execute("select * from pg_depend order by refclassid, refobjid, refobjsubid")
# Check layer file sizes
tenant_id = endpoint.safe_psql("show neon.tenant_id")[0][0]
timeline_id = endpoint.safe_psql("show neon.timeline_id")[0][0]
timeline_path = "{}/tenants/{}/timelines/{}/".format(env.repo_dir, tenant_id, timeline_id)
timeline_path = "{}/tenants/{}/timelines/{}/".format(
env.repo_dir, env.initial_tenant, env.initial_timeline
)
for filename in os.listdir(timeline_path):
if filename.startswith("00000"):
log.info(f"layer {filename} size is {os.path.getsize(timeline_path + filename)}")

View File

@@ -8,7 +8,7 @@ from fixtures.neon_fixtures import (
)
from fixtures.pageserver.utils import wait_for_last_record_lsn, wait_for_upload
from fixtures.remote_storage import RemoteStorageKind
from fixtures.types import Lsn, TenantId, TimelineId
from fixtures.types import Lsn
from fixtures.utils import query_scalar
@@ -34,8 +34,8 @@ def test_basic_eviction(
client = env.pageserver.http_client()
endpoint = env.endpoints.create_start("main")
tenant_id = TenantId(endpoint.safe_psql("show neon.tenant_id")[0][0])
timeline_id = TimelineId(endpoint.safe_psql("show neon.timeline_id")[0][0])
tenant_id = env.initial_tenant
timeline_id = env.initial_timeline
# Create a number of layers in the tenant
with endpoint.cursor() as cur:

View File

@@ -18,8 +18,7 @@ from fixtures.neon_fixtures import (
)
from fixtures.port_distributor import PortDistributor
from fixtures.remote_storage import RemoteStorageKind
from fixtures.types import TenantId, TimelineId
from fixtures.utils import query_scalar
from fixtures.types import TenantId
from pytest_httpserver import HTTPServer
from werkzeug.wrappers.request import Request
from werkzeug.wrappers.response import Response
@@ -115,15 +114,13 @@ def test_metric_collection(
# Order of fixtures shutdown is not specified, and if http server gets down
# before pageserver, pageserver log might contain such errors in the end.
env.pageserver.allowed_errors.append(".*metrics endpoint refused the sent metrics*")
env.neon_cli.create_branch("test_metric_collection")
tenant_id = env.initial_tenant
timeline_id = env.neon_cli.create_branch("test_metric_collection")
endpoint = env.endpoints.create_start("test_metric_collection")
pg_conn = endpoint.connect()
cur = pg_conn.cursor()
tenant_id = TenantId(query_scalar(cur, "SHOW neon.tenant_id"))
timeline_id = TimelineId(query_scalar(cur, "SHOW neon.timeline_id"))
cur.execute("CREATE TABLE foo (id int, counter int, t text)")
cur.execute(
"""

View File

@@ -78,8 +78,8 @@ def test_ondemand_download_large_rel(
client = env.pageserver.http_client()
tenant_id = endpoint.safe_psql("show neon.tenant_id")[0][0]
timeline_id = endpoint.safe_psql("show neon.timeline_id")[0][0]
tenant_id = env.initial_tenant
timeline_id = env.initial_timeline
# We want to make sure that the data is large enough that the keyspace is partitioned.
num_rows = 1000000
@@ -183,8 +183,8 @@ def test_ondemand_download_timetravel(
client = env.pageserver.http_client()
tenant_id = endpoint.safe_psql("show neon.tenant_id")[0][0]
timeline_id = endpoint.safe_psql("show neon.timeline_id")[0][0]
tenant_id = env.initial_tenant
timeline_id = env.initial_timeline
lsns = []
@@ -342,8 +342,8 @@ def test_download_remote_layers_api(
client = env.pageserver.http_client()
tenant_id = endpoint.safe_psql("show neon.tenant_id")[0][0]
timeline_id = endpoint.safe_psql("show neon.timeline_id")[0][0]
tenant_id = env.initial_tenant
timeline_id = env.initial_timeline
table_len = 10000
with endpoint.cursor() as cur:
@@ -369,7 +369,7 @@ def test_download_remote_layers_api(
filled_current_physical = get_api_current_physical_size()
log.info(filled_current_physical)
filled_size = get_resident_physical_size()
log.info(filled_size)
log.info(f"filled_size: {filled_size}")
assert filled_current_physical == filled_size, "we don't yet do layer eviction"
env.pageserver.stop()
@@ -377,7 +377,7 @@ def test_download_remote_layers_api(
# remove all the layer files
# XXX only delete some of the layer files, to show that it really just downloads all the layers
for layer in (Path(env.repo_dir) / "tenants").glob("*/timelines/*/*-*_*"):
log.info(f"unlinking layer {layer}")
log.info(f"unlinking layer {layer.name}")
layer.unlink()
# Shut down safekeepers before starting the pageserver.
@@ -403,7 +403,7 @@ def test_download_remote_layers_api(
filled_current_physical == get_api_current_physical_size()
), "current_physical_size is sum of loaded layer sizes, independent of whether local or remote"
post_unlink_size = get_resident_physical_size()
log.info(post_unlink_size)
log.info(f"post_unlink_size: {post_unlink_size}")
assert (
post_unlink_size < filled_size
), "we just deleted layers and didn't cause anything to re-download them yet"
@@ -516,7 +516,6 @@ def test_compaction_downloads_on_demand_without_image_creation(
tenant_id = env.initial_tenant
timeline_id = env.initial_timeline
assert timeline_id is not None
with env.endpoints.create_start("main") as endpoint:
# no particular reason to create the layers like this, but we are sure
@@ -590,7 +589,6 @@ def test_compaction_downloads_on_demand_with_image_creation(
env = neon_env_builder.init_start(initial_tenant_conf=stringify(conf))
tenant_id = env.initial_tenant
timeline_id = env.initial_timeline
assert timeline_id is not None
pageserver_http = env.pageserver.http_client()

View File

@@ -265,16 +265,23 @@ def test_sql_over_http_output_options(static_proxy: NeonProxy):
def test_sql_over_http_batch(static_proxy: NeonProxy):
static_proxy.safe_psql("create role http with login password 'http' superuser")
def qq(queries: List[Tuple[str, Optional[List[Any]]]], read_only: bool = False) -> Any:
def qq(
queries: List[Tuple[str, Optional[List[Any]]]],
read_only: bool = False,
deferrable: bool = False,
) -> Any:
connstr = f"postgresql://http:http@{static_proxy.domain}:{static_proxy.proxy_port}/postgres"
response = requests.post(
f"https://{static_proxy.domain}:{static_proxy.external_http_port}/sql",
data=json.dumps(list(map(lambda x: {"query": x[0], "params": x[1] or []}, queries))),
data=json.dumps(
{"queries": list(map(lambda x: {"query": x[0], "params": x[1] or []}, queries))}
),
headers={
"Content-Type": "application/sql",
"Neon-Connection-String": connstr,
"Neon-Batch-Isolation-Level": "Serializable",
"Neon-Batch-Read-Only": "true" if read_only else "false",
"Neon-Batch-Deferrable": "true" if deferrable else "false",
},
verify=str(static_proxy.test_output_dir / "proxy.crt"),
)
@@ -297,7 +304,8 @@ def test_sql_over_http_batch(static_proxy: NeonProxy):
)
assert headers["Neon-Batch-Isolation-Level"] == "Serializable"
assert headers["Neon-Batch-Read-Only"] == "false"
assert "Neon-Batch-Read-Only" not in headers
assert "Neon-Batch-Deferrable" not in headers
assert result[0]["rows"] == [{"answer": 42}]
assert result[1]["rows"] == [{"answer": "42"}]
@@ -325,8 +333,57 @@ def test_sql_over_http_batch(static_proxy: NeonProxy):
("select 42 as answer", None),
],
True,
True,
)
assert headers["Neon-Batch-Isolation-Level"] == "Serializable"
assert headers["Neon-Batch-Read-Only"] == "true"
assert headers["Neon-Batch-Deferrable"] == "true"
assert result[0]["rows"] == [{"answer": 42}]
def test_sql_over_http_pool(static_proxy: NeonProxy):
static_proxy.safe_psql("create user http_auth with password 'http' superuser")
def get_pid(status: int, pw: str) -> Any:
connstr = (
f"postgresql://http_auth:{pw}@{static_proxy.domain}:{static_proxy.proxy_port}/postgres"
)
response = requests.post(
f"https://{static_proxy.domain}:{static_proxy.external_http_port}/sql",
data=json.dumps(
{"query": "SELECT pid FROM pg_stat_activity WHERE state = 'active'", "params": []}
),
headers={
"Content-Type": "application/sql",
"Neon-Connection-String": connstr,
"Neon-Pool-Opt-In": "true",
},
verify=str(static_proxy.test_output_dir / "proxy.crt"),
)
assert response.status_code == status
return response.json()
pid1 = get_pid(200, "http")["rows"][0]["pid"]
# query should be on the same connection
rows = get_pid(200, "http")["rows"]
assert rows == [{"pid": pid1}]
# incorrect password should not work
res = get_pid(400, "foobar")
assert "password authentication failed for user" in res["message"]
static_proxy.safe_psql("alter user http_auth with password 'http2'")
# after password change, should open a new connection to verify it
pid2 = get_pid(200, "http2")["rows"][0]["pid"]
assert pid1 != pid2
# query should be on an existing connection
pid = get_pid(200, "http2")["rows"][0]["pid"]
assert pid in [pid1, pid2]
# old password should not work
res = get_pid(400, "http")
assert "password authentication failed for user" in res["message"]

View File

@@ -2,7 +2,7 @@ from contextlib import closing
from fixtures.neon_fixtures import NeonEnvBuilder
from fixtures.pageserver.utils import wait_for_last_record_lsn
from fixtures.types import Lsn, TenantId, TimelineId
from fixtures.types import Lsn
from fixtures.utils import query_scalar
@@ -12,24 +12,21 @@ from fixtures.utils import query_scalar
# Additionally, tests that pageserver is able to create tenants with custom configs.
def test_read_request_tracing(neon_env_builder: NeonEnvBuilder):
neon_env_builder.num_safekeepers = 1
env = neon_env_builder.init_start()
tenant, _ = env.neon_cli.create_tenant(
conf={
env = neon_env_builder.init_start(
initial_tenant_conf={
"trace_read_requests": "true",
}
)
timeline = env.neon_cli.create_timeline("test_trace_replay", tenant_id=tenant)
endpoint = env.endpoints.create_start("test_trace_replay", "main", tenant)
tenant_id = env.initial_tenant
timeline_id = env.initial_timeline
endpoint = env.endpoints.create_start("main")
with closing(endpoint.connect()) as conn:
with conn.cursor() as cur:
cur.execute("create table t (i integer);")
cur.execute(f"insert into t values (generate_series(1,{10000}));")
cur.execute("select count(*) from t;")
tenant_id = TenantId(endpoint.safe_psql("show neon.tenant_id")[0][0])
timeline_id = TimelineId(endpoint.safe_psql("show neon.timeline_id")[0][0])
current_lsn = Lsn(query_scalar(cur, "SELECT pg_current_wal_flush_lsn()"))
# wait until pageserver receives that data
pageserver_http = env.pageserver.http_client()
@@ -38,5 +35,5 @@ def test_read_request_tracing(neon_env_builder: NeonEnvBuilder):
# Stop postgres so we drop the connection and flush the traces
endpoint.stop()
trace_path = env.repo_dir / "traces" / str(tenant) / str(timeline)
trace_path = env.repo_dir / "traces" / str(tenant_id) / str(timeline_id)
assert trace_path.exists()

View File

@@ -24,6 +24,7 @@ from fixtures.pageserver.utils import (
wait_until_tenant_state,
)
from fixtures.remote_storage import (
TIMELINE_INDEX_PART_FILE_NAME,
LocalFsStorage,
RemoteStorageKind,
available_remote_storages,
@@ -94,12 +95,12 @@ def test_remote_storage_backup_and_restore(
client = env.pageserver.http_client()
tenant_id = TenantId(endpoint.safe_psql("show neon.tenant_id")[0][0])
timeline_id = TimelineId(endpoint.safe_psql("show neon.timeline_id")[0][0])
tenant_id = env.initial_tenant
timeline_id = env.initial_timeline
# Thats because of UnreliableWrapper's injected failures
env.pageserver.allowed_errors.append(
f".*failed to fetch tenant deletion mark at tenants/({tenant_id}|{env.initial_tenant})/deleted attempt 1.*"
f".*failed to fetch tenant deletion mark at tenants/{tenant_id}/deleted attempt 1.*"
)
checkpoint_numbers = range(1, 3)
@@ -173,9 +174,7 @@ def test_remote_storage_backup_and_restore(
#
# The initiated attach operation should survive the restart, and continue from where it was.
env.pageserver.stop()
layer_download_failed_regex = (
r"download.*[0-9A-F]+-[0-9A-F]+.*open a download stream for layer.*simulated failure"
)
layer_download_failed_regex = r"Failed to download a remote file: simulated failure of remote operation Download.*[0-9A-F]+-[0-9A-F]+"
assert not env.pageserver.log_contains(
layer_download_failed_regex
), "we shouldn't have tried any layer downloads yet since list remote timelines has a failpoint"
@@ -208,7 +207,7 @@ def test_remote_storage_backup_and_restore(
== f"{data}|{checkpoint_number}"
)
log.info("ensure that we neede to retry downloads due to test_remote_failures=1")
log.info("ensure that we needed to retry downloads due to test_remote_failures=1")
assert env.pageserver.log_contains(layer_download_failed_regex)
@@ -271,7 +270,7 @@ def test_remote_storage_upload_queue_retries(
f"""
INSERT INTO foo (id, val)
SELECT g, '{data}'
FROM generate_series(1, 10000) g
FROM generate_series(1, 20000) g
ON CONFLICT (id) DO UPDATE
SET val = EXCLUDED.val
""",
@@ -372,7 +371,7 @@ def test_remote_storage_upload_queue_retries(
log.info("restarting postgres to validate")
endpoint = env.endpoints.create_start("main", tenant_id=tenant_id)
with endpoint.cursor() as cur:
assert query_scalar(cur, "SELECT COUNT(*) FROM foo WHERE val = 'd'") == 10000
assert query_scalar(cur, "SELECT COUNT(*) FROM foo WHERE val = 'd'") == 20000
@pytest.mark.parametrize("remote_storage_kind", [RemoteStorageKind.LOCAL_FS])
@@ -404,8 +403,7 @@ def test_remote_timeline_client_calls_started_metric(
)
tenant_id = env.initial_tenant
assert env.initial_timeline is not None
timeline_id: TimelineId = env.initial_timeline
timeline_id = env.initial_timeline
client = env.pageserver.http_client()
@@ -420,7 +418,7 @@ def test_remote_timeline_client_calls_started_metric(
f"""
INSERT INTO foo (id, val)
SELECT g, '{data}'
FROM generate_series(1, 10000) g
FROM generate_series(1, 20000) g
ON CONFLICT (id) DO UPDATE
SET val = EXCLUDED.val
""",
@@ -511,7 +509,7 @@ def test_remote_timeline_client_calls_started_metric(
log.info("restarting postgres to validate")
endpoint = env.endpoints.create_start("main", tenant_id=tenant_id)
with endpoint.cursor() as cur:
assert query_scalar(cur, "SELECT COUNT(*) FROM foo WHERE val = 'd'") == 10000
assert query_scalar(cur, "SELECT COUNT(*) FROM foo WHERE val = 'd'") == 20000
# ensure that we updated the calls_started download metric
fetch_calls_started()
@@ -543,8 +541,7 @@ def test_timeline_deletion_with_files_stuck_in_upload_queue(
}
)
tenant_id = env.initial_tenant
assert env.initial_timeline is not None
timeline_id: TimelineId = env.initial_timeline
timeline_id = env.initial_timeline
timeline_path = env.timeline_dir(tenant_id, timeline_id)
@@ -609,15 +606,15 @@ def test_timeline_deletion_with_files_stuck_in_upload_queue(
".* ERROR .*Error processing HTTP request: InternalServerError\\(timeline is Stopping"
)
timeline_delete_wait_completed(client, tenant_id, timeline_id)
# Generous timeout, because currently deletions can get blocked waiting for compaction
# This can be reduced when https://github.com/neondatabase/neon/issues/4998 is fixed.
timeline_delete_wait_completed(client, tenant_id, timeline_id, iterations=30, interval=1)
assert not timeline_path.exists()
# to please mypy
assert isinstance(env.remote_storage, LocalFsStorage)
remote_timeline_path = (
env.remote_storage.root / "tenants" / str(tenant_id) / "timelines" / str(timeline_id)
)
remote_timeline_path = env.remote_storage.timeline_path(tenant_id, timeline_id)
assert not list(remote_timeline_path.iterdir())
@@ -722,15 +719,14 @@ def test_empty_branch_remote_storage_upload_on_restart(
# index upload is now hitting the failpoint, it should block the shutdown
env.pageserver.stop(immediate=True)
timeline_path = (
Path("tenants") / str(env.initial_tenant) / "timelines" / str(new_branch_timeline_id)
)
local_metadata = env.repo_dir / timeline_path / "metadata"
local_metadata = env.timeline_dir(env.initial_tenant, new_branch_timeline_id) / "metadata"
assert local_metadata.is_file()
assert isinstance(env.remote_storage, LocalFsStorage)
new_branch_on_remote_storage = env.remote_storage.root / timeline_path
new_branch_on_remote_storage = env.remote_storage.timeline_path(
env.initial_tenant, new_branch_timeline_id
)
assert (
not new_branch_on_remote_storage.exists()
), "failpoint should had prohibited index_part.json upload"
@@ -779,7 +775,7 @@ def test_empty_branch_remote_storage_upload_on_restart(
assert_nothing_to_upload(client, env.initial_tenant, new_branch_timeline_id)
assert (
new_branch_on_remote_storage / "index_part.json"
new_branch_on_remote_storage / TIMELINE_INDEX_PART_FILE_NAME
).is_file(), "uploads scheduled during initial load should had been awaited for"
finally:
create_thread.join()
@@ -810,8 +806,7 @@ def test_compaction_delete_before_upload(
)
tenant_id = env.initial_tenant
assert env.initial_timeline is not None
timeline_id: TimelineId = env.initial_timeline
timeline_id = env.initial_timeline
client = env.pageserver.http_client()

View File

@@ -1,5 +1,7 @@
import enum
import os
import shutil
from pathlib import Path
import pytest
from fixtures.log_helper import log
@@ -13,13 +15,18 @@ from fixtures.pageserver.http import PageserverApiException
from fixtures.pageserver.utils import (
MANY_SMALL_LAYERS_TENANT_CONFIG,
assert_prefix_empty,
assert_prefix_not_empty,
poll_for_remote_storage_iterations,
tenant_delete_wait_completed,
wait_tenant_status_404,
wait_until_tenant_active,
wait_until_tenant_state,
)
from fixtures.remote_storage import RemoteStorageKind, available_remote_storages
from fixtures.remote_storage import (
RemoteStorageKind,
available_remote_storages,
available_s3_storages,
)
from fixtures.types import TenantId
from fixtures.utils import run_pg_bench_small
@@ -32,6 +39,8 @@ def test_tenant_delete_smoke(
remote_storage_kind: RemoteStorageKind,
pg_bin: PgBin,
):
neon_env_builder.pageserver_config_override = "test_remote_failures=1"
neon_env_builder.enable_remote_storage(
remote_storage_kind=remote_storage_kind,
test_name="test_tenant_delete_smoke",
@@ -39,6 +48,11 @@ def test_tenant_delete_smoke(
env = neon_env_builder.init_start()
# lucky race with stopping from flushing a layer we fail to schedule any uploads
env.pageserver.allowed_errors.append(
".*layer flush task.+: could not flush frozen layer: update_metadata_file"
)
ps_http = env.pageserver.http_client()
# first try to delete non existing tenant
@@ -62,6 +76,17 @@ def test_tenant_delete_smoke(
run_pg_bench_small(pg_bin, endpoint.connstr())
wait_for_last_flush_lsn(env, endpoint, tenant=tenant_id, timeline=timeline_id)
if remote_storage_kind in available_s3_storages():
assert_prefix_not_empty(
neon_env_builder,
prefix="/".join(
(
"tenants",
str(tenant_id),
)
),
)
parent = timeline
iterations = poll_for_remote_storage_iterations(remote_storage_kind)
@@ -71,7 +96,7 @@ def test_tenant_delete_smoke(
tenant_path = env.tenant_dir(tenant_id=tenant_id)
assert not tenant_path.exists()
if remote_storage_kind in [RemoteStorageKind.MOCK_S3, RemoteStorageKind.REAL_S3]:
if remote_storage_kind in available_s3_storages():
assert_prefix_empty(
neon_env_builder,
prefix="/".join(
@@ -123,25 +148,35 @@ def combinations():
for remote_storage_kind in remotes:
for delete_failpoint in FAILPOINTS:
if remote_storage_kind == RemoteStorageKind.NOOP and delete_failpoint in (
if remote_storage_kind is RemoteStorageKind.NOOP and delete_failpoint in (
"timeline-delete-before-index-delete",
):
# the above failpoint are not relevant for config without remote storage
continue
result.append((remote_storage_kind, delete_failpoint))
# Simulate failures for only one type of remote storage
# to avoid log pollution and make tests run faster
if remote_storage_kind is RemoteStorageKind.MOCK_S3:
simulate_failures = True
else:
simulate_failures = False
result.append((remote_storage_kind, delete_failpoint, simulate_failures))
return result
@pytest.mark.parametrize("remote_storage_kind, failpoint", combinations())
@pytest.mark.parametrize("remote_storage_kind, failpoint, simulate_failures", combinations())
@pytest.mark.parametrize("check", list(Check))
def test_delete_tenant_exercise_crash_safety_failpoints(
neon_env_builder: NeonEnvBuilder,
remote_storage_kind: RemoteStorageKind,
failpoint: str,
simulate_failures: bool,
check: Check,
pg_bin: PgBin,
):
if simulate_failures:
neon_env_builder.pageserver_config_override = "test_remote_failures=1"
neon_env_builder.enable_remote_storage(
remote_storage_kind, "test_delete_tenant_exercise_crash_safety_failpoints"
)
@@ -177,6 +212,17 @@ def test_delete_tenant_exercise_crash_safety_failpoints(
else:
last_flush_lsn_upload(env, endpoint, tenant_id, timeline_id)
if remote_storage_kind in available_s3_storages():
assert_prefix_not_empty(
neon_env_builder,
prefix="/".join(
(
"tenants",
str(tenant_id),
)
),
)
ps_http.configure_failpoints((failpoint, "return"))
iterations = poll_for_remote_storage_iterations(remote_storage_kind)
@@ -229,8 +275,12 @@ def test_delete_tenant_exercise_crash_safety_failpoints(
tenant_delete_wait_completed(ps_http, tenant_id, iterations=iterations)
# Check remote is impty
if remote_storage_kind is RemoteStorageKind.MOCK_S3:
tenant_dir = env.tenant_dir(tenant_id)
# Check local is empty
assert not tenant_dir.exists()
# Check remote is empty
if remote_storage_kind in available_s3_storages():
assert_prefix_empty(
neon_env_builder,
prefix="/".join(
@@ -241,10 +291,116 @@ def test_delete_tenant_exercise_crash_safety_failpoints(
),
)
tenant_dir = env.tenant_dir(tenant_id)
# Check local is empty
assert not tenant_dir.exists()
@pytest.mark.parametrize("remote_storage_kind", available_remote_storages())
def test_tenant_delete_is_resumed_on_attach(
neon_env_builder: NeonEnvBuilder,
remote_storage_kind: RemoteStorageKind,
pg_bin: PgBin,
):
neon_env_builder.enable_remote_storage(
remote_storage_kind=remote_storage_kind,
test_name="test_deleted_tenant_ignored_on_attach",
)
env = neon_env_builder.init_start(initial_tenant_conf=MANY_SMALL_LAYERS_TENANT_CONFIG)
tenant_id = env.initial_tenant
ps_http = env.pageserver.http_client()
# create two timelines
for timeline in ["first", "second"]:
timeline_id = env.neon_cli.create_timeline(timeline, tenant_id=tenant_id)
with env.endpoints.create_start(timeline, tenant_id=tenant_id) as endpoint:
run_pg_bench_small(pg_bin, endpoint.connstr())
wait_for_last_flush_lsn(env, endpoint, tenant=tenant_id, timeline=timeline_id)
# sanity check, data should be there
if remote_storage_kind in available_s3_storages():
assert_prefix_not_empty(
neon_env_builder,
prefix="/".join(
(
"tenants",
str(tenant_id),
)
),
)
# failpoint before we remove index_part from s3
failpoint = "timeline-delete-before-index-delete"
ps_http.configure_failpoints((failpoint, "return"))
env.pageserver.allowed_errors.extend(
(
# allow errors caused by failpoints
f".*failpoint: {failpoint}",
# From deletion polling
f".*NotFound: tenant {env.initial_tenant}.*",
# It appears when we stopped flush loop during deletion (attempt) and then pageserver is stopped
".*freeze_and_flush_on_shutdown.*failed to freeze and flush: cannot flush frozen layers when flush_loop is not running, state is Exited",
# error from http response is also logged
".*InternalServerError\\(Tenant is marked as deleted on remote storage.*",
'.*shutdown_pageserver{exit_code=0}: stopping left-over name="remote upload".*',
)
)
iterations = poll_for_remote_storage_iterations(remote_storage_kind)
ps_http.tenant_delete(tenant_id)
tenant_info = wait_until_tenant_state(
pageserver_http=ps_http,
tenant_id=tenant_id,
expected_state="Broken",
iterations=iterations,
)
if remote_storage_kind in available_s3_storages():
assert_prefix_not_empty(
neon_env_builder,
prefix="/".join(
(
"tenants",
str(tenant_id),
)
),
)
reason = tenant_info["state"]["data"]["reason"]
# failpoint may not be the only error in the stack
assert reason.endswith(f"failpoint: {failpoint}"), reason
# now we stop pageserver and remove local tenant state
env.endpoints.stop_all()
env.pageserver.stop()
dir_to_clear = Path(env.repo_dir) / "tenants"
shutil.rmtree(dir_to_clear)
os.mkdir(dir_to_clear)
env.pageserver.start()
# now we call attach
ps_http.tenant_attach(tenant_id=tenant_id)
# delete should be resumed
wait_tenant_status_404(ps_http, tenant_id, iterations)
# we shouldn've created tenant dir on disk
tenant_path = env.tenant_dir(tenant_id=tenant_id)
assert not tenant_path.exists()
if remote_storage_kind in available_s3_storages():
assert_prefix_empty(
neon_env_builder,
prefix="/".join(
(
"tenants",
str(tenant_id),
)
),
)
# TODO test concurrent deletions with "hang" failpoint
# TODO test tenant delete continues after attach

View File

@@ -463,8 +463,8 @@ def test_detach_while_attaching(
client = env.pageserver.http_client()
tenant_id = TenantId(endpoint.safe_psql("show neon.tenant_id")[0][0])
timeline_id = TimelineId(endpoint.safe_psql("show neon.timeline_id")[0][0])
tenant_id = env.initial_tenant
timeline_id = env.initial_timeline
# Attempts to connect from compute to pageserver while the tenant is
# temporarily detached produces these errors in the pageserver log.
@@ -615,8 +615,8 @@ def test_ignored_tenant_download_missing_layers(
pageserver_http = env.pageserver.http_client()
endpoint = env.endpoints.create_start("main")
tenant_id = TenantId(endpoint.safe_psql("show neon.tenant_id")[0][0])
timeline_id = TimelineId(endpoint.safe_psql("show neon.timeline_id")[0][0])
tenant_id = env.initial_tenant
timeline_id = env.initial_timeline
# Attempts to connect from compute to pageserver while the tenant is
# temporarily detached produces these errors in the pageserver log.
@@ -679,10 +679,10 @@ def test_ignored_tenant_stays_broken_without_metadata(
)
env = neon_env_builder.init_start()
pageserver_http = env.pageserver.http_client()
endpoint = env.endpoints.create_start("main")
env.endpoints.create_start("main")
tenant_id = TenantId(endpoint.safe_psql("show neon.tenant_id")[0][0])
timeline_id = TimelineId(endpoint.safe_psql("show neon.timeline_id")[0][0])
tenant_id = env.initial_tenant
timeline_id = env.initial_timeline
# Attempts to connect from compute to pageserver while the tenant is
# temporarily detached produces these errors in the pageserver log.
@@ -723,9 +723,9 @@ def test_load_attach_negatives(
)
env = neon_env_builder.init_start()
pageserver_http = env.pageserver.http_client()
endpoint = env.endpoints.create_start("main")
env.endpoints.create_start("main")
tenant_id = TenantId(endpoint.safe_psql("show neon.tenant_id")[0][0])
tenant_id = env.initial_tenant
# Attempts to connect from compute to pageserver while the tenant is
# temporarily detached produces these errors in the pageserver log.
@@ -773,8 +773,8 @@ def test_ignore_while_attaching(
pageserver_http = env.pageserver.http_client()
tenant_id = TenantId(endpoint.safe_psql("show neon.tenant_id")[0][0])
timeline_id = TimelineId(endpoint.safe_psql("show neon.timeline_id")[0][0])
tenant_id = env.initial_tenant
timeline_id = env.initial_timeline
# Attempts to connect from compute to pageserver while the tenant is
# temporarily detached produces these errors in the pageserver log.

View File

@@ -7,7 +7,6 @@
#
import asyncio
import json
import os
from pathlib import Path
from typing import List, Tuple
@@ -143,8 +142,8 @@ def test_tenants_attached_after_download(
client = env.pageserver.http_client()
tenant_id = TenantId(endpoint.safe_psql("show neon.tenant_id")[0][0])
timeline_id = TimelineId(endpoint.safe_psql("show neon.timeline_id")[0][0])
tenant_id = env.initial_tenant
timeline_id = env.initial_timeline
# Thats because of UnreliableWrapper's injected failures
env.pageserver.allowed_errors.append(
@@ -225,10 +224,11 @@ def test_tenants_attached_after_download(
# FIXME: test index_part.json getting downgraded from imaginary new version
@pytest.mark.parametrize("remote_storage_kind", [RemoteStorageKind.LOCAL_FS])
def test_tenant_redownloads_truncated_file_on_startup(
neon_env_builder: NeonEnvBuilder, remote_storage_kind: RemoteStorageKind
neon_env_builder: NeonEnvBuilder,
):
remote_storage_kind = RemoteStorageKind.LOCAL_FS
# since we now store the layer file length metadata, we notice on startup that a layer file is of wrong size, and proceed to redownload it.
neon_env_builder.enable_remote_storage(
remote_storage_kind=remote_storage_kind,
@@ -237,6 +237,8 @@ def test_tenant_redownloads_truncated_file_on_startup(
env = neon_env_builder.init_start()
assert isinstance(env.remote_storage, LocalFsStorage)
env.pageserver.allowed_errors.append(
".*removing local file .* because it has unexpected length.*"
)
@@ -250,8 +252,8 @@ def test_tenant_redownloads_truncated_file_on_startup(
pageserver_http = env.pageserver.http_client()
endpoint = env.endpoints.create_start("main")
tenant_id = TenantId(endpoint.safe_psql("show neon.tenant_id")[0][0])
timeline_id = TimelineId(endpoint.safe_psql("show neon.timeline_id")[0][0])
tenant_id = env.initial_tenant
timeline_id = env.initial_timeline
with endpoint.cursor() as cur:
cur.execute("CREATE TABLE t1 AS VALUES (123, 'foobar');")
@@ -279,7 +281,7 @@ def test_tenant_redownloads_truncated_file_on_startup(
(path, expected_size) = local_layer_truncated
# ensure the same size is found from the index_part.json
index_part = local_fs_index_part(env, tenant_id, timeline_id)
index_part = env.remote_storage.index_content(tenant_id, timeline_id)
assert index_part["layer_metadata"][path.name]["file_size"] == expected_size
## Start the pageserver. It will notice that the file size doesn't match, and
@@ -309,7 +311,7 @@ def test_tenant_redownloads_truncated_file_on_startup(
assert os.stat(path).st_size == expected_size, "truncated layer should had been re-downloaded"
# the remote side of local_layer_truncated
remote_layer_path = local_fs_index_part_path(env, tenant_id, timeline_id).parent / path.name
remote_layer_path = env.remote_storage.timeline_path(tenant_id, timeline_id) / path.name
# if the upload ever was ongoing, this check would be racy, but at least one
# extra http request has been made in between so assume it's enough delay
@@ -334,27 +336,3 @@ def test_tenant_redownloads_truncated_file_on_startup(
assert (
os.stat(remote_layer_path).st_size == expected_size
), "truncated file should not had been uploaded after next checkpoint"
def local_fs_index_part(env, tenant_id, timeline_id):
"""
Return json.load parsed index_part.json of tenant and timeline from LOCAL_FS
"""
timeline_path = local_fs_index_part_path(env, tenant_id, timeline_id)
with open(timeline_path, "r") as timeline_file:
return json.load(timeline_file)
def local_fs_index_part_path(env, tenant_id, timeline_id):
"""
Return path to the LOCAL_FS index_part.json of the tenant and timeline.
"""
assert isinstance(env.remote_storage, LocalFsStorage)
return (
env.remote_storage.root
/ "tenants"
/ str(tenant_id)
/ "timelines"
/ str(timeline_id)
/ "index_part.json"
)

View File

@@ -10,7 +10,6 @@ from fixtures.neon_fixtures import (
)
from fixtures.pageserver.http import LayerMapInfo
from fixtures.remote_storage import RemoteStorageKind
from fixtures.types import TimelineId
from pytest_httpserver import HTTPServer
# NB: basic config change tests are in test_tenant_conf.py
@@ -45,7 +44,6 @@ def test_threshold_based_eviction(
)
tenant_id, timeline_id = env.initial_tenant, env.initial_timeline
assert isinstance(timeline_id, TimelineId)
ps_http = env.pageserver.http_client()
assert ps_http.tenant_config(tenant_id).effective_config["eviction_policy"] == {

View File

@@ -17,7 +17,9 @@ from fixtures.neon_fixtures import (
)
from fixtures.pageserver.http import PageserverApiException
from fixtures.pageserver.utils import (
MANY_SMALL_LAYERS_TENANT_CONFIG,
assert_prefix_empty,
assert_prefix_not_empty,
poll_for_remote_storage_iterations,
timeline_delete_wait_completed,
wait_for_last_record_lsn,
@@ -27,11 +29,13 @@ from fixtures.pageserver.utils import (
wait_until_timeline_state,
)
from fixtures.remote_storage import (
LocalFsStorage,
RemoteStorageKind,
available_remote_storages,
available_s3_storages,
)
from fixtures.types import Lsn, TenantId, TimelineId
from fixtures.utils import query_scalar, wait_until
from fixtures.utils import query_scalar, run_pg_bench_small, wait_until
def test_timeline_delete(neon_simple_env: NeonEnv):
@@ -205,12 +209,25 @@ def test_delete_timeline_exercise_crash_safety_failpoints(
timeline_id = env.neon_cli.create_timeline("delete")
with env.endpoints.create_start("delete") as endpoint:
# generate enough layers
pg_bin.run(["pgbench", "-i", "-I dtGvp", "-s1", endpoint.connstr()])
run_pg_bench_small(pg_bin, endpoint.connstr())
if remote_storage_kind is RemoteStorageKind.NOOP:
wait_for_last_flush_lsn(env, endpoint, env.initial_tenant, timeline_id)
else:
last_flush_lsn_upload(env, endpoint, env.initial_tenant, timeline_id)
if remote_storage_kind in available_s3_storages():
assert_prefix_not_empty(
neon_env_builder,
prefix="/".join(
(
"tenants",
str(env.initial_tenant),
"timelines",
str(timeline_id),
)
),
)
env.pageserver.allowed_errors.append(f".*{timeline_id}.*failpoint: {failpoint}")
# It appears when we stopped flush loop during deletion and then pageserver is stopped
env.pageserver.allowed_errors.append(
@@ -297,7 +314,7 @@ def test_delete_timeline_exercise_crash_safety_failpoints(
ps_http, env.initial_tenant, timeline_id, iterations=iterations
)
# Check remote is impty
# Check remote is empty
if remote_storage_kind is RemoteStorageKind.MOCK_S3:
assert_prefix_empty(
neon_env_builder,
@@ -342,8 +359,8 @@ def test_timeline_resurrection_on_attach(
ps_http = env.pageserver.http_client()
pg = env.endpoints.create_start("main")
tenant_id = TenantId(pg.safe_psql("show neon.tenant_id")[0][0])
main_timeline_id = TimelineId(pg.safe_psql("show neon.timeline_id")[0][0])
tenant_id = env.initial_tenant
main_timeline_id = env.initial_timeline
with pg.cursor() as cur:
cur.execute("CREATE TABLE f (i integer);")
@@ -471,15 +488,7 @@ def test_timeline_delete_fail_before_local_delete(neon_env_builder: NeonEnvBuild
# Wait for tenant to finish loading.
wait_until_tenant_active(ps_http, tenant_id=env.initial_tenant, iterations=10, period=1)
try:
data = ps_http.timeline_detail(env.initial_tenant, leaf_timeline_id)
log.debug(f"detail {data}")
except PageserverApiException as e:
log.debug(e)
if e.status_code != 404:
raise
else:
raise Exception("detail succeeded (it should return 404)")
wait_timeline_detail_404(ps_http, env.initial_tenant, leaf_timeline_id, iterations=4)
assert (
not leaf_timeline_path.exists()
@@ -503,8 +512,6 @@ def test_timeline_delete_fail_before_local_delete(neon_env_builder: NeonEnvBuild
),
)
assert env.initial_timeline is not None
for timeline_id in (intermediate_timeline_id, env.initial_timeline):
timeline_delete_wait_completed(
ps_http, tenant_id=env.initial_tenant, timeline_id=timeline_id
@@ -707,13 +714,9 @@ def test_timeline_delete_works_for_remote_smoke(
ps_http = env.pageserver.http_client()
pg = env.endpoints.create_start("main")
tenant_id = TenantId(pg.safe_psql("show neon.tenant_id")[0][0])
main_timeline_id = TimelineId(pg.safe_psql("show neon.timeline_id")[0][0])
tenant_id = env.initial_tenant
timeline_id = env.initial_timeline
assert tenant_id == env.initial_tenant
assert main_timeline_id == env.initial_timeline
assert env.initial_timeline is not None
timeline_ids = [env.initial_timeline]
for i in range(2):
branch_timeline_id = env.neon_cli.create_branch(f"new{i}", "main")
@@ -734,9 +737,21 @@ def test_timeline_delete_works_for_remote_smoke(
log.info("waiting for checkpoint upload")
wait_for_upload(ps_http, tenant_id, branch_timeline_id, current_lsn)
log.info("upload of checkpoint is done")
timeline_id = TimelineId(pg.safe_psql("show neon.timeline_id")[0][0])
timeline_ids.append(timeline_id)
timeline_ids.append(branch_timeline_id)
for timeline_id in timeline_ids:
assert_prefix_not_empty(
neon_env_builder,
prefix="/".join(
(
"tenants",
str(env.initial_tenant),
"timelines",
str(timeline_id),
)
),
)
for timeline_id in reversed(timeline_ids):
# note that we need to finish previous deletion before scheduling next one
@@ -757,8 +772,183 @@ def test_timeline_delete_works_for_remote_smoke(
# for some reason the check above doesnt immediately take effect for the below.
# Assume it is mock server inconsistency and check twice.
wait_until(
2,
0.5,
lambda: assert_prefix_empty(neon_env_builder),
wait_until(2, 0.5, lambda: assert_prefix_empty(neon_env_builder))
def test_delete_orphaned_objects(
neon_env_builder: NeonEnvBuilder,
pg_bin: PgBin,
):
remote_storage_kind = RemoteStorageKind.LOCAL_FS
neon_env_builder.enable_remote_storage(remote_storage_kind, "test_delete_orphaned_objects")
env = neon_env_builder.init_start(
initial_tenant_conf={
"gc_period": "0s",
"compaction_period": "0s",
"checkpoint_distance": f"{1024 ** 2}",
"image_creation_threshold": "100",
}
)
assert isinstance(env.remote_storage, LocalFsStorage)
ps_http = env.pageserver.http_client()
timeline_id = env.neon_cli.create_timeline("delete")
with env.endpoints.create_start("delete") as endpoint:
# generate enough layers
run_pg_bench_small(pg_bin, endpoint.connstr())
last_flush_lsn_upload(env, endpoint, env.initial_tenant, timeline_id)
# write orphaned file that is missing from the index
remote_timeline_path = env.remote_storage.timeline_path(env.initial_tenant, timeline_id)
orphans = [remote_timeline_path / f"orphan_{i}" for i in range(3)]
for orphan in orphans:
orphan.write_text("I shouldnt be there")
# trigger failpoint after orphaned file deletion to check that index_part is not deleted as well.
failpoint = "timeline-delete-before-index-delete"
ps_http.configure_failpoints((failpoint, "return"))
env.pageserver.allowed_errors.append(f".*failpoint: {failpoint}")
iterations = poll_for_remote_storage_iterations(remote_storage_kind)
ps_http.timeline_delete(env.initial_tenant, timeline_id)
timeline_info = wait_until_timeline_state(
pageserver_http=ps_http,
tenant_id=env.initial_tenant,
timeline_id=timeline_id,
expected_state="Broken",
iterations=iterations,
)
reason = timeline_info["state"]["Broken"]["reason"]
assert reason.endswith(f"failpoint: {failpoint}"), reason
for orphan in orphans:
assert not orphan.exists()
assert env.pageserver.log_contains(
f"deleting a file not referenced from index_part.json name={orphan.stem}"
)
assert env.remote_storage.index_path(env.initial_tenant, timeline_id).exists()
@pytest.mark.parametrize("remote_storage_kind", available_remote_storages())
def test_timeline_delete_resumed_on_attach(
neon_env_builder: NeonEnvBuilder,
remote_storage_kind: RemoteStorageKind,
pg_bin: PgBin,
):
neon_env_builder.enable_remote_storage(
remote_storage_kind=remote_storage_kind,
test_name="test_deleted_tenant_ignored_on_attach",
)
env = neon_env_builder.init_start(initial_tenant_conf=MANY_SMALL_LAYERS_TENANT_CONFIG)
tenant_id = env.initial_tenant
ps_http = env.pageserver.http_client()
timeline_id = env.neon_cli.create_timeline("delete")
with env.endpoints.create_start("delete") as endpoint:
# generate enough layers
run_pg_bench_small(pg_bin, endpoint.connstr())
last_flush_lsn_upload(env, endpoint, env.initial_tenant, timeline_id)
if remote_storage_kind in available_s3_storages():
assert_prefix_not_empty(
neon_env_builder,
prefix="/".join(
(
"tenants",
str(env.initial_tenant),
"timelines",
str(timeline_id),
)
),
)
# failpoint before we remove index_part from s3
failpoint = "timeline-delete-during-rm"
ps_http.configure_failpoints((failpoint, "return"))
env.pageserver.allowed_errors.extend(
(
# allow errors caused by failpoints
f".*failpoint: {failpoint}",
# It appears when we stopped flush loop during deletion (attempt) and then pageserver is stopped
".*freeze_and_flush_on_shutdown.*failed to freeze and flush: cannot flush frozen layers when flush_loop is not running, state is Exited",
# error from http response is also logged
".*InternalServerError\\(Tenant is marked as deleted on remote storage.*",
# Polling after attach may fail with this
f".*InternalServerError\\(Tenant {tenant_id} is not active.*",
'.*shutdown_pageserver{exit_code=0}: stopping left-over name="remote upload".*',
)
)
iterations = poll_for_remote_storage_iterations(remote_storage_kind)
ps_http.timeline_delete(tenant_id, timeline_id)
timeline_info = wait_until_timeline_state(
pageserver_http=ps_http,
tenant_id=env.initial_tenant,
timeline_id=timeline_id,
expected_state="Broken",
iterations=iterations,
)
reason = timeline_info["state"]["Broken"]["reason"]
log.info(f"timeline broken: {reason}")
# failpoint may not be the only error in the stack
assert reason.endswith(f"failpoint: {failpoint}"), reason
if remote_storage_kind in available_s3_storages():
assert_prefix_not_empty(
neon_env_builder,
prefix="/".join(
(
"tenants",
str(tenant_id),
"timelines",
str(timeline_id),
)
),
)
# now we stop pageserver and remove local tenant state
env.endpoints.stop_all()
env.pageserver.stop()
dir_to_clear = Path(env.repo_dir) / "tenants"
shutil.rmtree(dir_to_clear)
os.mkdir(dir_to_clear)
env.pageserver.start()
# now we call attach
ps_http.tenant_attach(tenant_id=tenant_id)
# delete should be resumed
wait_timeline_detail_404(ps_http, env.initial_tenant, timeline_id, iterations=iterations)
tenant_path = env.timeline_dir(tenant_id=tenant_id, timeline_id=timeline_id)
assert not tenant_path.exists()
if remote_storage_kind in available_s3_storages():
assert_prefix_empty(
neon_env_builder,
prefix="/".join(
(
"tenants",
str(timeline_id),
"timelines",
str(timeline_id),
)
),
)

Some files were not shown because too many files have changed in this diff Show More