Compare commits

...

32 Commits

Author SHA1 Message Date
John Spray
c4b0ecec33 pageserver: reduce usage of TENANTS 2024-05-15 20:56:17 +01:00
John Spray
a5565cc641 pageserver: pull get_active_tenant_with_timeout into TenantManager 2024-05-15 20:49:16 +01:00
John Spray
1804abd23d pageserver: pass a TenantManager into page service 2024-05-15 20:45:31 +01:00
John Spray
0d4247d2d8 fix a weird variable name 2024-05-15 18:14:10 +01:00
John Spray
b968ae0aa8 make a comment clearer 2024-05-15 18:13:37 +01:00
John Spray
f901cd459b pageserver: refine tenant_id->shard lookup 2024-05-14 22:20:57 +01:00
Christian Schwarz
4eedb3b6f1 test suite: allow overriding default compaction algorithm via env var (#7747)
This PR allows setting the
`PAGESERVER_DEFAULT_TENANT_CONFIG_COMPACTION_ALGORITHM` env var to
override the `tenant_config.compaction_algorithm` field in the initial
`pageserver.toml` for all tests.

I tested manually that this works by halting a test using pdb and
inspecting the `effective_config` in the tenant status managment API.

If the env var is set, the tests are parametrized by the `kind` tag
field, allowing to do a matrix build in CI and let Allure summarize
everything in a nice report.

If the env var is not set, the tests are not parametrized. So, merging
this PR doesn't cause problems for flaky test detection. In fact, it
doesn't cause any runtime change if the env var is not set.

There are some tests in the test suite that set used to override
the entire tenant_config using
`NeonEnvBuilder.pageserver_config_override`.
Since config overrides are merged non-recursively, such overrides
that don't specify `kind = ` cause a fallback to pageserver's built-in
`DEFAULT_COMPACTION_ALGORITHM`.

Such cases can be found using

```
["']tenant_config\s*[='"]
```

We'll deal with these tests in a future PR.

closes https://github.com/neondatabase/neon/issues/7555
2024-05-14 18:03:08 +02:00
Arpad Müller
e67fcf9563 Update mold to 2.31 (#7757)
The [2.31.0 release](https://github.com/rui314/mold/releases/tag/v2.31.0) of mold
includes a 10% speed improvement for binaries with a lot of debug info.
As we have such, it might be useful to update mold to the latest
release. The jump is from 2.4.0 to 2.31.0, but it's not been many
releases in between as the version number was raised by the mold
maintainers to 2.30.0 after 2.4.1 [to avoid confusion for some
tools](https://github.com/rui314/mold/releases/tag/v2.30.0).
2024-05-14 17:49:19 +02:00
John Spray
82960b2175 pageserver: skip waiting for logical size on shard >0 (#7744)
## Problem

Shards with number >0 could hang waiting for
`await_initial_logical_size`, as we don't calculate logical size on
these shards. This causes them to hold onto semaphore units and starve
other tenants out from proceeding with warmup activation.

That doesn't hurt availability (we still have on-demand activation), but
it does mean that some background tasks like consumption metrics would
omit some tenants.

## Summary of changes

- Skip waiting for logical size calculation on shards >0
- Upgrade unexpected code paths to use debug_assert!(), which acts as an
implicit regression test for this issue, and make the info() one into a
warn()
2024-05-14 16:39:17 +01:00
Alex Chi Z
30d15ad403 chore(test): add version check for forward compat test (#7685)
A test for https://github.com/neondatabase/neon/pull/7684.

This pull request checks if the pageserver version we specified is the
one actually running by comparing the git hash in forward compatibility
tests.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-05-14 10:36:48 -04:00
Alexander Bayandin
b6ee91835b CI(report-benchmarks-failures): fix condition (#7745)
## Problem

`report-benchmarks-failures` job is triggered for any failure in the CI
pipeline, but we need it to be triggered only for failed `benchmarks`
job

## Summary of changes
- replace `failure()` with `needs.benchmarks.result == 'failure'` in the
condition
2024-05-14 13:39:59 +03:00
John Spray
df0f1e359b pageserver: switch on new-style local layer paths (#7660)
We recently added support for local layer paths that contain a
generation number:
- https://github.com/neondatabase/neon/pull/7609
- https://github.com/neondatabase/neon/pull/7640

Now that we've cut a
[release](https://github.com/neondatabase/neon/pull/7735) that includes
those changes, we can proceed to enable writing the new format without
breaking forward compatibility.
2024-05-14 09:37:48 +01:00
John Spray
cd0e344938 pageserver: do fewer heatmap uploads for tiny tenants (#7731)
## Problem

Currently we do a large number of heatmap uploads for tiny tenants.
"tiny" in this context is defined as being less than a single layer in
size. These uploads are triggered by atime changes rather than changes
in the set of layers.

Uploading heatmaps for atime changes on small tenants isn't useful,
because even without bumping these atimes, disk usage eviction still
avoids evicting the largest resident layer of a tenant, which in
practice keeps tiny/empty tenants mostly resident irrespective of
atimes.

## Summary of changes

- For tenants smaller than one checkpoint interval, only upload heatmap
if the set of layers has changed, not if only the atimes have changed.
- Include the heatmap period in the uploaded heatmap, as a precursor to
implementing https://github.com/neondatabase/neon/issues/6200
(auto-adjusting download intervals to match upload intervals)
2024-05-14 09:31:26 +01:00
Heikki Linnakangas
22afaea6e1 Always use Lsn::MAX as the request LSN in the primary (#7708)
The new protocol version supports sending two LSNs to the pageserver:
request LSN and a "not_modified_since" hint. A primary always wants to
read the latest version of each page, so having two values was not
strictly necessary, and the old protocol worked fine with just the
"not_modified_since" LSN and a flag to request the latest page
version. Nevertheless, it seemed like a good idea to set the request
LSN to the current insert/flush LSN, because that's logically the page
version that the primary wants to read.

However, that made the test_gc_aggressive test case flaky. When the
primary requests a page with the last inserted or flushed LSN, it's
possible that by the time that the pageserver processes the request,
more WAL has been generated by other processes in the compute and
already digested by the pageserver. Furthermore, if the PITR horizon
in the pageserver is set to 0, and GC runs during that window, it's
possible that the GC horizon has advances past the request LSN, before
the pageserver processes the request. It is still correct to send the
latest page version in that case, because the compute either has the
page locked so the it cannot have been modified in the primary, or if
it's a prefetch request, and we will validate the LSNs when the
prefetch response is processed and discard it if the page has been
modified. But the pageserver doesn't know that and rightly complains.

To fix, modify the compute so that the primary always uses Lsn::MAX in
the requests. This reverts the primary's behavior to how the protocol
version 1 worked. In protocol version 1, there was only one LSN, the
"not_modified_since" hint, and a flag was set to read the latest page
version, whatever that might be. Requests from computes that are still
using protocol version 1 were already mapped to Lsn::MAX in the
pageserver, now we do the same with protocol version 2 for primary's
requests. (I'm a bit sad about losing the information in the
pageserver, what the last LSN was at the time that the request wa
made. We never had it with protocol version 1, but I wanted to make it
available for debugging purposes.)

Add another field, 'effective_request_lsn', to track what the flush
LSN was when the request was made. It's not sent to the pageserver,
Lsn::MAX is now used as the request LSN, but it's still needed
internally in the compute to track the validity of prefetch requests.

Fixes issue https://github.com/neondatabase/neon/issues/7692
2024-05-14 09:32:43 +03:00
Heikki Linnakangas
ba20752b76 Refactor the request LSNs to a separate struct (#7708)
We had a lot of code that passed around the two LSNs that are
associated with each GetPage request. Introduce a new struct to
encapsulate them. I'm about to add a third LSN to the struct in the
next commit, this is a mechanical refactoring in preparation for that.
2024-05-14 09:32:43 +03:00
Arpad Müller
3a6fa76828 Tiered compaction: cut deltas along lsn as well if needed (#7671)
In general, tiered compaction is splitting delta layers along the key
dimension, but this can only continue until a single key is reached: if
the changes from a single key don't fit into one layer file, we used to
create layer files of unbounded sizes.

This patch implements the method listed as TODO/FIXME in the source
code. It does the following things:

* Make `accum_key_values` take the target size and if one key's
modifications exceed it, make it fill `partition_lsns`, a vector of lsns
to use for partitioning.
* Have `retile_deltas` use that `partition_lsns` to create delta layers
separated by lsn.
* Adjust the `test_many_updates_for_single_key` to allow layer files
below 0.5 the target size. This situation can create arbitarily small
layer files: The amount of data is arbitrary that sits between having
just cut a new delta, and then stumbling upon the key that needs to be
split along lsn. This data will end up in a dedicated layer and it can
be arbitrarily small.
* Ignore single-key delta layers for depth calculation: in theory we
might have only single-key delta layers in a tier, and this might
confuse depth calculation as well, but this should be unlikely.

Fixes #7243

Part of #7554

---------

Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
2024-05-14 01:13:25 +02:00
Alex Chi Z
9ffb852359 fix(test): ensure compatibility test uses the correct compute node (#7741)
Use the old compute node for compat tests.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-05-13 17:14:08 -04:00
John Spray
972470b174 pageserver: use adaptive concurrency in secondary layer downloads (#7675)
## Problem

Secondary downloads are a low priority task, and intentionally do not
try to max out download speeds. This is almost always fine when they are
used through the life of a tenant shard as a continuous "trickle" of
background downloads.

However, there are sometimes circumstances where we would like to
populate a secondary location as fast as we can, within the constraint
that we don't want to impact the activity of attached tenants:
- During node removal, where we will need to create replacements for
secondary locations on the node being removed
- After a shard split, we need new secondary locations for the new
shards to populate before the shards can be migrated to their final
location.

## Summary of changes

- Add an activity() function to the remote storage interface, enabling
callers to query how busy the remote storage backend is
- In the secondary download code, use a very modest amount of
concurrency, driven by the remote storage's state: we only use
concurrency if the remote storage semaphore is 75% free, and scale the
amount of concurrency used within that range.

This is not a super clever form of prioritization, but it should
accomplish the key goals:
- Enable secondary downloads to happen faster when the system is idle
- Make secondary downloads a much lower priority than attached tenants
when the remote storage is busy.

---------

Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
2024-05-13 17:38:30 +00:00
Vlad Lazar
1412e9b3e8 pagectl: fix diagrams generation for paths containing generations (#7739)
## Problem
When layer paths include generations, the lsn parsing does not work and
`pagectl` errors out.

## Summary of changes
If the last "word" of the layer path contains 8 characters, discard it
for the purpose of lsn parsing.
2024-05-13 18:24:12 +01:00
John Spray
be0c73f8e7 pageserver: improve API for invoking GC (#7655)
## Problem

In https://github.com/neondatabase/neon/pull/7531, I had a test flaky
because the GC API endpoint fails if the tenant happens not to be active
yet.

## Summary of changes

While adding that wait for the tenant to be active, I noticed that this
endpoint is kind of strange (spawns a TaskManager task) and has a
comment `// TODO: spawning is redundant now, need to hold the gate`, so
this PR cleans it up to just run the GC inline while holding a gate.

The GC code is updated to avoid assuming it runs inside a task manager
task. Avoiding checking the task_mgr cancellation token is safe, because
our timeline shutdown always cancels Timeline::cancel.
2024-05-13 17:59:59 +01:00
Alex Chi Z
7f51764001 feat(pageserver): add metrics for aux file size (#7623)
ref https://github.com/neondatabase/neon/issues/7443

## Summary of changes

This pull request adds a size estimator for aux files. Each timeline
stores a cached `isize` for the estimated total size of aux files. It
gets reset on basebackup, and gets updated for each aux file
modification. TODO: print a warning when it exceeds the size.

The size metrics is not accurate. Race between `on_basebackup` and other
functions could create a negative basebackup size, but the chance is
rare. Anyways, this does not impose any extra I/Os to the storage as
everything is computed in-memory.

The aux files are only stored on shard 0. As basebackups are only
generated on shard 0, only shard 0 will report this metrics.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-05-13 15:33:41 +00:00
Joonas Koivunen
4d8a10af1c fix: do not create metrics contention from background task permit (#7730)
The background task loop permit metrics do two of `with_label_values`
very often. Change the codepath to cache the counters on first access
into a `Lazy` with `enum_map::EnumMap`. The expectation is that this
should not fix for metric collection failures under load, but it doesn't
hurt.

Cc: #7161
2024-05-13 17:49:50 +03:00
Alexander Bayandin
55ba885f6b CI(report-benchmarks-failures): report benchmarks failures to slack (#7678)
## Problem

`benchmarks` job that we run on the main doesn't block anything, so it's
easy to miss its failure.

Ref https://github.com/neondatabase/cloud/issues/13087

## Summary of changes
- Add `report-benchmarks-failures` job that report failures of
`benchmarks` job to a Slack channel
2024-05-13 14:16:03 +01:00
Christian Schwarz
6ff74295b5 chore(pageserver): plumb through RequestContext to VirtualFile open methods (#7725)
This PR introduces no functional changes.

The `open()` path will be done separately.

refs https://github.com/neondatabase/neon/issues/6107
refs https://github.com/neondatabase/neon/issues/7386

Co-authored-by: Joonas Koivunen <joonas@neon.tech>
2024-05-13 14:52:06 +02:00
Vlad Lazar
bbe730d7ca Revert protocol version upgrade (#7727)
## Problem

"John pointed out that the switch to protocol version 2 made
test_gc_aggressive test flaky:
https://github.com/neondatabase/neon/issues/7692.
I tracked it down, and that is indeed an issue. Conditions for hitting
the issue:
The problem occurs in the primary
GC horizon is set to a very low value, e.g. 0.
If the primary is actively writing WAL, and GC runs in the pageserver at
the same time that the primary sends a GetPage request, it's possible
that the GC advances the GC horizon past the GetPage request's LSN. I'm
working on a fix here: https://github.com/neondatabase/neon/pull/7708."
- Heikki

## Summary of changes
Use protocol version 1 as default.
2024-05-13 13:41:14 +01:00
Jure Bajic
5a0da93c53 Fix test_lock_time_tracing flakiness (#7712)
## Problem

Closes
[test_lock_time_tracing](https://github.com/neondatabase/neon/issues/7691)

## Summary of changes

Taking a look at the execution of the same test in logs, it can be
concluded that the time we are holding the lock is sometimes not
enough(must be above 30s) to cause the second log to be shown by the
thread that is creating a timeline.

In the [successful
execution](https://neon-github-public-dev.s3.amazonaws.com/reports/pr-7663/9021247520/index.html#testresult/a21bce8c702b37f0)
it can be seen that the log `Operation TimelineCreate on key
5e088fc2dd14945020d0fa6d9efd1e36 has waited 30.000887709s for shared
lock` was on the edge of being logged, if it was below 30s it would not
be shown.

```
2024-05-09T18:02:32.552093Z  WARN request{method=PUT path=/control/v1/tenant/5e088fc2dd14945020d0fa6d9efd1e36/policy request_id=af7e4a04-d181-4acb-952f-9597c8eba5a8}: Lock on UpdatePolicy was held for 31.001892592s
2024-05-09T18:02:32.552109Z  INFO request{method=PUT path=/control/v1/tenant/5e088fc2dd14945020d0fa6d9efd1e36/policy request_id=af7e4a04-d181-4acb-952f-9597c8eba5a8}: Request handled, status: 200 OK
2024-05-09T18:02:32.552271Z  WARN request{method=POST path=/v1/tenant/5e088fc2dd14945020d0fa6d9efd1e36/timeline request_id=d3af756e-dbb3-476b-89bd-3594f19bbb67}: Operation TimelineCreate on key 5e088fc2dd14945020d0fa6d9efd1e36 has waited 30.000887709s for shared lock
```

In the [failed
execution](https://neon-github-public-dev.s3.amazonaws.com/reports/pr-7663/9022743601/index.html#/testresult/deb90136aeae4fce):
```
2024-05-09T20:14:33.526311Z  INFO request{method=POST path=/v1/tenant/68194ffadb61ca11adcbb11cbeb4ec6e/timeline request_id=1daa8c31-522d-4805-9114-68cdcffb9823}: Creating timeline 68194ffadb61ca11adcbb11cbeb4ec6e/f72185990ed13f0b0533383f81d877af
2024-05-09T20:14:36.441165Z  INFO Heartbeat round complete for 1 nodes, 0 offline
2024-05-09T20:14:41.441657Z  INFO Heartbeat round complete for 1 nodes, 0 offline
2024-05-09T20:14:41.535227Z  INFO request{method=POST path=/upcall/v1/validate request_id=94a7be88-474e-4163-92f8-57b401473add}: Handling request
2024-05-09T20:14:41.535269Z  INFO request{method=POST path=/upcall/v1/validate request_id=94a7be88-474e-4163-92f8-57b401473add}: handle_validate: 68194ffadb61ca11adcbb11cbeb4ec6e(gen 1): valid=true (latest Some(00000001))
2024-05-09T20:14:41.535284Z  INFO request{method=POST path=/upcall/v1/validate request_id=94a7be88-474e-4163-92f8-57b401473add}: Request handled, status: 200 OK
2024-05-09T20:14:46.441854Z  INFO Heartbeat round complete for 1 nodes, 0 offline
2024-05-09T20:14:51.441151Z  INFO Heartbeat round complete for 1 nodes, 0 offline
2024-05-09T20:14:56.441199Z  INFO Heartbeat round complete for 1 nodes, 0 offline
2024-05-09T20:15:01.440971Z  INFO Heartbeat round complete for 1 nodes, 0 offline
2024-05-09T20:15:03.516320Z  INFO request{method=PUT path=/control/v1/tenant/68194ffadb61ca11adcbb11cbeb4ec6e/policy request_id=0edfdb5b-2b05-486b-9879-d83f234d2f0d}: failpoint "tenant-update-policy-exclusive-lock": sleep done
2024-05-09T20:15:03.518474Z  INFO request{method=PUT path=/control/v1/tenant/68194ffadb61ca11adcbb11cbeb4ec6e/policy request_id=0edfdb5b-2b05-486b-9879-d83f234d2f0d}: Updated scheduling policy to Stop tenant_id=68194ffadb61ca11adcbb11cbeb4ec6e shard_id=0000
2024-05-09T20:15:03.518512Z  WARN request{method=PUT path=/control/v1/tenant/68194ffadb61ca11adcbb11cbeb4ec6e/policy request_id=0edfdb5b-2b05-486b-9879-d83f234d2f0d}: Scheduling is disabled by policy Stop tenant_id=68194ffadb61ca11adcbb11cbeb4ec6e shard_id=0000
2024-05-09T20:15:03.518540Z  WARN request{method=PUT path=/control/v1/tenant/68194ffadb61ca11adcbb11cbeb4ec6e/policy request_id=0edfdb5b-2b05-486b-9879-d83f234d2f0d}: Lock on UpdatePolicy was held for 31.003712703s
2024-05-09T20:15:03.518570Z  INFO request{method=PUT path=/control/v1/tenant/68194ffadb61ca11adcbb11cbeb4ec6e/policy request_id=0edfdb5b-2b05-486b-9879-d83f234d2f0d}: Request handled, status: 200 OK
2024-05-09T20:15:03.518804Z  WARN request{method=POST path=/v1/tenant/68194ffadb61ca11adcbb11cbeb4ec6e/timeline request_id=1daa8c31-522d-4805-9114-68cdcffb9823}: Scheduling is disabled by policy Stop tenant_id=68194ffadb61ca11adcbb11cbeb4ec6e shard_id=0000
2024-05-09T20:15:03.518815Z  INFO request{method=POST path=/v1/tenant/68194ffadb61ca11adcbb11cbeb4ec6e/timeline request_id=1daa8c31-522d-4805-9114-68cdcffb9823}: Creating timeline on shard 68194ffadb61ca11adcbb11cbeb4ec6e/f72185990ed13f0b0533383f81d877af, attached to node 1 (localhost)
```
we can see that the difference between starting to create timeline
`2024-05-09T20:14:33.526311Z` and creating timeline
`2024-05-09T20:15:03.518815Z` is not above 30s and will not cause any
logs to appear.

The proposed solution is to prolong how long we will pause to ensure
that the thread that creates the timeline waits above 30s.
2024-05-13 13:18:14 +01:00
Joonas Koivunen
d9dcbffac3 python: allow using allowed_errors.py (#7719)
See #7718. Fix it by renaming all `types.py` to `common_types.py`.

Additionally, add an advert for using `allowed_errors.py` to test any
added regex.
2024-05-13 15:16:23 +03:00
John Spray
f50ff14560 pageserver: refuse to run without remote storage (#7722)
## Problem

Since https://github.com/neondatabase/neon/pull/6769, the pageserver is
intentionally not usable without remote storage: it's purpose is to act
as a cache to an object store, rather than as a source of truth in its
own right.

## Summary of changes

- Make remote storage configuration mandatory: the pageserver will
refuse to start if it is not provided.

This is a precursor that will make it safe to subsequently remove all
the internal Option<>s
2024-05-13 13:05:46 +01:00
Christian Schwarz
b58a615197 chore(pageserver): plumb through RequestContext to VirtualFile read methods (#7720)
This PR introduces no functional changes.

The `open()` path will be done separately.

refs https://github.com/neondatabase/neon/issues/6107
refs https://github.com/neondatabase/neon/issues/7386
2024-05-13 09:22:10 +00:00
Joonas Koivunen
1a1d527875 test: allow vectored get validation failure during shutdown (#7716)
Per [evidence] the timeline ancestor detach tests can panic while
shutting down on vectored get validation. Allow the error because tenant
is restarted twice in the test.

[evidence]:
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-7708/9058185709/index.html#suites/a1c2be32556270764423c495fad75d47/d444f7e5c0a18ce9
2024-05-13 09:21:49 +00:00
Joonas Koivunen
216fc5ba7b test: fix confusing limit and logging (#7589)
The test has been flaky since 2024-04-11 for unknown reason, and the
logging was off. Fix the logging and raise the limit a bit. The
problematic ratio reproduces with pg14 and added sleep (not included)
but not on pg15. The new ratio abs diff limit works for all inspected
examples.

Cc: #7536
2024-05-13 11:56:07 +03:00
Joonas Koivunen
4270e86eb2 test(ancestor detach): verify with fullbackup (#7706)
In timeline detach ancestor tests there is no way to really be sure that
there were no subtle off-by one bugs. One such bug is demoed and
reverted. Add verifying fullbackup is equal before and after detaching
ancestor.

Fullbackup is expected to be equal apart from `zenith.signal`, which is
known to be good because endpoint can be started without the detached
branch receiving writes.
2024-05-13 10:58:03 +03:00
122 changed files with 1540 additions and 743 deletions

View File

@@ -546,9 +546,27 @@ jobs:
# XXX: no coverage data handling here, since benchmarks are run on release builds,
# while coverage is currently collected for the debug ones
report-benchmarks-failures:
needs: [ benchmarks, create-test-report ]
if: github.ref_name == 'main' && needs.benchmarks.result == 'failure'
runs-on: ubuntu-latest
steps:
- uses: slackapi/slack-github-action@v1
with:
channel-id: C060CNA47S9 # on-call-staging-storage-stream
slack-message: |
Benchmarks failed on main: ${{ github.event.head_commit.url }}
Allure report: ${{ needs.create-test-report.outputs.report-url }}
env:
SLACK_BOT_TOKEN: ${{ secrets.SLACK_BOT_TOKEN }}
create-test-report:
needs: [ check-permissions, regress-tests, coverage-report, benchmarks, build-build-tools-image ]
if: ${{ !cancelled() && contains(fromJSON('["skipped", "success"]'), needs.check-permissions.result) }}
outputs:
report-url: ${{ steps.create-allure-report.outputs.report-url }}
runs-on: [ self-hosted, gen3, small ]
container:

View File

@@ -87,7 +87,7 @@ RUN curl "https://awscli.amazonaws.com/awscli-exe-linux-$(uname -m).zip" -o "aws
&& rm awscliv2.zip
# Mold: A Modern Linker
ENV MOLD_VERSION v2.4.0
ENV MOLD_VERSION v2.31.0
RUN set -e \
&& git clone https://github.com/rui314/mold.git \
&& mkdir mold/build \

View File

@@ -29,6 +29,7 @@ use http_types::{StatusCode, Url};
use tokio_util::sync::CancellationToken;
use tracing::debug;
use crate::RemoteStorageActivity;
use crate::{
error::Cancelled, s3_bucket::RequestKind, AzureConfig, ConcurrencyLimiter, Download,
DownloadError, Listing, ListingMode, RemotePath, RemoteStorage, StorageMetadata,
@@ -525,6 +526,10 @@ impl RemoteStorage for AzureBlobStorage {
// https://learn.microsoft.com/en-us/azure/storage/blobs/point-in-time-restore-overview
Err(TimeTravelError::Unimplemented)
}
fn activity(&self) -> RemoteStorageActivity {
self.concurrency_limiter.activity()
}
}
pin_project_lite::pin_project! {

View File

@@ -263,6 +263,17 @@ pub trait RemoteStorage: Send + Sync + 'static {
done_if_after: SystemTime,
cancel: &CancellationToken,
) -> Result<(), TimeTravelError>;
/// Query how busy we currently are: may be used by callers which wish to politely
/// back off if there are already a lot of operations underway.
fn activity(&self) -> RemoteStorageActivity;
}
pub struct RemoteStorageActivity {
pub read_available: usize,
pub read_total: usize,
pub write_available: usize,
pub write_total: usize,
}
/// DownloadStream is sensitive to the timeout and cancellation used with the original
@@ -444,6 +455,15 @@ impl<Other: RemoteStorage> GenericRemoteStorage<Arc<Other>> {
}
}
}
pub fn activity(&self) -> RemoteStorageActivity {
match self {
Self::LocalFs(s) => s.activity(),
Self::AwsS3(s) => s.activity(),
Self::AzureBlob(s) => s.activity(),
Self::Unreliable(s) => s.activity(),
}
}
}
impl GenericRemoteStorage {
@@ -774,6 +794,9 @@ struct ConcurrencyLimiter {
// The helps to ensure we don't exceed the thresholds.
write: Arc<Semaphore>,
read: Arc<Semaphore>,
write_total: usize,
read_total: usize,
}
impl ConcurrencyLimiter {
@@ -802,10 +825,21 @@ impl ConcurrencyLimiter {
Arc::clone(self.for_kind(kind)).acquire_owned().await
}
fn activity(&self) -> RemoteStorageActivity {
RemoteStorageActivity {
read_available: self.read.available_permits(),
read_total: self.read_total,
write_available: self.write.available_permits(),
write_total: self.write_total,
}
}
fn new(limit: usize) -> ConcurrencyLimiter {
Self {
read: Arc::new(Semaphore::new(limit)),
write: Arc::new(Semaphore::new(limit)),
read_total: limit,
write_total: limit,
}
}
}

View File

@@ -23,8 +23,8 @@ use tokio_util::{io::ReaderStream, sync::CancellationToken};
use utils::crashsafe::path_with_suffix_extension;
use crate::{
Download, DownloadError, Listing, ListingMode, RemotePath, TimeTravelError, TimeoutOrCancel,
REMOTE_STORAGE_PREFIX_SEPARATOR,
Download, DownloadError, Listing, ListingMode, RemotePath, RemoteStorageActivity,
TimeTravelError, TimeoutOrCancel, REMOTE_STORAGE_PREFIX_SEPARATOR,
};
use super::{RemoteStorage, StorageMetadata};
@@ -605,6 +605,16 @@ impl RemoteStorage for LocalFs {
) -> Result<(), TimeTravelError> {
Err(TimeTravelError::Unimplemented)
}
fn activity(&self) -> RemoteStorageActivity {
// LocalFS has no concurrency limiting: give callers the impression that plenty of units are available
RemoteStorageActivity {
read_available: 16,
read_total: 16,
write_available: 16,
write_total: 16,
}
}
}
fn storage_metadata_path(original_path: &Utf8Path) -> Utf8PathBuf {

View File

@@ -47,8 +47,8 @@ use utils::backoff;
use super::StorageMetadata;
use crate::{
error::Cancelled, support::PermitCarrying, ConcurrencyLimiter, Download, DownloadError,
Listing, ListingMode, RemotePath, RemoteStorage, S3Config, TimeTravelError, TimeoutOrCancel,
MAX_KEYS_PER_DELETE, REMOTE_STORAGE_PREFIX_SEPARATOR,
Listing, ListingMode, RemotePath, RemoteStorage, RemoteStorageActivity, S3Config,
TimeTravelError, TimeoutOrCancel, MAX_KEYS_PER_DELETE, REMOTE_STORAGE_PREFIX_SEPARATOR,
};
pub(super) mod metrics;
@@ -975,6 +975,10 @@ impl RemoteStorage for S3Bucket {
}
Ok(())
}
fn activity(&self) -> RemoteStorageActivity {
self.concurrency_limiter.activity()
}
}
/// On drop (cancellation) count towards [`metrics::BucketMetrics::cancelled_waits`].

View File

@@ -12,7 +12,7 @@ use tokio_util::sync::CancellationToken;
use crate::{
Download, DownloadError, GenericRemoteStorage, Listing, ListingMode, RemotePath, RemoteStorage,
StorageMetadata, TimeTravelError,
RemoteStorageActivity, StorageMetadata, TimeTravelError,
};
pub struct UnreliableWrapper {
@@ -213,4 +213,8 @@ impl RemoteStorage for UnreliableWrapper {
.time_travel_recover(prefix, timestamp, done_if_after, cancel)
.await
}
fn activity(&self) -> RemoteStorageActivity {
self.inner.activity()
}
}

View File

@@ -530,8 +530,6 @@ where
// If we have accumulated only a narrow band of keyspace, create an
// image layer. Otherwise write a delta layer.
// FIXME: deal with the case of lots of values for same key
// FIXME: we are ignoring images here. Did we already divide the work
// so that we won't encounter them here?
@@ -550,39 +548,94 @@ where
let mut new_jobs = Vec::new();
// Slide a window through the keyspace
let mut key_accum = std::pin::pin!(accum_key_values(key_value_stream));
let mut key_accum =
std::pin::pin!(accum_key_values(key_value_stream, self.target_file_size));
let mut all_in_window: bool = false;
let mut window = Window::new();
// Helper function to create a job for a new delta layer with given key-lsn
// rectangle.
let create_delta_job = |key_range, lsn_range: &Range<Lsn>, new_jobs: &mut Vec<_>| {
// The inputs for the job are all the input layers of the original job that
// overlap with the rectangle.
let batch_layers: Vec<LayerId> = job
.input_layers
.iter()
.filter(|layer_id| {
overlaps_with(self.layers[layer_id.0].layer.key_range(), &key_range)
})
.cloned()
.collect();
assert!(!batch_layers.is_empty());
new_jobs.push(CompactionJob {
key_range,
lsn_range: lsn_range.clone(),
strategy: CompactionStrategy::CreateDelta,
input_layers: batch_layers,
completed: false,
});
};
loop {
if all_in_window && window.elems.is_empty() {
if all_in_window && window.is_empty() {
// All done!
break;
}
// If we now have enough keyspace for next delta layer in the window, create a
// new delta layer
if let Some(key_range) = window.choose_next_delta(self.target_file_size, !all_in_window)
{
let batch_layers: Vec<LayerId> = job
.input_layers
.iter()
.filter(|layer_id| {
overlaps_with(self.layers[layer_id.0].layer.key_range(), &key_range)
})
.cloned()
.collect();
assert!(!batch_layers.is_empty());
new_jobs.push(CompactionJob {
key_range,
lsn_range: job.lsn_range.clone(),
strategy: CompactionStrategy::CreateDelta,
input_layers: batch_layers,
completed: false,
});
} else {
assert!(!all_in_window);
if let Some(next_key) = key_accum.next().await.transpose()? {
window.feed(next_key.key, next_key.size);
} else {
create_delta_job(key_range, &job.lsn_range, &mut new_jobs);
continue;
}
assert!(!all_in_window);
// Process next key in the key space
match key_accum.next().await.transpose()? {
None => {
all_in_window = true;
}
Some(next_key) if next_key.partition_lsns.is_empty() => {
// Normal case: extend the window by the key
window.feed(next_key.key, next_key.size);
}
Some(next_key) => {
// A key with too large size impact for a single delta layer. This
// case occurs if you make a huge number of updates for a single key.
//
// Drain the window with has_more = false to make a clean cut before
// the key, and then make dedicated delta layers for the single key.
//
// We cannot cluster the key with the others, because we don't want
// layer files to overlap with each other in the lsn,key space (no
// overlaps for the rectangles).
let key = next_key.key;
debug!("key {key} with size impact larger than the layer size");
while !window.is_empty() {
let has_more = false;
let key_range = window.choose_next_delta(self.target_file_size, has_more)
.expect("with has_more==false, choose_next_delta always returns something for a non-empty Window");
create_delta_job(key_range, &job.lsn_range, &mut new_jobs);
}
// Not really required: but here for future resilience:
// We make a "gap" here, so any structure the window holds should
// probably be reset.
window = Window::new();
let mut prior_lsn = job.lsn_range.start;
let mut lsn_ranges = Vec::new();
for (lsn, _size) in next_key.partition_lsns.iter() {
lsn_ranges.push(prior_lsn..*lsn);
prior_lsn = *lsn;
}
lsn_ranges.push(prior_lsn..job.lsn_range.end);
for lsn_range in lsn_ranges {
let key_range = key..key.next();
create_delta_job(key_range, &lsn_range, &mut new_jobs);
}
}
}
}
@@ -803,6 +856,10 @@ where
self.elems.front().unwrap().accum_size - self.splitoff_size
}
fn is_empty(&self) -> bool {
self.elems.is_empty()
}
fn commit_upto(&mut self, mut upto: usize) {
while upto > 1 {
let popped = self.elems.pop_front().unwrap();

View File

@@ -235,9 +235,14 @@ pub struct KeySize<K> {
pub key: K,
pub num_values: u64,
pub size: u64,
/// The lsns to partition at (if empty then no per-lsn partitioning)
pub partition_lsns: Vec<(Lsn, u64)>,
}
pub fn accum_key_values<'a, I, K, D, E>(input: I) -> impl Stream<Item = Result<KeySize<K>, E>>
pub fn accum_key_values<'a, I, K, D, E>(
input: I,
target_size: u64,
) -> impl Stream<Item = Result<KeySize<K>, E>>
where
K: Eq + PartialOrd + Display + Copy,
I: Stream<Item = Result<D, E>>,
@@ -249,25 +254,35 @@ where
if let Some(first) = input.next().await {
let first = first?;
let mut part_size = first.size();
let mut accum: KeySize<K> = KeySize {
key: first.key(),
num_values: 1,
size: first.size(),
size: part_size,
partition_lsns: Vec::new(),
};
let mut last_key = accum.key;
while let Some(this) = input.next().await {
let this = this?;
if this.key() == accum.key {
accum.size += this.size();
let add_size = this.size();
if part_size + add_size > target_size {
accum.partition_lsns.push((this.lsn(), part_size));
part_size = 0;
}
part_size += add_size;
accum.size += add_size;
accum.num_values += 1;
} else {
assert!(last_key <= accum.key, "last_key={last_key} <= accum.key={}", accum.key);
last_key = accum.key;
yield accum;
part_size = this.size();
accum = KeySize {
key: this.key(),
num_values: 1,
size: this.size(),
size: part_size,
partition_lsns: Vec::new(),
};
}
}

View File

@@ -184,6 +184,12 @@ impl<L> Level<L> {
}
let mut events: Vec<Event<K>> = Vec::new();
for (idx, l) in self.layers.iter().enumerate() {
let key_range = l.key_range();
if key_range.end == key_range.start.next() && l.is_delta() {
// Ignore single-key delta layers as they can be stacked on top of each other
// as that is the only way to cut further.
continue;
}
events.push(Event {
key: l.key_range().start,
layer_idx: idx,

View File

@@ -20,10 +20,6 @@ pub(crate) fn setup_logging() {
/// even if we produce an extremely narrow delta layer, spanning just that one
/// key, we still too many records to fit in the target file size. We need to
/// split in the LSN dimension too in that case.
///
/// TODO: The code to avoid this problem has not been implemented yet! So the
/// assertion currently fails, but we need to make it not fail.
#[ignore]
#[tokio::test]
async fn test_many_updates_for_single_key() {
setup_logging();
@@ -43,9 +39,9 @@ async fn test_many_updates_for_single_key() {
}
for l in executor.live_layers.iter() {
assert!(l.file_size() < executor.target_file_size * 2);
// sanity check that none of the delta layers are stupidly small either
// Sanity check that none of the delta layers are empty either.
if l.is_delta() {
assert!(l.file_size() > executor.target_file_size / 2);
assert!(l.file_size() > 0);
}
}
}

View File

@@ -83,6 +83,11 @@ fn parse_filename(name: &str) -> (Range<Key>, Range<Lsn>) {
let split: Vec<&str> = name.split("__").collect();
let keys: Vec<&str> = split[0].split('-').collect();
let mut lsns: Vec<&str> = split[1].split('-').collect();
if lsns.last().expect("should").len() == 8 {
lsns.pop();
}
if lsns.len() == 1 {
lsns.push(lsns[0]);
}

View File

@@ -100,7 +100,7 @@ pub(crate) fn parse_filename(name: &str) -> Option<LayerFile> {
// Finds the max_holes largest holes, ignoring any that are smaller than MIN_HOLE_LENGTH"
async fn get_holes(path: &Utf8Path, max_holes: usize, ctx: &RequestContext) -> Result<Vec<Hole>> {
let file = VirtualFile::open(path).await?;
let file = VirtualFile::open(path, ctx).await?;
let file_id = page_cache::next_file_id();
let block_reader = FileBlockReader::new(&file, file_id);
let summary_blk = block_reader.read_blk(0, ctx).await?;

View File

@@ -61,7 +61,7 @@ async fn read_delta_file(path: impl AsRef<Path>, ctx: &RequestContext) -> Result
let path = Utf8Path::from_path(path.as_ref()).expect("non-Unicode path");
virtual_file::init(10, virtual_file::api::IoEngineKind::StdFs);
page_cache::init(100);
let file = VirtualFile::open(path).await?;
let file = VirtualFile::open(path, ctx).await?;
let file_id = page_cache::next_file_id();
let block_reader = FileBlockReader::new(&file, file_id);
let summary_blk = block_reader.read_blk(0, ctx).await?;

View File

@@ -1,3 +1,6 @@
use std::sync::Arc;
use ::metrics::IntGauge;
use bytes::{Buf, BufMut, Bytes};
use pageserver_api::key::{Key, AUX_KEY_PREFIX, METADATA_KEY_SIZE};
use tracing::warn;
@@ -140,6 +143,55 @@ pub fn encode_file_value(files: &[(&str, &[u8])]) -> anyhow::Result<Vec<u8>> {
Ok(encoded)
}
/// An estimation of the size of aux files.
pub struct AuxFileSizeEstimator {
aux_file_size_gauge: IntGauge,
size: Arc<std::sync::Mutex<Option<isize>>>,
}
impl AuxFileSizeEstimator {
pub fn new(aux_file_size_gauge: IntGauge) -> Self {
Self {
aux_file_size_gauge,
size: Arc::new(std::sync::Mutex::new(None)),
}
}
pub fn on_base_backup(&self, new_size: usize) {
let mut guard = self.size.lock().unwrap();
*guard = Some(new_size as isize);
self.report(new_size as isize);
}
pub fn on_add(&self, file_size: usize) {
let mut guard = self.size.lock().unwrap();
if let Some(size) = &mut *guard {
*size += file_size as isize;
self.report(*size);
}
}
pub fn on_remove(&self, file_size: usize) {
let mut guard = self.size.lock().unwrap();
if let Some(size) = &mut *guard {
*size -= file_size as isize;
self.report(*size);
}
}
pub fn on_update(&self, old_size: usize, new_size: usize) {
let mut guard = self.size.lock().unwrap();
if let Some(size) = &mut *guard {
*size += new_size as isize - old_size as isize;
self.report(*size);
}
}
pub fn report(&self, size: isize) {
self.aux_file_size_gauge.set(size as i64);
}
}
#[cfg(test)]
mod tests {
use super::*;

View File

@@ -383,7 +383,7 @@ fn start_pageserver(
let shutdown_pageserver = tokio_util::sync::CancellationToken::new();
// Set up remote storage client
let remote_storage = create_remote_storage_client(conf)?;
let remote_storage = Some(create_remote_storage_client(conf)?);
// Set up deletion queue
let (deletion_queue, deletion_workers) = DeletionQueue::new(
@@ -654,17 +654,20 @@ fn start_pageserver(
None,
"libpq endpoint listener",
true,
async move {
page_service::libpq_listener_main(
conf,
broker_client,
pg_auth,
pageserver_listener,
conf.pg_auth_type,
libpq_ctx,
task_mgr::shutdown_token(),
)
.await
{
let tenant_manager = tenant_manager.clone();
async move {
page_service::libpq_listener_main(
tenant_manager,
broker_client,
pg_auth,
pageserver_listener,
conf.pg_auth_type,
libpq_ctx,
task_mgr::shutdown_token(),
)
.await
}
},
);
}
@@ -708,12 +711,11 @@ fn start_pageserver(
fn create_remote_storage_client(
conf: &'static PageServerConf,
) -> anyhow::Result<Option<GenericRemoteStorage>> {
) -> anyhow::Result<GenericRemoteStorage> {
let config = if let Some(config) = &conf.remote_storage_config {
config
} else {
tracing::warn!("no remote storage configured, this is a deprecated configuration");
return Ok(None);
anyhow::bail!("no remote storage configured, this is a deprecated configuration");
};
// Create the client
@@ -733,7 +735,7 @@ fn create_remote_storage_client(
GenericRemoteStorage::unreliable_wrapper(remote_storage, conf.test_remote_failures);
}
Ok(Some(remote_storage))
Ok(remote_storage)
}
fn cli() -> Command {

View File

@@ -1715,12 +1715,7 @@ async fn timeline_gc_handler(
let gc_req: TimelineGcRequest = json_request(&mut request).await?;
let ctx = RequestContext::new(TaskKind::MgmtRequest, DownloadBehavior::Download);
let wait_task_done = mgr::immediate_gc(tenant_shard_id, timeline_id, gc_req, cancel, &ctx)?;
let gc_result = wait_task_done
.await
.context("wait for gc task")
.map_err(ApiError::InternalServerError)?
.map_err(ApiError::InternalServerError)?;
let gc_result = mgr::immediate_gc(tenant_shard_id, timeline_id, gc_req, cancel, &ctx).await?;
json_response(StatusCode::OK, gc_result)
}

View File

@@ -585,6 +585,15 @@ static CURRENT_LOGICAL_SIZE: Lazy<UIntGaugeVec> = Lazy::new(|| {
.expect("failed to define current logical size metric")
});
static AUX_FILE_SIZE: Lazy<IntGaugeVec> = Lazy::new(|| {
register_int_gauge_vec!(
"pageserver_aux_file_estimated_size",
"The size of all aux files for a timeline in aux file v2 store.",
&["tenant_id", "shard_id", "timeline_id"]
)
.expect("failed to define a metric")
});
pub(crate) mod initial_logical_size {
use metrics::{register_int_counter, register_int_counter_vec, IntCounter, IntCounterVec};
use once_cell::sync::Lazy;
@@ -2115,6 +2124,7 @@ pub(crate) struct TimelineMetrics {
resident_physical_size_gauge: UIntGauge,
/// copy of LayeredTimeline.current_logical_size
pub current_logical_size_gauge: UIntGauge,
pub aux_file_size_gauge: IntGauge,
pub directory_entries_count_gauge: Lazy<UIntGauge, Box<dyn Send + Fn() -> UIntGauge>>,
pub evictions: IntCounter,
pub evictions_with_low_residence_duration: std::sync::RwLock<EvictionsWithLowResidenceDuration>,
@@ -2187,6 +2197,9 @@ impl TimelineMetrics {
let current_logical_size_gauge = CURRENT_LOGICAL_SIZE
.get_metric_with_label_values(&[&tenant_id, &shard_id, &timeline_id])
.unwrap();
let aux_file_size_gauge = AUX_FILE_SIZE
.get_metric_with_label_values(&[&tenant_id, &shard_id, &timeline_id])
.unwrap();
// TODO use impl Trait syntax here once we have ability to use it: https://github.com/rust-lang/rust/issues/63065
let directory_entries_count_gauge_closure = {
let tenant_shard_id = *tenant_shard_id;
@@ -2224,6 +2237,7 @@ impl TimelineMetrics {
last_record_gauge,
resident_physical_size_gauge,
current_logical_size_gauge,
aux_file_size_gauge,
directory_entries_count_gauge,
evictions,
evictions_with_low_residence_duration: std::sync::RwLock::new(
@@ -2264,6 +2278,7 @@ impl TimelineMetrics {
let _ = metric.remove_label_values(&[tenant_id, shard_id, timeline_id]);
}
let _ = EVICTIONS.remove_label_values(&[tenant_id, shard_id, timeline_id]);
let _ = AUX_FILE_SIZE.remove_label_values(&[tenant_id, shard_id, timeline_id]);
self.evictions_with_low_residence_duration
.write()

View File

@@ -49,7 +49,6 @@ use utils::{
use crate::auth::check_permission;
use crate::basebackup;
use crate::basebackup::BasebackupError;
use crate::config::PageServerConf;
use crate::context::{DownloadBehavior, RequestContext};
use crate::import_datadir::import_wal_from_tar;
use crate::metrics;
@@ -59,10 +58,9 @@ use crate::span::debug_assert_current_span_has_tenant_and_timeline_id;
use crate::span::debug_assert_current_span_has_tenant_and_timeline_id_no_shard_id;
use crate::task_mgr;
use crate::task_mgr::TaskKind;
use crate::tenant::mgr;
use crate::tenant::mgr::get_active_tenant_with_timeout;
use crate::tenant::mgr::GetActiveTenantError;
use crate::tenant::mgr::ShardSelector;
use crate::tenant::mgr::TenantManager;
use crate::tenant::timeline::WaitLsnError;
use crate::tenant::GetTimelineError;
use crate::tenant::PageReconstructError;
@@ -135,7 +133,7 @@ async fn read_tar_eof(mut reader: (impl AsyncRead + Unpin)) -> anyhow::Result<()
/// Listens for connections, and launches a new handler task for each.
///
pub async fn libpq_listener_main(
conf: &'static PageServerConf,
tenant_manager: Arc<TenantManager>,
broker_client: storage_broker::BrokerClientChannel,
auth: Option<Arc<SwappableJwtAuth>>,
listener: TcpListener,
@@ -180,7 +178,7 @@ pub async fn libpq_listener_main(
"serving compute connection task",
false,
page_service_conn_main(
conf,
tenant_manager.clone(),
broker_client.clone(),
local_auth,
socket,
@@ -203,7 +201,7 @@ pub async fn libpq_listener_main(
#[instrument(skip_all, fields(peer_addr))]
async fn page_service_conn_main(
conf: &'static PageServerConf,
tenant_manager: Arc<TenantManager>,
broker_client: storage_broker::BrokerClientChannel,
auth: Option<Arc<SwappableJwtAuth>>,
socket: tokio::net::TcpStream,
@@ -260,7 +258,8 @@ async fn page_service_conn_main(
// and create a child per-query context when it invokes process_query.
// But it's in a shared crate, so, we store connection_ctx inside PageServerHandler
// and create the per-query context in process_query ourselves.
let mut conn_handler = PageServerHandler::new(conf, broker_client, auth, connection_ctx);
let mut conn_handler =
PageServerHandler::new(tenant_manager, broker_client, auth, connection_ctx);
let pgbackend = PostgresBackend::new_from_io(socket, peer_addr, auth_type, None)?;
match pgbackend
@@ -291,11 +290,12 @@ struct HandlerTimeline {
}
struct PageServerHandler {
_conf: &'static PageServerConf,
broker_client: storage_broker::BrokerClientChannel,
auth: Option<Arc<SwappableJwtAuth>>,
claims: Option<Claims>,
tenant_manager: Arc<TenantManager>,
/// The context created for the lifetime of the connection
/// services by this PageServerHandler.
/// For each query received over the connection,
@@ -381,13 +381,13 @@ impl From<WaitLsnError> for QueryError {
impl PageServerHandler {
pub fn new(
conf: &'static PageServerConf,
tenant_manager: Arc<TenantManager>,
broker_client: storage_broker::BrokerClientChannel,
auth: Option<Arc<SwappableJwtAuth>>,
connection_ctx: RequestContext,
) -> Self {
PageServerHandler {
_conf: conf,
tenant_manager,
broker_client,
auth,
claims: None,
@@ -552,13 +552,15 @@ impl PageServerHandler {
{
debug_assert_current_span_has_tenant_and_timeline_id_no_shard_id();
let tenant = mgr::get_active_tenant_with_timeout(
tenant_id,
ShardSelector::First,
ACTIVE_TENANT_TIMEOUT,
&task_mgr::shutdown_token(),
)
.await?;
let tenant = self
.tenant_manager
.get_active_tenant_with_timeout(
tenant_id,
ShardSelector::First,
ACTIVE_TENANT_TIMEOUT,
&task_mgr::shutdown_token(),
)
.await?;
// Make request tracer if needed
let mut tracer = if tenant.get_trace_read_requests() {
@@ -726,13 +728,15 @@ impl PageServerHandler {
// Create empty timeline
info!("creating new timeline");
let tenant = get_active_tenant_with_timeout(
tenant_id,
ShardSelector::Zero,
ACTIVE_TENANT_TIMEOUT,
&task_mgr::shutdown_token(),
)
.await?;
let tenant = self
.tenant_manager
.get_active_tenant_with_timeout(
tenant_id,
ShardSelector::Zero,
ACTIVE_TENANT_TIMEOUT,
&task_mgr::shutdown_token(),
)
.await?;
let timeline = tenant
.create_empty_timeline(timeline_id, base_lsn, pg_version, &ctx)
.await?;
@@ -1370,14 +1374,16 @@ impl PageServerHandler {
timeline_id: TimelineId,
selector: ShardSelector,
) -> Result<Arc<Timeline>, GetActiveTimelineError> {
let tenant = get_active_tenant_with_timeout(
tenant_id,
selector,
ACTIVE_TENANT_TIMEOUT,
&task_mgr::shutdown_token(),
)
.await
.map_err(GetActiveTimelineError::Tenant)?;
let tenant = self
.tenant_manager
.get_active_tenant_with_timeout(
tenant_id,
selector,
ACTIVE_TENANT_TIMEOUT,
&task_mgr::shutdown_token(),
)
.await
.map_err(GetActiveTimelineError::Tenant)?;
let timeline = tenant.get_timeline(timeline_id, true)?;
set_tracing_field_shard_id(&timeline);
Ok(timeline)
@@ -1771,13 +1777,15 @@ where
self.check_permission(Some(tenant_id))?;
let tenant = get_active_tenant_with_timeout(
tenant_id,
ShardSelector::Zero,
ACTIVE_TENANT_TIMEOUT,
&task_mgr::shutdown_token(),
)
.await?;
let tenant = self
.tenant_manager
.get_active_tenant_with_timeout(
tenant_id,
ShardSelector::Zero,
ACTIVE_TENANT_TIMEOUT,
&task_mgr::shutdown_token(),
)
.await?;
pgb.write_message_noflush(&BeMessage::RowDescription(&[
RowDescriptor::int8_col(b"checkpoint_distance"),
RowDescriptor::int8_col(b"checkpoint_timeout"),

View File

@@ -699,13 +699,17 @@ impl Timeline {
.await
.context("scan")?;
let mut result = HashMap::new();
let mut sz = 0;
for (_, v) in kv {
let v = v.context("get value")?;
let v = aux_file::decode_file_value_bytes(&v).context("value decode")?;
for (fname, content) in v {
sz += fname.len();
sz += content.len();
result.insert(fname, content);
}
}
self.aux_file_size_estimator.on_base_backup(sz);
Ok(result)
}
@@ -1474,23 +1478,45 @@ impl<'a> DatadirModification<'a> {
Err(PageReconstructError::MissingKey(_)) => None,
Err(e) => return Err(e.into()),
};
let files = if let Some(ref old_val) = old_val {
let files: Vec<(&str, &[u8])> = if let Some(ref old_val) = old_val {
aux_file::decode_file_value(old_val)?
} else {
Vec::new()
};
let new_files = if content.is_empty() {
files
.into_iter()
.filter(|(p, _)| &path != p)
.collect::<Vec<_>>()
} else {
files
.into_iter()
.filter(|(p, _)| &path != p)
.chain(std::iter::once((path, content)))
.collect::<Vec<_>>()
};
let mut other_files = Vec::with_capacity(files.len());
let mut modifying_file = None;
for file @ (p, content) in files {
if path == p {
assert!(
modifying_file.is_none(),
"duplicated entries found for {}",
path
);
modifying_file = Some(content);
} else {
other_files.push(file);
}
}
let mut new_files = other_files;
match (modifying_file, content.is_empty()) {
(Some(old_content), false) => {
self.tline
.aux_file_size_estimator
.on_update(old_content.len(), content.len());
new_files.push((path, content));
}
(Some(old_content), true) => {
self.tline
.aux_file_size_estimator
.on_remove(old_content.len());
// not adding the file key to the final `new_files` vec.
}
(None, false) => {
self.tline.aux_file_size_estimator.on_add(content.len());
new_files.push((path, content));
}
(None, true) => anyhow::bail!("removing non-existing aux file: {}", path),
}
let new_val = aux_file::encode_file_value(&new_files)?;
self.put(key, Value::Image(new_val.into()));
}
@@ -1671,7 +1697,7 @@ impl<'a> DatadirModification<'a> {
}
if !self.pending_deletions.is_empty() {
writer.delete_batch(&self.pending_deletions).await?;
writer.delete_batch(&self.pending_deletions, ctx).await?;
self.pending_deletions.clear();
}

View File

@@ -2800,7 +2800,7 @@ impl Tenant {
// See comments in [`Tenant::branch_timeline`] for more information about why branch
// creation task can run concurrently with timeline's GC iteration.
for timeline in gc_timelines {
if task_mgr::is_shutdown_requested() || cancel.is_cancelled() {
if cancel.is_cancelled() {
// We were requested to shut down. Stop and return with the progress we
// made.
break;

View File

@@ -299,7 +299,7 @@ mod tests {
// Write part (in block to drop the file)
let mut offsets = Vec::new();
{
let file = VirtualFile::create(pathbuf.as_path()).await?;
let file = VirtualFile::create(pathbuf.as_path(), &ctx).await?;
let mut wtr = BlobWriter::<BUFFERED>::new(file, 0);
for blob in blobs.iter() {
let (_, res) = wtr.write_blob(blob.clone(), &ctx).await;
@@ -314,7 +314,7 @@ mod tests {
wtr.flush_buffer(&ctx).await?;
}
let file = VirtualFile::open(pathbuf.as_path()).await?;
let file = VirtualFile::open(pathbuf.as_path(), &ctx).await?;
let rdr = BlockReaderRef::VirtualFile(&file);
let rdr = BlockCursor::new(rdr);
for (idx, (blob, offset)) in blobs.iter().zip(offsets.iter()).enumerate() {

View File

@@ -102,7 +102,7 @@ impl<'a> BlockReaderRef<'a> {
#[cfg(test)]
TestDisk(r) => r.read_blk(blknum),
#[cfg(test)]
VirtualFile(r) => r.read_blk(blknum).await,
VirtualFile(r) => r.read_blk(blknum, ctx).await,
}
}
}
@@ -177,10 +177,11 @@ impl<'a> FileBlockReader<'a> {
&self,
buf: PageWriteGuard<'static>,
blkno: u32,
ctx: &RequestContext,
) -> Result<PageWriteGuard<'static>, std::io::Error> {
assert!(buf.len() == PAGE_SZ);
self.file
.read_exact_at_page(buf, blkno as u64 * PAGE_SZ as u64)
.read_exact_at_page(buf, blkno as u64 * PAGE_SZ as u64, ctx)
.await
}
/// Read a block.
@@ -206,7 +207,7 @@ impl<'a> FileBlockReader<'a> {
ReadBufResult::Found(guard) => Ok(guard.into()),
ReadBufResult::NotFound(write_guard) => {
// Read the page from disk into the buffer
let write_guard = self.fill_buffer(write_guard, blknum).await?;
let write_guard = self.fill_buffer(write_guard, blknum, ctx).await?;
Ok(write_guard.mark_valid().into())
}
}

View File

@@ -28,6 +28,7 @@ impl EphemeralFile {
conf: &PageServerConf,
tenant_shard_id: TenantShardId,
timeline_id: TimelineId,
ctx: &RequestContext,
) -> Result<EphemeralFile, io::Error> {
static NEXT_FILENAME: AtomicU64 = AtomicU64::new(1);
let filename_disambiguator =
@@ -45,6 +46,7 @@ impl EphemeralFile {
.read(true)
.write(true)
.create(true),
ctx,
)
.await?;
@@ -153,7 +155,7 @@ mod tests {
async fn test_ephemeral_blobs() -> Result<(), io::Error> {
let (conf, tenant_id, timeline_id, ctx) = harness("ephemeral_blobs")?;
let mut file = EphemeralFile::create(conf, tenant_id, timeline_id).await?;
let mut file = EphemeralFile::create(conf, tenant_id, timeline_id, &ctx).await?;
let pos_foo = file.write_blob(b"foo", &ctx).await?;
assert_eq!(

View File

@@ -78,7 +78,7 @@ impl RW {
page_cache::ReadBufResult::NotFound(write_guard) => {
let write_guard = writer
.file
.read_exact_at_page(write_guard, blknum as u64 * PAGE_SZ as u64)
.read_exact_at_page(write_guard, blknum as u64 * PAGE_SZ as u64, ctx)
.await?;
let read_guard = write_guard.mark_valid();
return Ok(BlockLease::PageReadGuard(read_guard));

View File

@@ -119,6 +119,7 @@ pub(crate) enum TenantsMapRemoveResult {
/// When resolving a TenantId to a shard, we may be looking for the 0th
/// shard, or we might be looking for whichever shard holds a particular page.
#[derive(Copy, Clone)]
pub(crate) enum ShardSelector {
/// Only return the 0th shard, if it is present. If a non-0th shard is present,
/// ignore it.
@@ -169,6 +170,14 @@ impl TenantStartupMode {
}
}
/// Result type for looking up a TenantId to a specific shard
enum ShardResolveResult {
NotFound,
Found(Arc<Tenant>),
// Wait for this barrrier, then query again
InProgress(utils::completion::Barrier),
}
impl TenantsMap {
/// Convenience function for typical usage, where we want to get a `Tenant` object, for
/// working with attached tenants. If the TenantId is in the map but in Secondary state,
@@ -184,26 +193,40 @@ impl TenantsMap {
/// A page service client sends a TenantId, and to look up the correct Tenant we must
/// resolve this to a fully qualified TenantShardId.
///
/// During shard splits: we shall see parent shards in InProgress state and skip them, and
/// instead match on child shards which should appear in Attached state. Very early in a shard
/// split, or in other cases where a shard is InProgress, we will return our own InProgress result
/// to instruct the caller to wait for that to finish before querying again.
fn resolve_attached_shard(
&self,
tenant_id: &TenantId,
selector: ShardSelector,
) -> Option<TenantShardId> {
) -> ShardResolveResult {
let mut want_shard = None;
let mut any_in_progress = None;
match self {
TenantsMap::Initializing => None,
TenantsMap::Initializing => ShardResolveResult::NotFound,
TenantsMap::Open(m) | TenantsMap::ShuttingDown(m) => {
for slot in m.range(TenantShardId::tenant_range(*tenant_id)) {
// Ignore all slots that don't contain an attached tenant
let tenant = match &slot.1 {
TenantSlot::Attached(t) => t,
TenantSlot::InProgress(barrier) => {
// We might still find a usable shard, but in case we don't, remember that
// we saw at least one InProgress slot, so that we can distinguish this case
// from a simple NotFound in our return value.
any_in_progress = Some(barrier.clone());
continue;
}
_ => continue,
};
match selector {
ShardSelector::First => return Some(*slot.0),
ShardSelector::First => return ShardResolveResult::Found(tenant.clone()),
ShardSelector::Zero if slot.0.shard_number == ShardNumber(0) => {
return Some(*slot.0)
return ShardResolveResult::Found(tenant.clone())
}
ShardSelector::Page(key) => {
// First slot we see for this tenant, calculate the expected shard number
@@ -214,15 +237,21 @@ impl TenantsMap {
}
if Some(tenant.shard_identity.number) == want_shard {
return Some(*slot.0);
return ShardResolveResult::Found(tenant.clone());
}
}
_ => continue,
}
}
// Fall through: we didn't find an acceptable shard
None
// Fall through: we didn't find a slot that was in Attached state & matched our selector. If
// we found one or more InProgress slot, indicate to caller that they should retry later. Otherwise
// this requested shard simply isn't found.
if let Some(barrier) = any_in_progress {
ShardResolveResult::InProgress(barrier)
} else {
ShardResolveResult::NotFound
}
}
}
}
@@ -1693,7 +1722,7 @@ impl TenantManager {
for child_shard_id in &child_shards {
let child_shard_id = *child_shard_id;
let child_shard = {
let locked = TENANTS.read().unwrap();
let locked = self.tenants.read().unwrap();
let peek_slot =
tenant_map_peek_slot(&locked, &child_shard_id, TenantSlotPeekMode::Read)?;
peek_slot.and_then(|s| s.get_attached()).cloned()
@@ -1926,13 +1955,7 @@ impl TenantManager {
deletion_queue_client: &DeletionQueueClient,
) -> Result<(), TenantStateError> {
let tmp_path = self
.detach_tenant0(
conf,
&TENANTS,
tenant_shard_id,
detach_ignored,
deletion_queue_client,
)
.detach_tenant0(conf, tenant_shard_id, detach_ignored, deletion_queue_client)
.await?;
spawn_background_purge(tmp_path);
@@ -1942,7 +1965,6 @@ impl TenantManager {
async fn detach_tenant0(
&self,
conf: &'static PageServerConf,
tenants: &std::sync::RwLock<TenantsMap>,
tenant_shard_id: TenantShardId,
detach_ignored: bool,
deletion_queue_client: &DeletionQueueClient,
@@ -1957,7 +1979,7 @@ impl TenantManager {
};
let removal_result = remove_tenant_from_memory(
tenants,
self.tenants,
tenant_shard_id,
tenant_dir_rename_operation(tenant_shard_id),
)
@@ -1993,7 +2015,7 @@ impl TenantManager {
pub(crate) fn list_tenants(
&self,
) -> Result<Vec<(TenantShardId, TenantState, Generation)>, TenantMapListError> {
let tenants = TENANTS.read().unwrap();
let tenants = self.tenants.read().unwrap();
let m = match &*tenants {
TenantsMap::Initializing => return Err(TenantMapListError::Initializing),
TenantsMap::Open(m) | TenantsMap::ShuttingDown(m) => m,
@@ -2103,6 +2125,68 @@ impl TenantManager {
Ok(reparented)
}
/// Get a [`Tenant`] in its active state. If the tenant_id is currently in [`TenantSlot::InProgress`]
/// state, then wait for up to `timeout`. If the [`Tenant`] is not currently in [`TenantState::Active`],
/// then wait for up to `timeout` (minus however long we waited for the slot).
pub(crate) async fn get_active_tenant_with_timeout(
&self,
tenant_id: TenantId,
shard_selector: ShardSelector,
timeout: Duration,
cancel: &CancellationToken,
) -> Result<Arc<Tenant>, GetActiveTenantError> {
let wait_start = Instant::now();
let deadline = wait_start + timeout;
// Resolve TenantId to TenantShardId. This is usually a quick one-shot thing, the loop is
// for handling the rare case that the slot we're accessing is InProgress.
let tenant_shard = loop {
let resolved = {
let locked = self.tenants.read().unwrap();
locked.resolve_attached_shard(&tenant_id, shard_selector)
};
match resolved {
ShardResolveResult::Found(tenant_shard) => break tenant_shard,
ShardResolveResult::NotFound => {
return Err(GetActiveTenantError::NotFound(GetTenantError::NotFound(
tenant_id,
)));
}
ShardResolveResult::InProgress(barrier) => {
// We can't authoritatively answer right now: wait for InProgress state
// to end, then try again
match timeout_cancellable(
deadline.duration_since(Instant::now()),
cancel,
barrier.wait(),
)
.await
{
Ok(_) => {
// The barrier completed: proceed around the loop to try looking up again
continue;
}
Err(TimeoutCancellableError::Timeout) => {
return Err(GetActiveTenantError::WaitForActiveTimeout {
latest_state: None,
wait_time: timeout,
});
}
Err(TimeoutCancellableError::Cancelled) => {
return Err(GetActiveTenantError::Cancelled);
}
}
}
};
};
tracing::debug!("Waiting for tenant to enter active state...");
tenant_shard
.wait_to_become_active(deadline.duration_since(Instant::now()))
.await?;
Ok(tenant_shard)
}
}
#[derive(Debug, thiserror::Error)]
@@ -2151,105 +2235,6 @@ pub(crate) enum GetActiveTenantError {
Broken(String),
}
/// Get a [`Tenant`] in its active state. If the tenant_id is currently in [`TenantSlot::InProgress`]
/// state, then wait for up to `timeout`. If the [`Tenant`] is not currently in [`TenantState::Active`],
/// then wait for up to `timeout` (minus however long we waited for the slot).
pub(crate) async fn get_active_tenant_with_timeout(
tenant_id: TenantId,
shard_selector: ShardSelector,
timeout: Duration,
cancel: &CancellationToken,
) -> Result<Arc<Tenant>, GetActiveTenantError> {
enum WaitFor {
Barrier(utils::completion::Barrier),
Tenant(Arc<Tenant>),
}
let wait_start = Instant::now();
let deadline = wait_start + timeout;
let (wait_for, tenant_shard_id) = {
let locked = TENANTS.read().unwrap();
// Resolve TenantId to TenantShardId
let tenant_shard_id = locked
.resolve_attached_shard(&tenant_id, shard_selector)
.ok_or(GetActiveTenantError::NotFound(GetTenantError::NotFound(
tenant_id,
)))?;
let peek_slot = tenant_map_peek_slot(&locked, &tenant_shard_id, TenantSlotPeekMode::Read)
.map_err(GetTenantError::MapState)?;
match peek_slot {
Some(TenantSlot::Attached(tenant)) => {
match tenant.current_state() {
TenantState::Active => {
// Fast path: we don't need to do any async waiting.
return Ok(tenant.clone());
}
_ => {
tenant.activate_now();
(WaitFor::Tenant(tenant.clone()), tenant_shard_id)
}
}
}
Some(TenantSlot::Secondary(_)) => {
return Err(GetActiveTenantError::NotFound(GetTenantError::NotActive(
tenant_shard_id,
)))
}
Some(TenantSlot::InProgress(barrier)) => {
(WaitFor::Barrier(barrier.clone()), tenant_shard_id)
}
None => {
return Err(GetActiveTenantError::NotFound(GetTenantError::NotFound(
tenant_id,
)))
}
}
};
let tenant = match wait_for {
WaitFor::Barrier(barrier) => {
tracing::debug!("Waiting for tenant InProgress state to pass...");
timeout_cancellable(
deadline.duration_since(Instant::now()),
cancel,
barrier.wait(),
)
.await
.map_err(|e| match e {
TimeoutCancellableError::Timeout => GetActiveTenantError::WaitForActiveTimeout {
latest_state: None,
wait_time: wait_start.elapsed(),
},
TimeoutCancellableError::Cancelled => GetActiveTenantError::Cancelled,
})?;
{
let locked = TENANTS.read().unwrap();
let peek_slot =
tenant_map_peek_slot(&locked, &tenant_shard_id, TenantSlotPeekMode::Read)
.map_err(GetTenantError::MapState)?;
match peek_slot {
Some(TenantSlot::Attached(tenant)) => tenant.clone(),
_ => {
return Err(GetActiveTenantError::NotFound(GetTenantError::NotActive(
tenant_shard_id,
)))
}
}
}
}
WaitFor::Tenant(tenant) => tenant,
};
tracing::debug!("Waiting for tenant to enter active state...");
tenant
.wait_to_become_active(deadline.duration_since(Instant::now()))
.await?;
Ok(tenant)
}
#[derive(Debug, thiserror::Error)]
pub(crate) enum DeleteTimelineError {
#[error("Tenant {0}")]
@@ -2880,86 +2865,73 @@ use {
utils::http::error::ApiError,
};
pub(crate) fn immediate_gc(
#[instrument(skip_all, fields(tenant_id=%tenant_shard_id.tenant_id, shard_id=%tenant_shard_id.shard_slug(), %timeline_id))]
pub(crate) async fn immediate_gc(
tenant_shard_id: TenantShardId,
timeline_id: TimelineId,
gc_req: TimelineGcRequest,
cancel: CancellationToken,
ctx: &RequestContext,
) -> Result<tokio::sync::oneshot::Receiver<Result<GcResult, anyhow::Error>>, ApiError> {
let guard = TENANTS.read().unwrap();
let tenant = guard
.get(&tenant_shard_id)
.cloned()
.with_context(|| format!("tenant {tenant_shard_id}"))
.map_err(|e| ApiError::NotFound(e.into()))?;
) -> Result<GcResult, ApiError> {
let tenant = {
let guard = TENANTS.read().unwrap();
guard
.get(&tenant_shard_id)
.cloned()
.with_context(|| format!("tenant {tenant_shard_id}"))
.map_err(|e| ApiError::NotFound(e.into()))?
};
let gc_horizon = gc_req.gc_horizon.unwrap_or_else(|| tenant.get_gc_horizon());
// Use tenant's pitr setting
let pitr = tenant.get_pitr_interval();
tenant.wait_to_become_active(ACTIVE_TENANT_TIMEOUT).await?;
// Run in task_mgr to avoid race with tenant_detach operation
let ctx = ctx.detached_child(TaskKind::GarbageCollector, DownloadBehavior::Download);
let (task_done, wait_task_done) = tokio::sync::oneshot::channel();
let span = info_span!("manual_gc", tenant_id=%tenant_shard_id.tenant_id, shard_id=%tenant_shard_id.shard_slug(), %timeline_id);
let ctx: RequestContext =
ctx.detached_child(TaskKind::GarbageCollector, DownloadBehavior::Download);
// TODO: spawning is redundant now, need to hold the gate
task_mgr::spawn(
&tokio::runtime::Handle::current(),
TaskKind::GarbageCollector,
Some(tenant_shard_id),
Some(timeline_id),
&format!("timeline_gc_handler garbage collection run for tenant {tenant_shard_id} timeline {timeline_id}"),
false,
async move {
fail::fail_point!("immediate_gc_task_pre");
let _gate_guard = tenant.gate.enter().map_err(|_| ApiError::ShuttingDown)?;
#[allow(unused_mut)]
let mut result = tenant
.gc_iteration(Some(timeline_id), gc_horizon, pitr, &cancel, &ctx)
.await;
// FIXME: `gc_iteration` can return an error for multiple reasons; we should handle it
// better once the types support it.
fail::fail_point!("immediate_gc_task_pre");
#[cfg(feature = "testing")]
{
// we need to synchronize with drop completion for python tests without polling for
// log messages
if let Ok(result) = result.as_mut() {
let mut js = tokio::task::JoinSet::new();
for layer in std::mem::take(&mut result.doomed_layers) {
js.spawn(layer.wait_drop());
}
tracing::info!(total = js.len(), "starting to wait for the gc'd layers to be dropped");
while let Some(res) = js.join_next().await {
res.expect("wait_drop should not panic");
}
}
#[allow(unused_mut)]
let mut result = tenant
.gc_iteration(Some(timeline_id), gc_horizon, pitr, &cancel, &ctx)
.await;
// FIXME: `gc_iteration` can return an error for multiple reasons; we should handle it
// better once the types support it.
let timeline = tenant.get_timeline(timeline_id, false).ok();
let rtc = timeline.as_ref().and_then(|x| x.remote_client.as_ref());
if let Some(rtc) = rtc {
// layer drops schedule actions on remote timeline client to actually do the
// deletions; don't care about the shutdown error, just exit fast
drop(rtc.wait_completion().await);
}
#[cfg(feature = "testing")]
{
// we need to synchronize with drop completion for python tests without polling for
// log messages
if let Ok(result) = result.as_mut() {
let mut js = tokio::task::JoinSet::new();
for layer in std::mem::take(&mut result.doomed_layers) {
js.spawn(layer.wait_drop());
}
match task_done.send(result) {
Ok(_) => (),
Err(result) => error!("failed to send gc result: {result:?}"),
tracing::info!(
total = js.len(),
"starting to wait for the gc'd layers to be dropped"
);
while let Some(res) = js.join_next().await {
res.expect("wait_drop should not panic");
}
Ok(())
}
.instrument(span)
);
// drop the guard until after we've spawned the task so that timeline shutdown will wait for the task
drop(guard);
let timeline = tenant.get_timeline(timeline_id, false).ok();
let rtc = timeline.as_ref().and_then(|x| x.remote_client.as_ref());
Ok(wait_task_done)
if let Some(rtc) = rtc {
// layer drops schedule actions on remote timeline client to actually do the
// deletions; don't care about the shutdown error, just exit fast
drop(rtc.wait_completion().await);
}
}
result.map_err(ApiError::InternalServerError)
}
#[cfg(test)]

View File

@@ -1127,6 +1127,11 @@ impl RemoteTimelineClient {
Ok(())
}
pub(crate) fn is_deleting(&self) -> bool {
let mut locked = self.upload_queue.lock().unwrap();
locked.stopped_mut().is_ok()
}
pub(crate) async fn preserve_initdb_archive(
self: &Arc<Self>,
tenant_id: &TenantId,

View File

@@ -112,14 +112,17 @@ pub async fn download_layer_file<'a>(
// We use fatal_err() below because the after the rename above,
// the in-memory state of the filesystem already has the layer file in its final place,
// and subsequent pageserver code could think it's durable while it really isn't.
let work = async move {
let timeline_dir = VirtualFile::open(&timeline_path)
.await
.fatal_err("VirtualFile::open for timeline dir fsync");
timeline_dir
.sync_all()
.await
.fatal_err("VirtualFile::sync_all timeline dir");
let work = {
let ctx = ctx.detached_child(ctx.task_kind(), ctx.download_behavior());
async move {
let timeline_dir = VirtualFile::open(&timeline_path, &ctx)
.await
.fatal_err("VirtualFile::open for timeline dir fsync");
timeline_dir
.sync_all()
.await
.fatal_err("VirtualFile::sync_all timeline dir");
}
};
crate::virtual_file::io_engine::get()
.spawn_blocking_and_block_on_if_std(work)
@@ -196,7 +199,7 @@ async fn download_object<'a>(
use crate::virtual_file::owned_buffers_io::{self, util::size_tracking_writer};
use bytes::BytesMut;
async {
let destination_file = VirtualFile::create(dst_path)
let destination_file = VirtualFile::create(dst_path, ctx)
.await
.with_context(|| format!("create a destination file for layer '{dst_path}'"))
.map_err(DownloadError::Other)?;

View File

@@ -45,10 +45,10 @@ use crate::tenant::{
use camino::Utf8PathBuf;
use chrono::format::{DelayedFormat, StrftimeItems};
use futures::Future;
use futures::{Future, StreamExt};
use pageserver_api::models::SecondaryProgress;
use pageserver_api::shard::TenantShardId;
use remote_storage::{DownloadError, Etag, GenericRemoteStorage};
use remote_storage::{DownloadError, Etag, GenericRemoteStorage, RemoteStorageActivity};
use tokio_util::sync::CancellationToken;
use tracing::{info_span, instrument, warn, Instrument};
@@ -71,6 +71,12 @@ use super::{
/// `<ttps://github.com/neondatabase/neon/issues/6200>`
const DOWNLOAD_FRESHEN_INTERVAL: Duration = Duration::from_millis(60000);
/// Range of concurrency we may use when downloading layers within a timeline. This is independent
/// for each tenant we're downloading: the concurrency of _tenants_ is defined separately in
/// `PageServerConf::secondary_download_concurrency`
const MAX_LAYER_CONCURRENCY: usize = 16;
const MIN_LAYER_CONCURRENCY: usize = 1;
pub(super) async fn downloader_task(
tenant_manager: Arc<TenantManager>,
remote_storage: GenericRemoteStorage,
@@ -79,14 +85,15 @@ pub(super) async fn downloader_task(
cancel: CancellationToken,
root_ctx: RequestContext,
) {
let concurrency = tenant_manager.get_conf().secondary_download_concurrency;
// How many tenants' secondary download operations we will run concurrently
let tenant_concurrency = tenant_manager.get_conf().secondary_download_concurrency;
let generator = SecondaryDownloader {
tenant_manager,
remote_storage,
root_ctx,
};
let mut scheduler = Scheduler::new(generator, concurrency);
let mut scheduler = Scheduler::new(generator, tenant_concurrency);
scheduler
.run(command_queue, background_jobs_can_start, cancel)
@@ -792,6 +799,8 @@ impl<'a> TenantDownloader<'a> {
tracing::debug!(timeline_id=%timeline.timeline_id, "Downloading layers, {} in heatmap", timeline.layers.len());
let mut download_futs = Vec::new();
// Download heatmap layers that are not present on local disk, or update their
// access time if they are already present.
for layer in timeline.layers {
@@ -874,67 +883,33 @@ impl<'a> TenantDownloader<'a> {
}
}
// Failpoint for simulating slow remote storage
failpoint_support::sleep_millis_async!(
"secondary-layer-download-sleep",
&self.secondary_state.cancel
);
// Note: no backoff::retry wrapper here because download_layer_file does its own retries internally
let downloaded_bytes = match download_layer_file(
self.conf,
self.remote_storage,
*tenant_shard_id,
timeline.timeline_id,
&layer.name,
&LayerFileMetadata::from(&layer.metadata),
&self.secondary_state.cancel,
download_futs.push(self.download_layer(
tenant_shard_id,
&timeline.timeline_id,
layer,
ctx,
)
.await
{
Ok(bytes) => bytes,
Err(DownloadError::NotFound) => {
// A heatmap might be out of date and refer to a layer that doesn't exist any more.
// This is harmless: continue to download the next layer. It is expected during compaction
// GC.
tracing::debug!(
"Skipped downloading missing layer {}, raced with compaction/gc?",
layer.name
);
continue;
));
}
// Break up layer downloads into chunks, so that for each chunk we can re-check how much
// concurrency to use based on activity level of remote storage.
while !download_futs.is_empty() {
let chunk =
download_futs.split_off(download_futs.len().saturating_sub(MAX_LAYER_CONCURRENCY));
let concurrency = Self::layer_concurrency(self.remote_storage.activity());
let mut result_stream = futures::stream::iter(chunk).buffered(concurrency);
let mut result_stream = std::pin::pin!(result_stream);
while let Some(result) = result_stream.next().await {
match result {
Err(e) => return Err(e),
Ok(None) => {
// No error, but we didn't download the layer. Don't mark it touched
}
Ok(Some(layer)) => touched.push(layer),
}
Err(e) => return Err(e.into()),
};
if downloaded_bytes != layer.metadata.file_size {
let local_path = local_layer_path(
self.conf,
tenant_shard_id,
&timeline.timeline_id,
&layer.name,
&layer.metadata.generation,
);
tracing::warn!(
"Downloaded layer {} with unexpected size {} != {}. Removing download.",
layer.name,
downloaded_bytes,
layer.metadata.file_size
);
tokio::fs::remove_file(&local_path)
.await
.or_else(fs_ext::ignore_not_found)?;
} else {
tracing::info!("Downloaded layer {}, size {}", layer.name, downloaded_bytes);
let mut progress = self.secondary_state.progress.lock().unwrap();
progress.bytes_downloaded += downloaded_bytes;
progress.layers_downloaded += 1;
}
SECONDARY_MODE.download_layer.inc();
touched.push(layer)
}
// Write updates to state to record layers we just downloaded or touched.
@@ -966,6 +941,90 @@ impl<'a> TenantDownloader<'a> {
Ok(())
}
async fn download_layer(
&self,
tenant_shard_id: &TenantShardId,
timeline_id: &TimelineId,
layer: HeatMapLayer,
ctx: &RequestContext,
) -> Result<Option<HeatMapLayer>, UpdateError> {
// Failpoint for simulating slow remote storage
failpoint_support::sleep_millis_async!(
"secondary-layer-download-sleep",
&self.secondary_state.cancel
);
// Note: no backoff::retry wrapper here because download_layer_file does its own retries internally
let downloaded_bytes = match download_layer_file(
self.conf,
self.remote_storage,
*tenant_shard_id,
*timeline_id,
&layer.name,
&LayerFileMetadata::from(&layer.metadata),
&self.secondary_state.cancel,
ctx,
)
.await
{
Ok(bytes) => bytes,
Err(DownloadError::NotFound) => {
// A heatmap might be out of date and refer to a layer that doesn't exist any more.
// This is harmless: continue to download the next layer. It is expected during compaction
// GC.
tracing::debug!(
"Skipped downloading missing layer {}, raced with compaction/gc?",
layer.name
);
return Ok(None);
}
Err(e) => return Err(e.into()),
};
if downloaded_bytes != layer.metadata.file_size {
let local_path = local_layer_path(
self.conf,
tenant_shard_id,
timeline_id,
&layer.name,
&layer.metadata.generation,
);
tracing::warn!(
"Downloaded layer {} with unexpected size {} != {}. Removing download.",
layer.name,
downloaded_bytes,
layer.metadata.file_size
);
tokio::fs::remove_file(&local_path)
.await
.or_else(fs_ext::ignore_not_found)?;
} else {
tracing::info!("Downloaded layer {}, size {}", layer.name, downloaded_bytes);
let mut progress = self.secondary_state.progress.lock().unwrap();
progress.bytes_downloaded += downloaded_bytes;
progress.layers_downloaded += 1;
}
SECONDARY_MODE.download_layer.inc();
Ok(Some(layer))
}
/// Calculate the currently allowed parallelism of layer download tasks, based on activity level of the remote storage
fn layer_concurrency(activity: RemoteStorageActivity) -> usize {
// When less than 75% of units are available, use minimum concurrency. Else, do a linear mapping
// of our concurrency range to the units available within the remaining 25%.
let clamp_at = (activity.read_total * 3) / 4;
if activity.read_available > clamp_at {
(MAX_LAYER_CONCURRENCY * (activity.read_available - clamp_at))
/ (activity.read_total - clamp_at)
} else {
MIN_LAYER_CONCURRENCY
}
}
}
/// Scan local storage and build up Layer objects based on the metadata in a HeatMapTimeline
@@ -1092,3 +1151,58 @@ async fn init_timeline_state(
detail
}
#[cfg(test)]
mod test {
use super::*;
#[test]
fn layer_concurrency() {
// Totally idle
assert_eq!(
TenantDownloader::layer_concurrency(RemoteStorageActivity {
read_available: 16,
read_total: 16,
write_available: 16,
write_total: 16
}),
MAX_LAYER_CONCURRENCY
);
// Totally busy
assert_eq!(
TenantDownloader::layer_concurrency(RemoteStorageActivity {
read_available: 0,
read_total: 16,
write_available: 16,
write_total: 16
}),
MIN_LAYER_CONCURRENCY
);
// Edge of the range at which we interpolate
assert_eq!(
TenantDownloader::layer_concurrency(RemoteStorageActivity {
read_available: 12,
read_total: 16,
write_available: 16,
write_total: 16
}),
MIN_LAYER_CONCURRENCY
);
// Midpoint of the range in which we interpolate
assert_eq!(
TenantDownloader::layer_concurrency(RemoteStorageActivity {
read_available: 14,
read_total: 16,
write_available: 16,
write_total: 16
}),
MAX_LAYER_CONCURRENCY / 2
);
}
}

View File

@@ -15,6 +15,14 @@ pub(super) struct HeatMapTenant {
pub(super) generation: Generation,
pub(super) timelines: Vec<HeatMapTimeline>,
/// Uploaders provide their own upload period in the heatmap, as a hint to downloaders
/// of how frequently it is worthwhile to check for updates.
///
/// This is optional for backward compat, and because we sometimes might upload
/// a heatmap explicitly via API for a tenant that has no periodic upload configured.
#[serde(default)]
pub(super) upload_period_ms: Option<u128>,
}
#[serde_as]
@@ -81,4 +89,21 @@ impl HeatMapTenant {
stats
}
pub(crate) fn strip_atimes(self) -> Self {
Self {
timelines: self
.timelines
.into_iter()
.map(|mut tl| {
for layer in &mut tl.layers {
layer.access_time = SystemTime::UNIX_EPOCH;
}
tl
})
.collect(),
generation: self.generation,
upload_period_ms: self.upload_period_ms,
}
}
}

View File

@@ -80,7 +80,7 @@ impl RunningJob for WriteInProgress {
struct UploadPending {
tenant: Arc<Tenant>,
last_digest: Option<md5::Digest>,
last_upload: Option<LastUploadState>,
target_time: Option<Instant>,
period: Option<Duration>,
}
@@ -94,7 +94,7 @@ impl scheduler::PendingJob for UploadPending {
struct WriteComplete {
tenant_shard_id: TenantShardId,
completed_at: Instant,
digest: Option<md5::Digest>,
uploaded: Option<LastUploadState>,
next_upload: Option<Instant>,
}
@@ -115,10 +115,7 @@ struct UploaderTenantState {
tenant: Weak<Tenant>,
/// Digest of the serialized heatmap that we last successfully uploaded
///
/// md5 is generally a bad hash. We use it because it's convenient for interop with AWS S3's ETag,
/// which is also an md5sum.
last_digest: Option<md5::Digest>,
last_upload_state: Option<LastUploadState>,
/// When the last upload attempt completed (may have been successful or failed)
last_upload: Option<Instant>,
@@ -187,7 +184,7 @@ impl JobGenerator<UploadPending, WriteInProgress, WriteComplete, UploadCommand>
tenant: Arc::downgrade(&tenant),
last_upload: None,
next_upload: Some(now.checked_add(period_warmup(period)).unwrap_or(now)),
last_digest: None,
last_upload_state: None,
});
// Decline to do the upload if insufficient time has passed
@@ -195,10 +192,10 @@ impl JobGenerator<UploadPending, WriteInProgress, WriteComplete, UploadCommand>
return;
}
let last_digest = state.last_digest;
let last_upload = state.last_upload_state.clone();
result.jobs.push(UploadPending {
tenant,
last_digest,
last_upload,
target_time: state.next_upload,
period: Some(period),
});
@@ -218,7 +215,7 @@ impl JobGenerator<UploadPending, WriteInProgress, WriteComplete, UploadCommand>
) {
let UploadPending {
tenant,
last_digest,
last_upload,
target_time,
period,
} = job;
@@ -231,16 +228,16 @@ impl JobGenerator<UploadPending, WriteInProgress, WriteComplete, UploadCommand>
let _completion = completion;
let started_at = Instant::now();
let digest = match upload_tenant_heatmap(remote_storage, &tenant, last_digest).await {
Ok(UploadHeatmapOutcome::Uploaded(digest)) => {
let uploaded = match upload_tenant_heatmap(remote_storage, &tenant, last_upload.clone()).await {
Ok(UploadHeatmapOutcome::Uploaded(uploaded)) => {
let duration = Instant::now().duration_since(started_at);
SECONDARY_MODE
.upload_heatmap_duration
.observe(duration.as_secs_f64());
SECONDARY_MODE.upload_heatmap.inc();
Some(digest)
Some(uploaded)
}
Ok(UploadHeatmapOutcome::NoChange | UploadHeatmapOutcome::Skipped) => last_digest,
Ok(UploadHeatmapOutcome::NoChange | UploadHeatmapOutcome::Skipped) => last_upload,
Err(UploadHeatmapError::Upload(e)) => {
tracing::warn!(
"Failed to upload heatmap for tenant {}: {e:#}",
@@ -251,11 +248,11 @@ impl JobGenerator<UploadPending, WriteInProgress, WriteComplete, UploadCommand>
.upload_heatmap_duration
.observe(duration.as_secs_f64());
SECONDARY_MODE.upload_heatmap_errors.inc();
last_digest
last_upload
}
Err(UploadHeatmapError::Cancelled) => {
tracing::info!("Cancelled heatmap upload, shutting down");
last_digest
last_upload
}
};
@@ -277,7 +274,7 @@ impl JobGenerator<UploadPending, WriteInProgress, WriteComplete, UploadCommand>
WriteComplete {
tenant_shard_id: *tenant.get_tenant_shard_id(),
completed_at: now,
digest,
uploaded,
next_upload,
}
}.instrument(info_span!(parent: None, "heatmap_upload", tenant_id=%tenant_shard_id.tenant_id, shard_id=%tenant_shard_id.shard_slug()))))
@@ -299,7 +296,7 @@ impl JobGenerator<UploadPending, WriteInProgress, WriteComplete, UploadCommand>
Ok(UploadPending {
// Ignore our state for last digest: this forces an upload even if nothing has changed
last_digest: None,
last_upload: None,
tenant,
target_time: None,
period: None,
@@ -312,7 +309,7 @@ impl JobGenerator<UploadPending, WriteInProgress, WriteComplete, UploadCommand>
let WriteComplete {
tenant_shard_id,
completed_at,
digest,
uploaded,
next_upload,
} = completion;
use std::collections::hash_map::Entry;
@@ -322,7 +319,7 @@ impl JobGenerator<UploadPending, WriteInProgress, WriteComplete, UploadCommand>
}
Entry::Occupied(mut entry) => {
entry.get_mut().last_upload = Some(completed_at);
entry.get_mut().last_digest = digest;
entry.get_mut().last_upload_state = uploaded;
entry.get_mut().next_upload = next_upload
}
}
@@ -331,7 +328,7 @@ impl JobGenerator<UploadPending, WriteInProgress, WriteComplete, UploadCommand>
enum UploadHeatmapOutcome {
/// We successfully wrote to remote storage, with this digest.
Uploaded(md5::Digest),
Uploaded(LastUploadState),
/// We did not upload because the heatmap digest was unchanged since the last upload
NoChange,
/// We skipped the upload for some reason, such as tenant/timeline not ready
@@ -347,12 +344,25 @@ enum UploadHeatmapError {
Upload(#[from] anyhow::Error),
}
/// Digests describing the heatmap we most recently uploaded successfully.
///
/// md5 is generally a bad hash. We use it because it's convenient for interop with AWS S3's ETag,
/// which is also an md5sum.
#[derive(Clone)]
struct LastUploadState {
// Digest of json-encoded HeatMapTenant
uploaded_digest: md5::Digest,
// Digest without atimes set.
layers_only_digest: md5::Digest,
}
/// The inner upload operation. This will skip if `last_digest` is Some and matches the digest
/// of the object we would have uploaded.
async fn upload_tenant_heatmap(
remote_storage: GenericRemoteStorage,
tenant: &Arc<Tenant>,
last_digest: Option<md5::Digest>,
last_upload: Option<LastUploadState>,
) -> Result<UploadHeatmapOutcome, UploadHeatmapError> {
debug_assert_current_span_has_tenant_id();
@@ -368,6 +378,7 @@ async fn upload_tenant_heatmap(
let mut heatmap = HeatMapTenant {
timelines: Vec::new(),
generation,
upload_period_ms: tenant.get_heatmap_period().map(|p| p.as_millis()),
};
let timelines = tenant.timelines.lock().unwrap().clone();
@@ -396,15 +407,31 @@ async fn upload_tenant_heatmap(
// Serialize the heatmap
let bytes = serde_json::to_vec(&heatmap).map_err(|e| anyhow::anyhow!(e))?;
let bytes = bytes::Bytes::from(bytes);
let size = bytes.len();
// Drop out early if nothing changed since our last upload
let digest = md5::compute(&bytes);
if Some(digest) == last_digest {
if Some(&digest) == last_upload.as_ref().map(|d| &d.uploaded_digest) {
return Ok(UploadHeatmapOutcome::NoChange);
}
// Calculate a digest that omits atimes, so that we can distinguish actual changes in
// layers from changes only in atimes.
let heatmap_size_bytes = heatmap.get_stats().bytes;
let layers_only_bytes =
serde_json::to_vec(&heatmap.strip_atimes()).map_err(|e| anyhow::anyhow!(e))?;
let layers_only_digest = md5::compute(&layers_only_bytes);
if heatmap_size_bytes < tenant.get_checkpoint_distance() {
// For small tenants, skip upload if only atimes changed. This avoids doing frequent
// uploads from long-idle tenants whose atimes are just incremented by periodic
// size calculations.
if Some(&layers_only_digest) == last_upload.as_ref().map(|d| &d.layers_only_digest) {
return Ok(UploadHeatmapOutcome::NoChange);
}
}
let bytes = bytes::Bytes::from(bytes);
let size = bytes.len();
let path = remote_heatmap_path(tenant.get_tenant_shard_id());
let cancel = &tenant.cancel;
@@ -436,5 +463,8 @@ async fn upload_tenant_heatmap(
tracing::info!("Successfully uploaded {size} byte heatmap to {path}");
Ok(UploadHeatmapOutcome::Uploaded(digest))
Ok(UploadHeatmapOutcome::Uploaded(LastUploadState {
uploaded_digest: digest,
layers_only_digest,
}))
}

View File

@@ -394,6 +394,7 @@ impl DeltaLayerWriterInner {
tenant_shard_id: TenantShardId,
key_start: Key,
lsn_range: Range<Lsn>,
ctx: &RequestContext,
) -> anyhow::Result<Self> {
// Create the file initially with a temporary filename. We don't know
// the end key yet, so we cannot form the final filename yet. We will
@@ -404,7 +405,7 @@ impl DeltaLayerWriterInner {
let path =
DeltaLayer::temp_path_for(conf, &tenant_shard_id, &timeline_id, key_start, &lsn_range);
let mut file = VirtualFile::create(&path).await?;
let mut file = VirtualFile::create(&path, ctx).await?;
// make room for the header block
file.seek(SeekFrom::Start(PAGE_SZ as u64)).await?;
let blob_writer = BlobWriter::new(file, PAGE_SZ as u64);
@@ -586,6 +587,7 @@ impl DeltaLayerWriter {
tenant_shard_id: TenantShardId,
key_start: Key,
lsn_range: Range<Lsn>,
ctx: &RequestContext,
) -> anyhow::Result<Self> {
Ok(Self {
inner: Some(
@@ -595,6 +597,7 @@ impl DeltaLayerWriter {
tenant_shard_id,
key_start,
lsn_range,
ctx,
)
.await?,
),
@@ -701,6 +704,7 @@ impl DeltaLayer {
let mut file = VirtualFile::open_with_options(
path,
virtual_file::OpenOptions::new().read(true).write(true),
ctx,
)
.await
.with_context(|| format!("Failed to open file '{}'", path))?;
@@ -734,7 +738,7 @@ impl DeltaLayerInner {
max_vectored_read_bytes: Option<MaxVectoredReadBytes>,
ctx: &RequestContext,
) -> Result<Result<Self, anyhow::Error>, anyhow::Error> {
let file = match VirtualFile::open(path).await {
let file = match VirtualFile::open(path, ctx).await {
Ok(file) => file,
Err(e) => return Ok(Err(anyhow::Error::new(e).context("open layer file"))),
};
@@ -908,7 +912,7 @@ impl DeltaLayerInner {
.await
.map_err(GetVectoredError::Other)?;
self.do_reads_and_update_state(reads, reconstruct_state)
self.do_reads_and_update_state(reads, reconstruct_state, ctx)
.await;
reconstruct_state.on_lsn_advanced(&keyspace, self.lsn_range.start);
@@ -1012,6 +1016,7 @@ impl DeltaLayerInner {
&self,
reads: Vec<VectoredRead>,
reconstruct_state: &mut ValuesReconstructState,
ctx: &RequestContext,
) {
let vectored_blob_reader = VectoredBlobReader::new(&self.file);
let mut ignore_key_with_err = None;
@@ -1029,7 +1034,7 @@ impl DeltaLayerInner {
// track when a key is done.
for read in reads.into_iter().rev() {
let res = vectored_blob_reader
.read_blobs(&read, buf.take().expect("Should have a buffer"))
.read_blobs(&read, buf.take().expect("Should have a buffer"), ctx)
.await;
let blobs_buf = match res {
@@ -1274,7 +1279,7 @@ impl DeltaLayerInner {
buf.clear();
buf.reserve(read.size());
let res = reader.read_blobs(&read, buf).await?;
let res = reader.read_blobs(&read, buf, ctx).await?;
for blob in res.blobs {
let key = blob.meta.key;
@@ -1791,6 +1796,7 @@ mod test {
harness.tenant_shard_id,
entries_meta.key_range.start,
entries_meta.lsn_range.clone(),
&ctx,
)
.await?;
@@ -1848,7 +1854,7 @@ mod test {
for read in vectored_reads {
let blobs_buf = vectored_blob_reader
.read_blobs(&read, buf.take().expect("Should have a buffer"))
.read_blobs(&read, buf.take().expect("Should have a buffer"), &ctx)
.await?;
for meta in blobs_buf.blobs.iter() {
let value = &blobs_buf.buf[meta.start..meta.end];
@@ -1978,6 +1984,7 @@ mod test {
tenant.tenant_shard_id,
Key::MIN,
Lsn(0x11)..truncate_at,
ctx,
)
.await
.unwrap();

View File

@@ -343,6 +343,7 @@ impl ImageLayer {
let mut file = VirtualFile::open_with_options(
path,
virtual_file::OpenOptions::new().read(true).write(true),
ctx,
)
.await
.with_context(|| format!("Failed to open file '{}'", path))?;
@@ -377,7 +378,7 @@ impl ImageLayerInner {
max_vectored_read_bytes: Option<MaxVectoredReadBytes>,
ctx: &RequestContext,
) -> Result<Result<Self, anyhow::Error>, anyhow::Error> {
let file = match VirtualFile::open(path).await {
let file = match VirtualFile::open(path, ctx).await {
Ok(file) => file,
Err(e) => return Ok(Err(anyhow::Error::new(e).context("open layer file"))),
};
@@ -474,7 +475,7 @@ impl ImageLayerInner {
.await
.map_err(GetVectoredError::Other)?;
self.do_reads_and_update_state(reads, reconstruct_state)
self.do_reads_and_update_state(reads, reconstruct_state, ctx)
.await;
Ok(())
@@ -537,6 +538,7 @@ impl ImageLayerInner {
&self,
reads: Vec<VectoredRead>,
reconstruct_state: &mut ValuesReconstructState,
ctx: &RequestContext,
) {
let max_vectored_read_bytes = self
.max_vectored_read_bytes
@@ -565,7 +567,7 @@ impl ImageLayerInner {
}
let buf = BytesMut::with_capacity(buf_size);
let res = vectored_blob_reader.read_blobs(&read, buf).await;
let res = vectored_blob_reader.read_blobs(&read, buf, ctx).await;
match res {
Ok(blobs_buf) => {
@@ -631,6 +633,7 @@ impl ImageLayerWriterInner {
tenant_shard_id: TenantShardId,
key_range: &Range<Key>,
lsn: Lsn,
ctx: &RequestContext,
) -> anyhow::Result<Self> {
// Create the file initially with a temporary filename.
// We'll atomically rename it to the final name when we're done.
@@ -650,6 +653,7 @@ impl ImageLayerWriterInner {
virtual_file::OpenOptions::new()
.write(true)
.create_new(true),
ctx,
)
.await?
};
@@ -804,10 +808,11 @@ impl ImageLayerWriter {
tenant_shard_id: TenantShardId,
key_range: &Range<Key>,
lsn: Lsn,
ctx: &RequestContext,
) -> anyhow::Result<ImageLayerWriter> {
Ok(Self {
inner: Some(
ImageLayerWriterInner::new(conf, timeline_id, tenant_shard_id, key_range, lsn)
ImageLayerWriterInner::new(conf, timeline_id, tenant_shard_id, key_range, lsn, ctx)
.await?,
),
})

View File

@@ -473,10 +473,11 @@ impl InMemoryLayer {
timeline_id: TimelineId,
tenant_shard_id: TenantShardId,
start_lsn: Lsn,
ctx: &RequestContext,
) -> Result<InMemoryLayer> {
trace!("initializing new empty InMemoryLayer for writing on timeline {timeline_id} at {start_lsn}");
let file = EphemeralFile::create(conf, tenant_shard_id, timeline_id).await?;
let file = EphemeralFile::create(conf, tenant_shard_id, timeline_id, ctx).await?;
let key = InMemoryLayerFileId(file.page_cache_file_id());
Ok(InMemoryLayer {
@@ -642,6 +643,7 @@ impl InMemoryLayer {
self.tenant_shard_id,
Key::MIN,
self.start_lsn..end_lsn,
ctx,
)
.await?;

View File

@@ -129,19 +129,16 @@ pub(crate) fn local_layer_path(
tenant_shard_id: &TenantShardId,
timeline_id: &TimelineId,
layer_file_name: &LayerName,
_generation: &Generation,
generation: &Generation,
) -> Utf8PathBuf {
let timeline_path = conf.timeline_path(tenant_shard_id, timeline_id);
timeline_path.join(layer_file_name.to_string())
// TODO: switch to enabling new-style layer paths after next release
// if generation.is_none() {
// // Without a generation, we may only use legacy path style
// timeline_path.join(layer_file_name.to_string())
// } else {
// timeline_path.join(format!("{}-v1{}", layer_file_name, generation.get_suffix()))
// }
if generation.is_none() {
// Without a generation, we may only use legacy path style
timeline_path.join(layer_file_name.to_string())
} else {
timeline_path.join(format!("{}-v1{}", layer_file_name, generation.get_suffix()))
}
}
impl Layer {

View File

@@ -41,7 +41,7 @@ static CONCURRENT_BACKGROUND_TASKS: once_cell::sync::Lazy<tokio::sync::Semaphore
tokio::sync::Semaphore::new(permits)
});
#[derive(Debug, PartialEq, Eq, Clone, Copy, strum_macros::IntoStaticStr)]
#[derive(Debug, PartialEq, Eq, Clone, Copy, strum_macros::IntoStaticStr, enum_map::Enum)]
#[strum(serialize_all = "snake_case")]
pub(crate) enum BackgroundLoopKind {
Compaction,
@@ -57,19 +57,25 @@ pub(crate) enum BackgroundLoopKind {
impl BackgroundLoopKind {
fn as_static_str(&self) -> &'static str {
let s: &'static str = self.into();
s
self.into()
}
}
static PERMIT_GAUGES: once_cell::sync::Lazy<
enum_map::EnumMap<BackgroundLoopKind, metrics::IntCounterPair>,
> = once_cell::sync::Lazy::new(|| {
enum_map::EnumMap::from_array(std::array::from_fn(|i| {
let kind = <BackgroundLoopKind as enum_map::Enum>::from_usize(i);
crate::metrics::BACKGROUND_LOOP_SEMAPHORE_WAIT_GAUGE.with_label_values(&[kind.into()])
}))
});
/// Cancellation safe.
pub(crate) async fn concurrent_background_tasks_rate_limit_permit(
loop_kind: BackgroundLoopKind,
_ctx: &RequestContext,
) -> tokio::sync::SemaphorePermit<'static> {
let _guard = crate::metrics::BACKGROUND_LOOP_SEMAPHORE_WAIT_GAUGE
.with_label_values(&[loop_kind.as_static_str()])
.guard();
let _guard = PERMIT_GAUGES[loop_kind].guard();
pausable_failpoint!(
"initial-size-calculation-permit-pause",

View File

@@ -61,9 +61,12 @@ use std::{
};
use crate::tenant::timeline::init::LocalLayerFileMetadata;
use crate::tenant::{
layer_map::{LayerMap, SearchResult},
metadata::TimelineMetadata,
use crate::{
aux_file::AuxFileSizeEstimator,
tenant::{
layer_map::{LayerMap, SearchResult},
metadata::TimelineMetadata,
},
};
use crate::{
context::{DownloadBehavior, RequestContext},
@@ -409,6 +412,8 @@ pub struct Timeline {
/// Keep aux directory cache to avoid it's reconstruction on each update
pub(crate) aux_files: tokio::sync::Mutex<AuxFilesState>,
pub(crate) aux_file_size_estimator: AuxFileSizeEstimator,
}
pub struct WalReceiverInfo {
@@ -2161,6 +2166,16 @@ impl Timeline {
};
Arc::new_cyclic(|myself| {
let metrics = TimelineMetrics::new(
&tenant_shard_id,
&timeline_id,
crate::metrics::EvictionsWithLowResidenceDurationBuilder::new(
"mtime",
evictions_low_residence_duration_metric_threshold,
),
);
let aux_file_metrics = metrics.aux_file_size_gauge.clone();
let mut result = Timeline {
conf,
tenant_conf,
@@ -2192,14 +2207,7 @@ impl Timeline {
ancestor_timeline: ancestor,
ancestor_lsn: metadata.ancestor_lsn(),
metrics: TimelineMetrics::new(
&tenant_shard_id,
&timeline_id,
crate::metrics::EvictionsWithLowResidenceDurationBuilder::new(
"mtime",
evictions_low_residence_duration_metric_threshold,
),
),
metrics,
query_metrics: crate::metrics::SmgrQueryTimePerTimeline::new(
&tenant_shard_id,
@@ -2263,6 +2271,8 @@ impl Timeline {
dir: None,
n_deltas: 0,
}),
aux_file_size_estimator: AuxFileSizeEstimator::new(aux_file_metrics),
};
result.repartition_threshold =
result.get_checkpoint_distance() / REPARTITION_FREQ_IN_CHECKPOINT_DISTANCE;
@@ -2621,6 +2631,7 @@ impl Timeline {
// Don't make noise.
} else {
warn!("unexpected: cancel_wait_for_background_loop_concurrency_limit_semaphore not set, priority-boosting of logical size calculation will not work");
debug_assert!(false);
}
}
};
@@ -3560,7 +3571,11 @@ impl Timeline {
///
/// Get a handle to the latest layer for appending.
///
async fn get_layer_for_write(&self, lsn: Lsn) -> anyhow::Result<Arc<InMemoryLayer>> {
async fn get_layer_for_write(
&self,
lsn: Lsn,
ctx: &RequestContext,
) -> anyhow::Result<Arc<InMemoryLayer>> {
let mut guard = self.layers.write().await;
let layer = guard
.get_layer_for_write(
@@ -3569,6 +3584,7 @@ impl Timeline {
self.conf,
self.timeline_id,
self.tenant_shard_id,
ctx,
)
.await?;
Ok(layer)
@@ -3833,8 +3849,8 @@ impl Timeline {
);
self.create_delta_layer(
&frozen_layer,
ctx,
Some(metadata_keyspace.0.ranges[0].clone()),
ctx,
)
.await?
} else {
@@ -3863,7 +3879,7 @@ impl Timeline {
// Normal case, write out a L0 delta layer file.
// `create_delta_layer` will not modify the layer map.
// We will remove frozen layer and add delta layer in one atomic operation later.
let Some(layer) = self.create_delta_layer(&frozen_layer, ctx, None).await? else {
let Some(layer) = self.create_delta_layer(&frozen_layer, None, ctx).await? else {
panic!("delta layer cannot be empty if no filter is applied");
};
(
@@ -3992,8 +4008,8 @@ impl Timeline {
async fn create_delta_layer(
self: &Arc<Self>,
frozen_layer: &Arc<InMemoryLayer>,
ctx: &RequestContext,
key_range: Option<Range<Key>>,
ctx: &RequestContext,
) -> anyhow::Result<Option<ResidentLayer>> {
let self_clone = Arc::clone(self);
let frozen_layer = Arc::clone(frozen_layer);
@@ -4016,6 +4032,7 @@ impl Timeline {
&self_clone
.conf
.timeline_path(&self_clone.tenant_shard_id, &self_clone.timeline_id),
&ctx,
)
.await
.fatal_err("VirtualFile::open for timeline dir fsync");
@@ -4209,6 +4226,7 @@ impl Timeline {
self.tenant_shard_id,
&img_range,
lsn,
ctx,
)
.await?;
@@ -4313,6 +4331,7 @@ impl Timeline {
&self
.conf
.timeline_path(&self.tenant_shard_id, &self.timeline_id),
ctx,
)
.await
.fatal_err("VirtualFile::open for timeline dir fsync");
@@ -4337,6 +4356,21 @@ impl Timeline {
/// this Timeline is shut down. Calling this function will cause the initial
/// logical size calculation to skip waiting for the background jobs barrier.
pub(crate) async fn await_initial_logical_size(self: Arc<Self>) {
if !self.shard_identity.is_shard_zero() {
// We don't populate logical size on shard >0: skip waiting for it.
return;
}
if self
.remote_client
.as_ref()
.map(|c| c.is_deleting())
.unwrap_or(false)
{
// The timeline was created in a deletion-resume state, we don't expect logical size to be populated
return;
}
if let Some(await_bg_cancel) = self
.current_logical_size
.cancel_wait_for_background_loop_concurrency_limit_semaphore
@@ -4348,9 +4382,10 @@ impl Timeline {
// the logical size cancellation to skip the concurrency limit semaphore.
// TODO: this is an unexpected case. We should restructure so that it
// can't happen.
tracing::info!(
tracing::warn!(
"await_initial_logical_size: can't get semaphore cancel token, skipping"
);
debug_assert!(false);
}
tokio::select!(
@@ -4638,11 +4673,9 @@ impl Timeline {
pub(super) async fn gc(&self) -> anyhow::Result<GcResult> {
// this is most likely the background tasks, but it might be the spawned task from
// immediate_gc
let cancel = crate::task_mgr::shutdown_token();
let _g = tokio::select! {
guard = self.gc_lock.lock() => guard,
_ = self.cancel.cancelled() => return Ok(GcResult::default()),
_ = cancel.cancelled() => return Ok(GcResult::default()),
};
let timer = self.metrics.garbage_collect_histo.start_timer();
@@ -5214,7 +5247,7 @@ impl<'a> TimelineWriter<'a> {
let buf_size: u64 = buf.len().try_into().expect("oversized value buf");
let action = self.get_open_layer_action(lsn, buf_size);
let layer = self.handle_open_layer_action(lsn, action).await?;
let layer = self.handle_open_layer_action(lsn, action, ctx).await?;
let res = layer.put_value(key, lsn, &buf, ctx).await;
if res.is_ok() {
@@ -5237,14 +5270,15 @@ impl<'a> TimelineWriter<'a> {
&mut self,
at: Lsn,
action: OpenLayerAction,
ctx: &RequestContext,
) -> anyhow::Result<&Arc<InMemoryLayer>> {
match action {
OpenLayerAction::Roll => {
let freeze_at = self.write_guard.as_ref().unwrap().max_lsn.unwrap();
self.roll_layer(freeze_at).await?;
self.open_layer(at).await?;
self.open_layer(at, ctx).await?;
}
OpenLayerAction::Open => self.open_layer(at).await?,
OpenLayerAction::Open => self.open_layer(at, ctx).await?,
OpenLayerAction::None => {
assert!(self.write_guard.is_some());
}
@@ -5253,8 +5287,8 @@ impl<'a> TimelineWriter<'a> {
Ok(&self.write_guard.as_ref().unwrap().open_layer)
}
async fn open_layer(&mut self, at: Lsn) -> anyhow::Result<()> {
let layer = self.tl.get_layer_for_write(at).await?;
async fn open_layer(&mut self, at: Lsn, ctx: &RequestContext) -> anyhow::Result<()> {
let layer = self.tl.get_layer_for_write(at, ctx).await?;
let initial_size = layer.size().await?;
let last_freeze_at = self.last_freeze_at.load();
@@ -5331,10 +5365,14 @@ impl<'a> TimelineWriter<'a> {
Ok(())
}
pub(crate) async fn delete_batch(&mut self, batch: &[(Range<Key>, Lsn)]) -> anyhow::Result<()> {
pub(crate) async fn delete_batch(
&mut self,
batch: &[(Range<Key>, Lsn)],
ctx: &RequestContext,
) -> anyhow::Result<()> {
if let Some((_, lsn)) = batch.first() {
let action = self.get_open_layer_action(*lsn, 0);
let layer = self.handle_open_layer_action(*lsn, action).await?;
let layer = self.handle_open_layer_action(*lsn, action, ctx).await?;
layer.put_tombstones(batch).await?;
}

View File

@@ -700,6 +700,7 @@ impl Timeline {
debug!("Create new layer {}..{}", lsn_range.start, lsn_range.end);
lsn_range.clone()
},
ctx,
)
.await?,
);
@@ -755,6 +756,7 @@ impl Timeline {
&self
.conf
.timeline_path(&self.tenant_shard_id, &self.timeline_id),
ctx,
)
.await
.fatal_err("VirtualFile::open for timeline dir fsync");
@@ -1093,6 +1095,7 @@ impl CompactionJobExecutor for TimelineAdaptor {
self.timeline.tenant_shard_id,
key_range.start,
lsn_range.clone(),
ctx,
)
.await?;
@@ -1167,6 +1170,7 @@ impl TimelineAdaptor {
self.timeline.tenant_shard_id,
key_range,
lsn,
ctx,
)
.await?;

View File

@@ -215,6 +215,7 @@ pub(super) async fn prepare(
&detached
.conf
.timeline_path(&detached.tenant_shard_id, &detached.timeline_id),
ctx,
)
.await
.fatal_err("VirtualFile::open for timeline dir fsync");
@@ -339,6 +340,7 @@ async fn copy_lsn_prefix(
target_timeline.tenant_shard_id,
layer.layer_desc().key_range.start,
layer.layer_desc().lsn_range.start..end_lsn,
ctx,
)
.await
.map_err(CopyDeltaPrefix)?;

View File

@@ -9,6 +9,7 @@ use utils::{
use crate::{
config::PageServerConf,
context::RequestContext,
metrics::TimelineMetrics,
tenant::{
layer_map::{BatchedUpdates, LayerMap},
@@ -69,6 +70,7 @@ impl LayerManager {
conf: &'static PageServerConf,
timeline_id: TimelineId,
tenant_shard_id: TenantShardId,
ctx: &RequestContext,
) -> Result<Arc<InMemoryLayer>> {
ensure!(lsn.is_aligned());
@@ -105,7 +107,7 @@ impl LayerManager {
);
let new_layer =
InMemoryLayer::create(conf, timeline_id, tenant_shard_id, start_lsn).await?;
InMemoryLayer::create(conf, timeline_id, tenant_shard_id, start_lsn, ctx).await?;
let layer = Arc::new(new_layer);
self.layer_map.open_layer = Some(layer.clone());

View File

@@ -23,6 +23,7 @@ use pageserver_api::key::Key;
use utils::lsn::Lsn;
use utils::vec_map::VecMap;
use crate::context::RequestContext;
use crate::virtual_file::VirtualFile;
#[derive(Copy, Clone, Debug, PartialEq, Eq)]
@@ -285,6 +286,7 @@ impl<'a> VectoredBlobReader<'a> {
&self,
read: &VectoredRead,
buf: BytesMut,
ctx: &RequestContext,
) -> Result<VectoredBlobsBuf, std::io::Error> {
assert!(read.size() > 0);
assert!(
@@ -295,7 +297,7 @@ impl<'a> VectoredBlobReader<'a> {
);
let buf = self
.file
.read_exact_at_n(buf, read.start, read.size())
.read_exact_at_n(buf, read.start, read.size(), ctx)
.await?;
let blobs_at = read.blobs_at.as_slice();

View File

@@ -344,16 +344,23 @@ macro_rules! with_file {
impl VirtualFile {
/// Open a file in read-only mode. Like File::open.
pub async fn open(path: &Utf8Path) -> Result<VirtualFile, std::io::Error> {
Self::open_with_options(path, OpenOptions::new().read(true)).await
pub async fn open(
path: &Utf8Path,
ctx: &RequestContext,
) -> Result<VirtualFile, std::io::Error> {
Self::open_with_options(path, OpenOptions::new().read(true), ctx).await
}
/// Create a new file for writing. If the file exists, it will be truncated.
/// Like File::create.
pub async fn create(path: &Utf8Path) -> Result<VirtualFile, std::io::Error> {
pub async fn create(
path: &Utf8Path,
ctx: &RequestContext,
) -> Result<VirtualFile, std::io::Error> {
Self::open_with_options(
path,
OpenOptions::new().write(true).create(true).truncate(true),
ctx,
)
.await
}
@@ -366,6 +373,7 @@ impl VirtualFile {
pub async fn open_with_options(
path: &Utf8Path,
open_options: &OpenOptions,
_ctx: &RequestContext, /* TODO: carry a pointer to the metrics in the RequestContext instead of the parsing https://github.com/neondatabase/neon/issues/6107 */
) -> Result<VirtualFile, std::io::Error> {
let path_str = path.to_string();
let parts = path_str.split('/').collect::<Vec<&str>>();
@@ -576,21 +584,34 @@ impl VirtualFile {
Ok(self.pos)
}
pub async fn read_exact_at<B>(&self, buf: B, offset: u64) -> Result<B, Error>
pub async fn read_exact_at<B>(
&self,
buf: B,
offset: u64,
ctx: &RequestContext,
) -> Result<B, Error>
where
B: IoBufMut + Send,
{
let (buf, res) =
read_exact_at_impl(buf, offset, None, |buf, offset| self.read_at(buf, offset)).await;
let (buf, res) = read_exact_at_impl(buf, offset, None, |buf, offset| {
self.read_at(buf, offset, ctx)
})
.await;
res.map(|()| buf)
}
pub async fn read_exact_at_n<B>(&self, buf: B, offset: u64, count: usize) -> Result<B, Error>
pub async fn read_exact_at_n<B>(
&self,
buf: B,
offset: u64,
count: usize,
ctx: &RequestContext,
) -> Result<B, Error>
where
B: IoBufMut + Send,
{
let (buf, res) = read_exact_at_impl(buf, offset, Some(count), |buf, offset| {
self.read_at(buf, offset)
self.read_at(buf, offset, ctx)
})
.await;
res.map(|()| buf)
@@ -601,12 +622,13 @@ impl VirtualFile {
&self,
page: PageWriteGuard<'static>,
offset: u64,
ctx: &RequestContext,
) -> Result<PageWriteGuard<'static>, Error> {
let buf = PageWriteGuardBuf {
page,
init_up_to: 0,
};
let res = self.read_exact_at(buf, offset).await;
let res = self.read_exact_at(buf, offset, ctx).await;
res.map(|PageWriteGuardBuf { page, .. }| page)
.map_err(|e| Error::new(ErrorKind::Other, e))
}
@@ -699,7 +721,12 @@ impl VirtualFile {
(buf, Ok(n))
}
pub(crate) async fn read_at<B>(&self, buf: B, offset: u64) -> (B, Result<usize, Error>)
pub(crate) async fn read_at<B>(
&self,
buf: B,
offset: u64,
_ctx: &RequestContext, /* TODO: use for metrics: https://github.com/neondatabase/neon/issues/6107 */
) -> (B, Result<usize, Error>)
where
B: tokio_epoll_uring::BoundedBufMut + Send,
{
@@ -1020,20 +1047,21 @@ impl VirtualFile {
pub(crate) async fn read_blk(
&self,
blknum: u32,
ctx: &RequestContext,
) -> Result<crate::tenant::block_io::BlockLease<'_>, std::io::Error> {
use crate::page_cache::PAGE_SZ;
let buf = vec![0; PAGE_SZ];
let buf = self
.read_exact_at(buf, blknum as u64 * (PAGE_SZ as u64))
.read_exact_at(buf, blknum as u64 * (PAGE_SZ as u64), ctx)
.await?;
Ok(crate::tenant::block_io::BlockLease::Vec(buf))
}
async fn read_to_end(&mut self, buf: &mut Vec<u8>) -> Result<(), Error> {
async fn read_to_end(&mut self, buf: &mut Vec<u8>, ctx: &RequestContext) -> Result<(), Error> {
let mut tmp = vec![0; 128];
loop {
let res;
(tmp, res) = self.read_at(tmp, self.pos).await;
(tmp, res) = self.read_at(tmp, self.pos, ctx).await;
match res {
Ok(0) => return Ok(()),
Ok(n) => {
@@ -1159,7 +1187,6 @@ mod tests {
use rand::seq::SliceRandom;
use rand::thread_rng;
use rand::Rng;
use std::future::Future;
use std::io::Write;
use std::os::unix::fs::FileExt;
use std::sync::Arc;
@@ -1176,9 +1203,14 @@ mod tests {
}
impl MaybeVirtualFile {
async fn read_exact_at(&self, mut buf: Vec<u8>, offset: u64) -> Result<Vec<u8>, Error> {
async fn read_exact_at(
&self,
mut buf: Vec<u8>,
offset: u64,
ctx: &RequestContext,
) -> Result<Vec<u8>, Error> {
match self {
MaybeVirtualFile::VirtualFile(file) => file.read_exact_at(buf, offset).await,
MaybeVirtualFile::VirtualFile(file) => file.read_exact_at(buf, offset, ctx).await,
MaybeVirtualFile::File(file) => file.read_exact_at(&mut buf, offset).map(|()| buf),
}
}
@@ -1230,13 +1262,13 @@ mod tests {
// Helper function to slurp contents of a file, starting at the current position,
// into a string
async fn read_string(&mut self) -> Result<String, Error> {
async fn read_string(&mut self, ctx: &RequestContext) -> Result<String, Error> {
use std::io::Read;
let mut buf = String::new();
match self {
MaybeVirtualFile::VirtualFile(file) => {
let mut buf = Vec::new();
file.read_to_end(&mut buf).await?;
file.read_to_end(&mut buf, ctx).await?;
return Ok(String::from_utf8(buf).unwrap());
}
MaybeVirtualFile::File(file) => {
@@ -1247,9 +1279,14 @@ mod tests {
}
// Helper function to slurp a portion of a file into a string
async fn read_string_at(&mut self, pos: u64, len: usize) -> Result<String, Error> {
async fn read_string_at(
&mut self,
pos: u64,
len: usize,
ctx: &RequestContext,
) -> Result<String, Error> {
let buf = vec![0; len];
let buf = self.read_exact_at(buf, pos).await?;
let buf = self.read_exact_at(buf, pos, ctx).await?;
Ok(String::from_utf8(buf).unwrap())
}
}
@@ -1263,73 +1300,101 @@ mod tests {
// results with VirtualFiles as with native Files. (Except that with
// native files, you will run out of file descriptors if the ulimit
// is low enough.)
test_files("virtual_files", |path, open_options| async move {
let vf = VirtualFile::open_with_options(&path, &open_options).await?;
Ok(MaybeVirtualFile::VirtualFile(vf))
})
.await
struct A;
impl Adapter for A {
async fn open(
path: Utf8PathBuf,
opts: OpenOptions,
ctx: &RequestContext,
) -> Result<MaybeVirtualFile, anyhow::Error> {
let vf = VirtualFile::open_with_options(&path, &opts, ctx).await?;
Ok(MaybeVirtualFile::VirtualFile(vf))
}
}
test_files::<A>("virtual_files").await
}
#[tokio::test]
async fn test_physical_files() -> anyhow::Result<()> {
test_files("physical_files", |path, open_options| async move {
Ok(MaybeVirtualFile::File({
let owned_fd = open_options.open(path.as_std_path()).await?;
File::from(owned_fd)
}))
})
.await
struct B;
impl Adapter for B {
async fn open(
path: Utf8PathBuf,
opts: OpenOptions,
_ctx: &RequestContext,
) -> Result<MaybeVirtualFile, anyhow::Error> {
Ok(MaybeVirtualFile::File({
let owned_fd = opts.open(path.as_std_path()).await?;
File::from(owned_fd)
}))
}
}
test_files::<B>("physical_files").await
}
async fn test_files<OF, FT>(testname: &str, openfunc: OF) -> anyhow::Result<()>
/// This is essentially a closure which returns a MaybeVirtualFile, but because rust edition
/// 2024 is not yet out with new lifetime capture or outlives rules, this is a async function
/// in trait which benefits from the new lifetime capture rules already.
trait Adapter {
async fn open(
path: Utf8PathBuf,
opts: OpenOptions,
ctx: &RequestContext,
) -> Result<MaybeVirtualFile, anyhow::Error>;
}
async fn test_files<A>(testname: &str) -> anyhow::Result<()>
where
OF: Fn(Utf8PathBuf, OpenOptions) -> FT,
FT: Future<Output = Result<MaybeVirtualFile, std::io::Error>>,
A: Adapter,
{
let ctx = RequestContext::new(TaskKind::UnitTest, DownloadBehavior::Error);
let testdir = crate::config::PageServerConf::test_repo_dir(testname);
std::fs::create_dir_all(&testdir)?;
let path_a = testdir.join("file_a");
let mut file_a = openfunc(
let mut file_a = A::open(
path_a.clone(),
OpenOptions::new()
.write(true)
.create(true)
.truncate(true)
.to_owned(),
&ctx,
)
.await?;
file_a.write_all(b"foobar".to_vec(), &ctx).await?;
// cannot read from a file opened in write-only mode
let _ = file_a.read_string().await.unwrap_err();
let _ = file_a.read_string(&ctx).await.unwrap_err();
// Close the file and re-open for reading
let mut file_a = openfunc(path_a, OpenOptions::new().read(true).to_owned()).await?;
let mut file_a = A::open(path_a, OpenOptions::new().read(true).to_owned(), &ctx).await?;
// cannot write to a file opened in read-only mode
let _ = file_a.write_all(b"bar".to_vec(), &ctx).await.unwrap_err();
// Try simple read
assert_eq!("foobar", file_a.read_string().await?);
assert_eq!("foobar", file_a.read_string(&ctx).await?);
// It's positioned at the EOF now.
assert_eq!("", file_a.read_string().await?);
assert_eq!("", file_a.read_string(&ctx).await?);
// Test seeks.
assert_eq!(file_a.seek(SeekFrom::Start(1)).await?, 1);
assert_eq!("oobar", file_a.read_string().await?);
assert_eq!("oobar", file_a.read_string(&ctx).await?);
assert_eq!(file_a.seek(SeekFrom::End(-2)).await?, 4);
assert_eq!("ar", file_a.read_string().await?);
assert_eq!("ar", file_a.read_string(&ctx).await?);
assert_eq!(file_a.seek(SeekFrom::Start(1)).await?, 1);
assert_eq!(file_a.seek(SeekFrom::Current(2)).await?, 3);
assert_eq!("bar", file_a.read_string().await?);
assert_eq!("bar", file_a.read_string(&ctx).await?);
assert_eq!(file_a.seek(SeekFrom::Current(-5)).await?, 1);
assert_eq!("oobar", file_a.read_string().await?);
assert_eq!("oobar", file_a.read_string(&ctx).await?);
// Test erroneous seeks to before byte 0
file_a.seek(SeekFrom::End(-7)).await.unwrap_err();
@@ -1337,11 +1402,11 @@ mod tests {
file_a.seek(SeekFrom::Current(-2)).await.unwrap_err();
// the erroneous seek should have left the position unchanged
assert_eq!("oobar", file_a.read_string().await?);
assert_eq!("oobar", file_a.read_string(&ctx).await?);
// Create another test file, and try FileExt functions on it.
let path_b = testdir.join("file_b");
let mut file_b = openfunc(
let mut file_b = A::open(
path_b.clone(),
OpenOptions::new()
.read(true)
@@ -1349,12 +1414,13 @@ mod tests {
.create(true)
.truncate(true)
.to_owned(),
&ctx,
)
.await?;
file_b.write_all_at(b"BAR".to_vec(), 3, &ctx).await?;
file_b.write_all_at(b"FOO".to_vec(), 0, &ctx).await?;
assert_eq!(file_b.read_string_at(2, 3).await?, "OBA");
assert_eq!(file_b.read_string_at(2, 3, &ctx).await?, "OBA");
// Open a lot of files, enough to cause some evictions. (Or to be precise,
// open the same file many times. The effect is the same.)
@@ -1364,9 +1430,13 @@ mod tests {
let mut vfiles = Vec::new();
for _ in 0..100 {
let mut vfile =
openfunc(path_b.clone(), OpenOptions::new().read(true).to_owned()).await?;
assert_eq!("FOOBAR", vfile.read_string().await?);
let mut vfile = A::open(
path_b.clone(),
OpenOptions::new().read(true).to_owned(),
&ctx,
)
.await?;
assert_eq!("FOOBAR", vfile.read_string(&ctx).await?);
vfiles.push(vfile);
}
@@ -1375,13 +1445,13 @@ mod tests {
// The underlying file descriptor for 'file_a' should be closed now. Try to read
// from it again. We left the file positioned at offset 1 above.
assert_eq!("oobar", file_a.read_string().await?);
assert_eq!("oobar", file_a.read_string(&ctx).await?);
// Check that all the other FDs still work too. Use them in random order for
// good measure.
vfiles.as_mut_slice().shuffle(&mut thread_rng());
for vfile in vfiles.iter_mut() {
assert_eq!("OOBAR", vfile.read_string_at(1, 5).await?);
assert_eq!("OOBAR", vfile.read_string_at(1, 5, &ctx).await?);
}
Ok(())
@@ -1397,6 +1467,7 @@ mod tests {
const THREADS: usize = 100;
const SAMPLE: [u8; SIZE] = [0xADu8; SIZE];
let ctx = RequestContext::new(TaskKind::UnitTest, DownloadBehavior::Error);
let testdir = crate::config::PageServerConf::test_repo_dir("vfile_concurrency");
std::fs::create_dir_all(&testdir)?;
@@ -1410,8 +1481,12 @@ mod tests {
// Open the file many times.
let mut files = Vec::new();
for _ in 0..VIRTUAL_FILES {
let f = VirtualFile::open_with_options(&test_file_path, OpenOptions::new().read(true))
.await?;
let f = VirtualFile::open_with_options(
&test_file_path,
OpenOptions::new().read(true),
&ctx,
)
.await?;
files.push(f);
}
let files = Arc::new(files);
@@ -1425,12 +1500,13 @@ mod tests {
let mut hdls = Vec::new();
for _threadno in 0..THREADS {
let files = files.clone();
let ctx = ctx.detached_child(TaskKind::UnitTest, DownloadBehavior::Error);
let hdl = rt.spawn(async move {
let mut buf = vec![0u8; SIZE];
let mut rng = rand::rngs::OsRng;
for _ in 1..1000 {
let f = &files[rng.gen_range(0..files.len())];
buf = f.read_exact_at(buf, 0).await.unwrap();
buf = f.read_exact_at(buf, 0, &ctx).await.unwrap();
assert!(buf == SAMPLE);
}
});
@@ -1446,6 +1522,7 @@ mod tests {
#[tokio::test]
async fn test_atomic_overwrite_basic() {
let ctx = RequestContext::new(TaskKind::UnitTest, DownloadBehavior::Error);
let testdir = crate::config::PageServerConf::test_repo_dir("test_atomic_overwrite_basic");
std::fs::create_dir_all(&testdir).unwrap();
@@ -1455,8 +1532,8 @@ mod tests {
VirtualFile::crashsafe_overwrite(path.clone(), tmp_path.clone(), b"foo".to_vec())
.await
.unwrap();
let mut file = MaybeVirtualFile::from(VirtualFile::open(&path).await.unwrap());
let post = file.read_string().await.unwrap();
let mut file = MaybeVirtualFile::from(VirtualFile::open(&path, &ctx).await.unwrap());
let post = file.read_string(&ctx).await.unwrap();
assert_eq!(post, "foo");
assert!(!tmp_path.exists());
drop(file);
@@ -1464,8 +1541,8 @@ mod tests {
VirtualFile::crashsafe_overwrite(path.clone(), tmp_path.clone(), b"bar".to_vec())
.await
.unwrap();
let mut file = MaybeVirtualFile::from(VirtualFile::open(&path).await.unwrap());
let post = file.read_string().await.unwrap();
let mut file = MaybeVirtualFile::from(VirtualFile::open(&path, &ctx).await.unwrap());
let post = file.read_string(&ctx).await.unwrap();
assert_eq!(post, "bar");
assert!(!tmp_path.exists());
drop(file);
@@ -1473,6 +1550,7 @@ mod tests {
#[tokio::test]
async fn test_atomic_overwrite_preexisting_tmp() {
let ctx = RequestContext::new(TaskKind::UnitTest, DownloadBehavior::Error);
let testdir =
crate::config::PageServerConf::test_repo_dir("test_atomic_overwrite_preexisting_tmp");
std::fs::create_dir_all(&testdir).unwrap();
@@ -1487,8 +1565,8 @@ mod tests {
.await
.unwrap();
let mut file = MaybeVirtualFile::from(VirtualFile::open(&path).await.unwrap());
let post = file.read_string().await.unwrap();
let mut file = MaybeVirtualFile::from(VirtualFile::open(&path, &ctx).await.unwrap());
let post = file.read_string(&ctx).await.unwrap();
assert_eq!(post, "foo");
assert!(!tmp_path.exists());
drop(file);

View File

@@ -49,7 +49,7 @@ char *neon_auth_token;
int readahead_buffer_size = 128;
int flush_every_n_requests = 8;
int neon_protocol_version = 2;
int neon_protocol_version = 1;
static int n_reconnect_attempts = 0;
static int max_reconnect_attempts = 60;
@@ -860,7 +860,7 @@ pg_init_libpagestore(void)
"Version of compute<->page server protocol",
NULL,
&neon_protocol_version,
2, /* use protocol version 2 */
1, /* default to old protocol for now */
1, /* min */
2, /* max */
PGC_SU_BACKEND,

View File

@@ -237,18 +237,50 @@ extern void neon_zeroextend(SMgrRelation reln, ForkNumber forknum,
extern bool neon_prefetch(SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum);
/*
* LSN values associated with each request to the pageserver
*/
typedef struct
{
/*
* 'request_lsn' is the main value that determines which page version to
* fetch.
*/
XLogRecPtr request_lsn;
/*
* A hint to the pageserver that the requested page hasn't been modified
* between this LSN and 'request_lsn'. That allows the pageserver to
* return the page faster, without waiting for 'request_lsn' to arrive in
* the pageserver, as long as 'not_modified_since' has arrived.
*/
XLogRecPtr not_modified_since;
/*
* 'effective_request_lsn' is not included in the request that's sent to
* the pageserver, but is used to keep track of the latest LSN of when the
* request was made. In a standby server, this is always the same as the
* 'request_lsn', but in the primary we use UINT64_MAX as the
* 'request_lsn' to request the latest page version, so we need this
* separate field to remember that latest LSN was when the request was
* made. It's needed to manage prefetch request, to verify if the response
* to a prefetched request is still valid.
*/
XLogRecPtr effective_request_lsn;
} neon_request_lsns;
#if PG_MAJORVERSION_NUM < 16
extern void neon_read(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
char *buffer);
extern PGDLLEXPORT void neon_read_at_lsn(NRelFileInfo rnode, ForkNumber forkNum, BlockNumber blkno,
XLogRecPtr request_lsn, XLogRecPtr not_modified_since, char *buffer);
neon_request_lsns request_lsns, char *buffer);
extern void neon_write(SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum, char *buffer, bool skipFsync);
#else
extern void neon_read(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
void *buffer);
extern PGDLLEXPORT void neon_read_at_lsn(NRelFileInfo rnode, ForkNumber forkNum, BlockNumber blkno,
XLogRecPtr request_lsn, XLogRecPtr not_modified_since, void *buffer);
neon_request_lsns request_lsns, void *buffer);
extern void neon_write(SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum, const void *buffer, bool skipFsync);
#endif

View File

@@ -168,8 +168,7 @@ typedef enum PrefetchStatus
typedef struct PrefetchRequest
{
BufferTag buftag; /* must be first entry in the struct */
XLogRecPtr request_lsn;
XLogRecPtr not_modified_since;
neon_request_lsns request_lsns;
NeonResponse *response; /* may be null */
PrefetchStatus status;
shardno_t shard_no;
@@ -271,16 +270,15 @@ static PrefetchState *MyPState;
static bool compact_prefetch_buffers(void);
static void consume_prefetch_responses(void);
static uint64 prefetch_register_buffer(BufferTag tag, XLogRecPtr *force_request_lsn, XLogRecPtr *force_not_modified_since);
static uint64 prefetch_register_buffer(BufferTag tag, neon_request_lsns *force_request_lsns);
static bool prefetch_read(PrefetchRequest *slot);
static void prefetch_do_request(PrefetchRequest *slot, XLogRecPtr *force_request_lsn, XLogRecPtr *force_not_modified_since);
static void prefetch_do_request(PrefetchRequest *slot, neon_request_lsns *force_request_lsns);
static bool prefetch_wait_for(uint64 ring_index);
static void prefetch_cleanup_trailing_unused(void);
static inline void prefetch_set_unused(uint64 ring_index);
static void neon_get_request_lsn(NRelFileInfo rinfo, ForkNumber forknum, BlockNumber blkno,
XLogRecPtr *request_lsn, XLogRecPtr *not_modified_since);
static bool neon_prefetch_response_usable(XLogRecPtr request_lsn, XLogRecPtr not_modified_since,
static neon_request_lsns neon_get_request_lsns(NRelFileInfo rinfo, ForkNumber forknum, BlockNumber blkno);
static bool neon_prefetch_response_usable(neon_request_lsns request_lsns,
PrefetchRequest *slot);
static bool
@@ -338,8 +336,7 @@ compact_prefetch_buffers(void)
target_slot->shard_no = source_slot->shard_no;
target_slot->status = source_slot->status;
target_slot->response = source_slot->response;
target_slot->request_lsn = source_slot->request_lsn;
target_slot->not_modified_since = source_slot->not_modified_since;
target_slot->request_lsns = source_slot->request_lsns;
target_slot->my_ring_index = empty_ring_index;
prfh_delete(MyPState->prf_hash, source_slot);
@@ -358,8 +355,9 @@ compact_prefetch_buffers(void)
};
source_slot->response = NULL;
source_slot->my_ring_index = 0;
source_slot->request_lsn = InvalidXLogRecPtr;
source_slot->not_modified_since = InvalidXLogRecPtr;
source_slot->request_lsns = (neon_request_lsns) {
InvalidXLogRecPtr, InvalidXLogRecPtr, InvalidXLogRecPtr
};
/* update bookkeeping */
n_moved++;
@@ -689,7 +687,7 @@ prefetch_set_unused(uint64 ring_index)
* prefetch_wait_for().
*/
static void
prefetch_do_request(PrefetchRequest *slot, XLogRecPtr *force_request_lsn, XLogRecPtr *force_not_modified_since)
prefetch_do_request(PrefetchRequest *slot, neon_request_lsns *force_request_lsns)
{
bool found;
NeonGetPageRequest request = {
@@ -700,23 +698,14 @@ prefetch_do_request(PrefetchRequest *slot, XLogRecPtr *force_request_lsn, XLogRe
.blkno = slot->buftag.blockNum,
};
Assert(((force_request_lsn != NULL) == (force_not_modified_since != NULL)));
if (force_request_lsn)
{
request.req.lsn = *force_request_lsn;
request.req.not_modified_since = *force_not_modified_since;
}
if (force_request_lsns)
slot->request_lsns = *force_request_lsns;
else
{
neon_get_request_lsn(BufTagGetNRelFileInfo(slot->buftag),
slot->buftag.forkNum,
slot->buftag.blockNum,
&request.req.lsn,
&request.req.not_modified_since);
}
slot->request_lsn = request.req.lsn;
slot->not_modified_since = request.req.not_modified_since;
slot->request_lsns = neon_get_request_lsns(BufTagGetNRelFileInfo(slot->buftag),
slot->buftag.forkNum,
slot->buftag.blockNum);
request.req.lsn = slot->request_lsns.request_lsn;
request.req.not_modified_since = slot->request_lsns.not_modified_since;
Assert(slot->response == NULL);
Assert(slot->my_ring_index == MyPState->ring_unused);
@@ -742,25 +731,22 @@ prefetch_do_request(PrefetchRequest *slot, XLogRecPtr *force_request_lsn, XLogRe
*
* Register that we may want the contents of BufferTag in the near future.
*
* If force_request_lsn and force_not_modified_since are not NULL, those
* values are sent to the pageserver. If they are NULL, we utilize the
* lastWrittenLsn -infrastructure to fill them in.
* If force_request_lsns is not NULL, those values are sent to the
* pageserver. If NULL, we utilize the lastWrittenLsn -infrastructure
* to calculate the LSNs to send.
*
* NOTE: this function may indirectly update MyPState->pfs_hash; which
* invalidates any active pointers into the hash table.
*/
static uint64
prefetch_register_buffer(BufferTag tag, XLogRecPtr *force_request_lsn,
XLogRecPtr *force_not_modified_since)
prefetch_register_buffer(BufferTag tag, neon_request_lsns *force_request_lsns)
{
uint64 ring_index;
PrefetchRequest req;
PrefetchRequest *slot;
PrfHashEntry *entry;
Assert(((force_request_lsn != NULL) == (force_not_modified_since != NULL)));
/* use an intermediate PrefetchRequest struct to ensure correct alignment */
req.buftag = tag;
Retry:
@@ -781,10 +767,9 @@ Retry:
* If the caller specified a request LSN to use, only accept prefetch
* responses that satisfy that request.
*/
if (force_request_lsn)
if (force_request_lsns)
{
if (!neon_prefetch_response_usable(*force_request_lsn,
*force_not_modified_since, slot))
if (!neon_prefetch_response_usable(*force_request_lsns, slot))
{
/* Wait for the old request to finish and discard it */
if (!prefetch_wait_for(ring_index))
@@ -886,7 +871,7 @@ Retry:
slot->shard_no = get_shard_number(&tag);
slot->my_ring_index = ring_index;
prefetch_do_request(slot, force_request_lsn, force_not_modified_since);
prefetch_do_request(slot, force_request_lsns);
Assert(slot->status == PRFS_REQUESTED);
Assert(MyPState->ring_last <= ring_index &&
ring_index < MyPState->ring_unused);
@@ -1529,11 +1514,11 @@ nm_adjust_lsn(XLogRecPtr lsn)
/*
* Return LSN for requesting pages and number of blocks from page server
*/
static void
neon_get_request_lsn(NRelFileInfo rinfo, ForkNumber forknum, BlockNumber blkno,
XLogRecPtr *request_lsn, XLogRecPtr *not_modified_since)
static neon_request_lsns
neon_get_request_lsns(NRelFileInfo rinfo, ForkNumber forknum, BlockNumber blkno)
{
XLogRecPtr last_written_lsn;
neon_request_lsns result;
last_written_lsn = GetLastWrittenLSN(rinfo, forknum, blkno);
last_written_lsn = nm_adjust_lsn(last_written_lsn);
@@ -1542,12 +1527,13 @@ neon_get_request_lsn(NRelFileInfo rinfo, ForkNumber forknum, BlockNumber blkno,
if (RecoveryInProgress())
{
/* Request the page at the last replayed LSN. */
*request_lsn = GetXLogReplayRecPtr(NULL);
*not_modified_since = last_written_lsn;
Assert(last_written_lsn <= *request_lsn);
result.request_lsn = GetXLogReplayRecPtr(NULL);
result.not_modified_since = last_written_lsn;
result.effective_request_lsn = result.request_lsn;
Assert(last_written_lsn <= result.request_lsn);
neon_log(DEBUG1, "neon_get_request_lsn request lsn %X/%X, not_modified_since %X/%X",
LSN_FORMAT_ARGS(*request_lsn), LSN_FORMAT_ARGS(*not_modified_since));
neon_log(DEBUG1, "neon_get_request_lsns request lsn %X/%X, not_modified_since %X/%X",
LSN_FORMAT_ARGS(result.request_lsn), LSN_FORMAT_ARGS(result.not_modified_since));
}
else
{
@@ -1559,7 +1545,7 @@ neon_get_request_lsn(NRelFileInfo rinfo, ForkNumber forknum, BlockNumber blkno,
* must still in the buffer cache, so our request cannot concern
* those.
*/
neon_log(DEBUG1, "neon_get_request_lsn GetLastWrittenLSN lsn %X/%X ",
neon_log(DEBUG1, "neon_get_request_lsns GetLastWrittenLSN lsn %X/%X",
LSN_FORMAT_ARGS(last_written_lsn));
/*
@@ -1585,16 +1571,33 @@ neon_get_request_lsn(NRelFileInfo rinfo, ForkNumber forknum, BlockNumber blkno,
}
/*
* Request the latest version of the page. The most up-to-date request
* LSN we could use would be the current insert LSN, but to avoid the
* overhead of looking it up, use 'flushlsn' instead. This relies on
* the assumption that if the page was modified since the last WAL
* flush, it should still be in the buffer cache, and we wouldn't be
* requesting it.
* Request the very latest version of the page. In principle we
* want to read the page at the current insert LSN, and we could
* use that value in the request. However, there's a corner case
* with pageserver's garbage collection. If the GC horizon is
* set to a very small value, it's possible that by the time
* that the pageserver processes our request, the GC horizon has
* already moved past the LSN we calculate here. Standby servers
* always have that problem as the can always lag behind the
* primary, but for the primary we can avoid it by always
* requesting the latest page, by setting request LSN to
* UINT64_MAX.
*
* Remember the current LSN, however, so that we can later
* correctly determine if the response to the request is still
* valid. The most up-to-date LSN we could use for that purpose
* would be the current insert LSN, but to avoid the overhead of
* looking it up, use 'flushlsn' instead. This relies on the
* assumption that if the page was modified since the last WAL
* flush, it should still be in the buffer cache, and we
* wouldn't be requesting it.
*/
*request_lsn = flushlsn;
*not_modified_since = last_written_lsn;
result.request_lsn = UINT64_MAX;
result.not_modified_since = last_written_lsn;
result.effective_request_lsn = flushlsn;
}
return result;
}
/*
@@ -1604,12 +1607,16 @@ neon_get_request_lsn(NRelFileInfo rinfo, ForkNumber forknum, BlockNumber blkno,
* satisfy a page read now.
*/
static bool
neon_prefetch_response_usable(XLogRecPtr request_lsn, XLogRecPtr not_modified_since,
neon_prefetch_response_usable(neon_request_lsns request_lsns,
PrefetchRequest *slot)
{
/* sanity check the LSN's on the old and the new request */
Assert(request_lsn >= not_modified_since);
Assert(slot->request_lsn >= slot->not_modified_since);
Assert(request_lsns.request_lsn >= request_lsns.not_modified_since);
Assert(request_lsns.effective_request_lsn >= request_lsns.not_modified_since);
Assert(request_lsns.effective_request_lsn <= request_lsns.request_lsn);
Assert(slot->request_lsns.request_lsn >= slot->request_lsns.not_modified_since);
Assert(slot->request_lsns.effective_request_lsn >= slot->request_lsns.not_modified_since);
Assert(slot->request_lsns.effective_request_lsn <= slot->request_lsns.request_lsn);
Assert(slot->status != PRFS_UNUSED);
/*
@@ -1627,26 +1634,40 @@ neon_prefetch_response_usable(XLogRecPtr request_lsn, XLogRecPtr not_modified_si
* calculate LSNs "out of order" with each other, but the prefetch queue
* is backend-private at the moment.)
*/
if (request_lsn < slot->request_lsn || not_modified_since < slot->not_modified_since)
if (request_lsns.effective_request_lsn < slot->request_lsns.effective_request_lsn ||
request_lsns.not_modified_since < slot->request_lsns.not_modified_since)
{
ereport(LOG,
(errcode(ERRCODE_IO_ERROR),
errmsg(NEON_TAG "request with unexpected LSN after prefetch"),
errdetail("Request %X/%X not_modified_since %X/%X, prefetch %X/%X not_modified_since %X/%X)",
LSN_FORMAT_ARGS(request_lsn), LSN_FORMAT_ARGS(not_modified_since),
LSN_FORMAT_ARGS(slot->request_lsn), LSN_FORMAT_ARGS(slot->not_modified_since))));
LSN_FORMAT_ARGS(request_lsns.effective_request_lsn),
LSN_FORMAT_ARGS(request_lsns.not_modified_since),
LSN_FORMAT_ARGS(slot->request_lsns.effective_request_lsn),
LSN_FORMAT_ARGS(slot->request_lsns.not_modified_since))));
return false;
}
/*---
* Each request to the pageserver carries two LSN values:
* `not_modified_since` and `request_lsn`. The (not_modified_since,
* request_lsn] range of each request is effectively a claim that the page
* has not been modified between those LSNs. If the range of the old
* request in the queue overlaps with the new request, we know that the
* page hasn't been modified in the union of the ranges. We can use the
* response to old request to satisfy the new request in that case. For
* example:
* Each request to the pageserver has three LSN values associated with it:
* `not_modified_since`, `request_lsn`, and 'effective_request_lsn'.
* `not_modified_since` and `request_lsn` are sent to the pageserver, but
* in the primary node, we always use UINT64_MAX as the `request_lsn`, so
* we remember `effective_request_lsn` separately. In a primary,
* `effective_request_lsn` is the last flush WAL position when the request
* was sent to the pageserver. That's logically the LSN that we are
* requesting the page at, but we send UINT64_MAX to the pageserver so
* that if the GC horizon advances past that position, we still get a
* valid response instead of an error.
*
* To determine whether a response to a GetPage request issued earlier is
* still valid to satisfy a new page read, we look at the
* (not_modified_since, effective_request_lsn] range of the request. It is
* effectively a claim that the page has not been modified between those
* LSNs. If the range of the old request in the queue overlaps with the
* new request, we know that the page hasn't been modified in the union of
* the ranges. We can use the response to old request to satisfy the new
* request in that case. For example:
*
* 100 500
* Old request: +--------+
@@ -1675,9 +1696,9 @@ neon_prefetch_response_usable(XLogRecPtr request_lsn, XLogRecPtr not_modified_si
*/
/* this follows from the checks above */
Assert(request_lsn >= slot->not_modified_since);
Assert(request_lsns.effective_request_lsn >= slot->request_lsns.not_modified_since);
return not_modified_since <= slot->request_lsn;
return request_lsns.not_modified_since <= slot->request_lsns.effective_request_lsn;
}
/*
@@ -1689,8 +1710,7 @@ neon_exists(SMgrRelation reln, ForkNumber forkNum)
bool exists;
NeonResponse *resp;
BlockNumber n_blocks;
XLogRecPtr request_lsn;
XLogRecPtr not_modified_since;
neon_request_lsns request_lsns;
switch (reln->smgr_relpersistence)
{
@@ -1745,15 +1765,15 @@ neon_exists(SMgrRelation reln, ForkNumber forkNum)
return false;
}
neon_get_request_lsn(InfoFromSMgrRel(reln), forkNum, REL_METADATA_PSEUDO_BLOCKNO,
&request_lsn, &not_modified_since);
request_lsns = neon_get_request_lsns(InfoFromSMgrRel(reln), forkNum, REL_METADATA_PSEUDO_BLOCKNO);
{
NeonExistsRequest request = {
.req.tag = T_NeonExistsRequest,
.req.lsn = request_lsn,
.req.not_modified_since = not_modified_since,
.req.lsn = request_lsns.request_lsn,
.req.not_modified_since = request_lsns.not_modified_since,
.rinfo = InfoFromSMgrRel(reln),
.forknum = forkNum};
.forknum = forkNum
};
resp = page_server_request(&request);
}
@@ -1770,7 +1790,7 @@ neon_exists(SMgrRelation reln, ForkNumber forkNum)
errmsg(NEON_TAG "could not read relation existence of rel %u/%u/%u.%u from page server at lsn %X/%08X",
RelFileInfoFmt(InfoFromSMgrRel(reln)),
forkNum,
(uint32) (request_lsn >> 32), (uint32) request_lsn),
LSN_FORMAT_ARGS(request_lsns.effective_request_lsn)),
errdetail("page server returned error: %s",
((NeonErrorResponse *) resp)->message)));
break;
@@ -2135,7 +2155,7 @@ neon_prefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
CopyNRelFileInfoToBufTag(tag, InfoFromSMgrRel(reln));
ring_index = prefetch_register_buffer(tag, NULL, NULL);
ring_index = prefetch_register_buffer(tag, NULL);
Assert(ring_index < MyPState->ring_unused &&
MyPState->ring_last <= ring_index);
@@ -2188,10 +2208,10 @@ neon_writeback(SMgrRelation reln, ForkNumber forknum,
void
#if PG_MAJORVERSION_NUM < 16
neon_read_at_lsn(NRelFileInfo rinfo, ForkNumber forkNum, BlockNumber blkno,
XLogRecPtr request_lsn, XLogRecPtr not_modified_since, char *buffer)
neon_request_lsns request_lsns, char *buffer)
#else
neon_read_at_lsn(NRelFileInfo rinfo, ForkNumber forkNum, BlockNumber blkno,
XLogRecPtr request_lsn, XLogRecPtr not_modified_since, void *buffer)
neon_request_lsns request_lsns, void *buffer)
#endif
{
NeonResponse *resp;
@@ -2223,7 +2243,7 @@ neon_read_at_lsn(NRelFileInfo rinfo, ForkNumber forkNum, BlockNumber blkno,
* value of the LwLsn cache when the entry is not found.
*/
if (RecoveryInProgress() && !(MyBackendType == B_STARTUP))
XLogWaitForReplayOf(request_lsn);
XLogWaitForReplayOf(request_lsns.request_lsn);
/*
* Try to find prefetched page in the list of received pages.
@@ -2234,7 +2254,7 @@ neon_read_at_lsn(NRelFileInfo rinfo, ForkNumber forkNum, BlockNumber blkno,
if (entry != NULL)
{
slot = entry->slot;
if (neon_prefetch_response_usable(request_lsn, not_modified_since, slot))
if (neon_prefetch_response_usable(request_lsns, slot))
{
ring_index = slot->my_ring_index;
pgBufferUsage.prefetch.hits += 1;
@@ -2268,8 +2288,7 @@ neon_read_at_lsn(NRelFileInfo rinfo, ForkNumber forkNum, BlockNumber blkno,
{
pgBufferUsage.prefetch.misses += 1;
ring_index = prefetch_register_buffer(buftag, &request_lsn,
&not_modified_since);
ring_index = prefetch_register_buffer(buftag, &request_lsns);
slot = GetPrfSlot(ring_index);
}
else
@@ -2310,7 +2329,7 @@ neon_read_at_lsn(NRelFileInfo rinfo, ForkNumber forkNum, BlockNumber blkno,
slot->shard_no, blkno,
RelFileInfoFmt(rinfo),
forkNum,
(uint32) (request_lsn >> 32), (uint32) request_lsn),
LSN_FORMAT_ARGS(request_lsns.effective_request_lsn)),
errdetail("page server returned error: %s",
((NeonErrorResponse *) resp)->message)));
break;
@@ -2333,8 +2352,7 @@ neon_read(SMgrRelation reln, ForkNumber forkNum, BlockNumber blkno, char *buffer
neon_read(SMgrRelation reln, ForkNumber forkNum, BlockNumber blkno, void *buffer)
#endif
{
XLogRecPtr request_lsn;
XLogRecPtr not_modified_since;
neon_request_lsns request_lsns;
switch (reln->smgr_relpersistence)
{
@@ -2359,9 +2377,8 @@ neon_read(SMgrRelation reln, ForkNumber forkNum, BlockNumber blkno, void *buffer
return;
}
neon_get_request_lsn(InfoFromSMgrRel(reln), forkNum, blkno,
&request_lsn, &not_modified_since);
neon_read_at_lsn(InfoFromSMgrRel(reln), forkNum, blkno, request_lsn, not_modified_since, buffer);
request_lsns = neon_get_request_lsns(InfoFromSMgrRel(reln), forkNum, blkno);
neon_read_at_lsn(InfoFromSMgrRel(reln), forkNum, blkno, request_lsns, buffer);
#ifdef DEBUG_COMPARE_LOCAL
if (forkNum == MAIN_FORKNUM && IS_LOCAL_REL(reln))
@@ -2530,8 +2547,7 @@ neon_nblocks(SMgrRelation reln, ForkNumber forknum)
{
NeonResponse *resp;
BlockNumber n_blocks;
XLogRecPtr request_lsn;
XLogRecPtr not_modified_since;
neon_request_lsns request_lsns;
switch (reln->smgr_relpersistence)
{
@@ -2558,13 +2574,12 @@ neon_nblocks(SMgrRelation reln, ForkNumber forknum)
return n_blocks;
}
neon_get_request_lsn(InfoFromSMgrRel(reln), forknum, REL_METADATA_PSEUDO_BLOCKNO,
&request_lsn, &not_modified_since);
request_lsns = neon_get_request_lsns(InfoFromSMgrRel(reln), forknum, REL_METADATA_PSEUDO_BLOCKNO);
{
NeonNblocksRequest request = {
.req.tag = T_NeonNblocksRequest,
.req.lsn = request_lsn,
.req.not_modified_since = not_modified_since,
.req.lsn = request_lsns.request_lsn,
.req.not_modified_since = request_lsns.not_modified_since,
.rinfo = InfoFromSMgrRel(reln),
.forknum = forknum,
};
@@ -2584,7 +2599,7 @@ neon_nblocks(SMgrRelation reln, ForkNumber forknum)
errmsg(NEON_TAG "could not read relation size of rel %u/%u/%u.%u from page server at lsn %X/%08X",
RelFileInfoFmt(InfoFromSMgrRel(reln)),
forknum,
(uint32) (request_lsn >> 32), (uint32) request_lsn),
LSN_FORMAT_ARGS(request_lsns.effective_request_lsn)),
errdetail("page server returned error: %s",
((NeonErrorResponse *) resp)->message)));
break;
@@ -2595,10 +2610,10 @@ neon_nblocks(SMgrRelation reln, ForkNumber forknum)
update_cached_relsize(InfoFromSMgrRel(reln), forknum, n_blocks);
neon_log(SmgrTrace, "neon_nblocks: rel %u/%u/%u fork %u (request LSN %X/%08X): %u blocks",
RelFileInfoFmt(InfoFromSMgrRel(reln)),
forknum,
(uint32) (request_lsn >> 32), (uint32) request_lsn,
n_blocks);
RelFileInfoFmt(InfoFromSMgrRel(reln)),
forknum,
LSN_FORMAT_ARGS(request_lsns.effective_request_lsn),
n_blocks);
pfree(resp);
return n_blocks;
@@ -2612,17 +2627,15 @@ neon_dbsize(Oid dbNode)
{
NeonResponse *resp;
int64 db_size;
XLogRecPtr request_lsn,
not_modified_since;
neon_request_lsns request_lsns;
NRelFileInfo dummy_node = {0};
neon_get_request_lsn(dummy_node, MAIN_FORKNUM, REL_METADATA_PSEUDO_BLOCKNO,
&request_lsn, &not_modified_since);
request_lsns = neon_get_request_lsns(dummy_node, MAIN_FORKNUM, REL_METADATA_PSEUDO_BLOCKNO);
{
NeonDbSizeRequest request = {
.req.tag = T_NeonDbSizeRequest,
.req.lsn = request_lsn,
.req.not_modified_since = not_modified_since,
.req.lsn = request_lsns.request_lsn,
.req.not_modified_since = request_lsns.not_modified_since,
.dbNode = dbNode,
};
@@ -2639,8 +2652,7 @@ neon_dbsize(Oid dbNode)
ereport(ERROR,
(errcode(ERRCODE_IO_ERROR),
errmsg(NEON_TAG "could not read db size of db %u from page server at lsn %X/%08X",
dbNode,
(uint32) (request_lsn >> 32), (uint32) request_lsn),
dbNode, LSN_FORMAT_ARGS(request_lsns.effective_request_lsn)),
errdetail("page server returned error: %s",
((NeonErrorResponse *) resp)->message)));
break;
@@ -2650,9 +2662,7 @@ neon_dbsize(Oid dbNode)
}
neon_log(SmgrTrace, "neon_dbsize: db %u (request LSN %X/%08X): %ld bytes",
dbNode,
(uint32) (request_lsn >> 32), (uint32) request_lsn,
db_size);
dbNode, LSN_FORMAT_ARGS(request_lsns.effective_request_lsn), db_size);
pfree(resp);
return db_size;
@@ -2897,6 +2907,10 @@ neon_read_slru_segment(SMgrRelation reln, const char* path, int segno, void* buf
XLogRecPtr request_lsn,
not_modified_since;
/*
* Compute a request LSN to use, similar to neon_get_request_lsns() but the
* logic is a bit simpler.
*/
if (RecoveryInProgress())
{
request_lsn = GetXLogReplayRecPtr(NULL);
@@ -2908,10 +2922,10 @@ neon_read_slru_segment(SMgrRelation reln, const char* path, int segno, void* buf
*/
request_lsn = GetRedoStartLsn();
}
request_lsn = nm_adjust_lsn(request_lsn);
}
else
request_lsn = GetXLogInsertRecPtr();
request_lsn = nm_adjust_lsn(request_lsn);
request_lsn = UINT64_MAX;
/*
* GetRedoStartLsn() returns LSN of basebackup. We know that the SLRU

View File

@@ -48,10 +48,10 @@ PG_FUNCTION_INFO_V1(neon_xlogflush);
*/
#if PG_MAJORVERSION_NUM < 16
typedef void (*neon_read_at_lsn_type) (NRelFileInfo rinfo, ForkNumber forkNum, BlockNumber blkno,
XLogRecPtr request_lsn, XLogRecPtr not_modified_since, char *buffer);
neon_request_lsns request_lsns, char *buffer);
#else
typedef void (*neon_read_at_lsn_type) (NRelFileInfo rinfo, ForkNumber forkNum, BlockNumber blkno,
XLogRecPtr request_lsn, XLogRecPtr not_modified_since, void *buffer);
neon_request_lsns request_lsns, void *buffer);
#endif
static neon_read_at_lsn_type neon_read_at_lsn_ptr;
@@ -298,9 +298,7 @@ get_raw_page_at_lsn(PG_FUNCTION_ARGS)
text *relname;
text *forkname;
uint32 blkno;
XLogRecPtr request_lsn;
XLogRecPtr not_modified_since;
neon_request_lsns request_lsns;
if (PG_NARGS() != 5)
elog(ERROR, "unexpected number of arguments in SQL function signature");
@@ -312,8 +310,15 @@ get_raw_page_at_lsn(PG_FUNCTION_ARGS)
forkname = PG_GETARG_TEXT_PP(1);
blkno = PG_GETARG_UINT32(2);
request_lsn = PG_ARGISNULL(3) ? GetXLogInsertRecPtr() : PG_GETARG_LSN(3);
not_modified_since = PG_ARGISNULL(4) ? request_lsn : PG_GETARG_LSN(4);
request_lsns.request_lsn = PG_ARGISNULL(3) ? GetXLogInsertRecPtr() : PG_GETARG_LSN(3);
request_lsns.not_modified_since = PG_ARGISNULL(4) ? request_lsns.request_lsn : PG_GETARG_LSN(4);
/*
* For the time being, use the same LSN for request and
* effective request LSN. If any test needed to use UINT64_MAX
* as the request LSN, we'd need to add effective_request_lsn
* as a new argument.
*/
request_lsns.effective_request_lsn = request_lsns.request_lsn;
if (!superuser())
ereport(ERROR,
@@ -367,7 +372,8 @@ get_raw_page_at_lsn(PG_FUNCTION_ARGS)
SET_VARSIZE(raw_page, BLCKSZ + VARHDRSZ);
raw_page_data = VARDATA(raw_page);
neon_read_at_lsn(InfoFromRelation(rel), forknum, blkno, request_lsn, not_modified_since, raw_page_data);
neon_read_at_lsn(InfoFromRelation(rel), forknum, blkno, request_lsns,
raw_page_data);
relation_close(rel, AccessShareLock);
@@ -413,19 +419,25 @@ get_raw_page_at_lsn_ex(PG_FUNCTION_ARGS)
ForkNumber forknum = PG_GETARG_UINT32(3);
uint32 blkno = PG_GETARG_UINT32(4);
XLogRecPtr request_lsn;
XLogRecPtr not_modified_since;
neon_request_lsns request_lsns;
/* Initialize buffer to copy to */
bytea *raw_page = (bytea *) palloc(BLCKSZ + VARHDRSZ);
request_lsn = PG_ARGISNULL(5) ? GetXLogInsertRecPtr() : PG_GETARG_LSN(5);
not_modified_since = PG_ARGISNULL(6) ? request_lsn : PG_GETARG_LSN(6);
request_lsns.request_lsn = PG_ARGISNULL(5) ? GetXLogInsertRecPtr() : PG_GETARG_LSN(5);
request_lsns.not_modified_since = PG_ARGISNULL(6) ? request_lsns.request_lsn : PG_GETARG_LSN(6);
/*
* For the time being, use the same LSN for request
* and effective request LSN. If any test needed to
* use UINT64_MAX as the request LSN, we'd need to add
* effective_request_lsn as a new argument.
*/
request_lsns.effective_request_lsn = request_lsns.request_lsn;
SET_VARSIZE(raw_page, BLCKSZ + VARHDRSZ);
raw_page_data = VARDATA(raw_page);
neon_read_at_lsn(rinfo, forknum, blkno, request_lsn, not_modified_since, raw_page_data);
neon_read_at_lsn(rinfo, forknum, blkno, request_lsns, raw_page_data);
PG_RETURN_BYTEA_P(raw_page);
}
}

18
scripts/check_allowed_errors.sh Executable file
View File

@@ -0,0 +1,18 @@
#!/usr/bin/env bash
set -eu
HELPER_DIR="$(dirname "${BASH_SOURCE[0]}")"
SCRIPT="test_runner/fixtures/pageserver/allowed_errors.py"
# first run to understand all of the errors:
#
# example: ./scripts/check_allowed_errors.sh -i - < pageserver.log
# example: ./scripts/check_allowed_errors.sh -i pageserver.log
#
# then edit the test local allowed_errors to the
# test_runner/fixtures/pageserver/allowed_errors.py, then re-run to make sure
# they are handled.
#
# finally revert any local changes to allowed_errors.py.
poetry run python3 "$HELPER_DIR/../$SCRIPT" $*

View File

@@ -5,10 +5,11 @@ import json
import logging
import os
from collections import defaultdict
from typing import DefaultDict, Dict
from typing import Any, DefaultDict, Dict, Optional
import psycopg2
import psycopg2.extras
import toml
FLAKY_TESTS_QUERY = """
SELECT
@@ -58,6 +59,24 @@ def main(args: argparse.Namespace):
else:
pageserver_virtual_file_io_engine_parameter = ""
# re-use existing records of flaky tests from before parametrization by compaction_algorithm
def get_pageserver_default_tenant_config_compaction_algorithm() -> Optional[Dict[str, Any]]:
"""Duplicated from parametrize.py"""
toml_table = os.getenv("PAGESERVER_DEFAULT_TENANT_CONFIG_COMPACTION_ALGORITHM")
if toml_table is None:
return None
v = toml.loads(toml_table)
assert isinstance(v, dict)
return v
pageserver_default_tenant_config_compaction_algorithm_parameter = ""
if (
explicit_default := get_pageserver_default_tenant_config_compaction_algorithm()
) is not None:
pageserver_default_tenant_config_compaction_algorithm_parameter = (
f"-{explicit_default['kind']}"
)
for row in rows:
# We don't want to automatically rerun tests in a performance suite
if row["parent_suite"] != "test_runner.regress":
@@ -66,10 +85,10 @@ def main(args: argparse.Namespace):
if row["name"].endswith("]"):
parametrized_test = row["name"].replace(
"[",
f"[{build_type}-pg{pg_version}{pageserver_virtual_file_io_engine_parameter}-",
f"[{build_type}-pg{pg_version}{pageserver_virtual_file_io_engine_parameter}{pageserver_default_tenant_config_compaction_algorithm_parameter}-",
)
else:
parametrized_test = f"{row['name']}[{build_type}-pg{pg_version}{pageserver_virtual_file_io_engine_parameter}]"
parametrized_test = f"{row['name']}[{build_type}-pg{pg_version}{pageserver_virtual_file_io_engine_parameter}{pageserver_default_tenant_config_compaction_algorithm_parameter}]"
res[row["parent_suite"]][row["suite"]][parametrized_test] = True

View File

@@ -19,9 +19,9 @@ from _pytest.config.argparsing import Parser
from _pytest.fixtures import FixtureRequest
from _pytest.terminal import TerminalReporter
from fixtures.common_types import TenantId, TimelineId
from fixtures.log_helper import log
from fixtures.neon_fixtures import NeonPageserver
from fixtures.types import TenantId, TimelineId
"""
This file contains fixtures for micro-benchmarks.

View File

@@ -5,8 +5,8 @@ import pytest
from werkzeug.wrappers.request import Request
from werkzeug.wrappers.response import Response
from fixtures.common_types import TenantId
from fixtures.log_helper import log
from fixtures.types import TenantId
class ComputeReconfigure:

View File

@@ -149,6 +149,7 @@ PAGESERVER_PER_TENANT_METRICS: Tuple[str, ...] = (
"pageserver_storage_operations_seconds_sum_total",
"pageserver_evictions_total",
"pageserver_evictions_with_low_residence_duration_total",
"pageserver_aux_file_estimated_size",
*PAGESERVER_PER_TENANT_REMOTE_TIMELINE_CLIENT_METRICS,
# "pageserver_directory_entries_count", -- only used if above a certain threshold
# "pageserver_broken_tenants_count" -- used only for broken

View File

@@ -47,14 +47,15 @@ from urllib3.util.retry import Retry
from fixtures import overlayfs
from fixtures.broker import NeonBroker
from fixtures.common_types import Lsn, TenantId, TenantShardId, TimelineId
from fixtures.log_helper import log
from fixtures.metrics import Metrics, MetricsGetter, parse_metrics
from fixtures.pageserver.allowed_errors import (
DEFAULT_PAGESERVER_ALLOWED_ERRORS,
DEFAULT_STORAGE_CONTROLLER_ALLOWED_ERRORS,
)
from fixtures.pageserver.common_types import IndexPartDump, LayerName, parse_layer_file_name
from fixtures.pageserver.http import PageserverHttpClient
from fixtures.pageserver.types import IndexPartDump, LayerName, parse_layer_file_name
from fixtures.pageserver.utils import (
wait_for_last_record_lsn,
wait_for_upload,
@@ -72,7 +73,6 @@ from fixtures.remote_storage import (
)
from fixtures.safekeeper.http import SafekeeperHttpClient
from fixtures.safekeeper.utils import are_walreceivers_absent
from fixtures.types import Lsn, TenantId, TenantShardId, TimelineId
from fixtures.utils import (
ATTACHMENT_NAME_REGEX,
allure_add_grafana_links,
@@ -467,6 +467,7 @@ class NeonEnvBuilder:
initial_timeline: Optional[TimelineId] = None,
pageserver_virtual_file_io_engine: Optional[str] = None,
pageserver_aux_file_policy: Optional[AuxFileStore] = None,
pageserver_default_tenant_config_compaction_algorithm: Optional[Dict[str, Any]] = None,
):
self.repo_dir = repo_dir
self.rust_log_override = rust_log_override
@@ -507,6 +508,14 @@ class NeonEnvBuilder:
self.pageserver_virtual_file_io_engine: Optional[str] = pageserver_virtual_file_io_engine
self.pageserver_default_tenant_config_compaction_algorithm: Optional[
Dict[str, Any]
] = pageserver_default_tenant_config_compaction_algorithm
if self.pageserver_default_tenant_config_compaction_algorithm is not None:
log.debug(
f"Overriding pageserver default compaction algorithm to {self.pageserver_default_tenant_config_compaction_algorithm}"
)
self.pageserver_get_vectored_impl: Optional[str] = None
if os.getenv("PAGESERVER_GET_VECTORED_IMPL", "") == "vectored":
self.pageserver_get_vectored_impl = "vectored"
@@ -702,8 +711,7 @@ class NeonEnvBuilder:
config["branch_name_mappings"] = snapshot_config["branch_name_mappings"]
# Update the config with new neon + postgres path in case of compat test
# FIXME: overriding pg_distrib_dir cause storage controller fail to start
# config["pg_distrib_dir"] = str(self.pg_distrib_dir)
config["pg_distrib_dir"] = str(self.pg_distrib_dir)
config["neon_distrib_dir"] = str(self.neon_binpath)
with (self.repo_dir / "config").open("w") as f:
@@ -1104,6 +1112,11 @@ class NeonEnv:
ps_cfg["get_impl"] = config.pageserver_get_impl
if config.pageserver_validate_vectored_get is not None:
ps_cfg["validate_vectored_get"] = config.pageserver_validate_vectored_get
if config.pageserver_default_tenant_config_compaction_algorithm is not None:
tenant_config = ps_cfg.setdefault("tenant_config", {})
tenant_config[
"compaction_algorithm"
] = config.pageserver_default_tenant_config_compaction_algorithm
if self.pageserver_remote_storage is not None:
ps_cfg["remote_storage"] = remote_storage_to_toml_dict(
@@ -1305,6 +1318,7 @@ def _shared_simple_env(
pg_version: PgVersion,
pageserver_virtual_file_io_engine: str,
pageserver_aux_file_policy: Optional[AuxFileStore],
pageserver_default_tenant_config_compaction_algorithm: Optional[Dict[str, Any]],
) -> Iterator[NeonEnv]:
"""
# Internal fixture backing the `neon_simple_env` fixture. If TEST_SHARED_FIXTURES
@@ -1336,6 +1350,7 @@ def _shared_simple_env(
test_output_dir=test_output_dir,
pageserver_virtual_file_io_engine=pageserver_virtual_file_io_engine,
pageserver_aux_file_policy=pageserver_aux_file_policy,
pageserver_default_tenant_config_compaction_algorithm=pageserver_default_tenant_config_compaction_algorithm,
) as builder:
env = builder.init_start()
@@ -1376,6 +1391,7 @@ def neon_env_builder(
top_output_dir: Path,
pageserver_virtual_file_io_engine: str,
pageserver_aux_file_policy: Optional[AuxFileStore] = None,
pageserver_default_tenant_config_compaction_algorithm: Optional[Dict[str, Any]] = None,
) -> Iterator[NeonEnvBuilder]:
"""
Fixture to create a Neon environment for test.
@@ -1410,6 +1426,7 @@ def neon_env_builder(
test_output_dir=test_output_dir,
test_overlay_dir=test_overlay_dir,
pageserver_aux_file_policy=pageserver_aux_file_policy,
pageserver_default_tenant_config_compaction_algorithm=pageserver_default_tenant_config_compaction_algorithm,
) as builder:
yield builder

View File

@@ -131,9 +131,10 @@ if __name__ == "__main__":
"-i",
"--input",
type=argparse.FileType("r"),
default=sys.stdin,
help="Pageserver logs file. Reads from stdin if no file is provided.",
help="Pageserver logs file. Use '-' for stdin.",
required=True,
)
args = parser.parse_args()
errors = _check_allowed_errors(args.input)

View File

@@ -2,7 +2,7 @@ import re
from dataclasses import dataclass
from typing import Any, Dict, Tuple, Union
from fixtures.types import KEY_MAX, KEY_MIN, Key, Lsn
from fixtures.common_types import KEY_MAX, KEY_MIN, Key, Lsn
@dataclass

View File

@@ -11,10 +11,10 @@ import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
from fixtures.common_types import Lsn, TenantId, TenantShardId, TimelineId
from fixtures.log_helper import log
from fixtures.metrics import Metrics, MetricsGetter, parse_metrics
from fixtures.pg_version import PgVersion
from fixtures.types import Lsn, TenantId, TenantShardId, TimelineId
from fixtures.utils import Fn

View File

@@ -3,6 +3,7 @@ import time
from typing import Any, Callable, Dict, Tuple
import fixtures.pageserver.remote_storage
from fixtures.common_types import TenantId, TimelineId
from fixtures.log_helper import log
from fixtures.neon_fixtures import (
NeonEnv,
@@ -12,7 +13,6 @@ from fixtures.pageserver.utils import (
wait_until_tenant_state,
)
from fixtures.remote_storage import LocalFsStorage, RemoteStorageKind
from fixtures.types import TenantId, TimelineId
def single_timeline(

View File

@@ -6,13 +6,13 @@ import threading
from pathlib import Path
from typing import Any, List, Tuple
from fixtures.common_types import TenantId, TimelineId
from fixtures.neon_fixtures import NeonEnv, Pagectl
from fixtures.pageserver.types import (
from fixtures.pageserver.common_types import (
InvalidFileName,
parse_layer_file_name,
)
from fixtures.remote_storage import LocalFsStorage
from fixtures.types import TenantId, TimelineId
def duplicate_one_tenant(env: NeonEnv, template_tenant: TenantId, new_tenant: TenantId):

View File

@@ -8,10 +8,10 @@ from mypy_boto3_s3.type_defs import (
ObjectTypeDef,
)
from fixtures.common_types import Lsn, TenantId, TenantShardId, TimelineId
from fixtures.log_helper import log
from fixtures.pageserver.http import PageserverApiException, PageserverHttpClient
from fixtures.remote_storage import RemoteStorage, RemoteStorageKind, S3Storage
from fixtures.types import Lsn, TenantId, TenantShardId, TimelineId
from fixtures.utils import wait_until

View File

@@ -1,7 +1,8 @@
import os
from typing import Optional
from typing import Any, Dict, Optional
import pytest
import toml
from _pytest.python import Metafunc
from fixtures.pg_version import PgVersion
@@ -37,6 +38,20 @@ def pageserver_aux_file_policy() -> Optional[AuxFileStore]:
return None
def get_pageserver_default_tenant_config_compaction_algorithm() -> Optional[Dict[str, Any]]:
toml_table = os.getenv("PAGESERVER_DEFAULT_TENANT_CONFIG_COMPACTION_ALGORITHM")
if toml_table is None:
return None
v = toml.loads(toml_table)
assert isinstance(v, dict)
return v
@pytest.fixture(scope="function", autouse=True)
def pageserver_default_tenant_config_compaction_algorithm() -> Optional[Dict[str, Any]]:
return get_pageserver_default_tenant_config_compaction_algorithm()
def pytest_generate_tests(metafunc: Metafunc):
if (bt := os.getenv("BUILD_TYPE")) is None:
build_types = ["debug", "release"]
@@ -60,6 +75,16 @@ def pytest_generate_tests(metafunc: Metafunc):
):
metafunc.parametrize("pageserver_virtual_file_io_engine", [io_engine])
# Same hack for pageserver_default_tenant_config_compaction_algorithm
if (
explicit_default := get_pageserver_default_tenant_config_compaction_algorithm()
) is not None:
metafunc.parametrize(
"pageserver_default_tenant_config_compaction_algorithm",
[explicit_default],
ids=[explicit_default["kind"]],
)
# For performance tests, parametrize also by platform
if (
"test_runner/performance" in metafunc.definition._nodeid

View File

@@ -12,8 +12,8 @@ import boto3
import toml
from mypy_boto3_s3 import S3Client
from fixtures.common_types import TenantId, TimelineId
from fixtures.log_helper import log
from fixtures.types import TenantId, TimelineId
TIMELINE_INDEX_PART_FILE_NAME = "index_part.json"
TENANT_HEATMAP_FILE_NAME = "heatmap-v1.json"

View File

@@ -6,8 +6,8 @@ from typing import Any, Dict, List, Optional, Tuple, Union
import pytest
import requests
from fixtures.common_types import Lsn, TenantId, TimelineId
from fixtures.log_helper import log
from fixtures.types import Lsn, TenantId, TimelineId
# Walreceiver as returned by sk's timeline status endpoint.

View File

@@ -1,6 +1,6 @@
from fixtures.common_types import TenantId, TimelineId
from fixtures.log_helper import log
from fixtures.safekeeper.http import SafekeeperHttpClient
from fixtures.types import TenantId, TimelineId
def are_walreceivers_absent(

View File

@@ -25,14 +25,14 @@ import zstandard
from psycopg2.extensions import cursor
from fixtures.log_helper import log
from fixtures.pageserver.types import (
from fixtures.pageserver.common_types import (
parse_delta_layer,
parse_image_layer,
)
if TYPE_CHECKING:
from fixtures.neon_fixtures import PgBin
from fixtures.types import TimelineId
from fixtures.common_types import TimelineId
Fn = TypeVar("Fn", bound=Callable[..., Any])
@@ -452,6 +452,7 @@ def humantime_to_ms(humantime: str) -> float:
def scan_log_for_errors(input: Iterable[str], allowed_errors: List[str]) -> List[Tuple[int, str]]:
# FIXME: this duplicates test_runner/fixtures/pageserver/allowed_errors.py
error_or_warn = re.compile(r"\s(ERROR|WARN)")
errors = []
for lineno, line in enumerate(input, start=1):
@@ -484,7 +485,7 @@ def assert_no_errors(log_file, service, allowed_errors):
for _lineno, error in errors:
log.info(f"not allowed {service} error: {error.strip()}")
assert not errors, f"Log errors on {service}: {errors[0]}"
assert not errors, f"First log error on {service}: {errors[0]}\nHint: use scripts/check_allowed_errors.sh to test any new allowed_error you add"
@enum.unique

View File

@@ -1,6 +1,7 @@
import threading
from typing import Any, Optional
from fixtures.common_types import TenantId, TimelineId
from fixtures.log_helper import log
from fixtures.neon_fixtures import (
Endpoint,
@@ -10,7 +11,6 @@ from fixtures.neon_fixtures import (
wait_for_last_flush_lsn,
)
from fixtures.pageserver.utils import wait_for_last_record_lsn, wait_for_upload
from fixtures.types import TenantId, TimelineId
# neon_local doesn't handle creating/modifying endpoints concurrently, so we use a mutex
# to ensure we don't do that: this enables running lots of Workloads in parallel safely.

View File

@@ -5,13 +5,13 @@ Utilities used by all code in this sub-directory
from typing import Any, Callable, Dict, Tuple
import fixtures.pageserver.many_tenants as many_tenants
from fixtures.common_types import TenantId, TimelineId
from fixtures.log_helper import log
from fixtures.neon_fixtures import (
NeonEnv,
NeonEnvBuilder,
)
from fixtures.pageserver.utils import wait_until_all_tenants_state
from fixtures.types import TenantId, TimelineId
def ensure_pageserver_ready_for_benchmarking(env: NeonEnv, n_tenants: int):

View File

@@ -9,11 +9,11 @@ from typing import List
import pytest
from fixtures.benchmark_fixture import MetricReport, NeonBenchmarker
from fixtures.common_types import Lsn
from fixtures.compare_fixtures import NeonCompare
from fixtures.log_helper import log
from fixtures.neon_fixtures import NeonPageserver
from fixtures.pageserver.utils import wait_for_last_record_lsn
from fixtures.types import Lsn
from fixtures.utils import wait_until
from prometheus_client.samples import Sample

View File

@@ -2,10 +2,10 @@ from contextlib import closing
import pytest
from fixtures.benchmark_fixture import MetricReport
from fixtures.common_types import Lsn
from fixtures.compare_fixtures import NeonCompare, PgCompare
from fixtures.pageserver.utils import wait_tenant_status_404
from fixtures.pg_version import PgVersion
from fixtures.types import Lsn
#

View File

@@ -3,6 +3,7 @@ import random
import time
import pytest
from fixtures.common_types import TenantId, TenantShardId, TimelineId
from fixtures.compute_reconfigure import ComputeReconfigure
from fixtures.log_helper import log
from fixtures.neon_fixtures import (
@@ -10,7 +11,6 @@ from fixtures.neon_fixtures import (
)
from fixtures.pageserver.http import PageserverHttpClient
from fixtures.pg_version import PgVersion
from fixtures.types import TenantId, TenantShardId, TimelineId
@pytest.mark.timeout(3600) # super long running test: should go down as we optimize

View File

@@ -6,10 +6,10 @@ from typing import Any, Callable, List
import pytest
from fixtures.benchmark_fixture import MetricReport, NeonBenchmarker
from fixtures.common_types import Lsn
from fixtures.compare_fixtures import NeonCompare, PgCompare, VanillaCompare
from fixtures.log_helper import log
from fixtures.neon_fixtures import DEFAULT_BRANCH_NAME, NeonEnvBuilder, PgBin
from fixtures.types import Lsn
from performance.test_perf_pgbench import get_durations_matrix, get_scales_matrix

View File

@@ -1,6 +1,6 @@
from fixtures.common_types import TimelineId
from fixtures.log_helper import log
from fixtures.neon_fixtures import NeonEnvBuilder
from fixtures.types import TimelineId
from fixtures.utils import query_scalar

View File

@@ -2,13 +2,13 @@ from dataclasses import dataclass
from typing import Generator, Optional
import pytest
from fixtures.common_types import TenantId
from fixtures.neon_fixtures import (
NeonEnv,
NeonEnvBuilder,
)
from fixtures.pageserver.http import PageserverApiException, TenantConfig
from fixtures.remote_storage import LocalFsStorage, RemoteStorageKind
from fixtures.types import TenantId
from fixtures.utils import wait_until

View File

@@ -4,13 +4,13 @@ from pathlib import Path
import psycopg2
import pytest
from fixtures.common_types import TenantId, TimelineId
from fixtures.neon_fixtures import (
NeonEnv,
NeonEnvBuilder,
PgProtocol,
)
from fixtures.pageserver.http import PageserverApiException, PageserverHttpClient
from fixtures.types import TenantId, TimelineId
def assert_client_authorized(env: NeonEnv, http_client: PageserverHttpClient):

View File

@@ -2,10 +2,10 @@ import threading
import time
import pytest
from fixtures.common_types import Lsn, TimelineId
from fixtures.log_helper import log
from fixtures.neon_fixtures import NeonEnv
from fixtures.pageserver.http import TimelineCreate406
from fixtures.types import Lsn, TimelineId
from fixtures.utils import query_scalar

View File

@@ -1,8 +1,8 @@
import pytest
from fixtures.common_types import Lsn, TimelineId
from fixtures.log_helper import log
from fixtures.neon_fixtures import NeonEnvBuilder
from fixtures.pageserver.http import TimelineCreate406
from fixtures.types import Lsn, TimelineId
from fixtures.utils import print_gc_result, query_scalar

View File

@@ -5,6 +5,7 @@ from concurrent.futures import ThreadPoolExecutor
from typing import List
import pytest
from fixtures.common_types import Lsn, TimelineId
from fixtures.log_helper import log
from fixtures.neon_fixtures import (
Endpoint,
@@ -14,7 +15,6 @@ from fixtures.neon_fixtures import (
)
from fixtures.pageserver.http import PageserverApiException
from fixtures.pageserver.utils import wait_until_tenant_active
from fixtures.types import Lsn, TimelineId
from fixtures.utils import query_scalar
from performance.test_perf_pgbench import get_scales_matrix
from requests import RequestException

View File

@@ -3,6 +3,7 @@ import os
from typing import List, Tuple
import pytest
from fixtures.common_types import TenantId, TimelineId
from fixtures.log_helper import log
from fixtures.neon_fixtures import (
Endpoint,
@@ -11,7 +12,6 @@ from fixtures.neon_fixtures import (
wait_for_last_flush_lsn,
)
from fixtures.pg_version import PgVersion
from fixtures.types import TenantId, TimelineId
# Test restarting page server, while safekeeper and compute node keep

View File

@@ -1,4 +1,5 @@
import os
import re
import shutil
import subprocess
import tempfile
@@ -7,6 +8,7 @@ from typing import List, Optional
import pytest
import toml
from fixtures.common_types import Lsn
from fixtures.log_helper import log
from fixtures.neon_fixtures import (
NeonEnv,
@@ -21,7 +23,6 @@ from fixtures.pageserver.utils import (
)
from fixtures.pg_version import PgVersion
from fixtures.remote_storage import RemoteStorageKind
from fixtures.types import Lsn
#
# A test suite that help to prevent unintentionally breaking backward or forward compatibility between Neon releases.
@@ -245,14 +246,34 @@ def test_forward_compatibility(
compatibility_snapshot_dir / "repo",
)
# not using env.pageserver.version because it was initialized before
prev_pageserver_version_str = env.get_binary_version("pageserver")
prev_pageserver_version_match = re.search(
"Neon page server git-env:(.*) failpoints: (.*), features: (.*)",
prev_pageserver_version_str,
)
if prev_pageserver_version_match is not None:
prev_pageserver_version = prev_pageserver_version_match.group(1)
else:
raise AssertionError(
"cannot find git hash in the version string: " + prev_pageserver_version_str
)
# does not include logs from previous runs
assert not env.pageserver.log_contains("git-env:" + prev_pageserver_version)
neon_env_builder.start()
# ensure the specified pageserver is running
assert env.pageserver.log_contains("git-env:" + prev_pageserver_version)
check_neon_works(
env,
test_output_dir=test_output_dir,
sql_dump_path=compatibility_snapshot_dir / "dump.sql",
repo_dir=env.repo_dir,
)
except Exception:
if breaking_changes_allowed:
pytest.xfail(

View File

@@ -5,6 +5,7 @@ from dataclasses import dataclass
from typing import Any, Dict, Iterable, Tuple
import pytest
from fixtures.common_types import Lsn, TenantId, TimelineId
from fixtures.log_helper import log
from fixtures.neon_fixtures import (
NeonEnv,
@@ -16,7 +17,6 @@ from fixtures.neon_fixtures import (
from fixtures.pageserver.http import PageserverHttpClient
from fixtures.pageserver.utils import wait_for_upload_queue_empty
from fixtures.remote_storage import RemoteStorageKind
from fixtures.types import Lsn, TenantId, TimelineId
from fixtures.utils import wait_until
GLOBAL_LRU_LOG_LINE = "tenant_min_resident_size-respecting LRU would not relieve pressure, evicting more following global LRU policy"
@@ -623,15 +623,16 @@ def test_partial_evict_tenant(eviction_env: EvictionEnv, order: EvictionOrder):
ratio = count_now / original_count
abs_diff = abs(ratio - expected_ratio)
assert original_count > count_now
log.info(
f"tenant {tenant_id} layer count {original_count} -> {count_now}, ratio: {ratio}, expecting {abs_diff} < 0.1"
)
expectation = 0.06
log.info(
f"tenant {tenant_id} layer count {original_count} -> {count_now}, ratio: {ratio}, expecting {abs_diff} < {expectation}"
)
# in this test case both relative_spare and relative_equal produce
# the same outcomes; this must be a quantization effect of similar
# sizes (-s4 and -s6) and small (5MB) layer size.
# for pg15 and pg16 the absdiff is < 0.01, for pg14 it is closer to 0.02
assert abs_diff < 0.05
assert abs_diff < expectation
@pytest.mark.parametrize(

View File

@@ -2,7 +2,7 @@ import time
import pytest
from fixtures.neon_fixtures import NeonEnvBuilder, PgBin, wait_for_last_flush_lsn
from fixtures.pageserver.types import parse_layer_file_name
from fixtures.pageserver.common_types import parse_layer_file_name
from fixtures.pageserver.utils import (
wait_for_last_record_lsn,
wait_for_upload_queue_empty,

View File

@@ -1,6 +1,7 @@
import os
from pathlib import Path
from fixtures.common_types import Lsn, TimelineId
from fixtures.log_helper import log
from fixtures.neon_fixtures import (
NeonEnvBuilder,
@@ -8,7 +9,6 @@ from fixtures.neon_fixtures import (
VanillaPostgres,
)
from fixtures.port_distributor import PortDistributor
from fixtures.types import Lsn, TimelineId
from fixtures.utils import query_scalar, subprocess_capture
num_rows = 1000

View File

@@ -2,6 +2,7 @@ import asyncio
import concurrent.futures
import random
from fixtures.common_types import TimelineId
from fixtures.log_helper import log
from fixtures.neon_fixtures import (
Endpoint,
@@ -10,7 +11,6 @@ from fixtures.neon_fixtures import (
wait_for_last_flush_lsn,
)
from fixtures.remote_storage import RemoteStorageKind
from fixtures.types import TimelineId
# Test configuration
#

View File

@@ -7,6 +7,7 @@ from contextlib import closing
from pathlib import Path
import pytest
from fixtures.common_types import Lsn, TenantId, TimelineId
from fixtures.log_helper import log
from fixtures.neon_fixtures import (
Endpoint,
@@ -20,7 +21,6 @@ from fixtures.pageserver.utils import (
wait_for_upload,
)
from fixtures.remote_storage import RemoteStorageKind
from fixtures.types import Lsn, TenantId, TimelineId
from fixtures.utils import subprocess_capture

View File

@@ -7,7 +7,7 @@ from fixtures.neon_fixtures import (
flush_ep_to_pageserver,
wait_for_last_flush_lsn,
)
from fixtures.pageserver.types import parse_layer_file_name
from fixtures.pageserver.common_types import parse_layer_file_name
from fixtures.pageserver.utils import wait_for_upload
from fixtures.remote_storage import RemoteStorageKind

View File

@@ -1,8 +1,9 @@
import time
from fixtures.common_types import Lsn
from fixtures.log_helper import log
from fixtures.neon_fixtures import NeonEnvBuilder, flush_ep_to_pageserver
from fixtures.pageserver.types import (
from fixtures.pageserver.common_types import (
DeltaLayerName,
ImageLayerName,
is_future_layer,
@@ -13,7 +14,6 @@ from fixtures.pageserver.utils import (
wait_until_tenant_active,
)
from fixtures.remote_storage import LocalFsStorage, RemoteStorageKind
from fixtures.types import Lsn
from fixtures.utils import query_scalar, wait_until

View File

@@ -4,6 +4,7 @@ from random import choice
from string import ascii_lowercase
import pytest
from fixtures.common_types import Lsn
from fixtures.log_helper import log
from fixtures.neon_fixtures import (
AuxFileStore,
@@ -12,7 +13,6 @@ from fixtures.neon_fixtures import (
logical_replication_sync,
wait_for_last_flush_lsn,
)
from fixtures.types import Lsn
from fixtures.utils import query_scalar, wait_until

View File

@@ -2,10 +2,10 @@ import re
import time
from datetime import datetime, timedelta, timezone
from fixtures.common_types import Lsn
from fixtures.log_helper import log
from fixtures.neon_fixtures import NeonEnvBuilder, wait_for_last_flush_lsn
from fixtures.pageserver.http import PageserverApiException
from fixtures.types import Lsn
from fixtures.utils import query_scalar

View File

@@ -5,6 +5,7 @@ from typing import cast
import pytest
import requests
from fixtures.common_types import TenantId, TimelineId
from fixtures.neon_fixtures import (
DEFAULT_BRANCH_NAME,
NeonEnv,
@@ -13,7 +14,6 @@ from fixtures.neon_fixtures import (
)
from fixtures.pageserver.http import PageserverHttpClient
from fixtures.pg_version import PgVersion, skip_on_postgres
from fixtures.types import TenantId, TimelineId
def helper_compare_timeline_list(

View File

@@ -3,13 +3,13 @@ import os
import time
from pathlib import Path
from fixtures.common_types import Lsn, TenantId, TimelineId
from fixtures.log_helper import log
from fixtures.neon_fixtures import NeonEnvBuilder, wait_for_wal_insert_lsn
from fixtures.pageserver.utils import (
wait_for_last_record_lsn,
)
from fixtures.remote_storage import RemoteStorageKind
from fixtures.types import Lsn, TenantId, TimelineId
from fixtures.utils import query_scalar

View File

@@ -1,6 +1,6 @@
from fixtures.common_types import TimelineId
from fixtures.log_helper import log
from fixtures.neon_fixtures import NeonEnvBuilder
from fixtures.types import TimelineId
from fixtures.utils import print_gc_result, query_scalar

View File

@@ -5,6 +5,7 @@ import time
from collections import defaultdict
from typing import Any, DefaultDict, Dict, Tuple
from fixtures.common_types import Lsn
from fixtures.log_helper import log
from fixtures.neon_fixtures import (
NeonEnvBuilder,
@@ -21,7 +22,6 @@ from fixtures.pageserver.utils import (
wait_until_tenant_active,
)
from fixtures.remote_storage import RemoteStorageKind
from fixtures.types import Lsn
from fixtures.utils import query_scalar, wait_until

View File

@@ -1,9 +1,9 @@
from typing import Optional
import pytest
from fixtures.common_types import Lsn
from fixtures.log_helper import log
from fixtures.neon_fixtures import NeonEnvBuilder, tenant_get_shards
from fixtures.types import Lsn
from fixtures.utils import query_scalar

View File

@@ -3,13 +3,13 @@ from pathlib import Path
from typing import Optional
import toml
from fixtures.common_types import Lsn, TenantId, TimelineId
from fixtures.neon_fixtures import (
DEFAULT_BRANCH_NAME,
NeonEnv,
NeonEnvBuilder,
)
from fixtures.pageserver.http import PageserverHttpClient
from fixtures.types import Lsn, TenantId, TimelineId
from fixtures.utils import wait_until

View File

@@ -16,6 +16,7 @@ import time
from typing import Optional
import pytest
from fixtures.common_types import TenantId, TimelineId
from fixtures.log_helper import log
from fixtures.neon_fixtures import (
NeonEnv,
@@ -37,7 +38,6 @@ from fixtures.pageserver.utils import (
from fixtures.remote_storage import (
RemoteStorageKind,
)
from fixtures.types import TenantId, TimelineId
from fixtures.utils import print_gc_result, wait_until
from fixtures.workload import Workload
@@ -703,7 +703,6 @@ def test_multi_attach(
workload.validate(pageservers[2].id)
@pytest.mark.skip(reason="To be enabled after release with new local path style")
def test_upgrade_generationless_local_file_paths(
neon_env_builder: NeonEnvBuilder,
):

View File

@@ -2,10 +2,10 @@ import json
import uuid
from anyio import Path
from fixtures.common_types import TenantId, TimelineId
from fixtures.log_helper import log
from fixtures.neon_fixtures import NeonEnvBuilder, PgBin
from fixtures.pg_version import PgVersion
from fixtures.types import TenantId, TimelineId
from fixtures.utils import wait_until

View File

@@ -5,6 +5,7 @@ from typing import Optional, Tuple
import psutil
import pytest
from fixtures.common_types import Lsn, TenantId, TimelineId
from fixtures.log_helper import log
from fixtures.neon_fixtures import (
NeonEnv,
@@ -13,7 +14,6 @@ from fixtures.neon_fixtures import (
)
from fixtures.pageserver.http import PageserverHttpClient
from fixtures.pageserver.utils import wait_for_last_record_lsn, wait_for_upload
from fixtures.types import Lsn, TenantId, TimelineId
from fixtures.utils import wait_until
TIMELINE_COUNT = 10

View File

@@ -7,6 +7,7 @@ from pathlib import Path
from queue import SimpleQueue
from typing import Any, Dict, Set
from fixtures.common_types import TenantId, TimelineId
from fixtures.log_helper import log
from fixtures.neon_fixtures import (
NeonEnvBuilder,
@@ -17,7 +18,6 @@ from fixtures.remote_storage import (
RemoteStorageKind,
remote_storage_to_toml_inline_table,
)
from fixtures.types import TenantId, TimelineId
from pytest_httpserver import HTTPServer
from werkzeug.wrappers.request import Request
from werkzeug.wrappers.response import Response

View File

@@ -5,9 +5,10 @@ import time
from typing import Any, Dict, Optional
import pytest
from fixtures.common_types import TenantId, TimelineId
from fixtures.log_helper import log
from fixtures.neon_fixtures import NeonEnvBuilder, NeonPageserver, S3Scrubber
from fixtures.pageserver.types import parse_layer_file_name
from fixtures.pageserver.common_types import parse_layer_file_name
from fixtures.pageserver.utils import (
assert_prefix_empty,
poll_for_remote_storage_iterations,
@@ -15,7 +16,6 @@ from fixtures.pageserver.utils import (
wait_for_upload_queue_empty,
)
from fixtures.remote_storage import LocalFsStorage, RemoteStorageKind, S3Storage
from fixtures.types import TenantId, TimelineId
from fixtures.utils import wait_until
from fixtures.workload import Workload

View File

@@ -1,6 +1,6 @@
from fixtures.common_types import TimelineId
from fixtures.log_helper import log
from fixtures.neon_fixtures import NeonEnvBuilder
from fixtures.types import TimelineId
from fixtures.utils import print_gc_result, query_scalar

View File

@@ -1,8 +1,8 @@
from contextlib import closing
from fixtures.common_types import Lsn
from fixtures.neon_fixtures import NeonEnvBuilder
from fixtures.pageserver.utils import wait_for_last_record_lsn
from fixtures.types import Lsn
from fixtures.utils import query_scalar

Some files were not shown because too many files have changed in this diff Show More