Compare commits

...

48 Commits

Author SHA1 Message Date
Alex Chi Z
c32c26d741 add key repo
Signed-off-by: Alex Chi Z <chi@neon.tech>
2025-04-21 18:10:18 -04:00
Alex Chi Z
6da5c189b5 full impl of local fs encryption
Signed-off-by: Alex Chi Z <chi@neon.tech>
2025-04-18 15:02:11 -04:00
Alex Chi Z
66c01e46ca disable index_part upload check
Signed-off-by: Alex Chi Z <chi@neon.tech>
2025-04-18 14:51:00 -04:00
Alex Chi Z
94fe9d00f3 read path
Signed-off-by: Alex Chi Z <chi@neon.tech>
2025-04-17 16:29:36 -04:00
Alex Chi Z
dc19960cbf feat(pageserver): upload encrypted layer files
Signed-off-by: Alex Chi Z <chi@neon.tech>
2025-04-17 16:08:18 -04:00
Alex Chi Z
d3b654f6db feat(pageserver): support SSE-C for s3
Signed-off-by: Alex Chi Z <chi@neon.tech>
2025-04-17 15:24:10 -04:00
Alex Chi Z
4a898cfa2c feat(pageserver): add initial encrypt key fields
Signed-off-by: Alex Chi Z <chi@neon.tech>
2025-04-17 15:23:55 -04:00
Alex Chi Z.
ad0c5fdae7 fix(test): allow stale generation warnings in storcon (#11624)
## Problem

https://github.com/neondatabase/neon/pull/11531 did not fully fix the
problem because the warning is part of the storcon instead of
pageserver.

## Summary of changes

Allow stale generation error in storcon.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2025-04-17 16:12:24 +00:00
Christian Schwarz
2b041964b3 cover direct IO + concurrent IO in unit, regression & perf tests (#11585)
This mirrors the production config.

Thread that discusses the merits of this:
- https://neondb.slack.com/archives/C033RQ5SPDH/p1744742010740569

# Refs
- context
https://neondb.slack.com/archives/C04BLQ4LW7K/p1744724844844589?thread_ts=1744705831.014169&cid=C04BLQ4LW7K
- prep for https://github.com/neondatabase/neon/pull/11558 which adds
new io mode `direct-rw`

# Impact on CI turnaround time

Spot-checking impact on CI timings

- Baseline: [some recent main
commit](https://github.com/neondatabase/neon/actions/runs/14471549758/job/40587837475)
- Comparison: [this
commit](https://github.com/neondatabase/neon/actions/runs/14471945087/job/40589613274)
in this PR here

Impact on CI turnaround time

- Regression tests:
  - x64: very minor, sometimes better; likely in the noise
  - arm64: substantial  30min => 40min
- Benchmarks (x86 only I think): very minor; noise seems higher than
regress tests

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
Co-authored-by: Alex Chi Z. <4198311+skyzh@users.noreply.github.com>
Co-authored-by: Peter Bendel <peterbendel@neon.tech>
Co-authored-by: Alex Chi Z <chi@neon.tech>
2025-04-17 15:53:10 +00:00
John Spray
d4c059a884 tests: use endpoint http wrapper to get auth (#11628)
## Problem

`test_compute_startup_simple` and `test_compute_ondemand_slru_startup`
are failing.

This test implicitly asserts that the metrics.json endpoint succeeds and
returns all expected metrics, but doesn't make it easy to see what went
wrong if it doesn't (e.g. in this failure
https://neon-github-public-dev.s3.amazonaws.com/reports/main/14513210240/index.html#suites/13d8e764c394daadbad415a08454c04e/b0f92a86b2ed309f/)

In this case, it was failing because of a missing auth token, because it
was using `requests` directly instead of using the endpoint http client
type.

## Summary of changes

- Use endpoint http wrapper to get raise_for_status & auth token
2025-04-17 15:03:23 +00:00
Folke Behrens
2c56c46d48 compute: Set max log level for local proxy sql_over_http mod to WARN (#11629)
neondatabase/cloud#27738
2025-04-17 14:38:19 +00:00
Tristan Partin
d1728a6bcd Remove old compatibility hack for remote extensions (#11620)
Control plane has long since been updated to send the right value.

Signed-off-by: Tristan Partin <tristan@neon.tech>
2025-04-17 14:08:42 +00:00
John Spray
0a27973584 pageserver: rename Tenant to TenantShard (#11589)
## Problem

`Tenant` isn't really a whole tenant: it's just one shard of a tenant.

## Summary of changes

- Automated rename of Tenant to TenantShard
- Followup commit to change references in comments
2025-04-17 13:29:16 +00:00
Alexander Bayandin
07c2411f6b tests: remove mentions of ALLOW_*_COMPATIBILITY_BREAKAGE (#11618)
## Problem

There are mentions of `ALLOW_BACKWARD_COMPATIBILITY_BREAKAGE` and
`ALLOW_FORWARD_COMPATIBILITY_BREAKAGE`, but in reality, this mechanism
doesn't work, so let's remove it to avoid confusion.

The idea behind it was to allow some breaking changes by adding a
special label to a PR that would `xfail` the test. However, in practice,
this means we would need to carry this label through all subsequent PRs
until the release (and artifact regeneration). This approach isn't
really viable, as it increases the risk of missing a compatibility break
in another PR.

## Summary of changes
- Remove mentions and handling of
`ALLOW_BACKWARD_COMPATIBILITY_BREAKAGE` /
`ALLOW_FORWARD_COMPATIBILITY_BREAKAGE`
2025-04-17 10:03:21 +00:00
Alexander Bayandin
5819938c93 CI(pg-clients): fix workflow permissions (#11623)
## Problem

`pg-clients` can't start:

```
The workflow is not valid. .github/workflows/pg-clients.yml (Line: 44, Col: 3): Error calling workflow 'neondatabase/neon/.github/workflows/build-build-tools-image.yml@aa19f10e7e958fbe0e0641f2e8c5952ce3be44b3'. The nested job 'check-image' is requesting 'packages: read', but is only allowed 'packages: none'. .github/workflows/pg-clients.yml (Line: 44, Col: 3): Error calling workflow 'neondatabase/neon/.github/workflows/build-build-tools-image.yml@aa19f10e7e958fbe0e0641f2e8c5952ce3be44b3'. The nested job 'build-image' is requesting 'packages: write', but is only allowed 'packages: none'.
```

## Summary of changes
- Grant required `packages: write` permissions to the workflow
2025-04-17 08:54:23 +00:00
Konstantin Knizhnik
b7548de814 Disable autovacuum and increase limit for WS approximation (#11583)
## Problem

Test lfc working set approximation becomes flaky after recent changes in
prefetch.
May be it is caused by updating HLL in `lfc_write`, may be by some other
reasons.

## Summary of changes

1. Disable autovacuum in this test (as possible source of extra page
accesses).
2. Increase upper boundary for WS approximation from 12 to 20.

---------

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
2025-04-17 05:07:45 +00:00
Tristan Partin
9794f386f4 Make Postgres 17 the default version (#11619)
This is mostly a documentation update, but a few updates with regard to
neon_local, pageserver, and tests.

17 is our default for users in production, so dropping references to 16
makes sense.

Signed-off-by: Tristan Partin <tristan@neon.tech>

Signed-off-by: Tristan Partin <tristan@neon.tech>
2025-04-16 23:23:37 +00:00
Tristan Partin
79083de61c Remove forward compatibility hacks related to compute_ctl auth (#11621)
These various hacks were needed for the forward compatibility tests.
Enough time has passed since the merge that these are no longer needed.

Signed-off-by: Tristan Partin <tristan@neon.tech>
2025-04-16 23:14:24 +00:00
Folke Behrens
ec9079f483 Allow unwrap() in tests when clippy::unwrap_used is denied (#11616)
## Problem

The proxy denies using `unwrap()`s in regular code, but we want to use
it in test code
and so have to allow it for each test block.

## Summary of changes

Set `allow-unwrap-in-tests = true` in clippy.toml and remove all
exceptions.
2025-04-16 20:05:21 +00:00
Ivan Efremov
b9b25e13a0 feat(proxy): Return prefixed errors to testodrome (#11561)
Testodrome measures uptime based on the failed requests and errors. In
case of testodrome request we send back error based on the service. This
will help us distinguish error types in testodrome and rely on the
uptime SLI.
2025-04-16 19:03:23 +00:00
Alex Chi Z.
cf2e695f49 feat(pageserver): gc-compaction meta statistics (#11601)
## Problem

We currently only have gc-compaction statistics for each single
sub-compaction job.

## Summary of changes

Add meta statistics across all sub-compaction jobs scheduled.

Signed-off-by: Alex Chi Z <chi@neon.tech>
2025-04-16 18:51:48 +00:00
Conrad Ludgate
fc233794f6 fix(proxy): make sure that sql-over-http is TLS aware (#11612)
I noticed that while auth-broker -> local-proxy is TLS aware, and TCP
proxy -> postgres is TLS aware, HTTP proxy -> postgres is not 😅
2025-04-16 18:37:17 +00:00
Tristan Partin
c002236145 Remove compute_ctl authorization bypass if testing feature was enable (#11596)
We want to exercise the authorization middleware in our regression
tests.

Signed-off-by: Tristan Partin <tristan@neon.tech>
2025-04-16 17:54:51 +00:00
Erik Grinaker
4af0b9b387 pageserver: don't recompress images in ImageLayerInner::filter() (#11592)
## Problem

During shard ancestor compaction, we currently recompress all page
images as we move them into a new layer file. This is expensive and
unnecessary.

Resolves #11562.
Requires #11607.

## Summary of changes

Pass through compressed page images in `ImageLayerInner::filter()`.
2025-04-16 17:10:15 +00:00
Vlad Lazar
0e00faf528 tests: stability fixes for test_migration_to_cold_secondary (#11606)
1. Compute may generate WAL on shutdown. The test assumes that after
shutdown,
no further ingest happens. Tweak the compute shutdown to make the
assumption true.
2. Assertion of local layer count post cold migration is not right since
we may have downloaded
layers due to ingest. Remove it.

Closes https://github.com/neondatabase/neon/issues/11587
2025-04-16 16:31:23 +00:00
Anastasia Lubennikova
7747a9619f compute: fix copy-paste typo for neon GUC parameters check (#11610)
fix for commit
[5063151](5063151271)
2025-04-16 15:55:11 +00:00
Erik Grinaker
46100717ad pageserver: add VectoredBlob::raw_with_header (#11607)
## Problem

To avoid recompressing page images during layer filtering, we need
access to the raw header and data from vectored reads such that we can
pass them through to the target layer.

Touches #11562.

## Summary of changes

Adds `VectoredBlob::raw_with_header()` to return a raw view of the
header+data, and updates `read()` to track it.

Also adds `blob_io::Header` with header metadata and decode logic, to
reuse for tests and assertions. This isn't yet widely used.
2025-04-16 15:38:10 +00:00
Erik Grinaker
00eeff9b8d pageserver: add compaction_shard_ancestor to disable shard ancestor compaction (#11608)
## Problem

Splits of large tenants (several TB) can cause a huge amount of shard
ancestor compaction work, which can overload Pageservers.

Touches https://github.com/neondatabase/cloud/issues/22532.

## Summary of changes

Add a setting `compaction_shard_ancestor` (default `true`) to disable
shard ancestor compaction on a per-tenant basis.
2025-04-16 14:41:02 +00:00
Matthias van de Meent
2a46426157 Update neon GUCs with new default settings (#11595)
Staging and prod both have these settings configured like this, so let's
update this so we can eventually drop the overrides in prod.
2025-04-16 13:42:22 +00:00
Tristan Partin
edc11253b6 Fix neon_local public key parsing when create compute JWKS (#11602)
Finally figured out the right incantation. I had had this in my original
go, but due to some refactoring and apparently missed testing, I
committed a mistake. The reason this doesn't currently break anything is
that we bypass the authorization middleware when the "testing" cargo
feature is enabled.

Signed-off-by: Tristan Partin <tristan@neon.tech>
2025-04-16 12:51:48 +00:00
Heikki Linnakangas
b4e26a6284 Set last-written LSN as part of smgr_end_unlogged_build() (#11584)
This way, the callers don't need to do it, reducing the footprint of
changes we've had to made to various index AM's build functions.
2025-04-16 12:34:18 +00:00
Vlad Lazar
96b46365e4 tests: attach final metrics to allure report (#11604)
## Problem

Metrics are saved in https://github.com/neondatabase/neon/pull/11559,
but the file is not matched by the attachment regex.

## Summary of changes

Make attachment regex match the metrics file.
2025-04-16 10:26:47 +00:00
Alex Chi Z.
aa19f10e7e fix(test): allow shutdown warning in preempt tests (#11600)
## Problem

test_gc_compaction_preempt is still flaky

## Summary of changes

- allow shutdown warning logs

Signed-off-by: Alex Chi Z <chi@neon.tech>
2025-04-15 21:50:28 +00:00
Konstantin Knizhnik
35170656fe Allocate WalProposerConn using TopMemoryAllocator (#11577)
## Problem

See https://neondb.slack.com/archives/C04DGM6SMTM/p1744659631698609
`WalProposerConn` is allocated using current memory context which life
time is not long enough.

## Summary of changes

Allocate `WalProposerConn`  using `TopMemoryContext`.

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
2025-04-15 19:13:12 +00:00
Tristan Partin
cd9ad75797 Remove compute_ctl authorization bypass on localhost (#11597)
For whatever reason, this never worked in production computes anyway.

Signed-off-by: Tristan Partin <tristan@neon.tech>
2025-04-15 19:12:34 +00:00
Tristan Partin
eadb05f78e Teach neon_local to pass the Authorization header to compute_ctl (#11490)
This allows us to remove hacks in the compute_ctl authorization
middleware which allowed for bypasses of auth checks.

Fixes: https://github.com/neondatabase/neon/issues/11316

Signed-off-by: Tristan Partin <tristan@neon.tech>
2025-04-15 17:27:49 +00:00
Fedor Dikarev
c5115518e9 remove temp file from repo (#11586)
## Problem
In https://github.com/neondatabase/neon/pull/11409 we added temp file to
the repo.

## Summary of changes
Remove temp file from the repo.
2025-04-15 15:29:15 +00:00
Alex Chi Z.
931f8c4300 fix(pageserver): check if cancelled before waiting logical size (2/2) (#11575)
## Problem

close https://github.com/neondatabase/neon/issues/11486, proceeding
https://github.com/neondatabase/neon/pull/11531

## Summary of changes

This patch fixes the rest 50% of instability of
`test_create_churn_during_restart`. During tenant warmup, we'll request
logical size; however, if the startup gets cancelled, we won't be able
to spawn the initial logical size calculation task that sets the
`cancel_wait_for_background_loop_concurrency_limit_semaphore`.

Therefore, we check `cancelled` before proceeding to get
`cancel_wait_for_background_loop_concurrency_limit_semaphore`. There
will still be a race if the timeline shutdown happens after L5710 and
before L5711, but it should be enough to reduce the flakiness of the
test.

Signed-off-by: Alex Chi Z <chi@neon.tech>
2025-04-15 15:16:16 +00:00
Alexander Bayandin
0f7c2cc382 CI(release): add time to RC PR branch names (#11547)
## Problem

We can't have more than one open release PR created on the same day (due
to non-unique enough branch names).

## Summary of changes
- Add time (hours and minutes) to RC PR branch names
- Also make sure we use UTC for releases
2025-04-15 15:08:05 +00:00
Erik Grinaker
983d56502b pageserver: reduce shard ancestor rewrite threshold to 30% (#11582)
## Problem

When doing power-of-two shard splits (i.e. 4 → 8 → 16), we end up
rewriting all layers since half of the pages will be local due to
striping. This causes a lot of resource usage when splitting large
tenants.

## Summary of changes

Drop the threshold of local/total pages to 30%, to reduce the amount of
layer rewrites after splits.
2025-04-15 14:26:29 +00:00
Erik Grinaker
bcef542d5b pageserver: don't rewrite invisible layers during ancestor compaction (#11580)
## Problem

Shard ancestor compaction can be very expensive following shard splits
of large tenants. We currently rewrite garbage layers after shard splits
as well, which can be a significant amount of data.

Touches https://github.com/neondatabase/cloud/issues/22532.

## Summary of changes

Don't rewrite invisible layers after shard splits.
2025-04-15 14:25:58 +00:00
a-masterov
e31455d936 Add the tests for the extensions pg_jsonschema and pg_session_jwt (#11323)
## Problem
`pg_jsonschema` and `pg_session_jwt` are not yet covered by tests
## Summary of changes
Added the tests for these extensions.
2025-04-15 14:06:01 +00:00
Alex Chi Z.
a4ea7d6194 fix(pageserver): gc-compaction verification false failure (#11564)
## Problem

https://github.com/neondatabase/neon/pull/11515 introduced a bug that
some key history cannot be verified.

If a key only exists above the horizon, the verification will fail for
its first occurrence because the history does not exist at that point.

As gc-compaction skips a key range whenever an error occurs, it might be
doing some wasted work in staging/prod now. But I'm not planning a
hotfix this week as the bug doesn't affect correctness/performance.

## Summary of changes

Allow keys with only above horizon history in the verification.

Signed-off-by: Alex Chi Z <chi@neon.tech>
2025-04-15 13:58:32 +00:00
Alexander Bayandin
19bea5fd0c CI: do not wait for tests to trigger deploy job (#11548)
## Problem

There is too much delay between merging a PR into `main` and deploying
the changes to staging

## Summary of changes
- Trigger `deploy` job without waiting for `build-and-test-locally` job
2025-04-15 11:23:41 +00:00
a-masterov
5be94e28c4 Update the documentation of the cloud regress test (#11539)
## Problem
The information in the README.md contained errors, and some information
was missing.
## Summary of changes
Found errors are fixed, and new information is added.

---------

Co-authored-by: Alexander Bayandin <alexander@neon.tech>
2025-04-15 11:00:25 +00:00
Alexander Bayandin
63a106021a CI(allure-report-generate): Install allure to /tmp (#11579)
## Problem

The `/__w/neon/neon` directory is mounted from host to container and
persists between runs.
Sometimes the next workflow run fails to delete it:

```
Deleting the contents of '/__w/neon/neon'
Error: File was unable to be removed Error: EACCES: permission denied, rmdir '/__w/neon/neon/allure-2.32.2/bin'
```

## Summary of changes
- Download and install allure to `/tmp` which exists in container only

Ref https://github.com/neondatabase/cloud/issues/27186
2025-04-15 09:29:36 +00:00
Fedor Dikarev
9a6ace9bde introduce new runners: unit-perf and use them for benchmark jobs (#11409)
## Problem
Benchmarks results are inconsistent on existing small-metal runners

## Summary of changes
Introduce new `unit-perf` runners, and lets run benchmark on them.

The new hardware has slower, but consistent, CPU frequency - if run with
default governor schedutil.
Thus we needed to adjust some testcases' timeouts and add some retry
steps where hard-coded timeouts couldn't be increased without changing
the system under test.
-
[wait_for_last_record_lsn](6592d69a67/test_runner/fixtures/pageserver/utils.py (L193))
1000s -> 2000s
-
[test_branch_creation_many](https://github.com/neondatabase/neon/pull/11409/files#diff-2ebfe76f89004d563c7e53e3ca82462e1d85e92e6d5588e8e8f598bbe119e927)
1000s
-
[test_ingest_insert_bulk](https://github.com/neondatabase/neon/pull/11409/files#diff-e90e685be4a87053bc264a68740969e6a8872c8897b8b748d0e8c5f683a68d9f)
- with back throttling disabled compute becomes unresponsive for more
than 60 seconds (PG hard-coded client authentication connection timeout)
-
[test_sharded_ingest](https://github.com/neondatabase/neon/pull/11409/files#diff-e8d870165bd44acb9a6d8350f8640b301c1385a4108430b8d6d659b697e4a3f1)
600s -> 1200s

Right now there are only 2 runners of that class, and if we decide to go
with them, we have to check how much that type of runners we need, so
jobs not stuck with waiting for that type of runners available.

However we now decided to run those runners with governor performance
instead of schedutil.
This achieves almost same performance as previous runners but still
achieves consistent results for same commit

Related issue to activate performance governor on these runners
https://github.com/neondatabase/runner/pull/138

## Verification that it helps

### analyze runtimes on new runner for same commit

Table of runtimes for the same commit on different runners in
[run](https://github.com/neondatabase/neon/actions/runs/14417589789)

| Run | Benchmarks (1) | Benchmarks (2) |Benchmarks (3) |Benchmarks (4)
| Benchmarks (5) |
|--------|--------|---------|---------|---------|---------|
| 1 | 1950.37s | 6374.55s |  3646.15s |  4149.48s |  2330.22s | 
| 2 | - | 6369.27s |  3666.65s |  4162.42s |  2329.23s | 
| Delta % |  - |  0,07 %  | 0,5 %   |   0,3 % | 0,04 %   |
| with governor performance | 1519.57s |  4131.62s |  - | -  |  - |
| second run gov. perf. | 1513.62s |  4134.67s |  - | -  |  - |
| Delta % |  0,3 % |  0,07 %  |  -  |  - | -   |
| speedup gov. performance | 22 % |  35 % |  - | -  |  - |
| current desktop class hetzner runners (main) | 1487.10s | 3699.67s | -
| - | - |
| slower than desktop class | 2 % |  12 % |  - | -  |  - |


In summary, the runtimes for the same commit on this hardware varies
less than 1 %.

---------

Co-authored-by: BodoBolero <peterbendel@neon.tech>
2025-04-15 08:21:44 +00:00
Erik Grinaker
8c77ccfc01 pageserver: log total progress during shard ancestor compaction (#11565)
## Problem

Shard ancestor compaction doesn't currently log any global progress
information, only for the current batch.

## Summary of changes

Log the number of layers checked for eligibility this iteration, and the
total number of layers to check. This will indicate how far along the
total shard ancestor compaction has gotten for this iteration.
2025-04-15 07:25:09 +00:00
155 changed files with 2375 additions and 783 deletions

View File

@@ -6,6 +6,7 @@ self-hosted-runner:
- small
- small-metal
- small-arm64
- unit-perf
- us-east-2
config-variables:
- AWS_ECR_REGION

View File

@@ -70,6 +70,7 @@ runs:
- name: Install Allure
shell: bash -euxo pipefail {0}
working-directory: /tmp
run: |
if ! which allure; then
ALLURE_ZIP=allure-${ALLURE_VERSION}.zip

View File

@@ -113,8 +113,6 @@ runs:
TEST_OUTPUT: /tmp/test_output
BUILD_TYPE: ${{ inputs.build_type }}
COMPATIBILITY_SNAPSHOT_DIR: /tmp/compatibility_snapshot_pg${{ inputs.pg_version }}
ALLOW_BACKWARD_COMPATIBILITY_BREAKAGE: contains(github.event.pull_request.labels.*.name, 'backward compatibility breakage')
ALLOW_FORWARD_COMPATIBILITY_BREAKAGE: contains(github.event.pull_request.labels.*.name, 'forward compatibility breakage')
RERUN_FAILED: ${{ inputs.rerun_failed }}
PG_VERSION: ${{ inputs.pg_version }}
SANITIZERS: ${{ inputs.sanitizers }}

View File

@@ -272,10 +272,13 @@ jobs:
# run pageserver tests with different settings
for get_vectored_concurrent_io in sequential sidecar-task; do
for io_engine in std-fs tokio-epoll-uring ; do
NEON_PAGESERVER_UNIT_TEST_GET_VECTORED_CONCURRENT_IO=$get_vectored_concurrent_io \
NEON_PAGESERVER_UNIT_TEST_VIRTUAL_FILE_IOENGINE=$io_engine \
${cov_prefix} \
cargo nextest run $CARGO_FLAGS $CARGO_FEATURES -E 'package(pageserver)'
for io_mode in buffered direct direct-rw ; do
NEON_PAGESERVER_UNIT_TEST_GET_VECTORED_CONCURRENT_IO=$get_vectored_concurrent_io \
NEON_PAGESERVER_UNIT_TEST_VIRTUAL_FILE_IOENGINE=$io_engine \
NEON_PAGESERVER_UNIT_TEST_VIRTUAL_FILE_IOMODE=$io_mode \
${cov_prefix} \
cargo nextest run $CARGO_FLAGS $CARGO_FEATURES -E 'package(pageserver)'
done
done
done
@@ -392,6 +395,7 @@ jobs:
BUILD_TAG: ${{ inputs.build-tag }}
PAGESERVER_VIRTUAL_FILE_IO_ENGINE: tokio-epoll-uring
PAGESERVER_GET_VECTORED_CONCURRENT_IO: sidecar-task
PAGESERVER_VIRTUAL_FILE_IO_MODE: direct
USE_LFC: ${{ matrix.lfc_state == 'with-lfc' && 'true' || 'false' }}
# Temporary disable this step until we figure out why it's so flaky

View File

@@ -53,10 +53,13 @@ jobs:
|| inputs.component-name == 'Compute' && 'release-compute'
}}
run: |
today=$(date +'%Y-%m-%d')
echo "title=${COMPONENT_NAME} release ${today}" | tee -a ${GITHUB_OUTPUT}
echo "rc-branch=rc/${RELEASE_BRANCH}/${today}" | tee -a ${GITHUB_OUTPUT}
echo "release-branch=${RELEASE_BRANCH}" | tee -a ${GITHUB_OUTPUT}
now_date=$(date -u +'%Y-%m-%d')
now_time=$(date -u +'%H-%M-%Z')
{
echo "title=${COMPONENT_NAME} release ${now_date}"
echo "rc-branch=rc/${RELEASE_BRANCH}/${now_date}_${now_time}"
echo "release-branch=${RELEASE_BRANCH}"
} | tee -a ${GITHUB_OUTPUT}
- name: Configure git
run: |

View File

@@ -284,7 +284,7 @@ jobs:
statuses: write
contents: write
pull-requests: write
runs-on: [ self-hosted, small-metal ]
runs-on: [ self-hosted, unit-perf ]
container:
image: ${{ needs.build-build-tools-image.outputs.image }}-bookworm
credentials:
@@ -323,6 +323,8 @@ jobs:
PERF_TEST_RESULT_CONNSTR: "${{ secrets.PERF_TEST_RESULT_CONNSTR }}"
TEST_RESULT_CONNSTR: "${{ secrets.REGRESS_TEST_RESULT_CONNSTR_NEW }}"
PAGESERVER_VIRTUAL_FILE_IO_ENGINE: tokio-epoll-uring
PAGESERVER_GET_VECTORED_CONCURRENT_IO: sidecar-task
PAGESERVER_VIRTUAL_FILE_IO_MODE: direct
SYNC_BETWEEN_TESTS: true
# XXX: no coverage data handling here, since benchmarks are run on release builds,
# while coverage is currently collected for the debug ones
@@ -1271,7 +1273,7 @@ jobs:
exit 1
deploy:
needs: [ check-permissions, push-neon-image-dev, push-compute-image-dev, push-neon-image-prod, push-compute-image-prod, meta, build-and-test-locally, trigger-custom-extensions-build-and-wait ]
needs: [ check-permissions, push-neon-image-dev, push-compute-image-dev, push-neon-image-prod, push-compute-image-prod, meta, trigger-custom-extensions-build-and-wait ]
# `!failure() && !cancelled()` is required because the workflow depends on the job that can be skipped: `push-neon-image-prod` and `push-compute-image-prod`
if: ${{ contains(fromJSON('["push-main", "storage-release", "proxy-release", "compute-release"]'), needs.meta.outputs.run-kind) && !failure() && !cancelled() }}
permissions:

View File

@@ -30,7 +30,7 @@ permissions:
statuses: write # require for posting a status update
env:
DEFAULT_PG_VERSION: 16
DEFAULT_PG_VERSION: 17
PLATFORM: neon-captest-new
AWS_DEFAULT_REGION: eu-central-1
@@ -42,6 +42,8 @@ jobs:
github-event-name: ${{ github.event_name }}
build-build-tools-image:
permissions:
packages: write
needs: [ check-permissions ]
uses: ./.github/workflows/build-build-tools-image.yml
secrets: inherit

28
Cargo.lock generated
View File

@@ -1416,6 +1416,7 @@ name = "control_plane"
version = "0.1.0"
dependencies = [
"anyhow",
"base64 0.13.1",
"camino",
"clap",
"comfy-table",
@@ -1425,10 +1426,12 @@ dependencies = [
"humantime",
"humantime-serde",
"hyper 0.14.30",
"jsonwebtoken",
"nix 0.27.1",
"once_cell",
"pageserver_api",
"pageserver_client",
"pem",
"postgres_backend",
"postgres_connection",
"regex",
@@ -1437,6 +1440,8 @@ dependencies = [
"scopeguard",
"serde",
"serde_json",
"sha2",
"spki 0.7.3",
"storage_broker",
"thiserror 1.0.69",
"tokio",
@@ -2817,6 +2822,7 @@ dependencies = [
"hyper 0.14.30",
"itertools 0.10.5",
"jemalloc_pprof",
"jsonwebtoken",
"metrics",
"once_cell",
"pprof",
@@ -4245,6 +4251,7 @@ dependencies = [
"arc-swap",
"async-compression",
"async-stream",
"base64 0.13.1",
"bincode",
"bit_field",
"byteorder",
@@ -4269,6 +4276,7 @@ dependencies = [
"hyper 0.14.30",
"indoc",
"itertools 0.10.5",
"jsonwebtoken",
"md5",
"metrics",
"nix 0.27.1",
@@ -4291,6 +4299,7 @@ dependencies = [
"rand 0.8.5",
"range-set-blaze",
"regex",
"remote_keys",
"remote_storage",
"reqwest",
"rpds",
@@ -4345,6 +4354,7 @@ dependencies = [
"humantime-serde",
"itertools 0.10.5",
"nix 0.27.1",
"once_cell",
"postgres_backend",
"postgres_ffi",
"rand 0.8.5",
@@ -5495,6 +5505,16 @@ version = "1.9.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "c707298afce11da2efef2f600116fa93ffa7a032b5d7b628aa17711ec81383ca"
[[package]]
name = "remote_keys"
version = "0.1.0"
dependencies = [
"anyhow",
"rand 0.8.5",
"utils",
"workspace_hack",
]
[[package]]
name = "remote_storage"
version = "0.1.0"
@@ -5510,6 +5530,7 @@ dependencies = [
"azure_identity",
"azure_storage",
"azure_storage_blobs",
"base64 0.13.1",
"bytes",
"camino",
"camino-tempfile",
@@ -5520,6 +5541,7 @@ dependencies = [
"humantime-serde",
"hyper 1.4.1",
"itertools 0.10.5",
"md5",
"metrics",
"once_cell",
"pin-project-lite",
@@ -5685,9 +5707,9 @@ dependencies = [
[[package]]
name = "ring"
version = "0.17.13"
version = "0.17.14"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "70ac5d832aa16abd7d1def883a8545280c20a60f523a370aa3a9617c2b8550ee"
checksum = "a4689e6c2294d81e88dc6261c768b63bc4fcdb852be6d1352498b114f61383b7"
dependencies = [
"cc",
"cfg-if",
@@ -5988,6 +6010,7 @@ dependencies = [
"humantime",
"hyper 0.14.30",
"itertools 0.10.5",
"jsonwebtoken",
"metrics",
"once_cell",
"pageserver_api",
@@ -7872,6 +7895,7 @@ dependencies = [
"metrics",
"nix 0.27.1",
"once_cell",
"pem",
"pin-project-lite",
"postgres_connection",
"pprof",

View File

@@ -30,6 +30,7 @@ members = [
"libs/tenant_size_model",
"libs/metrics",
"libs/postgres_connection",
"libs/remote_keys",
"libs/remote_storage",
"libs/tracing-utils",
"libs/postgres_ffi/wal_craft",
@@ -141,6 +142,7 @@ parking_lot = "0.12"
parquet = { version = "53", default-features = false, features = ["zstd"] }
parquet_derive = "53"
pbkdf2 = { version = "0.12.1", features = ["simple", "std"] }
pem = "3.0.3"
pin-project-lite = "0.2"
pprof = { version = "0.14", features = ["criterion", "flamegraph", "frame-pointer", "prost-codec"] }
procfs = "0.16"
@@ -174,6 +176,7 @@ signal-hook = "0.3"
smallvec = "1.11"
smol_str = { version = "0.2.0", features = ["serde"] }
socket2 = "0.5"
spki = "0.7.3"
strum = "0.26"
strum_macros = "0.26"
"subtle" = "2.5.0"
@@ -253,6 +256,7 @@ postgres_ffi = { version = "0.1", path = "./libs/postgres_ffi/" }
postgres_initdb = { path = "./libs/postgres_initdb" }
pq_proto = { version = "0.1", path = "./libs/pq_proto/" }
remote_storage = { version = "0.1", path = "./libs/remote_storage/" }
remote_keys = { version = "0.1", path = "./libs/remote_keys/" }
safekeeper_api = { version = "0.1", path = "./libs/safekeeper_api" }
safekeeper_client = { path = "./safekeeper/client" }
desim = { version = "0.1", path = "./libs/desim" }

View File

@@ -270,7 +270,7 @@ By default, this runs both debug and release modes, and all supported postgres v
testing locally, it is convenient to run just one set of permutations, like this:
```sh
DEFAULT_PG_VERSION=16 BUILD_TYPE=release ./scripts/pytest
DEFAULT_PG_VERSION=17 BUILD_TYPE=release ./scripts/pytest
```
## Flamegraphs

View File

@@ -12,3 +12,5 @@ disallowed-macros = [
# cannot disallow this, because clippy finds used from tokio macros
#"tokio::pin",
]
allow-unwrap-in-tests = true

View File

@@ -15,7 +15,7 @@ index 7a4b88c..56678af 100644
HEADERS = src/halfvec.h src/sparsevec.h src/vector.h
diff --git a/src/hnswbuild.c b/src/hnswbuild.c
index b667478..dc95d89 100644
index b667478..1298aa1 100644
--- a/src/hnswbuild.c
+++ b/src/hnswbuild.c
@@ -843,9 +843,17 @@ HnswParallelBuildMain(dsm_segment *seg, shm_toc *toc)
@@ -36,7 +36,7 @@ index b667478..dc95d89 100644
/* Close relations within worker */
index_close(indexRel, indexLockmode);
table_close(heapRel, heapLockmode);
@@ -1100,12 +1108,39 @@ BuildIndex(Relation heap, Relation index, IndexInfo *indexInfo,
@@ -1100,13 +1108,25 @@ BuildIndex(Relation heap, Relation index, IndexInfo *indexInfo,
SeedRandom(42);
#endif
@@ -48,32 +48,17 @@ index b667478..dc95d89 100644
BuildGraph(buildstate, forkNum);
- if (RelationNeedsWAL(index) || forkNum == INIT_FORKNUM)
+#ifdef NEON_SMGR
+ smgr_finish_unlogged_build_phase_1(RelationGetSmgr(index));
+#endif
+
+ if (RelationNeedsWAL(index) || forkNum == INIT_FORKNUM) {
if (RelationNeedsWAL(index) || forkNum == INIT_FORKNUM)
log_newpage_range(index, forkNum, 0, RelationGetNumberOfBlocksInFork(index, forkNum), true);
+#ifdef NEON_SMGR
+ {
+#if PG_VERSION_NUM >= 160000
+ RelFileLocator rlocator = RelationGetSmgr(index)->smgr_rlocator.locator;
+#else
+ RelFileNode rlocator = RelationGetSmgr(index)->smgr_rnode.node;
+#endif
+ if (set_lwlsn_block_range_hook)
+ set_lwlsn_block_range_hook(XactLastRecEnd, rlocator,
+ MAIN_FORKNUM, 0, RelationGetNumberOfBlocks(index));
+ if (set_lwlsn_relation_hook)
+ set_lwlsn_relation_hook(XactLastRecEnd, rlocator, MAIN_FORKNUM);
+ }
+#endif
+ }
+
+#ifdef NEON_SMGR
+ smgr_end_unlogged_build(RelationGetSmgr(index));
+#endif
+
FreeBuildState(buildstate);
}

View File

@@ -1,5 +1,5 @@
diff --git a/src/ruminsert.c b/src/ruminsert.c
index 255e616..7a2240f 100644
index 255e616..1c6edb7 100644
--- a/src/ruminsert.c
+++ b/src/ruminsert.c
@@ -628,6 +628,10 @@ rumbuild(Relation heap, Relation index, struct IndexInfo *indexInfo)
@@ -24,24 +24,12 @@ index 255e616..7a2240f 100644
/*
* Write index to xlog
*/
@@ -713,6 +721,22 @@ rumbuild(Relation heap, Relation index, struct IndexInfo *indexInfo)
@@ -713,6 +721,10 @@ rumbuild(Relation heap, Relation index, struct IndexInfo *indexInfo)
UnlockReleaseBuffer(buffer);
}
+#ifdef NEON_SMGR
+ {
+#if PG_VERSION_NUM >= 160000
+ RelFileLocator rlocator = RelationGetSmgr(index)->smgr_rlocator.locator;
+#else
+ RelFileNode rlocator = RelationGetSmgr(index)->smgr_rnode.node;
+#endif
+ if (set_lwlsn_block_range_hook)
+ set_lwlsn_block_range_hook(XactLastRecEnd, rlocator, MAIN_FORKNUM, 0, RelationGetNumberOfBlocks(index));
+ if (set_lwlsn_relation_hook)
+ set_lwlsn_relation_hook(XactLastRecEnd, rlocator, MAIN_FORKNUM);
+
+ smgr_end_unlogged_build(index->rd_smgr);
+ }
+ smgr_end_unlogged_build(index->rd_smgr);
+#endif
+
/*

View File

@@ -22,7 +22,7 @@ commands:
- name: local_proxy
user: postgres
sysvInitAction: respawn
shell: '/usr/local/bin/local_proxy --config-path /etc/local_proxy/config.json --pid-path /etc/local_proxy/pid --http 0.0.0.0:10432'
shell: 'RUST_LOG="info,proxy::serverless::sql_over_http=warn" /usr/local/bin/local_proxy --config-path /etc/local_proxy/config.json --pid-path /etc/local_proxy/pid --http 0.0.0.0:10432'
- name: postgres-exporter
user: nobody
sysvInitAction: respawn

View File

@@ -22,7 +22,7 @@ commands:
- name: local_proxy
user: postgres
sysvInitAction: respawn
shell: '/usr/local/bin/local_proxy --config-path /etc/local_proxy/config.json --pid-path /etc/local_proxy/pid --http 0.0.0.0:10432'
shell: 'RUST_LOG="info,proxy::serverless::sql_over_http=warn" /usr/local/bin/local_proxy --config-path /etc/local_proxy/config.json --pid-path /etc/local_proxy/pid --http 0.0.0.0:10432'
- name: postgres-exporter
user: nobody
sysvInitAction: respawn

View File

@@ -57,24 +57,13 @@ use tracing::{error, info};
use url::Url;
use utils::failpoint_support;
// Compatibility hack: if the control plane specified any remote-ext-config
// use the default value for extension storage proxy gateway.
// Remove this once the control plane is updated to pass the gateway URL
fn parse_remote_ext_config(arg: &str) -> Result<String> {
if arg.starts_with("http") {
Ok(arg.trim_end_matches('/').to_string())
} else {
Ok("http://pg-ext-s3-gateway".to_string())
}
}
#[derive(Parser)]
#[command(rename_all = "kebab-case")]
struct Cli {
#[arg(short = 'b', long, default_value = "postgres", env = "POSTGRES_PATH")]
pub pgbin: String,
#[arg(short = 'r', long, value_parser = parse_remote_ext_config)]
#[arg(short = 'r', long)]
pub remote_ext_config: Option<String>,
/// The port to bind the external listening HTTP server to. Clients running
@@ -116,9 +105,7 @@ struct Cli {
#[arg(long)]
pub set_disk_quota_for_fs: Option<String>,
// TODO(tristan957): remove alias after compatibility tests are no longer
// an issue
#[arg(short = 'c', long, alias = "spec-path")]
#[arg(short = 'c', long)]
pub config: Option<OsString>,
#[arg(short = 'i', long, group = "compute-id")]

View File

@@ -6,4 +6,5 @@ pub(crate) mod request_id;
pub(crate) use json::Json;
pub(crate) use path::Path;
pub(crate) use query::Query;
#[allow(unused)]
pub(crate) use request_id::RequestId;

View File

@@ -1,7 +1,7 @@
use std::{collections::HashSet, net::SocketAddr};
use std::collections::HashSet;
use anyhow::{Result, anyhow};
use axum::{RequestExt, body::Body, extract::ConnectInfo};
use axum::{RequestExt, body::Body};
use axum_extra::{
TypedHeader,
headers::{Authorization, authorization::Bearer},
@@ -13,7 +13,7 @@ use jsonwebtoken::{Algorithm, DecodingKey, TokenData, Validation, jwk::JwkSet};
use tower_http::auth::AsyncAuthorizeRequest;
use tracing::{debug, warn};
use crate::http::{JsonResponse, extract::RequestId};
use crate::http::JsonResponse;
#[derive(Clone, Debug)]
pub(in crate::http) struct Authorize {
@@ -52,31 +52,6 @@ impl AsyncAuthorizeRequest<Body> for Authorize {
let validation = self.validation.clone();
Box::pin(async move {
let request_id = request.extract_parts::<RequestId>().await.unwrap();
// TODO: Remove this stanza after teaching neon_local and the
// regression tests to use a JWT + JWKS.
//
// https://github.com/neondatabase/neon/issues/11316
if cfg!(feature = "testing") {
warn!(%request_id, "Skipping compute_ctl authorization check");
return Ok(request);
}
let connect_info = request
.extract_parts::<ConnectInfo<SocketAddr>>()
.await
.unwrap();
// In the event the request is coming from the loopback interface,
// allow all requests
if connect_info.ip().is_loopback() {
warn!(%request_id, "Bypassed authorization because request is coming from the loopback interface");
return Ok(request);
}
let TypedHeader(Authorization(bearer)) = request
.extract_parts::<TypedHeader<Authorization<Bearer>>>()
.await
@@ -112,6 +87,8 @@ impl Authorize {
token: &str,
validation: &Validation,
) -> Result<TokenData<ComputeClaims>> {
debug_assert!(!jwks.keys.is_empty());
debug!("verifying token {}", token);
for jwk in jwks.keys.iter() {

View File

@@ -6,13 +6,16 @@ license.workspace = true
[dependencies]
anyhow.workspace = true
base64.workspace = true
camino.workspace = true
clap.workspace = true
comfy-table.workspace = true
futures.workspace = true
humantime.workspace = true
jsonwebtoken.workspace = true
nix.workspace = true
once_cell.workspace = true
pem.workspace = true
humantime-serde.workspace = true
hyper0.workspace = true
regex.workspace = true
@@ -20,6 +23,8 @@ reqwest = { workspace = true, features = ["blocking", "json"] }
scopeguard.workspace = true
serde.workspace = true
serde_json.workspace = true
sha2.workspace = true
spki.workspace = true
thiserror.workspace = true
toml.workspace = true
toml_edit.workspace = true

View File

@@ -63,7 +63,7 @@ const DEFAULT_PAGESERVER_ID: NodeId = NodeId(1);
const DEFAULT_BRANCH_NAME: &str = "main";
project_git_version!(GIT_VERSION);
const DEFAULT_PG_VERSION: u32 = 16;
const DEFAULT_PG_VERSION: u32 = 17;
const DEFAULT_PAGESERVER_CONTROL_PLANE_API: &str = "http://127.0.0.1:1234/upcall/v1/";
@@ -552,6 +552,7 @@ enum EndpointCmd {
Start(EndpointStartCmdArgs),
Reconfigure(EndpointReconfigureCmdArgs),
Stop(EndpointStopCmdArgs),
GenerateJwt(EndpointGenerateJwtCmdArgs),
}
#[derive(clap::Args)]
@@ -699,6 +700,13 @@ struct EndpointStopCmdArgs {
mode: String,
}
#[derive(clap::Args)]
#[clap(about = "Generate a JWT for an endpoint")]
struct EndpointGenerateJwtCmdArgs {
#[clap(help = "Postgres endpoint id")]
endpoint_id: String,
}
#[derive(clap::Subcommand)]
#[clap(about = "Manage neon_local branch name mappings")]
enum MappingsCmd {
@@ -1528,6 +1536,16 @@ async fn handle_endpoint(subcmd: &EndpointCmd, env: &local_env::LocalEnv) -> Res
.with_context(|| format!("postgres endpoint {endpoint_id} is not found"))?;
endpoint.stop(&args.mode, args.destroy)?;
}
EndpointCmd::GenerateJwt(args) => {
let endpoint_id = &args.endpoint_id;
let endpoint = cplane
.endpoints
.get(endpoint_id)
.with_context(|| format!("postgres endpoint {endpoint_id} is not found"))?;
let jwt = endpoint.generate_jwt()?;
print!("{jwt}");
}
}
Ok(())

View File

@@ -42,22 +42,30 @@ use std::path::PathBuf;
use std::process::Command;
use std::str::FromStr;
use std::sync::Arc;
use std::time::{Duration, Instant, SystemTime, UNIX_EPOCH};
use std::time::{Duration, Instant};
use anyhow::{Context, Result, anyhow, bail};
use compute_api::requests::ConfigurationRequest;
use compute_api::requests::{ComputeClaims, ConfigurationRequest};
use compute_api::responses::{
ComputeConfig, ComputeCtlConfig, ComputeStatus, ComputeStatusResponse,
ComputeConfig, ComputeCtlConfig, ComputeStatus, ComputeStatusResponse, TlsConfig,
};
use compute_api::spec::{
Cluster, ComputeAudit, ComputeFeature, ComputeMode, ComputeSpec, Database, PgIdent,
RemoteExtSpec, Role,
};
use jsonwebtoken::jwk::{
AlgorithmParameters, CommonParameters, EllipticCurve, Jwk, JwkSet, KeyAlgorithm, KeyOperations,
OctetKeyPairParameters, OctetKeyPairType, PublicKeyUse,
};
use nix::sys::signal::{Signal, kill};
use pageserver_api::shard::ShardStripeSize;
use pem::Pem;
use reqwest::header::CONTENT_TYPE;
use safekeeper_api::membership::SafekeeperGeneration;
use serde::{Deserialize, Serialize};
use sha2::{Digest, Sha256};
use spki::der::Decode;
use spki::{SubjectPublicKeyInfo, SubjectPublicKeyInfoRef};
use tracing::debug;
use url::Host;
use utils::id::{NodeId, TenantId, TimelineId};
@@ -82,6 +90,7 @@ pub struct EndpointConf {
drop_subscriptions_before_start: bool,
features: Vec<ComputeFeature>,
cluster: Option<Cluster>,
compute_ctl_config: ComputeCtlConfig,
}
//
@@ -137,6 +146,37 @@ impl ComputeControlPlane {
.unwrap_or(self.base_port)
}
/// Create a JSON Web Key Set. This ideally matches the way we create a JWKS
/// from the production control plane.
fn create_jwks_from_pem(pem: &Pem) -> Result<JwkSet> {
let spki: SubjectPublicKeyInfoRef = SubjectPublicKeyInfo::from_der(pem.contents())?;
let public_key = spki.subject_public_key.raw_bytes();
let mut hasher = Sha256::new();
hasher.update(public_key);
let key_hash = hasher.finalize();
Ok(JwkSet {
keys: vec![Jwk {
common: CommonParameters {
public_key_use: Some(PublicKeyUse::Signature),
key_operations: Some(vec![KeyOperations::Verify]),
key_algorithm: Some(KeyAlgorithm::EdDSA),
key_id: Some(base64::encode_config(key_hash, base64::URL_SAFE_NO_PAD)),
x509_url: None::<String>,
x509_chain: None::<Vec<String>>,
x509_sha1_fingerprint: None::<String>,
x509_sha256_fingerprint: None::<String>,
},
algorithm: AlgorithmParameters::OctetKeyPair(OctetKeyPairParameters {
key_type: OctetKeyPairType::OctetKeyPair,
curve: EllipticCurve::Ed25519,
x: base64::encode_config(public_key, base64::URL_SAFE_NO_PAD),
}),
}],
})
}
#[allow(clippy::too_many_arguments)]
pub fn new_endpoint(
&mut self,
@@ -154,6 +194,10 @@ impl ComputeControlPlane {
let pg_port = pg_port.unwrap_or_else(|| self.get_port());
let external_http_port = external_http_port.unwrap_or_else(|| self.get_port() + 1);
let internal_http_port = internal_http_port.unwrap_or_else(|| external_http_port + 1);
let compute_ctl_config = ComputeCtlConfig {
jwks: Self::create_jwks_from_pem(&self.env.read_public_key()?)?,
tls: None::<TlsConfig>,
};
let ep = Arc::new(Endpoint {
endpoint_id: endpoint_id.to_owned(),
pg_address: SocketAddr::new(IpAddr::from(Ipv4Addr::LOCALHOST), pg_port),
@@ -181,6 +225,7 @@ impl ComputeControlPlane {
reconfigure_concurrency: 1,
features: vec![],
cluster: None,
compute_ctl_config: compute_ctl_config.clone(),
});
ep.create_endpoint_dir()?;
@@ -200,6 +245,7 @@ impl ComputeControlPlane {
reconfigure_concurrency: 1,
features: vec![],
cluster: None,
compute_ctl_config,
})?,
)?;
std::fs::write(
@@ -242,7 +288,6 @@ impl ComputeControlPlane {
///////////////////////////////////////////////////////////////////////////////
#[derive(Debug)]
pub struct Endpoint {
/// used as the directory name
endpoint_id: String,
@@ -271,6 +316,9 @@ pub struct Endpoint {
features: Vec<ComputeFeature>,
// Cluster settings
cluster: Option<Cluster>,
/// The compute_ctl config for the endpoint's compute.
compute_ctl_config: ComputeCtlConfig,
}
#[derive(PartialEq, Eq)]
@@ -333,6 +381,7 @@ impl Endpoint {
drop_subscriptions_before_start: conf.drop_subscriptions_before_start,
features: conf.features,
cluster: conf.cluster,
compute_ctl_config: conf.compute_ctl_config,
})
}
@@ -580,6 +629,13 @@ impl Endpoint {
Ok(safekeeper_connstrings)
}
/// Generate a JWT with the correct claims.
pub fn generate_jwt(&self) -> Result<String> {
self.env.generate_auth_token(&ComputeClaims {
compute_id: self.endpoint_id.clone(),
})
}
#[allow(clippy::too_many_arguments)]
pub async fn start(
&self,
@@ -706,14 +762,10 @@ impl Endpoint {
ComputeConfig {
spec: Some(spec),
compute_ctl_config: ComputeCtlConfig::default(),
compute_ctl_config: self.compute_ctl_config.clone(),
}
};
// TODO(tristan957): Remove the write to spec.json after compatibility
// tests work themselves out
let spec_path = self.endpoint_path().join("spec.json");
std::fs::write(spec_path, serde_json::to_string_pretty(&config.spec)?)?;
let config_path = self.endpoint_path().join("config.json");
std::fs::write(config_path, serde_json::to_string_pretty(&config)?)?;
@@ -723,16 +775,6 @@ impl Endpoint {
.append(true)
.open(self.endpoint_path().join("compute.log"))?;
// TODO(tristan957): Remove when compatibility tests are no longer an
// issue
let old_compute_ctl = {
let mut cmd = Command::new(self.env.neon_distrib_dir.join("compute_ctl"));
let help_output = cmd.arg("--help").output()?;
let help_output = String::from_utf8_lossy(&help_output.stdout);
!help_output.contains("--config")
};
// Launch compute_ctl
let conn_str = self.connstr("cloud_admin", "postgres");
println!("Starting postgres node at '{}'", conn_str);
@@ -751,19 +793,8 @@ impl Endpoint {
])
.args(["--pgdata", self.pgdata().to_str().unwrap()])
.args(["--connstr", &conn_str])
// TODO(tristan957): Change this to --config when compatibility tests
// are no longer an issue
.args([
"--spec-path",
self.endpoint_path()
.join(if old_compute_ctl {
"spec.json"
} else {
"config.json"
})
.to_str()
.unwrap(),
])
.arg("--config")
.arg(self.endpoint_path().join("config.json").as_os_str())
.args([
"--pgbin",
self.env
@@ -774,16 +805,7 @@ impl Endpoint {
])
// TODO: It would be nice if we generated compute IDs with the same
// algorithm as the real control plane.
.args([
"--compute-id",
&format!(
"compute-{}",
SystemTime::now()
.duration_since(UNIX_EPOCH)
.unwrap()
.as_secs()
),
])
.args(["--compute-id", &self.endpoint_id])
.stdin(std::process::Stdio::null())
.stderr(logfile.try_clone()?)
.stdout(logfile);
@@ -881,6 +903,7 @@ impl Endpoint {
self.external_http_address.port()
),
)
.bearer_auth(self.generate_jwt()?)
.send()
.await?;
@@ -957,6 +980,7 @@ impl Endpoint {
self.external_http_address.port()
))
.header(CONTENT_TYPE.as_str(), "application/json")
.bearer_auth(self.generate_jwt()?)
.body(
serde_json::to_string(&ConfigurationRequest {
spec,

View File

@@ -12,6 +12,7 @@ use std::{env, fs};
use anyhow::{Context, bail};
use clap::ValueEnum;
use pem::Pem;
use postgres_backend::AuthType;
use reqwest::Url;
use serde::{Deserialize, Serialize};
@@ -22,7 +23,7 @@ use crate::object_storage::{OBJECT_STORAGE_REMOTE_STORAGE_DIR, ObjectStorage};
use crate::pageserver::{PAGESERVER_REMOTE_STORAGE_DIR, PageServerNode};
use crate::safekeeper::SafekeeperNode;
pub const DEFAULT_PG_VERSION: u32 = 16;
pub const DEFAULT_PG_VERSION: u32 = 17;
//
// This data structures represents neon_local CLI config
@@ -56,6 +57,7 @@ pub struct LocalEnv {
// used to issue tokens during e.g pg start
pub private_key_path: PathBuf,
/// Path to environment's public key
pub public_key_path: PathBuf,
pub broker: NeonBroker,
@@ -758,11 +760,11 @@ impl LocalEnv {
// this function is used only for testing purposes in CLI e g generate tokens during init
pub fn generate_auth_token<S: Serialize>(&self, claims: &S) -> anyhow::Result<String> {
let private_key_path = self.get_private_key_path();
let key_data = fs::read(private_key_path)?;
encode_from_key_file(claims, &key_data)
let key = self.read_private_key()?;
encode_from_key_file(claims, &key)
}
/// Get the path to the private key.
pub fn get_private_key_path(&self) -> PathBuf {
if self.private_key_path.is_absolute() {
self.private_key_path.to_path_buf()
@@ -771,6 +773,29 @@ impl LocalEnv {
}
}
/// Get the path to the public key.
pub fn get_public_key_path(&self) -> PathBuf {
if self.public_key_path.is_absolute() {
self.public_key_path.to_path_buf()
} else {
self.base_data_dir.join(&self.public_key_path)
}
}
/// Read the contents of the private key file.
pub fn read_private_key(&self) -> anyhow::Result<Pem> {
let private_key_path = self.get_private_key_path();
let pem = pem::parse(fs::read(private_key_path)?)?;
Ok(pem)
}
/// Read the contents of the public key file.
pub fn read_public_key(&self) -> anyhow::Result<Pem> {
let public_key_path = self.get_public_key_path();
let pem = pem::parse(fs::read(public_key_path)?)?;
Ok(pem)
}
/// Materialize the [`NeonLocalInitConf`] to disk. Called during [`neon_local init`].
pub fn init(conf: NeonLocalInitConf, force: &InitForceMode) -> anyhow::Result<()> {
let base_path = base_path();
@@ -956,6 +981,7 @@ fn generate_auth_keys(private_key_path: &Path, public_key_path: &Path) -> anyhow
String::from_utf8_lossy(&keygen_output.stderr)
);
}
// Extract the public key from the private key file
//
// openssl pkey -in auth_private_key.pem -pubout -out auth_public_key.pem
@@ -972,6 +998,7 @@ fn generate_auth_keys(private_key_path: &Path, public_key_path: &Path) -> anyhow
String::from_utf8_lossy(&keygen_output.stderr)
);
}
Ok(())
}

View File

@@ -413,6 +413,11 @@ impl PageServerNode {
.map(serde_json::from_str)
.transpose()
.context("Failed to parse 'compaction_algorithm' json")?,
compaction_shard_ancestor: settings
.remove("compaction_shard_ancestor")
.map(|x| x.parse::<bool>())
.transpose()
.context("Failed to parse 'compaction_shard_ancestor' as a bool")?,
compaction_l0_first: settings
.remove("compaction_l0_first")
.map(|x| x.parse::<bool>())

View File

@@ -18,6 +18,7 @@ use pageserver_api::models::{
};
use pageserver_api::shard::TenantShardId;
use pageserver_client::mgmt_api::ResponseErrorMessageExt;
use pem::Pem;
use postgres_backend::AuthType;
use reqwest::{Certificate, Method};
use serde::de::DeserializeOwned;
@@ -34,8 +35,8 @@ use crate::local_env::{LocalEnv, NeonStorageControllerConf};
pub struct StorageController {
env: LocalEnv,
private_key: Option<Vec<u8>>,
public_key: Option<String>,
private_key: Option<Pem>,
public_key: Option<Pem>,
client: reqwest::Client,
config: NeonStorageControllerConf,
@@ -116,7 +117,9 @@ impl StorageController {
AuthType::Trust => (None, None),
AuthType::NeonJWT => {
let private_key_path = env.get_private_key_path();
let private_key = fs::read(private_key_path).expect("failed to read private key");
let private_key =
pem::parse(fs::read(private_key_path).expect("failed to read private key"))
.expect("failed to parse PEM file");
// If pageserver auth is enabled, this implicitly enables auth for this service,
// using the same credentials.
@@ -138,9 +141,13 @@ impl StorageController {
.expect("Empty key dir")
.expect("Error reading key dir");
std::fs::read_to_string(dent.path()).expect("Can't read public key")
pem::parse(std::fs::read_to_string(dent.path()).expect("Can't read public key"))
.expect("Failed to parse PEM file")
} else {
std::fs::read_to_string(&public_key_path).expect("Can't read public key")
pem::parse(
std::fs::read_to_string(&public_key_path).expect("Can't read public key"),
)
.expect("Failed to parse PEM file")
};
(Some(private_key), Some(public_key))
}

View File

@@ -1,4 +1,3 @@
# Example docker compose configuration
The configuration in this directory is used for testing Neon docker images: it is
@@ -8,3 +7,13 @@ you can experiment with a miniature Neon system, use `cargo neon` rather than co
This configuration does not start the storage controller, because the controller
needs a way to reconfigure running computes, and no such thing exists in this setup.
## Generating the JWKS for a compute
```shell
openssl genpkey -algorithm Ed25519 -out private-key.pem
openssl pkey -in private-key.pem -pubout -out public-key.pem
openssl pkey -pubin -inform pem -in public-key.pem -pubout -outform der -out public-key.der
key="$(xxd -plain -cols 32 -s -32 public-key.der)"
key_id="$(printf '%s' "$key" | sha256sum | awk '{ print $1 }' | basenc --base64url --wrap=0)"
x="$(printf '%s' "$key" | basenc --base64url --wrap=0)"
```

View File

@@ -0,0 +1,3 @@
-----BEGIN PRIVATE KEY-----
MC4CAQAwBQYDK2VwBCIEIOmnRbzt2AJ0d+S3aU1hiYOl/tXpvz1FmWBfwHYBgOma
-----END PRIVATE KEY-----

Binary file not shown.

View File

@@ -0,0 +1,3 @@
-----BEGIN PUBLIC KEY-----
MCowBQYDK2VwAyEADY0al/U0bgB3+9fUGk+3PKWnsck9OyxN5DjHIN6Xep0=
-----END PUBLIC KEY-----

View File

@@ -81,19 +81,9 @@ sed -i "s/TIMELINE_ID/${timeline_id}/" ${CONFIG_FILE}
cat ${CONFIG_FILE}
# TODO(tristan957): Remove these workarounds for backwards compatibility after
# the next compute release. That includes these next few lines and the
# --spec-path in the compute_ctl invocation.
if compute_ctl --help | grep --quiet -- '--config'; then
SPEC_PATH="$CONFIG_FILE"
else
jq '.spec' < "$CONFIG_FILE" > /tmp/spec.json
SPEC_PATH=/tmp/spec.json
fi
echo "Start compute node"
/usr/local/bin/compute_ctl --pgdata /var/db/postgres/compute \
-C "postgresql://cloud_admin@localhost:55433/postgres" \
-b /usr/local/bin/postgres \
--compute-id "compute-$RANDOM" \
--spec-path "$SPEC_PATH"
--config "$CONFIG_FILE"

View File

@@ -142,7 +142,19 @@
},
"compute_ctl_config": {
"jwks": {
"keys": []
"keys": [
{
"use": "sig",
"key_ops": [
"verify"
],
"alg": "EdDSA",
"kid": "ZGIxMzAzOGY0YWQwODk2ODU1MTk1NzMxMDFkYmUyOWU2NzZkOWNjNjMyMGRkZGJjOWY0MjdjYWVmNzE1MjUyOAo=",
"kty": "OKP",
"crv": "Ed25519",
"x": "MGQ4ZDFhOTdmNTM0NmUwMDc3ZmJkN2Q0MWE0ZmI3M2NhNWE3YjFjOTNkM2IyYzRkZTQzOGM3MjBkZTk3N2E5ZAo="
}
]
}
}
}

View File

@@ -0,0 +1,8 @@
EXTENSION = pg_jsonschema
DATA = pg_jsonschema--1.0.sql
REGRESS = jsonschema_valid_api jsonschema_edge_cases
REGRESS_OPTS = --load-extension=pg_jsonschema
PG_CONFIG ?= pg_config
PGXS := $(shell $(PG_CONFIG) --pgxs)
include $(PGXS)

View File

@@ -0,0 +1,87 @@
-- Schema with enums, nulls, extra properties disallowed
SELECT jsonschema_is_valid('{
"type": "object",
"properties": {
"status": { "type": "string", "enum": ["active", "inactive", "pending"] },
"email": { "type": ["string", "null"], "format": "email" }
},
"required": ["status"],
"additionalProperties": false
}'::json);
jsonschema_is_valid
---------------------
t
(1 row)
-- Valid enum and null email
SELECT jsonschema_validation_errors(
'{
"type": "object",
"properties": {
"status": { "type": "string", "enum": ["active", "inactive", "pending"] },
"email": { "type": ["string", "null"], "format": "email" }
},
"required": ["status"],
"additionalProperties": false
}'::json,
'{"status": "active", "email": null}'::json
);
jsonschema_validation_errors
------------------------------
{}
(1 row)
-- Invalid enum value
SELECT jsonschema_validation_errors(
'{
"type": "object",
"properties": {
"status": { "type": "string", "enum": ["active", "inactive", "pending"] },
"email": { "type": ["string", "null"], "format": "email" }
},
"required": ["status"],
"additionalProperties": false
}'::json,
'{"status": "disabled", "email": null}'::json
);
jsonschema_validation_errors
----------------------------------------------------------------------
{"\"disabled\" is not one of [\"active\",\"inactive\",\"pending\"]"}
(1 row)
-- Invalid email format (assuming format is validated)
SELECT jsonschema_validation_errors(
'{
"type": "object",
"properties": {
"status": { "type": "string", "enum": ["active", "inactive", "pending"] },
"email": { "type": ["string", "null"], "format": "email" }
},
"required": ["status"],
"additionalProperties": false
}'::json,
'{"status": "active", "email": "not-an-email"}'::json
);
jsonschema_validation_errors
-----------------------------------------
{"\"not-an-email\" is not a \"email\""}
(1 row)
-- Extra property not allowed
SELECT jsonschema_validation_errors(
'{
"type": "object",
"properties": {
"status": { "type": "string", "enum": ["active", "inactive", "pending"] },
"email": { "type": ["string", "null"], "format": "email" }
},
"required": ["status"],
"additionalProperties": false
}'::json,
'{"status": "active", "extra": "should not be here"}'::json
);
jsonschema_validation_errors
--------------------------------------------------------------------
{"Additional properties are not allowed ('extra' was unexpected)"}
(1 row)

View File

@@ -0,0 +1,65 @@
-- Define schema
SELECT jsonschema_is_valid('{
"type": "object",
"properties": {
"username": { "type": "string" },
"age": { "type": "integer" }
},
"required": ["username"]
}'::json);
jsonschema_is_valid
---------------------
t
(1 row)
-- Valid instance
SELECT jsonschema_validation_errors(
'{
"type": "object",
"properties": {
"username": { "type": "string" },
"age": { "type": "integer" }
},
"required": ["username"]
}'::json,
'{"username": "alice", "age": 25}'::json
);
jsonschema_validation_errors
------------------------------
{}
(1 row)
-- Invalid instance: missing required "username"
SELECT jsonschema_validation_errors(
'{
"type": "object",
"properties": {
"username": { "type": "string" },
"age": { "type": "integer" }
},
"required": ["username"]
}'::json,
'{"age": 25}'::json
);
jsonschema_validation_errors
-----------------------------------------
{"\"username\" is a required property"}
(1 row)
-- Invalid instance: wrong type for "age"
SELECT jsonschema_validation_errors(
'{
"type": "object",
"properties": {
"username": { "type": "string" },
"age": { "type": "integer" }
},
"required": ["username"]
}'::json,
'{"username": "bob", "age": "twenty"}'::json
);
jsonschema_validation_errors
-------------------------------------------
{"\"twenty\" is not of type \"integer\""}
(1 row)

View File

@@ -0,0 +1,66 @@
-- Schema with enums, nulls, extra properties disallowed
SELECT jsonschema_is_valid('{
"type": "object",
"properties": {
"status": { "type": "string", "enum": ["active", "inactive", "pending"] },
"email": { "type": ["string", "null"], "format": "email" }
},
"required": ["status"],
"additionalProperties": false
}'::json);
-- Valid enum and null email
SELECT jsonschema_validation_errors(
'{
"type": "object",
"properties": {
"status": { "type": "string", "enum": ["active", "inactive", "pending"] },
"email": { "type": ["string", "null"], "format": "email" }
},
"required": ["status"],
"additionalProperties": false
}'::json,
'{"status": "active", "email": null}'::json
);
-- Invalid enum value
SELECT jsonschema_validation_errors(
'{
"type": "object",
"properties": {
"status": { "type": "string", "enum": ["active", "inactive", "pending"] },
"email": { "type": ["string", "null"], "format": "email" }
},
"required": ["status"],
"additionalProperties": false
}'::json,
'{"status": "disabled", "email": null}'::json
);
-- Invalid email format (assuming format is validated)
SELECT jsonschema_validation_errors(
'{
"type": "object",
"properties": {
"status": { "type": "string", "enum": ["active", "inactive", "pending"] },
"email": { "type": ["string", "null"], "format": "email" }
},
"required": ["status"],
"additionalProperties": false
}'::json,
'{"status": "active", "email": "not-an-email"}'::json
);
-- Extra property not allowed
SELECT jsonschema_validation_errors(
'{
"type": "object",
"properties": {
"status": { "type": "string", "enum": ["active", "inactive", "pending"] },
"email": { "type": ["string", "null"], "format": "email" }
},
"required": ["status"],
"additionalProperties": false
}'::json,
'{"status": "active", "extra": "should not be here"}'::json
);

View File

@@ -0,0 +1,48 @@
-- Define schema
SELECT jsonschema_is_valid('{
"type": "object",
"properties": {
"username": { "type": "string" },
"age": { "type": "integer" }
},
"required": ["username"]
}'::json);
-- Valid instance
SELECT jsonschema_validation_errors(
'{
"type": "object",
"properties": {
"username": { "type": "string" },
"age": { "type": "integer" }
},
"required": ["username"]
}'::json,
'{"username": "alice", "age": 25}'::json
);
-- Invalid instance: missing required "username"
SELECT jsonschema_validation_errors(
'{
"type": "object",
"properties": {
"username": { "type": "string" },
"age": { "type": "integer" }
},
"required": ["username"]
}'::json,
'{"age": 25}'::json
);
-- Invalid instance: wrong type for "age"
SELECT jsonschema_validation_errors(
'{
"type": "object",
"properties": {
"username": { "type": "string" },
"age": { "type": "integer" }
},
"required": ["username"]
}'::json,
'{"username": "bob", "age": "twenty"}'::json
);

View File

@@ -0,0 +1,9 @@
EXTENSION = pg_session_jwt
REGRESS = basic_functions
REGRESS_OPTS = --load-extension=$(EXTENSION)
export PGOPTIONS = -c pg_session_jwt.jwk={"crv":"Ed25519","kty":"OKP","x":"R_Abz-63zJ00l-IraL5fQhwkhGVZCSooQFV5ntC3C7M"}
PG_CONFIG ?= pg_config
PGXS := $(shell $(PG_CONFIG) --pgxs)
include $(PGXS)

View File

@@ -0,0 +1,35 @@
-- Basic functionality tests for pg_session_jwt
-- Test auth.init() function
SELECT auth.init();
init
------
(1 row)
-- Test an invalid JWT
SELECT auth.jwt_session_init('INVALID-JWT');
ERROR: invalid JWT encoding
-- Test creating a session with an expired JWT
SELECT auth.jwt_session_init('eyJhbGciOiJFZERTQSJ9.eyJleHAiOjE3NDI1NjQ0MzIsImlhdCI6MTc0MjU2NDI1MiwianRpIjo0MjQyNDIsInN1YiI6InVzZXIxMjMifQ.A6FwKuaSduHB9O7Gz37g0uoD_U9qVS0JNtT7YABGVgB7HUD1AMFc9DeyhNntWBqncg8k5brv-hrNTuUh5JYMAw');
ERROR: Token used after it has expired
-- Test creating a session with a valid JWT
SELECT auth.jwt_session_init('eyJhbGciOiJFZERTQSJ9.eyJleHAiOjQ4OTYxNjQyNTIsImlhdCI6MTc0MjU2NDI1MiwianRpIjo0MzQzNDMsInN1YiI6InVzZXIxMjMifQ.2TXVgjb6JSUq6_adlvp-m_SdOxZSyGS30RS9TLB0xu2N83dMSs2NybwE1NMU8Fb0tcAZR_ET7M2rSxbTrphfCg');
jwt_session_init
------------------
(1 row)
-- Test auth.session() function
SELECT auth.session();
session
-------------------------------------------------------------------------
{"exp": 4896164252, "iat": 1742564252, "jti": 434343, "sub": "user123"}
(1 row)
-- Test auth.user_id() function
SELECT auth.user_id() AS user_id;
user_id
---------
user123
(1 row)

View File

@@ -0,0 +1,19 @@
-- Basic functionality tests for pg_session_jwt
-- Test auth.init() function
SELECT auth.init();
-- Test an invalid JWT
SELECT auth.jwt_session_init('INVALID-JWT');
-- Test creating a session with an expired JWT
SELECT auth.jwt_session_init('eyJhbGciOiJFZERTQSJ9.eyJleHAiOjE3NDI1NjQ0MzIsImlhdCI6MTc0MjU2NDI1MiwianRpIjo0MjQyNDIsInN1YiI6InVzZXIxMjMifQ.A6FwKuaSduHB9O7Gz37g0uoD_U9qVS0JNtT7YABGVgB7HUD1AMFc9DeyhNntWBqncg8k5brv-hrNTuUh5JYMAw');
-- Test creating a session with a valid JWT
SELECT auth.jwt_session_init('eyJhbGciOiJFZERTQSJ9.eyJleHAiOjQ4OTYxNjQyNTIsImlhdCI6MTc0MjU2NDI1MiwianRpIjo0MzQzNDMsInN1YiI6InVzZXIxMjMifQ.2TXVgjb6JSUq6_adlvp-m_SdOxZSyGS30RS9TLB0xu2N83dMSs2NybwE1NMU8Fb0tcAZR_ET7M2rSxbTrphfCg');
-- Test auth.session() function
SELECT auth.session();
-- Test auth.user_id() function
SELECT auth.user_id() AS user_id;

View File

@@ -160,7 +160,7 @@ pub struct CatalogObjects {
pub databases: Vec<Database>,
}
#[derive(Clone, Debug, Deserialize, Serialize)]
#[derive(Clone, Debug, Deserialize, Eq, PartialEq, Serialize)]
pub struct ComputeCtlConfig {
/// Set of JSON web keys that the compute can use to authenticate
/// communication from the control plane.
@@ -179,7 +179,7 @@ impl Default for ComputeCtlConfig {
}
}
#[derive(Clone, Debug, Deserialize, Serialize)]
#[derive(Clone, Debug, Deserialize, Eq, PartialEq, Serialize)]
pub struct TlsConfig {
pub key_path: String,
pub cert_path: String,

View File

@@ -14,6 +14,7 @@ futures.workspace = true
hyper0.workspace = true
itertools.workspace = true
jemalloc_pprof.workspace = true
jsonwebtoken.workspace = true
once_cell.workspace = true
pprof.workspace = true
regex.workspace = true

View File

@@ -8,6 +8,7 @@ use bytes::{Bytes, BytesMut};
use hyper::header::{AUTHORIZATION, CONTENT_DISPOSITION, CONTENT_TYPE, HeaderName};
use hyper::http::HeaderValue;
use hyper::{Body, Method, Request, Response};
use jsonwebtoken::TokenData;
use metrics::{Encoder, IntCounter, TextEncoder, register_int_counter};
use once_cell::sync::Lazy;
use pprof::ProfilerGuardBuilder;
@@ -618,7 +619,7 @@ pub fn auth_middleware<B: hyper::body::HttpBody + Send + Sync + 'static>(
})?;
let token = parse_token(header_value)?;
let data = auth.decode(token).map_err(|err| {
let data: TokenData<Claims> = auth.decode(token).map_err(|err| {
warn!("Authentication error: {err}");
// Rely on From<AuthError> for ApiError impl
err

View File

@@ -35,6 +35,7 @@ nix = {workspace = true, optional = true}
reqwest.workspace = true
rand.workspace = true
tracing-utils.workspace = true
once_cell.workspace = true
[dev-dependencies]
bincode.workspace = true

View File

@@ -379,6 +379,8 @@ pub struct TenantConfigToml {
/// size exceeds `compaction_upper_limit * checkpoint_distance`.
pub compaction_upper_limit: usize,
pub compaction_algorithm: crate::models::CompactionAlgorithmSettings,
/// If true, enable shard ancestor compaction (enabled by default).
pub compaction_shard_ancestor: bool,
/// If true, compact down L0 across all tenant timelines before doing regular compaction. L0
/// compaction must be responsive to avoid read amp during heavy ingestion. Defaults to true.
pub compaction_l0_first: bool,
@@ -677,6 +679,7 @@ pub mod tenant_conf_defaults {
pub const DEFAULT_COMPACTION_PERIOD: &str = "20 s";
pub const DEFAULT_COMPACTION_THRESHOLD: usize = 10;
pub const DEFAULT_COMPACTION_SHARD_ANCESTOR: bool = true;
// This value needs to be tuned to avoid OOM. We have 3/4*CPUs threads for L0 compaction, that's
// 3/4*16=9 on most of our pageservers. Compacting 20 layers requires about 1 GB memory (could
@@ -734,6 +737,7 @@ impl Default for TenantConfigToml {
compaction_algorithm: crate::models::CompactionAlgorithmSettings {
kind: DEFAULT_COMPACTION_ALGORITHM,
},
compaction_shard_ancestor: DEFAULT_COMPACTION_SHARD_ANCESTOR,
compaction_l0_first: DEFAULT_COMPACTION_L0_FIRST,
compaction_l0_semaphore: DEFAULT_COMPACTION_L0_SEMAPHORE,
l0_flush_delay_threshold: None,

View File

@@ -526,6 +526,8 @@ pub struct TenantConfigPatch {
#[serde(skip_serializing_if = "FieldPatch::is_noop")]
pub compaction_algorithm: FieldPatch<CompactionAlgorithmSettings>,
#[serde(skip_serializing_if = "FieldPatch::is_noop")]
pub compaction_shard_ancestor: FieldPatch<bool>,
#[serde(skip_serializing_if = "FieldPatch::is_noop")]
pub compaction_l0_first: FieldPatch<bool>,
#[serde(skip_serializing_if = "FieldPatch::is_noop")]
pub compaction_l0_semaphore: FieldPatch<bool>,
@@ -615,6 +617,9 @@ pub struct TenantConfig {
#[serde(skip_serializing_if = "Option::is_none")]
pub compaction_algorithm: Option<CompactionAlgorithmSettings>,
#[serde(skip_serializing_if = "Option::is_none")]
pub compaction_shard_ancestor: Option<bool>,
#[serde(skip_serializing_if = "Option::is_none")]
pub compaction_l0_first: Option<bool>,
@@ -724,6 +729,7 @@ impl TenantConfig {
mut compaction_threshold,
mut compaction_upper_limit,
mut compaction_algorithm,
mut compaction_shard_ancestor,
mut compaction_l0_first,
mut compaction_l0_semaphore,
mut l0_flush_delay_threshold,
@@ -772,6 +778,9 @@ impl TenantConfig {
.compaction_upper_limit
.apply(&mut compaction_upper_limit);
patch.compaction_algorithm.apply(&mut compaction_algorithm);
patch
.compaction_shard_ancestor
.apply(&mut compaction_shard_ancestor);
patch.compaction_l0_first.apply(&mut compaction_l0_first);
patch
.compaction_l0_semaphore
@@ -860,6 +869,7 @@ impl TenantConfig {
compaction_threshold,
compaction_upper_limit,
compaction_algorithm,
compaction_shard_ancestor,
compaction_l0_first,
compaction_l0_semaphore,
l0_flush_delay_threshold,
@@ -920,6 +930,9 @@ impl TenantConfig {
.as_ref()
.unwrap_or(&global_conf.compaction_algorithm)
.clone(),
compaction_shard_ancestor: self
.compaction_shard_ancestor
.unwrap_or(global_conf.compaction_shard_ancestor),
compaction_l0_first: self
.compaction_l0_first
.unwrap_or(global_conf.compaction_l0_first),
@@ -1804,8 +1817,34 @@ pub mod virtual_file {
}
impl IoMode {
pub const fn preferred() -> Self {
Self::Buffered
pub fn preferred() -> Self {
// The default behavior when running Rust unit tests without any further
// flags is to use the newest behavior if available on the platform (Direct).
// The CI uses the following environment variable to unit tests for all
// different modes.
// NB: the Python regression & perf tests have their own defaults management
// that writes pageserver.toml; they do not use this variable.
if cfg!(test) {
use once_cell::sync::Lazy;
static CACHED: Lazy<IoMode> = Lazy::new(|| {
utils::env::var_serde_json_string(
"NEON_PAGESERVER_UNIT_TEST_VIRTUAL_FILE_IO_MODE",
)
.unwrap_or({
#[cfg(target_os = "linux")]
{
IoMode::Direct
}
#[cfg(not(target_os = "linux"))]
{
IoMode::Buffered
}
})
});
*CACHED
} else {
IoMode::Buffered
}
}
}

View File

@@ -0,0 +1,13 @@
[package]
name = "remote_keys"
version = "0.1.0"
edition = "2024"
license.workspace = true
[dependencies]
anyhow.workspace = true
utils.workspace = true
workspace_hack.workspace = true
[dev-dependencies]
rand.workspace = true

View File

@@ -0,0 +1,42 @@
//! A module that provides a KMS implementation that generates and unwraps the keys.
//!
/// A KMS implementation that does static wrapping and unwrapping of the keys.
pub struct NaiveKms {
account_id: String,
}
impl NaiveKms {
pub fn new(account_id: String) -> Self {
Self { account_id }
}
pub fn encrypt(&self, plain: &[u8]) -> anyhow::Result<Vec<u8>> {
let wrapped = [self.account_id.as_bytes(), "-wrapped-".as_bytes(), plain].concat();
Ok(wrapped)
}
pub fn decrypt(&self, wrapped: &[u8]) -> anyhow::Result<Vec<u8>> {
let Some(wrapped) = wrapped.strip_prefix(self.account_id.as_bytes()) else {
return Err(anyhow::anyhow!("invalid key"));
};
let Some(plain) = wrapped.strip_prefix(b"-wrapped-") else {
return Err(anyhow::anyhow!("invalid key"));
};
Ok(plain.to_vec())
}
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_generate_key() {
let kms = NaiveKms::new("test-tenant".to_string());
let data = rand::random::<[u8; 32]>().to_vec();
let encrypted = kms.encrypt(&data).unwrap();
let decrypted = kms.decrypt(&encrypted).unwrap();
assert_eq!(data, decrypted);
}
}

View File

@@ -13,6 +13,7 @@ aws-smithy-async.workspace = true
aws-smithy-types.workspace = true
aws-config.workspace = true
aws-sdk-s3.workspace = true
base64.workspace = true
bytes.workspace = true
camino = { workspace = true, features = ["serde1"] }
humantime-serde.workspace = true
@@ -27,6 +28,7 @@ tokio-util = { workspace = true, features = ["compat"] }
toml_edit.workspace = true
tracing.workspace = true
scopeguard.workspace = true
md5.workspace = true
metrics.workspace = true
utils = { path = "../utils", default-features = false }
pin-project-lite.workspace = true

View File

@@ -550,6 +550,19 @@ impl RemoteStorage for AzureBlobStorage {
self.download_for_builder(builder, timeout, cancel).await
}
#[allow(unused_variables)]
async fn upload_with_encryption(
&self,
from: impl Stream<Item = std::io::Result<Bytes>> + Send + Sync + 'static,
data_size_bytes: usize,
to: &RemotePath,
metadata: Option<StorageMetadata>,
encryption_key: Option<&[u8]>,
cancel: &CancellationToken,
) -> anyhow::Result<()> {
unimplemented!()
}
async fn delete(&self, path: &RemotePath, cancel: &CancellationToken) -> anyhow::Result<()> {
self.delete_objects(std::array::from_ref(path), cancel)
.await

View File

@@ -190,6 +190,8 @@ pub struct DownloadOpts {
/// timeouts: for something like an index/manifest/heatmap, we should time out faster than
/// for layer files
pub kind: DownloadKind,
/// The encryption key to use for the download.
pub encryption_key: Option<Vec<u8>>,
}
pub enum DownloadKind {
@@ -204,6 +206,7 @@ impl Default for DownloadOpts {
byte_start: Bound::Unbounded,
byte_end: Bound::Unbounded,
kind: DownloadKind::Large,
encryption_key: None,
}
}
}
@@ -241,6 +244,15 @@ impl DownloadOpts {
None => format!("bytes={start}-"),
})
}
pub fn with_encryption_key(mut self, encryption_key: Option<impl AsRef<[u8]>>) -> Self {
self.encryption_key = encryption_key.map(|k| k.as_ref().to_vec());
self
}
pub fn encryption_key(&self) -> Option<&[u8]> {
self.encryption_key.as_deref()
}
}
/// Storage (potentially remote) API to manage its state.
@@ -331,6 +343,19 @@ pub trait RemoteStorage: Send + Sync + 'static {
cancel: &CancellationToken,
) -> Result<Download, DownloadError>;
/// Same as upload, but with remote encryption if the backend supports it (e.g. SSE-C on AWS).
async fn upload_with_encryption(
&self,
from: impl Stream<Item = std::io::Result<Bytes>> + Send + Sync + 'static,
// S3 PUT request requires the content length to be specified,
// otherwise it starts to fail with the concurrent connection count increasing.
data_size_bytes: usize,
to: &RemotePath,
metadata: Option<StorageMetadata>,
encryption_key: Option<&[u8]>,
cancel: &CancellationToken,
) -> anyhow::Result<()>;
/// Delete a single path from remote storage.
///
/// If the operation fails because of timeout or cancellation, the root cause of the error will be
@@ -615,6 +640,63 @@ impl<Other: RemoteStorage> GenericRemoteStorage<Arc<Other>> {
}
}
}
pub async fn upload_with_encryption(
&self,
from: impl Stream<Item = std::io::Result<Bytes>> + Send + Sync + 'static,
data_size_bytes: usize,
to: &RemotePath,
metadata: Option<StorageMetadata>,
encryption_key: Option<&[u8]>,
cancel: &CancellationToken,
) -> anyhow::Result<()> {
match self {
Self::LocalFs(s) => {
s.upload_with_encryption(
from,
data_size_bytes,
to,
metadata,
encryption_key,
cancel,
)
.await
}
Self::AwsS3(s) => {
s.upload_with_encryption(
from,
data_size_bytes,
to,
metadata,
encryption_key,
cancel,
)
.await
}
Self::AzureBlob(s) => {
s.upload_with_encryption(
from,
data_size_bytes,
to,
metadata,
encryption_key,
cancel,
)
.await
}
Self::Unreliable(s) => {
s.upload_with_encryption(
from,
data_size_bytes,
to,
metadata,
encryption_key,
cancel,
)
.await
}
}
}
}
impl GenericRemoteStorage {

View File

@@ -198,6 +198,10 @@ impl LocalFs {
let mut entries = cur_folder.read_dir_utf8()?;
while let Some(Ok(entry)) = entries.next() {
let file_name = entry.file_name();
if file_name.ends_with(".metadata") || file_name.ends_with(".enc") {
// ignore metadata and encryption key files
continue;
}
let full_file_name = cur_folder.join(file_name);
if full_file_name.as_str().starts_with(prefix) {
let file_remote_path = self.local_file_to_relative_path(full_file_name.clone());
@@ -218,6 +222,7 @@ impl LocalFs {
data_size_bytes: usize,
to: &RemotePath,
metadata: Option<StorageMetadata>,
enctyption_key: Option<&[u8]>,
cancel: &CancellationToken,
) -> anyhow::Result<()> {
let target_file_path = to.with_base(&self.storage_root);
@@ -306,6 +311,8 @@ impl LocalFs {
)
})?;
// TODO: we might need to make the following writes atomic with the file write operation above
if let Some(storage_metadata) = metadata {
// FIXME: we must not be using metadata much, since this would forget the old metadata
// for new writes? or perhaps metadata is sticky; could consider removing if it's never
@@ -324,6 +331,15 @@ impl LocalFs {
})?;
}
if let Some(encryption_key) = enctyption_key {
let encryption_key_path = storage_encryption_key_path(&target_file_path);
fs::write(&encryption_key_path, encryption_key).await.with_context(|| {
format!(
"Failed to write encryption key to the local storage at '{encryption_key_path}'",
)
})?;
}
Ok(())
}
}
@@ -450,6 +466,7 @@ impl RemoteStorage for LocalFs {
key: &RemotePath,
_cancel: &CancellationToken,
) -> Result<ListingObject, DownloadError> {
// TODO: check encryption key
let target_file_path = key.with_base(&self.storage_root);
let metadata = file_metadata(&target_file_path).await?;
Ok(ListingObject {
@@ -461,34 +478,14 @@ impl RemoteStorage for LocalFs {
async fn upload(
&self,
data: impl Stream<Item = std::io::Result<Bytes>> + Send + Sync,
data: impl Stream<Item = std::io::Result<Bytes>> + Send + Sync + 'static,
data_size_bytes: usize,
to: &RemotePath,
metadata: Option<StorageMetadata>,
cancel: &CancellationToken,
) -> anyhow::Result<()> {
let cancel = cancel.child_token();
let op = self.upload0(data, data_size_bytes, to, metadata, &cancel);
let mut op = std::pin::pin!(op);
// race the upload0 to the timeout; if it goes over, do a graceful shutdown
let (res, timeout) = tokio::select! {
res = &mut op => (res, false),
_ = tokio::time::sleep(self.timeout) => {
cancel.cancel();
(op.await, true)
}
};
match res {
Err(e) if timeout && TimeoutOrCancel::caused_by_cancel(&e) => {
// we caused this cancel (or they happened simultaneously) -- swap it out to
// Timeout
Err(TimeoutOrCancel::Timeout.into())
}
res => res,
}
self.upload_with_encryption(data, data_size_bytes, to, metadata, None, cancel)
.await
}
async fn download(
@@ -506,6 +503,22 @@ impl RemoteStorage for LocalFs {
return Err(DownloadError::Unmodified);
}
let key = match fs::read(storage_encryption_key_path(&target_path)).await {
Ok(key) => Some(key),
Err(e) if e.kind() == ErrorKind::NotFound => None,
Err(e) => {
return Err(DownloadError::Other(
anyhow::anyhow!(e).context("cannot read encryption key"),
));
}
};
if key != opts.encryption_key {
return Err(DownloadError::Other(anyhow::anyhow!(
"encryption key mismatch"
)));
}
let mut file = fs::OpenOptions::new()
.read(true)
.open(&target_path)
@@ -551,12 +564,53 @@ impl RemoteStorage for LocalFs {
async fn delete(&self, path: &RemotePath, _cancel: &CancellationToken) -> anyhow::Result<()> {
let file_path = path.with_base(&self.storage_root);
match fs::remove_file(&file_path).await {
Ok(()) => Ok(()),
Ok(()) => {}
// The file doesn't exist. This shouldn't yield an error to mirror S3's behaviour.
// See https://docs.aws.amazon.com/AmazonS3/latest/API/API_DeleteObject.html
// > If there isn't a null version, Amazon S3 does not remove any objects but will still respond that the command was successful.
Err(e) if e.kind() == ErrorKind::NotFound => Ok(()),
Err(e) => Err(anyhow::anyhow!(e)),
Err(e) if e.kind() == ErrorKind::NotFound => {}
Err(e) => return Err(anyhow::anyhow!(e)),
};
fs::remove_file(&storage_metadata_path(&file_path))
.await
.ok();
fs::remove_file(&storage_encryption_key_path(&file_path))
.await
.ok();
Ok(())
}
#[allow(unused_variables)]
async fn upload_with_encryption(
&self,
data: impl Stream<Item = std::io::Result<Bytes>> + Send + Sync + 'static,
data_size_bytes: usize,
to: &RemotePath,
metadata: Option<StorageMetadata>,
encryption_key: Option<&[u8]>,
cancel: &CancellationToken,
) -> anyhow::Result<()> {
let cancel = cancel.child_token();
let op = self.upload0(data, data_size_bytes, to, metadata, encryption_key, &cancel);
let mut op = std::pin::pin!(op);
// race the upload0 to the timeout; if it goes over, do a graceful shutdown
let (res, timeout) = tokio::select! {
res = &mut op => (res, false),
_ = tokio::time::sleep(self.timeout) => {
cancel.cancel();
(op.await, true)
}
};
match res {
Err(e) if timeout && TimeoutOrCancel::caused_by_cancel(&e) => {
// we caused this cancel (or they happened simultaneously) -- swap it out to
// Timeout
Err(TimeoutOrCancel::Timeout.into())
}
res => res,
}
}
@@ -591,6 +645,7 @@ impl RemoteStorage for LocalFs {
to_path = to_path
)
})?;
// TODO: copy metadata and encryption key
Ok(())
}
@@ -609,6 +664,10 @@ fn storage_metadata_path(original_path: &Utf8Path) -> Utf8PathBuf {
path_with_suffix_extension(original_path, "metadata")
}
fn storage_encryption_key_path(original_path: &Utf8Path) -> Utf8PathBuf {
path_with_suffix_extension(original_path, "enc")
}
async fn create_target_directory(target_file_path: &Utf8Path) -> anyhow::Result<()> {
let target_dir = match target_file_path.parent() {
Some(parent_dir) => parent_dir,

View File

@@ -66,7 +66,10 @@ struct GetObjectRequest {
key: String,
etag: Option<String>,
range: Option<String>,
/// Base64 encoded SSE-C key for server-side encryption.
sse_c_key: Option<Vec<u8>>,
}
impl S3Bucket {
/// Creates the S3 storage, errors if incorrect AWS S3 configuration provided.
pub async fn new(remote_storage_config: &S3Config, timeout: Duration) -> anyhow::Result<Self> {
@@ -257,6 +260,13 @@ impl S3Bucket {
builder = builder.if_none_match(etag);
}
if let Some(encryption_key) = request.sse_c_key {
builder = builder.sse_customer_algorithm("AES256");
builder = builder.sse_customer_key(base64::encode(&encryption_key));
builder = builder
.sse_customer_key_md5(base64::encode(md5::compute(&encryption_key).as_slice()));
}
let get_object = builder.send();
let get_object = tokio::select! {
@@ -693,12 +703,13 @@ impl RemoteStorage for S3Bucket {
})
}
async fn upload(
async fn upload_with_encryption(
&self,
from: impl Stream<Item = std::io::Result<Bytes>> + Send + Sync + 'static,
from_size_bytes: usize,
to: &RemotePath,
metadata: Option<StorageMetadata>,
encryption_key: Option<&[u8]>,
cancel: &CancellationToken,
) -> anyhow::Result<()> {
let kind = RequestKind::Put;
@@ -709,7 +720,7 @@ impl RemoteStorage for S3Bucket {
let body = StreamBody::new(from.map(|x| x.map(Frame::data)));
let bytes_stream = ByteStream::new(SdkBody::from_body_1_x(body));
let upload = self
let mut upload = self
.client
.put_object()
.bucket(self.bucket_name.clone())
@@ -717,8 +728,17 @@ impl RemoteStorage for S3Bucket {
.set_metadata(metadata.map(|m| m.0))
.set_storage_class(self.upload_storage_class.clone())
.content_length(from_size_bytes.try_into()?)
.body(bytes_stream)
.send();
.body(bytes_stream);
if let Some(encryption_key) = encryption_key {
upload = upload.sse_customer_algorithm("AES256");
let base64_key = base64::encode(encryption_key);
upload = upload.sse_customer_key(&base64_key);
upload = upload
.sse_customer_key_md5(base64::encode(md5::compute(encryption_key).as_slice()));
}
let upload = upload.send();
let upload = tokio::time::timeout(self.timeout, upload);
@@ -742,6 +762,18 @@ impl RemoteStorage for S3Bucket {
}
}
async fn upload(
&self,
from: impl Stream<Item = std::io::Result<Bytes>> + Send + Sync + 'static,
data_size_bytes: usize,
to: &RemotePath,
metadata: Option<StorageMetadata>,
cancel: &CancellationToken,
) -> anyhow::Result<()> {
self.upload_with_encryption(from, data_size_bytes, to, metadata, None, cancel)
.await
}
async fn copy(
&self,
from: &RemotePath,
@@ -801,6 +833,7 @@ impl RemoteStorage for S3Bucket {
key: self.relative_path_to_s3_object(from),
etag: opts.etag.as_ref().map(|e| e.to_string()),
range: opts.byte_range_header(),
sse_c_key: opts.encryption_key.clone(),
},
cancel,
)

View File

@@ -178,6 +178,19 @@ impl RemoteStorage for UnreliableWrapper {
self.inner.download(from, opts, cancel).await
}
#[allow(unused_variables)]
async fn upload_with_encryption(
&self,
from: impl Stream<Item = std::io::Result<Bytes>> + Send + Sync + 'static,
data_size_bytes: usize,
to: &RemotePath,
metadata: Option<StorageMetadata>,
encryption_key: Option<&[u8]>,
cancel: &CancellationToken,
) -> anyhow::Result<()> {
unimplemented!()
}
async fn delete(&self, path: &RemotePath, cancel: &CancellationToken) -> anyhow::Result<()> {
self.delete_inner(path, true, cancel).await
}

View File

@@ -421,7 +421,7 @@ async fn download_is_timeouted(ctx: &mut MaybeEnabledStorage) {
))
.unwrap();
let len = upload_large_enough_file(&ctx.client, &path, &cancel).await;
let len = upload_large_enough_file(&ctx.client, &path, &cancel, None).await;
let timeout = std::time::Duration::from_secs(5);
@@ -500,7 +500,7 @@ async fn download_is_cancelled(ctx: &mut MaybeEnabledStorage) {
))
.unwrap();
let file_len = upload_large_enough_file(&ctx.client, &path, &cancel).await;
let file_len = upload_large_enough_file(&ctx.client, &path, &cancel, None).await;
{
let stream = ctx
@@ -555,6 +555,7 @@ async fn upload_large_enough_file(
client: &GenericRemoteStorage,
path: &RemotePath,
cancel: &CancellationToken,
encryption_key: Option<&[u8]>,
) -> usize {
let header = bytes::Bytes::from_static("remote blob data content".as_bytes());
let body = bytes::Bytes::from(vec![0u8; 1024]);
@@ -565,9 +566,54 @@ async fn upload_large_enough_file(
let contents = futures::stream::iter(contents.map(std::io::Result::Ok));
client
.upload(contents, len, path, None, cancel)
.upload_with_encryption(contents, len, path, None, encryption_key, cancel)
.await
.expect("upload succeeds");
len
}
#[test_context(MaybeEnabledStorage)]
#[tokio::test]
async fn encryption_works(ctx: &mut MaybeEnabledStorage) {
let MaybeEnabledStorage::Enabled(ctx) = ctx else {
return;
};
let cancel = CancellationToken::new();
let path = RemotePath::new(Utf8Path::new(
format!("{}/file_to_copy", ctx.base_prefix).as_str(),
))
.unwrap();
let key = rand::random::<[u8; 32]>();
let file_len = upload_large_enough_file(&ctx.client, &path, &cancel, Some(&key)).await;
{
let download = ctx
.client
.download(
&path,
&DownloadOpts::default().with_encryption_key(Some(&key)),
&cancel,
)
.await
.expect("should succeed");
let vec = download_to_vec(download).await.expect("should succeed");
assert_eq!(vec.len(), file_len);
}
{
// Download without encryption key should fail
let download = ctx
.client
.download(&path, &DownloadOpts::default(), &cancel)
.await;
assert!(download.is_err());
}
let cancel = CancellationToken::new();
ctx.client.delete_objects(&[path], &cancel).await.unwrap();
}

View File

@@ -29,6 +29,7 @@ futures = { workspace = true }
jsonwebtoken.workspace = true
nix = { workspace = true, features = ["ioctl"] }
once_cell.workspace = true
pem.workspace = true
pin-project-lite.workspace = true
regex.workspace = true
serde.workspace = true

View File

@@ -11,7 +11,8 @@ use camino::Utf8Path;
use jsonwebtoken::{
Algorithm, DecodingKey, EncodingKey, Header, TokenData, Validation, decode, encode,
};
use serde::{Deserialize, Serialize};
use pem::Pem;
use serde::{Deserialize, Serialize, de::DeserializeOwned};
use crate::id::TenantId;
@@ -73,7 +74,10 @@ impl SwappableJwtAuth {
pub fn swap(&self, jwt_auth: JwtAuth) {
self.0.swap(Arc::new(jwt_auth));
}
pub fn decode(&self, token: &str) -> std::result::Result<TokenData<Claims>, AuthError> {
pub fn decode<D: DeserializeOwned>(
&self,
token: &str,
) -> std::result::Result<TokenData<D>, AuthError> {
self.0.load().decode(token)
}
}
@@ -148,7 +152,10 @@ impl JwtAuth {
/// The function tries the stored decoding keys in succession,
/// and returns the first yielding a successful result.
/// If there is no working decoding key, it returns the last error.
pub fn decode(&self, token: &str) -> std::result::Result<TokenData<Claims>, AuthError> {
pub fn decode<D: DeserializeOwned>(
&self,
token: &str,
) -> std::result::Result<TokenData<D>, AuthError> {
let mut res = None;
for decoding_key in &self.decoding_keys {
res = Some(decode(token, decoding_key, &self.validation));
@@ -173,8 +180,8 @@ impl std::fmt::Debug for JwtAuth {
}
// this function is used only for testing purposes in CLI e g generate tokens during init
pub fn encode_from_key_file<S: Serialize>(claims: &S, key_data: &[u8]) -> Result<String> {
let key = EncodingKey::from_ed_pem(key_data)?;
pub fn encode_from_key_file<S: Serialize>(claims: &S, pem: &Pem) -> Result<String> {
let key = EncodingKey::from_ed_der(pem.contents());
Ok(encode(&Header::new(STORAGE_TOKEN_ALGORITHM), claims, &key)?)
}
@@ -188,13 +195,13 @@ mod tests {
//
// openssl genpkey -algorithm ed25519 -out ed25519-priv.pem
// openssl pkey -in ed25519-priv.pem -pubout -out ed25519-pub.pem
const TEST_PUB_KEY_ED25519: &[u8] = br#"
const TEST_PUB_KEY_ED25519: &str = r#"
-----BEGIN PUBLIC KEY-----
MCowBQYDK2VwAyEARYwaNBayR+eGI0iXB4s3QxE3Nl2g1iWbr6KtLWeVD/w=
-----END PUBLIC KEY-----
"#;
const TEST_PRIV_KEY_ED25519: &[u8] = br#"
const TEST_PRIV_KEY_ED25519: &str = r#"
-----BEGIN PRIVATE KEY-----
MC4CAQAwBQYDK2VwBCIEID/Drmc1AA6U/znNRWpF3zEGegOATQxfkdWxitcOMsIH
-----END PRIVATE KEY-----
@@ -222,9 +229,9 @@ MC4CAQAwBQYDK2VwBCIEID/Drmc1AA6U/znNRWpF3zEGegOATQxfkdWxitcOMsIH
// Check it can be validated with the public key
let auth = JwtAuth::new(vec![
DecodingKey::from_ed_pem(TEST_PUB_KEY_ED25519).unwrap(),
DecodingKey::from_ed_pem(TEST_PUB_KEY_ED25519.as_bytes()).unwrap(),
]);
let claims_from_token = auth.decode(encoded_eddsa).unwrap().claims;
let claims_from_token: Claims = auth.decode(encoded_eddsa).unwrap().claims;
assert_eq!(claims_from_token, expected_claims);
}
@@ -235,13 +242,14 @@ MC4CAQAwBQYDK2VwBCIEID/Drmc1AA6U/znNRWpF3zEGegOATQxfkdWxitcOMsIH
scope: Scope::Tenant,
};
let encoded = encode_from_key_file(&claims, TEST_PRIV_KEY_ED25519).unwrap();
let pem = pem::parse(TEST_PRIV_KEY_ED25519).unwrap();
let encoded = encode_from_key_file(&claims, &pem).unwrap();
// decode it back
let auth = JwtAuth::new(vec![
DecodingKey::from_ed_pem(TEST_PUB_KEY_ED25519).unwrap(),
DecodingKey::from_ed_pem(TEST_PUB_KEY_ED25519.as_bytes()).unwrap(),
]);
let decoded = auth.decode(&encoded).unwrap();
let decoded: TokenData<Claims> = auth.decode(&encoded).unwrap();
assert_eq!(decoded.claims, claims);
}

View File

@@ -17,6 +17,7 @@ anyhow.workspace = true
arc-swap.workspace = true
async-compression.workspace = true
async-stream.workspace = true
base64.workspace = true
bit_field.workspace = true
bincode.workspace = true
byteorder.workspace = true
@@ -35,6 +36,7 @@ humantime.workspace = true
humantime-serde.workspace = true
hyper0.workspace = true
itertools.workspace = true
jsonwebtoken.workspace = true
md5.workspace = true
nix.workspace = true
# hack to get the number of worker threads tokio uses
@@ -81,6 +83,7 @@ postgres_connection.workspace = true
postgres_ffi.workspace = true
pq_proto.workspace = true
remote_storage.workspace = true
remote_keys.workspace = true
storage_broker.workspace = true
tenant_size_model.workspace = true
http-utils.workspace = true

View File

@@ -45,6 +45,7 @@ fn bench_upload_queue_next_ready(c: &mut Criterion) {
shard: ShardIndex::new(ShardNumber(1), ShardCount(2)),
generation: Generation::Valid(1),
file_size: 0,
encryption_key: None,
};
// Construct the (initial and uploaded) index with layer0.

View File

@@ -118,13 +118,13 @@ pub struct PageServerConf {
/// A lower value implicitly deprioritizes loading such tenants, vs. other work in the system.
pub concurrent_tenant_warmup: ConfigurableSemaphore,
/// Number of concurrent [`Tenant::gather_size_inputs`](crate::tenant::Tenant::gather_size_inputs) allowed.
/// Number of concurrent [`TenantShard::gather_size_inputs`](crate::tenant::TenantShard::gather_size_inputs) allowed.
pub concurrent_tenant_size_logical_size_queries: ConfigurableSemaphore,
/// Limit of concurrent [`Tenant::gather_size_inputs`] issued by module `eviction_task`.
/// Limit of concurrent [`TenantShard::gather_size_inputs`] issued by module `eviction_task`.
/// The number of permits is the same as `concurrent_tenant_size_logical_size_queries`.
/// See the comment in `eviction_task` for details.
///
/// [`Tenant::gather_size_inputs`]: crate::tenant::Tenant::gather_size_inputs
/// [`TenantShard::gather_size_inputs`]: crate::tenant::TenantShard::gather_size_inputs
pub eviction_task_immitated_concurrent_logical_size_queries: ConfigurableSemaphore,
// How often to collect metrics and send them to the metrics endpoint.
@@ -588,10 +588,10 @@ impl ConfigurableSemaphore {
/// Initializse using a non-zero amount of permits.
///
/// Require a non-zero initial permits, because using permits == 0 is a crude way to disable a
/// feature such as [`Tenant::gather_size_inputs`]. Otherwise any semaphore using future will
/// feature such as [`TenantShard::gather_size_inputs`]. Otherwise any semaphore using future will
/// behave like [`futures::future::pending`], just waiting until new permits are added.
///
/// [`Tenant::gather_size_inputs`]: crate::tenant::Tenant::gather_size_inputs
/// [`TenantShard::gather_size_inputs`]: crate::tenant::TenantShard::gather_size_inputs
pub fn new(initial_permits: NonZeroUsize) -> Self {
ConfigurableSemaphore {
initial_permits,

View File

@@ -24,7 +24,7 @@ use crate::task_mgr::{self, BACKGROUND_RUNTIME, TaskKind};
use crate::tenant::mgr::TenantManager;
use crate::tenant::size::CalculateSyntheticSizeError;
use crate::tenant::tasks::BackgroundLoopKind;
use crate::tenant::{LogicalSizeCalculationCause, Tenant};
use crate::tenant::{LogicalSizeCalculationCause, TenantShard};
mod disk_cache;
mod metrics;
@@ -428,7 +428,7 @@ async fn calculate_synthetic_size_worker(
}
}
async fn calculate_and_log(tenant: &Tenant, cancel: &CancellationToken, ctx: &RequestContext) {
async fn calculate_and_log(tenant: &TenantShard, cancel: &CancellationToken, ctx: &RequestContext) {
const CAUSE: LogicalSizeCalculationCause =
LogicalSizeCalculationCause::ConsumptionMetricsSyntheticSize;

View File

@@ -175,9 +175,9 @@ impl MetricsKey {
.absolute_values()
}
/// [`Tenant::remote_size`]
/// [`TenantShard::remote_size`]
///
/// [`Tenant::remote_size`]: crate::tenant::Tenant::remote_size
/// [`TenantShard::remote_size`]: crate::tenant::TenantShard::remote_size
const fn remote_storage_size(tenant_id: TenantId) -> AbsoluteValueFactory {
MetricsKey {
tenant_id,
@@ -199,9 +199,9 @@ impl MetricsKey {
.absolute_values()
}
/// [`Tenant::cached_synthetic_size`] as refreshed by [`calculate_synthetic_size_worker`].
/// [`TenantShard::cached_synthetic_size`] as refreshed by [`calculate_synthetic_size_worker`].
///
/// [`Tenant::cached_synthetic_size`]: crate::tenant::Tenant::cached_synthetic_size
/// [`TenantShard::cached_synthetic_size`]: crate::tenant::TenantShard::cached_synthetic_size
/// [`calculate_synthetic_size_worker`]: super::calculate_synthetic_size_worker
const fn synthetic_size(tenant_id: TenantId) -> AbsoluteValueFactory {
MetricsKey {
@@ -254,7 +254,7 @@ pub(super) async fn collect_all_metrics(
async fn collect<S>(tenants: S, cache: &Cache, ctx: &RequestContext) -> Vec<NewRawMetric>
where
S: futures::stream::Stream<Item = (TenantId, Arc<crate::tenant::Tenant>)>,
S: futures::stream::Stream<Item = (TenantId, Arc<crate::tenant::TenantShard>)>,
{
let mut current_metrics: Vec<NewRawMetric> = Vec::new();
@@ -308,7 +308,7 @@ impl TenantSnapshot {
///
/// `resident_size` is calculated of the timelines we had access to for other metrics, so we
/// cannot just list timelines here.
fn collect(t: &Arc<crate::tenant::Tenant>, resident_size: u64) -> Self {
fn collect(t: &Arc<crate::tenant::TenantShard>, resident_size: u64) -> Self {
TenantSnapshot {
resident_size,
remote_size: t.remote_size(),

View File

@@ -1873,7 +1873,7 @@ async fn update_tenant_config_handler(
&ShardParameters::default(),
);
crate::tenant::Tenant::persist_tenant_config(state.conf, &tenant_shard_id, &location_conf)
crate::tenant::TenantShard::persist_tenant_config(state.conf, &tenant_shard_id, &location_conf)
.await
.map_err(|e| ApiError::InternalServerError(anyhow::anyhow!(e)))?;
@@ -1917,7 +1917,7 @@ async fn patch_tenant_config_handler(
&ShardParameters::default(),
);
crate::tenant::Tenant::persist_tenant_config(state.conf, &tenant_shard_id, &location_conf)
crate::tenant::TenantShard::persist_tenant_config(state.conf, &tenant_shard_id, &location_conf)
.await
.map_err(|e| ApiError::InternalServerError(anyhow::anyhow!(e)))?;

View File

@@ -49,7 +49,7 @@ use tracing::{info, info_span};
/// backwards-compatible changes to the metadata format.
pub const STORAGE_FORMAT_VERSION: u16 = 3;
pub const DEFAULT_PG_VERSION: u32 = 16;
pub const DEFAULT_PG_VERSION: u32 = 17;
// Magic constants used to identify different kinds of files
pub const IMAGE_FILE_MAGIC: u16 = 0x5A60;

View File

@@ -1086,7 +1086,7 @@ pub(crate) static TIMELINE_EPHEMERAL_BYTES: Lazy<UIntGauge> = Lazy::new(|| {
.expect("Failed to register metric")
});
/// Metrics related to the lifecycle of a [`crate::tenant::Tenant`] object: things
/// Metrics related to the lifecycle of a [`crate::tenant::TenantShard`] object: things
/// like how long it took to load.
///
/// Note that these are process-global metrics, _not_ per-tenant metrics. Per-tenant

View File

@@ -15,6 +15,7 @@ use async_compression::tokio::write::GzipEncoder;
use bytes::Buf;
use futures::FutureExt;
use itertools::Itertools;
use jsonwebtoken::TokenData;
use once_cell::sync::OnceCell;
use pageserver_api::config::{
PageServicePipeliningConfig, PageServicePipeliningConfigPipelined,
@@ -75,7 +76,7 @@ use crate::tenant::timeline::{self, WaitLsnError};
use crate::tenant::{GetTimelineError, PageReconstructError, Timeline};
use crate::{basebackup, timed_after_cancellation};
/// How long we may wait for a [`crate::tenant::mgr::TenantSlot::InProgress`]` and/or a [`crate::tenant::Tenant`] which
/// How long we may wait for a [`crate::tenant::mgr::TenantSlot::InProgress`]` and/or a [`crate::tenant::TenantShard`] which
/// is not yet in state [`TenantState::Active`].
///
/// NB: this is a different value than [`crate::http::routes::ACTIVE_TENANT_TIMEOUT`].
@@ -2837,7 +2838,7 @@ where
) -> Result<(), QueryError> {
// this unwrap is never triggered, because check_auth_jwt only called when auth_type is NeonJWT
// which requires auth to be present
let data = self
let data: TokenData<Claims> = self
.auth
.as_ref()
.unwrap()

View File

@@ -158,7 +158,7 @@ pub struct TenantSharedResources {
pub l0_flush_global_state: L0FlushGlobalState,
}
/// A [`Tenant`] is really an _attached_ tenant. The configuration
/// A [`TenantShard`] is really an _attached_ tenant. The configuration
/// for an attached tenant is a subset of the [`LocationConf`], represented
/// in this struct.
#[derive(Clone)]
@@ -245,7 +245,7 @@ pub(crate) enum SpawnMode {
///
/// Tenant consists of multiple timelines. Keep them in a hash table.
///
pub struct Tenant {
pub struct TenantShard {
// Global pageserver config parameters
pub conf: &'static PageServerConf,
@@ -267,7 +267,7 @@ pub struct Tenant {
shard_identity: ShardIdentity,
/// The remote storage generation, used to protect S3 objects from split-brain.
/// Does not change over the lifetime of the [`Tenant`] object.
/// Does not change over the lifetime of the [`TenantShard`] object.
///
/// This duplicates the generation stored in LocationConf, but that structure is mutable:
/// this copy enforces the invariant that generatio doesn't change during a Tenant's lifetime.
@@ -309,7 +309,7 @@ pub struct Tenant {
// Access to global deletion queue for when this tenant wants to schedule a deletion
deletion_queue_client: DeletionQueueClient,
/// Cached logical sizes updated updated on each [`Tenant::gather_size_inputs`].
/// Cached logical sizes updated updated on each [`TenantShard::gather_size_inputs`].
cached_logical_sizes: tokio::sync::Mutex<HashMap<(TimelineId, Lsn), u64>>,
cached_synthetic_tenant_size: Arc<AtomicU64>,
@@ -337,12 +337,12 @@ pub struct Tenant {
// Timelines' cancellation token.
pub(crate) cancel: CancellationToken,
// Users of the Tenant such as the page service must take this Gate to avoid
// trying to use a Tenant which is shutting down.
// Users of the TenantShard such as the page service must take this Gate to avoid
// trying to use a TenantShard which is shutting down.
pub(crate) gate: Gate,
/// Throttle applied at the top of [`Timeline::get`].
/// All [`Tenant::timelines`] of a given [`Tenant`] instance share the same [`throttle::Throttle`] instance.
/// All [`TenantShard::timelines`] of a given [`TenantShard`] instance share the same [`throttle::Throttle`] instance.
pub(crate) pagestream_throttle: Arc<throttle::Throttle>,
pub(crate) pagestream_throttle_metrics: Arc<crate::metrics::tenant_throttling::Pagestream>,
@@ -362,7 +362,7 @@ pub struct Tenant {
l0_flush_global_state: L0FlushGlobalState,
}
impl std::fmt::Debug for Tenant {
impl std::fmt::Debug for TenantShard {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
write!(f, "{} ({})", self.tenant_shard_id, self.current_state())
}
@@ -841,7 +841,7 @@ impl Debug for SetStoppingError {
}
}
/// Arguments to [`Tenant::create_timeline`].
/// Arguments to [`TenantShard::create_timeline`].
///
/// Not usable as an idempotency key for timeline creation because if [`CreateTimelineParamsBranch::ancestor_start_lsn`]
/// is `None`, the result of the timeline create call is not deterministic.
@@ -876,7 +876,7 @@ pub(crate) struct CreateTimelineParamsImportPgdata {
pub(crate) idempotency_key: import_pgdata::index_part_format::IdempotencyKey,
}
/// What is used to determine idempotency of a [`Tenant::create_timeline`] call in [`Tenant::start_creating_timeline`] in [`Tenant::start_creating_timeline`].
/// What is used to determine idempotency of a [`TenantShard::create_timeline`] call in [`TenantShard::start_creating_timeline`] in [`TenantShard::start_creating_timeline`].
///
/// Each [`Timeline`] object holds [`Self`] as an immutable property in [`Timeline::create_idempotency`].
///
@@ -914,7 +914,7 @@ pub(crate) struct CreatingTimelineIdempotencyImportPgdata {
idempotency_key: import_pgdata::index_part_format::IdempotencyKey,
}
/// What is returned by [`Tenant::start_creating_timeline`].
/// What is returned by [`TenantShard::start_creating_timeline`].
#[must_use]
enum StartCreatingTimelineResult {
CreateGuard(TimelineCreateGuard),
@@ -943,13 +943,13 @@ struct TimelineInitAndSyncNeedsSpawnImportPgdata {
guard: TimelineCreateGuard,
}
/// What is returned by [`Tenant::create_timeline`].
/// What is returned by [`TenantShard::create_timeline`].
enum CreateTimelineResult {
Created(Arc<Timeline>),
Idempotent(Arc<Timeline>),
/// IMPORTANT: This [`Arc<Timeline>`] object is not in [`Tenant::timelines`] when
/// IMPORTANT: This [`Arc<Timeline>`] object is not in [`TenantShard::timelines`] when
/// we return this result, nor will this concrete object ever be added there.
/// Cf method comment on [`Tenant::create_timeline_import_pgdata`].
/// Cf method comment on [`TenantShard::create_timeline_import_pgdata`].
ImportSpawned(Arc<Timeline>),
}
@@ -1082,7 +1082,7 @@ pub(crate) enum LoadConfigError {
NotFound(Utf8PathBuf),
}
impl Tenant {
impl TenantShard {
/// Yet another helper for timeline initialization.
///
/// - Initializes the Timeline struct and inserts it into the tenant's hash map
@@ -1303,7 +1303,7 @@ impl Tenant {
init_order: Option<InitializationOrder>,
mode: SpawnMode,
ctx: &RequestContext,
) -> Result<Arc<Tenant>, GlobalShutDown> {
) -> Result<Arc<TenantShard>, GlobalShutDown> {
let wal_redo_manager =
WalRedoManager::new(PostgresRedoManager::new(conf, tenant_shard_id))?;
@@ -1317,7 +1317,7 @@ impl Tenant {
let attach_mode = attached_conf.location.attach_mode;
let generation = attached_conf.location.generation;
let tenant = Arc::new(Tenant::new(
let tenant = Arc::new(TenantShard::new(
TenantState::Attaching,
conf,
attached_conf,
@@ -1334,7 +1334,7 @@ impl Tenant {
let attach_gate_guard = tenant
.gate
.enter()
.expect("We just created the Tenant: nothing else can have shut it down yet");
.expect("We just created the TenantShard: nothing else can have shut it down yet");
// Do all the hard work in the background
let tenant_clone = Arc::clone(&tenant);
@@ -1362,7 +1362,7 @@ impl Tenant {
}
}
fn make_broken_or_stopping(t: &Tenant, err: anyhow::Error) {
fn make_broken_or_stopping(t: &TenantShard, err: anyhow::Error) {
t.state.send_modify(|state| match state {
// TODO: the old code alluded to DeleteTenantFlow sometimes setting
// TenantState::Stopping before we get here, but this may be outdated.
@@ -1627,7 +1627,7 @@ impl Tenant {
/// No background tasks are started as part of this routine.
///
async fn attach(
self: &Arc<Tenant>,
self: &Arc<TenantShard>,
preload: Option<TenantPreload>,
ctx: &RequestContext,
) -> anyhow::Result<()> {
@@ -1957,7 +1957,7 @@ impl Tenant {
}
async fn load_timelines_metadata(
self: &Arc<Tenant>,
self: &Arc<TenantShard>,
timeline_ids: HashSet<TimelineId>,
remote_storage: &GenericRemoteStorage,
heatmap: Option<(HeatMapTenant, std::time::Instant)>,
@@ -2028,7 +2028,7 @@ impl Tenant {
}
fn load_timeline_metadata(
self: &Arc<Tenant>,
self: &Arc<TenantShard>,
timeline_id: TimelineId,
remote_storage: GenericRemoteStorage,
previous_heatmap: Option<PreviousHeatmap>,
@@ -2429,14 +2429,14 @@ impl Tenant {
/// This is used by tests & import-from-basebackup.
///
/// The returned [`UninitializedTimeline`] contains no data nor metadata and it is in
/// a state that will fail [`Tenant::load_remote_timeline`] because `disk_consistent_lsn=Lsn(0)`.
/// a state that will fail [`TenantShard::load_remote_timeline`] because `disk_consistent_lsn=Lsn(0)`.
///
/// The caller is responsible for getting the timeline into a state that will be accepted
/// by [`Tenant::load_remote_timeline`] / [`Tenant::attach`].
/// by [`TenantShard::load_remote_timeline`] / [`TenantShard::attach`].
/// Then they may call [`UninitializedTimeline::finish_creation`] to add the timeline
/// to the [`Tenant::timelines`].
/// to the [`TenantShard::timelines`].
///
/// Tests should use `Tenant::create_test_timeline` to set up the minimum required metadata keys.
/// Tests should use `TenantShard::create_test_timeline` to set up the minimum required metadata keys.
pub(crate) async fn create_empty_timeline(
self: &Arc<Self>,
new_timeline_id: TimelineId,
@@ -2584,7 +2584,7 @@ impl Tenant {
/// the same timeline ID already exists, returns CreateTimelineError::AlreadyExists.
#[allow(clippy::too_many_arguments)]
pub(crate) async fn create_timeline(
self: &Arc<Tenant>,
self: &Arc<TenantShard>,
params: CreateTimelineParams,
broker_client: storage_broker::BrokerClientChannel,
ctx: &RequestContext,
@@ -2751,13 +2751,13 @@ impl Tenant {
Ok(activated_timeline)
}
/// The returned [`Arc<Timeline>`] is NOT in the [`Tenant::timelines`] map until the import
/// The returned [`Arc<Timeline>`] is NOT in the [`TenantShard::timelines`] map until the import
/// completes in the background. A DIFFERENT [`Arc<Timeline>`] will be inserted into the
/// [`Tenant::timelines`] map when the import completes.
/// [`TenantShard::timelines`] map when the import completes.
/// We only return an [`Arc<Timeline>`] here so the API handler can create a [`pageserver_api::models::TimelineInfo`]
/// for the response.
async fn create_timeline_import_pgdata(
self: &Arc<Tenant>,
self: &Arc<Self>,
params: CreateTimelineParamsImportPgdata,
activate: ActivateTimelineArgs,
ctx: &RequestContext,
@@ -2854,7 +2854,7 @@ impl Tenant {
#[instrument(skip_all, fields(tenant_id=%self.tenant_shard_id.tenant_id, shard_id=%self.tenant_shard_id.shard_slug(), timeline_id=%timeline.timeline_id))]
async fn create_timeline_import_pgdata_task(
self: Arc<Tenant>,
self: Arc<TenantShard>,
timeline: Arc<Timeline>,
index_part: import_pgdata::index_part_format::Root,
activate: ActivateTimelineArgs,
@@ -2882,7 +2882,7 @@ impl Tenant {
}
async fn create_timeline_import_pgdata_task_impl(
self: Arc<Tenant>,
self: Arc<TenantShard>,
timeline: Arc<Timeline>,
index_part: import_pgdata::index_part_format::Root,
activate: ActivateTimelineArgs,
@@ -2899,10 +2899,10 @@ impl Tenant {
// Reload timeline from remote.
// This proves that the remote state is attachable, and it reuses the code.
//
// TODO: think about whether this is safe to do with concurrent Tenant::shutdown.
// TODO: think about whether this is safe to do with concurrent TenantShard::shutdown.
// timeline_create_guard hols the tenant gate open, so, shutdown cannot _complete_ until we exit.
// But our activate() call might launch new background tasks after Tenant::shutdown
// already went past shutting down the Tenant::timelines, which this timeline here is no part of.
// But our activate() call might launch new background tasks after TenantShard::shutdown
// already went past shutting down the TenantShard::timelines, which this timeline here is no part of.
// I think the same problem exists with the bootstrap & branch mgmt API tasks (tenant shutting
// down while bootstrapping/branching + activating), but, the race condition is much more likely
// to manifest because of the long runtime of this import task.
@@ -2917,7 +2917,7 @@ impl Tenant {
// };
let timeline_id = timeline.timeline_id;
// load from object storage like Tenant::attach does
// load from object storage like TenantShard::attach does
let resources = self.build_timeline_resources(timeline_id);
let index_part = resources
.remote_client
@@ -3938,7 +3938,7 @@ enum ActivateTimelineArgs {
No,
}
impl Tenant {
impl TenantShard {
pub fn tenant_specific_overrides(&self) -> pageserver_api::models::TenantConfig {
self.tenant_conf.load().tenant_conf.clone()
}
@@ -4096,7 +4096,7 @@ impl Tenant {
update: F,
) -> anyhow::Result<pageserver_api::models::TenantConfig> {
// Use read-copy-update in order to avoid overwriting the location config
// state if this races with [`Tenant::set_new_location_config`]. Note that
// state if this races with [`TenantShard::set_new_location_config`]. Note that
// this race is not possible if both request types come from the storage
// controller (as they should!) because an exclusive op lock is required
// on the storage controller side.
@@ -4219,7 +4219,7 @@ impl Tenant {
Ok((timeline, timeline_ctx))
}
/// [`Tenant::shutdown`] must be called before dropping the returned [`Tenant`] object
/// [`TenantShard::shutdown`] must be called before dropping the returned [`TenantShard`] object
/// to ensure proper cleanup of background tasks and metrics.
//
// Allow too_many_arguments because a constructor's argument list naturally grows with the
@@ -4235,7 +4235,7 @@ impl Tenant {
remote_storage: GenericRemoteStorage,
deletion_queue_client: DeletionQueueClient,
l0_flush_global_state: L0FlushGlobalState,
) -> Tenant {
) -> TenantShard {
debug_assert!(
!attached_conf.location.generation.is_none() || conf.control_plane_api.is_none()
);
@@ -4295,7 +4295,7 @@ impl Tenant {
}
});
Tenant {
TenantShard {
tenant_shard_id,
shard_identity,
generation: attached_conf.location.generation,
@@ -4330,7 +4330,7 @@ impl Tenant {
cancel: CancellationToken::default(),
gate: Gate::default(),
pagestream_throttle: Arc::new(throttle::Throttle::new(
Tenant::get_pagestream_throttle_config(conf, &attached_conf.tenant_conf),
TenantShard::get_pagestream_throttle_config(conf, &attached_conf.tenant_conf),
)),
pagestream_throttle_metrics: Arc::new(
crate::metrics::tenant_throttling::Pagestream::new(&tenant_shard_id),
@@ -4466,11 +4466,11 @@ impl Tenant {
// Perform GC for each timeline.
//
// Note that we don't hold the `Tenant::gc_cs` lock here because we don't want to delay the
// Note that we don't hold the `TenantShard::gc_cs` lock here because we don't want to delay the
// branch creation task, which requires the GC lock. A GC iteration can run concurrently
// with branch creation.
//
// See comments in [`Tenant::branch_timeline`] for more information about why branch
// See comments in [`TenantShard::branch_timeline`] for more information about why branch
// creation task can run concurrently with timeline's GC iteration.
for timeline in gc_timelines {
if cancel.is_cancelled() {
@@ -4500,7 +4500,7 @@ impl Tenant {
/// Refreshes the Timeline::gc_info for all timelines, returning the
/// vector of timelines which have [`Timeline::get_last_record_lsn`] past
/// [`Tenant::get_gc_horizon`].
/// [`TenantShard::get_gc_horizon`].
///
/// This is usually executed as part of periodic gc, but can now be triggered more often.
pub(crate) async fn refresh_gc_info(
@@ -5499,7 +5499,7 @@ impl Tenant {
}
}
// The flushes we did above were just writes, but the Tenant might have had
// The flushes we did above were just writes, but the TenantShard might have had
// pending deletions as well from recent compaction/gc: we want to flush those
// as well. This requires flushing the global delete queue. This is cheap
// because it's typically a no-op.
@@ -5517,7 +5517,7 @@ impl Tenant {
/// How much local storage would this tenant like to have? It can cope with
/// less than this (via eviction and on-demand downloads), but this function enables
/// the Tenant to advertise how much storage it would prefer to have to provide fast I/O
/// the TenantShard to advertise how much storage it would prefer to have to provide fast I/O
/// by keeping important things on local disk.
///
/// This is a heuristic, not a guarantee: tenants that are long-idle will actually use less
@@ -5540,11 +5540,11 @@ impl Tenant {
/// manifest in `Self::remote_tenant_manifest`.
///
/// TODO: instead of requiring callers to remember to call `maybe_upload_tenant_manifest` after
/// changing any `Tenant` state that's included in the manifest, consider making the manifest
/// changing any `TenantShard` state that's included in the manifest, consider making the manifest
/// the authoritative source of data with an API that automatically uploads on changes. Revisit
/// this when the manifest is more widely used and we have a better idea of the data model.
pub(crate) async fn maybe_upload_tenant_manifest(&self) -> Result<(), TenantManifestError> {
// Multiple tasks may call this function concurrently after mutating the Tenant runtime
// Multiple tasks may call this function concurrently after mutating the TenantShard runtime
// state, affecting the manifest generated by `build_tenant_manifest`. We use an async mutex
// to serialize these callers. `eq_ignoring_version` acts as a slightly inefficient but
// simple coalescing mechanism.
@@ -5812,7 +5812,7 @@ pub(crate) mod harness {
info_span!("TenantHarness", tenant_id=%self.tenant_shard_id.tenant_id, shard_id=%self.tenant_shard_id.shard_slug())
}
pub(crate) async fn load(&self) -> (Arc<Tenant>, RequestContext) {
pub(crate) async fn load(&self) -> (Arc<TenantShard>, RequestContext) {
let ctx = RequestContext::new(TaskKind::UnitTest, DownloadBehavior::Error)
.with_scope_unit_test();
(
@@ -5827,10 +5827,10 @@ pub(crate) mod harness {
pub(crate) async fn do_try_load(
&self,
ctx: &RequestContext,
) -> anyhow::Result<Arc<Tenant>> {
) -> anyhow::Result<Arc<TenantShard>> {
let walredo_mgr = Arc::new(WalRedoManager::from(TestRedoManager));
let tenant = Arc::new(Tenant::new(
let tenant = Arc::new(TenantShard::new(
TenantState::Attaching,
self.conf,
AttachedTenantConf::try_from(LocationConf::attached_single(
@@ -6046,7 +6046,7 @@ mod tests {
#[cfg(feature = "testing")]
#[allow(clippy::too_many_arguments)]
async fn randomize_timeline(
tenant: &Arc<Tenant>,
tenant: &Arc<TenantShard>,
new_timeline_id: TimelineId,
pg_version: u32,
spec: TestTimelineSpecification,
@@ -6936,7 +6936,7 @@ mod tests {
}
async fn bulk_insert_compact_gc(
tenant: &Tenant,
tenant: &TenantShard,
timeline: &Arc<Timeline>,
ctx: &RequestContext,
lsn: Lsn,
@@ -6948,7 +6948,7 @@ mod tests {
}
async fn bulk_insert_maybe_compact_gc(
tenant: &Tenant,
tenant: &TenantShard,
timeline: &Arc<Timeline>,
ctx: &RequestContext,
mut lsn: Lsn,
@@ -7858,7 +7858,7 @@ mod tests {
let (tline, _ctx) = tenant
.create_empty_timeline(TIMELINE_ID, Lsn(0), DEFAULT_PG_VERSION, &ctx)
.await?;
// Leave the timeline ID in [`Tenant::timelines_creating`] to exclude attempting to create it again
// Leave the timeline ID in [`TenantShard::timelines_creating`] to exclude attempting to create it again
let raw_tline = tline.raw_timeline().unwrap();
raw_tline
.shutdown(super::timeline::ShutdownMode::Hard)

View File

@@ -37,6 +37,63 @@ pub struct CompressionInfo {
pub compressed_size: Option<usize>,
}
/// A blob header, with header+data length and compression info.
///
/// TODO: use this more widely, and add an encode() method too.
/// TODO: document the header format.
#[derive(Clone, Copy, Default)]
pub struct Header {
pub header_len: usize,
pub data_len: usize,
pub compression_bits: u8,
}
impl Header {
/// Decodes a header from a byte slice.
pub fn decode(bytes: &[u8]) -> Result<Self, std::io::Error> {
let Some(&first_header_byte) = bytes.first() else {
return Err(std::io::Error::new(
std::io::ErrorKind::InvalidData,
"zero-length blob header",
));
};
// If the first bit is 0, this is just a 1-byte length prefix up to 128 bytes.
if first_header_byte < 0x80 {
return Ok(Self {
header_len: 1, // by definition
data_len: first_header_byte as usize,
compression_bits: BYTE_UNCOMPRESSED,
});
}
// Otherwise, this is a 4-byte header containing compression information and length.
const HEADER_LEN: usize = 4;
let mut header_buf: [u8; HEADER_LEN] = bytes[0..HEADER_LEN].try_into().map_err(|_| {
std::io::Error::new(
std::io::ErrorKind::InvalidData,
format!("blob header too short: {bytes:?}"),
)
})?;
// TODO: verify the compression bits and convert to an enum.
let compression_bits = header_buf[0] & LEN_COMPRESSION_BIT_MASK;
header_buf[0] &= !LEN_COMPRESSION_BIT_MASK;
let data_len = u32::from_be_bytes(header_buf) as usize;
Ok(Self {
header_len: HEADER_LEN,
data_len,
compression_bits,
})
}
/// Returns the total header+data length.
pub fn total_len(&self) -> usize {
self.header_len + self.data_len
}
}
impl BlockCursor<'_> {
/// Read a blob into a new buffer.
pub async fn read_blob(
@@ -389,6 +446,34 @@ impl<const BUFFERED: bool> BlobWriter<BUFFERED> {
};
(srcbuf, res.map(|_| (offset, compression_info)))
}
/// Writes a raw blob containing both header and data, returning its offset.
pub(crate) async fn write_blob_raw<Buf: IoBuf + Send>(
&mut self,
raw_with_header: FullSlice<Buf>,
ctx: &RequestContext,
) -> (FullSlice<Buf>, Result<u64, Error>) {
// Verify the header, to ensure we don't write invalid/corrupt data.
let header = match Header::decode(&raw_with_header) {
Ok(header) => header,
Err(err) => return (raw_with_header, Err(err)),
};
if raw_with_header.len() != header.total_len() {
let header_total_len = header.total_len();
let raw_len = raw_with_header.len();
return (
raw_with_header,
Err(std::io::Error::new(
std::io::ErrorKind::InvalidData,
format!("header length mismatch: {header_total_len} != {raw_len}"),
)),
);
}
let offset = self.offset;
let (raw_with_header, result) = self.write_all(raw_with_header, ctx).await;
(raw_with_header, result.map(|_| offset))
}
}
impl BlobWriter<true> {

View File

@@ -714,7 +714,7 @@ impl LayerMap {
true
}
pub fn iter_historic_layers(&self) -> impl '_ + Iterator<Item = Arc<PersistentLayerDesc>> {
pub fn iter_historic_layers(&self) -> impl ExactSizeIterator<Item = Arc<PersistentLayerDesc>> {
self.historic.iter()
}

View File

@@ -504,7 +504,7 @@ impl<Value: Clone> BufferedHistoricLayerCoverage<Value> {
}
/// Iterate all the layers
pub fn iter(&self) -> impl '_ + Iterator<Item = Value> {
pub fn iter(&self) -> impl ExactSizeIterator<Item = Value> {
// NOTE we can actually perform this without rebuilding,
// but it's not necessary for now.
if !self.buffer.is_empty() {

View File

@@ -564,8 +564,9 @@ mod tests {
Lsn(0),
Lsn(0),
Lsn(0),
// Any version will do here, so use the default
crate::DEFAULT_PG_VERSION,
// Updating this version to 17 will cause the test to fail at the
// next assert_eq!().
16,
);
let expected_bytes = vec![
/* TimelineMetadataHeader */

View File

@@ -52,7 +52,9 @@ use crate::tenant::config::{
use crate::tenant::span::debug_assert_current_span_has_tenant_id;
use crate::tenant::storage_layer::inmemory_layer;
use crate::tenant::timeline::ShutdownMode;
use crate::tenant::{AttachedTenantConf, GcError, LoadConfigError, SpawnMode, Tenant, TenantState};
use crate::tenant::{
AttachedTenantConf, GcError, LoadConfigError, SpawnMode, TenantShard, TenantState,
};
use crate::virtual_file::MaybeFatalIo;
use crate::{InitializationOrder, TEMP_FILE_SUFFIX};
@@ -67,7 +69,7 @@ use crate::{InitializationOrder, TEMP_FILE_SUFFIX};
/// having a properly acquired generation (Secondary doesn't need a generation)
#[derive(Clone)]
pub(crate) enum TenantSlot {
Attached(Arc<Tenant>),
Attached(Arc<TenantShard>),
Secondary(Arc<SecondaryTenant>),
/// In this state, other administrative operations acting on the TenantId should
/// block, or return a retry indicator equivalent to HTTP 503.
@@ -86,7 +88,7 @@ impl std::fmt::Debug for TenantSlot {
impl TenantSlot {
/// Return the `Tenant` in this slot if attached, else None
fn get_attached(&self) -> Option<&Arc<Tenant>> {
fn get_attached(&self) -> Option<&Arc<TenantShard>> {
match self {
Self::Attached(t) => Some(t),
Self::Secondary(_) => None,
@@ -164,7 +166,7 @@ impl TenantStartupMode {
/// Result type for looking up a TenantId to a specific shard
pub(crate) enum ShardResolveResult {
NotFound,
Found(Arc<Tenant>),
Found(Arc<TenantShard>),
// Wait for this barrrier, then query again
InProgress(utils::completion::Barrier),
}
@@ -173,7 +175,7 @@ impl TenantsMap {
/// Convenience function for typical usage, where we want to get a `Tenant` object, for
/// working with attached tenants. If the TenantId is in the map but in Secondary state,
/// None is returned.
pub(crate) fn get(&self, tenant_shard_id: &TenantShardId) -> Option<&Arc<Tenant>> {
pub(crate) fn get(&self, tenant_shard_id: &TenantShardId) -> Option<&Arc<TenantShard>> {
match self {
TenantsMap::Initializing => None,
TenantsMap::Open(m) | TenantsMap::ShuttingDown(m) => {
@@ -410,7 +412,7 @@ fn load_tenant_config(
return None;
}
Some(Tenant::load_tenant_config(conf, &tenant_shard_id))
Some(TenantShard::load_tenant_config(conf, &tenant_shard_id))
}
/// Initial stage of load: walk the local tenants directory, clean up any temp files,
@@ -606,7 +608,8 @@ pub async fn init_tenant_mgr(
// Presence of a generation number implies attachment: attach the tenant
// if it wasn't already, and apply the generation number.
config_write_futs.push(async move {
let r = Tenant::persist_tenant_config(conf, &tenant_shard_id, &location_conf).await;
let r =
TenantShard::persist_tenant_config(conf, &tenant_shard_id, &location_conf).await;
(tenant_shard_id, location_conf, r)
});
}
@@ -694,7 +697,7 @@ fn tenant_spawn(
init_order: Option<InitializationOrder>,
mode: SpawnMode,
ctx: &RequestContext,
) -> Result<Arc<Tenant>, GlobalShutDown> {
) -> Result<Arc<TenantShard>, GlobalShutDown> {
// All these conditions should have been satisfied by our caller: the tenant dir exists, is a well formed
// path, and contains a configuration file. Assertions that do synchronous I/O are limited to debug mode
// to avoid impacting prod runtime performance.
@@ -706,7 +709,7 @@ fn tenant_spawn(
.unwrap()
);
Tenant::spawn(
TenantShard::spawn(
conf,
tenant_shard_id,
resources,
@@ -883,12 +886,12 @@ impl TenantManager {
/// Gets the attached tenant from the in-memory data, erroring if it's absent, in secondary mode, or currently
/// undergoing a state change (i.e. slot is InProgress).
///
/// The return Tenant is not guaranteed to be active: check its status after obtaing it, or
/// use [`Tenant::wait_to_become_active`] before using it if you will do I/O on it.
/// The return TenantShard is not guaranteed to be active: check its status after obtaing it, or
/// use [`TenantShard::wait_to_become_active`] before using it if you will do I/O on it.
pub(crate) fn get_attached_tenant_shard(
&self,
tenant_shard_id: TenantShardId,
) -> Result<Arc<Tenant>, GetTenantError> {
) -> Result<Arc<TenantShard>, GetTenantError> {
let locked = self.tenants.read().unwrap();
let peek_slot = tenant_map_peek_slot(&locked, &tenant_shard_id, TenantSlotPeekMode::Read)?;
@@ -937,12 +940,12 @@ impl TenantManager {
flush: Option<Duration>,
mut spawn_mode: SpawnMode,
ctx: &RequestContext,
) -> Result<Option<Arc<Tenant>>, UpsertLocationError> {
) -> Result<Option<Arc<TenantShard>>, UpsertLocationError> {
debug_assert_current_span_has_tenant_id();
info!("configuring tenant location to state {new_location_config:?}");
enum FastPathModified {
Attached(Arc<Tenant>),
Attached(Arc<TenantShard>),
Secondary(Arc<SecondaryTenant>),
}
@@ -999,9 +1002,13 @@ impl TenantManager {
// phase of writing config and/or waiting for flush, before returning.
match fast_path_taken {
Some(FastPathModified::Attached(tenant)) => {
Tenant::persist_tenant_config(self.conf, &tenant_shard_id, &new_location_config)
.await
.fatal_err("write tenant shard config");
TenantShard::persist_tenant_config(
self.conf,
&tenant_shard_id,
&new_location_config,
)
.await
.fatal_err("write tenant shard config");
// Transition to AttachedStale means we may well hold a valid generation
// still, and have been requested to go stale as part of a migration. If
@@ -1030,9 +1037,13 @@ impl TenantManager {
return Ok(Some(tenant));
}
Some(FastPathModified::Secondary(_secondary_tenant)) => {
Tenant::persist_tenant_config(self.conf, &tenant_shard_id, &new_location_config)
.await
.fatal_err("write tenant shard config");
TenantShard::persist_tenant_config(
self.conf,
&tenant_shard_id,
&new_location_config,
)
.await
.fatal_err("write tenant shard config");
return Ok(None);
}
@@ -1122,7 +1133,7 @@ impl TenantManager {
// Before activating either secondary or attached mode, persist the
// configuration, so that on restart we will re-attach (or re-start
// secondary) on the tenant.
Tenant::persist_tenant_config(self.conf, &tenant_shard_id, &new_location_config)
TenantShard::persist_tenant_config(self.conf, &tenant_shard_id, &new_location_config)
.await
.fatal_err("write tenant shard config");
@@ -1262,7 +1273,7 @@ impl TenantManager {
let tenant_path = self.conf.tenant_path(&tenant_shard_id);
let timelines_path = self.conf.timelines_path(&tenant_shard_id);
let config = Tenant::load_tenant_config(self.conf, &tenant_shard_id)?;
let config = TenantShard::load_tenant_config(self.conf, &tenant_shard_id)?;
if drop_cache {
tracing::info!("Dropping local file cache");
@@ -1297,7 +1308,7 @@ impl TenantManager {
Ok(())
}
pub(crate) fn get_attached_active_tenant_shards(&self) -> Vec<Arc<Tenant>> {
pub(crate) fn get_attached_active_tenant_shards(&self) -> Vec<Arc<TenantShard>> {
let locked = self.tenants.read().unwrap();
match &*locked {
TenantsMap::Initializing => Vec::new(),
@@ -1446,7 +1457,7 @@ impl TenantManager {
#[instrument(skip_all, fields(tenant_id=%tenant.get_tenant_shard_id().tenant_id, shard_id=%tenant.get_tenant_shard_id().shard_slug(), new_shard_count=%new_shard_count.literal()))]
pub(crate) async fn shard_split(
&self,
tenant: Arc<Tenant>,
tenant: Arc<TenantShard>,
new_shard_count: ShardCount,
new_stripe_size: Option<ShardStripeSize>,
ctx: &RequestContext,
@@ -1476,7 +1487,7 @@ impl TenantManager {
pub(crate) async fn do_shard_split(
&self,
tenant: Arc<Tenant>,
tenant: Arc<TenantShard>,
new_shard_count: ShardCount,
new_stripe_size: Option<ShardStripeSize>,
ctx: &RequestContext,
@@ -1703,7 +1714,7 @@ impl TenantManager {
/// For each resident layer in the parent shard, we will hard link it into all of the child shards.
async fn shard_split_hardlink(
&self,
parent_shard: &Tenant,
parent_shard: &TenantShard,
child_shards: Vec<TenantShardId>,
) -> anyhow::Result<()> {
debug_assert_current_span_has_tenant_id();
@@ -1988,7 +1999,7 @@ impl TenantManager {
}
let tenant_path = self.conf.tenant_path(&tenant_shard_id);
let config = Tenant::load_tenant_config(self.conf, &tenant_shard_id)
let config = TenantShard::load_tenant_config(self.conf, &tenant_shard_id)
.map_err(|e| Error::DetachReparent(e.into()))?;
let shard_identity = config.shard;

View File

@@ -133,7 +133,7 @@
//! - Initiate upload queue with that [`IndexPart`].
//! - Reschedule all lost operations by comparing the local filesystem state
//! and remote state as per [`IndexPart`]. This is done in
//! [`Tenant::timeline_init_and_sync`].
//! [`TenantShard::timeline_init_and_sync`].
//!
//! Note that if we crash during file deletion between the index update
//! that removes the file from the list of files, and deleting the remote file,
@@ -171,7 +171,7 @@
//! If no remote storage configuration is provided, the [`RemoteTimelineClient`] is
//! not created and the uploads are skipped.
//!
//! [`Tenant::timeline_init_and_sync`]: super::Tenant::timeline_init_and_sync
//! [`TenantShard::timeline_init_and_sync`]: super::TenantShard::timeline_init_and_sync
//! [`Timeline::load_layer_map`]: super::Timeline::load_layer_map
pub(crate) mod download;
@@ -192,11 +192,12 @@ pub(crate) use download::{
download_index_part, download_initdb_tar_zst, download_tenant_manifest, is_temp_download_file,
list_remote_tenant_shards, list_remote_timelines,
};
use index::GcCompactionState;
pub(crate) use index::LayerFileMetadata;
use index::{EncryptionKey, EncryptionKeyId, EncryptionKeyPair, GcCompactionState, KeyVersion};
use pageserver_api::models::{RelSizeMigration, TimelineArchivalState, TimelineVisibilityState};
use pageserver_api::shard::{ShardIndex, TenantShardId};
use regex::Regex;
use remote_keys::NaiveKms;
use remote_storage::{
DownloadError, GenericRemoteStorage, ListingMode, RemotePath, TimeoutOrCancel,
};
@@ -367,6 +368,10 @@ pub(crate) struct RemoteTimelineClient {
config: std::sync::RwLock<RemoteTimelineClientConfig>,
cancel: CancellationToken,
kms_impl: Option<NaiveKms>,
key_repo: std::sync::Mutex<HashMap<EncryptionKeyId, EncryptionKeyPair>>,
}
impl Drop for RemoteTimelineClient {
@@ -411,6 +416,9 @@ impl RemoteTimelineClient {
)),
config: std::sync::RwLock::new(RemoteTimelineClientConfig::from(location_conf)),
cancel: CancellationToken::new(),
// TODO: make this configurable
kms_impl: Some(NaiveKms::new(tenant_shard_id.tenant_id.to_string())),
key_repo: std::sync::Mutex::new(HashMap::new()),
}
}
@@ -727,9 +735,43 @@ impl RemoteTimelineClient {
reason: "no need for a downloads gauge",
},
);
let key_pair = if let Some(ref key_id) = layer_metadata.encryption_key {
let wrapped_key = {
let mut queue = self.upload_queue.lock().unwrap();
let upload_queue = queue.initialized_mut().unwrap();
let encryption_key_pair =
upload_queue.dirty.keys.iter().find(|key| &key.id == key_id);
if let Some(encryption_key_pair) = encryption_key_pair {
// TODO: also check if we have uploaded the key yet; we should never use a key that is not persisted
encryption_key_pair.clone()
} else {
return Err(DownloadError::Other(anyhow::anyhow!(
"Encryption key pair not found in index_part.json"
)));
}
};
let Some(kms) = self.kms_impl.as_ref() else {
return Err(DownloadError::Other(anyhow::anyhow!(
"KMS not configured when downloading encrypted layer file"
)));
};
let plain_key = kms
.decrypt(&wrapped_key.key)
.context("failed to decrypt encryption key")
.map_err(DownloadError::Other)?;
Some(EncryptionKeyPair::new(
wrapped_key.id,
plain_key,
wrapped_key.key,
))
} else {
None
};
download::download_layer_file(
self.conf,
&self.storage_impl,
key_pair.as_ref(),
self.tenant_shard_id,
self.timeline_id,
layer_file_name,
@@ -1250,6 +1292,14 @@ impl RemoteTimelineClient {
upload_queue: &mut UploadQueueInitialized,
layer: ResidentLayer,
) {
let key_pair = {
if let Some(key_id) = layer.metadata().encryption_key {
let guard = self.key_repo.lock().unwrap();
Some(guard.get(&key_id).cloned().unwrap())
} else {
None
}
};
let metadata = layer.metadata();
upload_queue
@@ -1264,7 +1314,7 @@ impl RemoteTimelineClient {
"scheduled layer file upload {layer}",
);
let op = UploadOp::UploadLayer(layer, metadata, None);
let op = UploadOp::UploadLayer(layer, metadata, key_pair, None);
self.metric_begin(&op);
upload_queue.queued_operations.push_back(op);
}
@@ -1446,6 +1496,58 @@ impl RemoteTimelineClient {
upload_queue.queued_operations.push_back(op);
}
#[allow(dead_code)]
fn is_kms_enabled(&self) -> bool {
self.kms_impl.is_some()
}
pub(crate) fn schedule_generate_encryption_key(
self: &Arc<Self>,
) -> Result<Option<EncryptionKeyPair>, NotInitialized> {
let Some(kms_impl) = self.kms_impl.as_ref() else {
return Ok(None);
};
let plain_key = rand::random::<[u8; 32]>().to_vec(); // StdRng is cryptographically secure (?)
let wrapped_key = kms_impl.encrypt(&plain_key).unwrap();
let mut guard = self.upload_queue.lock().unwrap();
let upload_queue = guard.initialized_mut()?;
let last_key = upload_queue.dirty.keys.last();
let this_key_version = if let Some(last_key) = last_key {
let key_version = EncryptionKeyId {
version: last_key.id.version.next(),
generation: self.generation,
};
assert!(key_version > last_key.id); // ensure key version is strictly increasing; no dup key versions
key_version
} else {
EncryptionKeyId {
version: KeyVersion(1),
generation: self.generation,
}
};
let key_pair = EncryptionKeyPair {
id: this_key_version.clone(),
plain_key: plain_key.clone(),
wrapped_key,
};
upload_queue.dirty.keys.push(EncryptionKey {
key: plain_key,
id: this_key_version,
created_at: Utc::now().naive_utc(),
});
self.key_repo.lock().unwrap().insert(this_key_version, key_pair);
self.schedule_index_upload(upload_queue);
Ok(Some(key_pair))
}
/// Schedules a compaction update to the remote `index_part.json`.
///
/// `compacted_from` represent the L0 names which have been `compacted_to` L1 layers.
@@ -1454,6 +1556,7 @@ impl RemoteTimelineClient {
compacted_from: &[Layer],
compacted_to: &[ResidentLayer],
) -> Result<(), NotInitialized> {
// Use the same key for all layers in a single compaction job
let mut guard = self.upload_queue.lock().unwrap();
let upload_queue = guard.initialized_mut()?;
@@ -1715,6 +1818,7 @@ impl RemoteTimelineClient {
uploaded.local_path(),
&remote_path,
uploaded.metadata().file_size,
None, // TODO(chi): support encryption for those layer files uploaded using this interface
cancel,
)
.await
@@ -1757,6 +1861,8 @@ impl RemoteTimelineClient {
adopted_as.metadata().generation,
);
// TODO: support encryption for those layer files uploaded using this interface
backoff::retry(
|| async {
upload::copy_timeline_layer(
@@ -1977,7 +2083,7 @@ impl RemoteTimelineClient {
// Prepare upload.
match &mut next_op {
UploadOp::UploadLayer(layer, meta, mode) => {
UploadOp::UploadLayer(layer, meta, _, mode) => {
if upload_queue
.recently_deleted
.remove(&(layer.layer_desc().layer_name().clone(), meta.generation))
@@ -2071,7 +2177,7 @@ impl RemoteTimelineClient {
// Assert that we don't modify a layer that's referenced by the current index.
if cfg!(debug_assertions) {
let modified = match &task.op {
UploadOp::UploadLayer(layer, layer_metadata, _) => {
UploadOp::UploadLayer(layer, layer_metadata, _, _) => {
vec![(layer.layer_desc().layer_name(), layer_metadata)]
}
UploadOp::Delete(delete) => {
@@ -2093,7 +2199,7 @@ impl RemoteTimelineClient {
}
let upload_result: anyhow::Result<()> = match &task.op {
UploadOp::UploadLayer(layer, layer_metadata, mode) => {
UploadOp::UploadLayer(layer, layer_metadata, encryption_key_pair, mode) => {
// TODO: check if this mechanism can be removed now that can_bypass() performs
// conflict checks during scheduling.
if let Some(OpType::FlushDeletion) = mode {
@@ -2174,6 +2280,7 @@ impl RemoteTimelineClient {
local_path,
&remote_path,
layer_metadata.file_size,
encryption_key_pair.clone(),
&self.cancel,
)
.measure_remote_op(
@@ -2324,7 +2431,7 @@ impl RemoteTimelineClient {
upload_queue.inprogress_tasks.remove(&task.task_id);
let lsn_update = match task.op {
UploadOp::UploadLayer(_, _, _) => None,
UploadOp::UploadLayer(_, _, _, _) => None,
UploadOp::UploadMetadata { ref uploaded } => {
// the task id is reused as a monotonicity check for storing the "clean"
// IndexPart.
@@ -2403,7 +2510,7 @@ impl RemoteTimelineClient {
)> {
use RemoteTimelineClientMetricsCallTrackSize::DontTrackSize;
let res = match op {
UploadOp::UploadLayer(_, m, _) => (
UploadOp::UploadLayer(_, m, _, _) => (
RemoteOpFileKind::Layer,
RemoteOpKind::Upload,
RemoteTimelineClientMetricsCallTrackSize::Bytes(m.file_size),
@@ -2743,7 +2850,7 @@ mod tests {
use crate::tenant::config::AttachmentMode;
use crate::tenant::harness::{TIMELINE_ID, TenantHarness};
use crate::tenant::storage_layer::layer::local_layer_path;
use crate::tenant::{Tenant, Timeline};
use crate::tenant::{TenantShard, Timeline};
pub(super) fn dummy_contents(name: &str) -> Vec<u8> {
format!("contents for {name}").into()
@@ -2787,6 +2894,10 @@ mod tests {
for entry in std::fs::read_dir(remote_path).unwrap().flatten() {
let entry_name = entry.file_name();
let fname = entry_name.to_str().unwrap();
if fname.ends_with(".metadata") || fname.ends_with(".enc") {
// ignore metadata and encryption key files; should use local_fs APIs instead in the future
continue;
}
found.push(String::from(fname));
}
found.sort();
@@ -2796,7 +2907,7 @@ mod tests {
struct TestSetup {
harness: TenantHarness,
tenant: Arc<Tenant>,
tenant: Arc<TenantShard>,
timeline: Arc<Timeline>,
tenant_ctx: RequestContext,
}
@@ -2840,6 +2951,8 @@ mod tests {
)),
config: std::sync::RwLock::new(RemoteTimelineClientConfig::from(&location_conf)),
cancel: CancellationToken::new(),
kms_impl: None,
key_repo: std::sync::Mutex::new(HashMap::new()),
})
}

View File

@@ -23,7 +23,7 @@ use utils::crashsafe::path_with_suffix_extension;
use utils::id::{TenantId, TimelineId};
use utils::{backoff, pausable_failpoint};
use super::index::{IndexPart, LayerFileMetadata};
use super::index::{EncryptionKeyPair, IndexPart, LayerFileMetadata};
use super::manifest::TenantManifest;
use super::{
FAILED_DOWNLOAD_WARN_THRESHOLD, FAILED_REMOTE_OP_RETRIES, INITDB_PATH, parse_remote_index_path,
@@ -51,6 +51,7 @@ use crate::virtual_file::{MaybeFatalIo, VirtualFile, on_fatal_io_error};
pub async fn download_layer_file<'a>(
conf: &'static PageServerConf,
storage: &'a GenericRemoteStorage,
key_pair: Option<&'a EncryptionKeyPair>,
tenant_shard_id: TenantShardId,
timeline_id: TimelineId,
layer_file_name: &'a LayerName,
@@ -86,7 +87,16 @@ pub async fn download_layer_file<'a>(
let bytes_amount = download_retry(
|| async {
download_object(storage, &remote_path, &temp_file_path, gate, cancel, ctx).await
download_object(
storage,
key_pair,
&remote_path,
&temp_file_path,
gate,
cancel,
ctx,
)
.await
},
&format!("download {remote_path:?}"),
cancel,
@@ -145,6 +155,7 @@ pub async fn download_layer_file<'a>(
/// The unlinking has _not_ been made durable.
async fn download_object(
storage: &GenericRemoteStorage,
encryption_key_pair: Option<&EncryptionKeyPair>,
src_path: &RemotePath,
dst_path: &Utf8PathBuf,
#[cfg_attr(target_os = "macos", allow(unused_variables))] gate: &utils::sync::gate::Gate,
@@ -160,9 +171,12 @@ async fn download_object(
.with_context(|| format!("create a destination file for layer '{dst_path}'"))
.map_err(DownloadError::Other)?;
let download = storage
.download(src_path, &DownloadOpts::default(), cancel)
.await?;
let mut opts = DownloadOpts::default();
if let Some(encryption_key_pair) = encryption_key_pair {
opts.encryption_key = Some(encryption_key_pair.plain_key.to_vec());
}
let download = storage.download(src_path, &opts, cancel).await?;
pausable_failpoint!("before-downloading-layer-stream-pausable");
@@ -452,7 +466,7 @@ async fn do_download_index_part(
/// generation (normal case when migrating/restarting). Only if both of these return 404 do we fall back
/// to listing objects.
///
/// * `my_generation`: the value of `[crate::tenant::Tenant::generation]`
/// * `my_generation`: the value of `[crate::tenant::TenantShard::generation]`
/// * `what`: for logging, what object are we downloading
/// * `prefix`: when listing objects, use this prefix (i.e. the part of the object path before the generation)
/// * `do_download`: a GET of the object in a particular generation, which should **retry indefinitely** unless

View File

@@ -10,6 +10,8 @@ use pageserver_api::models::AuxFilePolicy;
use pageserver_api::models::RelSizeMigration;
use pageserver_api::shard::ShardIndex;
use serde::{Deserialize, Serialize};
use serde_with::base64::Base64;
use serde_with::serde_as;
use utils::id::TimelineId;
use utils::lsn::Lsn;
@@ -114,6 +116,70 @@ pub struct IndexPart {
/// The timestamp when the timeline was marked invisible in synthetic size calculations.
#[serde(skip_serializing_if = "Option::is_none", default)]
pub(crate) marked_invisible_at: Option<NaiveDateTime>,
/// The encryption key used to encrypt the timeline layer files.
#[serde(skip_serializing_if = "Vec::is_empty", default)]
pub(crate) keys: Vec<EncryptionKey>,
}
#[derive(Debug, PartialEq, Eq, Clone, Serialize, Deserialize, Ord, PartialOrd, Hash)]
pub struct KeyVersion(pub u32);
impl KeyVersion {
pub fn next(&self) -> Self {
Self(self.0 + 1)
}
}
/// An identifier for an encryption key. The scope of the key is the timeline (TBD).
#[derive(Debug, PartialEq, Eq, Clone, Serialize, Deserialize, Ord, PartialOrd, Hash)]
pub struct EncryptionKeyId {
pub version: KeyVersion,
pub generation: Generation,
}
#[derive(Clone)]
pub struct EncryptionKeyPair {
pub id: EncryptionKeyId,
pub plain_key: Vec<u8>,
pub wrapped_key: Vec<u8>,
}
impl EncryptionKeyPair {
pub fn new(id: EncryptionKeyId, plain_key: Vec<u8>, wrapped_key: Vec<u8>) -> Self {
Self {
id,
plain_key,
wrapped_key,
}
}
}
impl std::fmt::Debug for EncryptionKeyPair {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
let display =
base64::display::Base64Display::with_config(&self.wrapped_key, base64::STANDARD);
struct DisplayAsDebug<T: std::fmt::Display>(T);
impl<T: std::fmt::Display> std::fmt::Debug for DisplayAsDebug<T> {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
write!(f, "{}", self.0)
}
}
f.debug_struct("EncryptionKeyPair")
.field("id", &self.id)
.field("plain_key", &"<REDACTED>")
.field("wrapped_key", &DisplayAsDebug(&display))
.finish()
}
}
#[serde_as]
#[derive(Debug, PartialEq, Eq, Clone, Serialize, Deserialize)]
pub struct EncryptionKey {
#[serde_as(as = "Base64")]
pub key: Vec<u8>,
pub id: EncryptionKeyId,
pub created_at: NaiveDateTime,
}
#[derive(Debug, PartialEq, Eq, Clone, Serialize, Deserialize)]
@@ -142,10 +208,12 @@ impl IndexPart {
/// - 12: +l2_lsn
/// - 13: +gc_compaction
/// - 14: +marked_invisible_at
const LATEST_VERSION: usize = 14;
/// - 15: +keys and encryption_key in layer_metadata
const LATEST_VERSION: usize = 15;
// Versions we may see when reading from a bucket.
pub const KNOWN_VERSIONS: &'static [usize] = &[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14];
pub const KNOWN_VERSIONS: &'static [usize] =
&[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15];
pub const FILE_NAME: &'static str = "index_part.json";
@@ -165,6 +233,7 @@ impl IndexPart {
l2_lsn: None,
gc_compaction: None,
marked_invisible_at: None,
keys: Vec::new(),
}
}
@@ -205,14 +274,16 @@ impl IndexPart {
/// Check for invariants in the index: this is useful when uploading an index to ensure that if
/// we encounter a bug, we do not persist buggy metadata.
pub(crate) fn validate(&self) -> Result<(), String> {
if self.import_pgdata.is_none()
&& self.metadata.ancestor_timeline().is_none()
&& self.layer_metadata.is_empty()
{
// Unless we're in the middle of a raw pgdata import, or this is a child timeline,the index must
// always have at least one layer.
return Err("Index has no ancestor and no layers".to_string());
}
// We have to disable this check: we might need to upload an empty index part with new keys, or new `reldirv2` flag.
// if self.import_pgdata.is_none()
// && self.metadata.ancestor_timeline().is_none()
// && self.layer_metadata.is_empty()
// {
// // Unless we're in the middle of a raw pgdata import, or this is a child timeline,the index must
// // always have at least one layer.
// return Err("Index has no ancestor and no layers".to_string());
// }
Ok(())
}
@@ -222,7 +293,7 @@ impl IndexPart {
///
/// Fields have to be `Option`s because remote [`IndexPart`]'s can be from different version, which
/// might have less or more metadata depending if upgrading or rolling back an upgrade.
#[derive(Debug, Clone, PartialEq, Eq, PartialOrd, Ord, Serialize, Deserialize)]
#[derive(Debug, Clone, PartialEq, Eq, Serialize, Deserialize)]
pub struct LayerFileMetadata {
pub file_size: u64,
@@ -233,6 +304,9 @@ pub struct LayerFileMetadata {
#[serde(default = "ShardIndex::unsharded")]
#[serde(skip_serializing_if = "ShardIndex::is_unsharded")]
pub shard: ShardIndex,
#[serde(skip_serializing_if = "Option::is_none", default)]
pub encryption_key: Option<EncryptionKeyId>,
}
impl LayerFileMetadata {
@@ -241,6 +315,7 @@ impl LayerFileMetadata {
file_size,
generation,
shard,
encryption_key: None,
}
}
/// Helper to get both generation and file size in a tuple
@@ -453,14 +528,16 @@ mod tests {
("000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__0000000001696070-00000000016960E9".parse().unwrap(), LayerFileMetadata {
file_size: 25600000,
generation: Generation::none(),
shard: ShardIndex::unsharded()
shard: ShardIndex::unsharded(),
encryption_key: None,
}),
("000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000016B59D8-00000000016B5A51".parse().unwrap(), LayerFileMetadata {
// serde_json should always parse this but this might be a double with jq for
// example.
file_size: 9007199254741001,
generation: Generation::none(),
shard: ShardIndex::unsharded()
shard: ShardIndex::unsharded(),
encryption_key: None,
})
]),
disk_consistent_lsn: "0/16960E8".parse::<Lsn>().unwrap(),
@@ -475,6 +552,7 @@ mod tests {
l2_lsn: None,
gc_compaction: None,
marked_invisible_at: None,
keys: Vec::new(),
};
let part = IndexPart::from_json_bytes(example.as_bytes()).unwrap();
@@ -502,14 +580,16 @@ mod tests {
("000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__0000000001696070-00000000016960E9".parse().unwrap(), LayerFileMetadata {
file_size: 25600000,
generation: Generation::none(),
shard: ShardIndex::unsharded()
shard: ShardIndex::unsharded(),
encryption_key: None,
}),
("000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000016B59D8-00000000016B5A51".parse().unwrap(), LayerFileMetadata {
// serde_json should always parse this but this might be a double with jq for
// example.
file_size: 9007199254741001,
generation: Generation::none(),
shard: ShardIndex::unsharded()
shard: ShardIndex::unsharded(),
encryption_key: None,
})
]),
disk_consistent_lsn: "0/16960E8".parse::<Lsn>().unwrap(),
@@ -524,6 +604,7 @@ mod tests {
l2_lsn: None,
gc_compaction: None,
marked_invisible_at: None,
keys: Vec::new(),
};
let part = IndexPart::from_json_bytes(example.as_bytes()).unwrap();
@@ -552,14 +633,16 @@ mod tests {
("000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__0000000001696070-00000000016960E9".parse().unwrap(), LayerFileMetadata {
file_size: 25600000,
generation: Generation::none(),
shard: ShardIndex::unsharded()
shard: ShardIndex::unsharded(),
encryption_key: None,
}),
("000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000016B59D8-00000000016B5A51".parse().unwrap(), LayerFileMetadata {
// serde_json should always parse this but this might be a double with jq for
// example.
file_size: 9007199254741001,
generation: Generation::none(),
shard: ShardIndex::unsharded()
shard: ShardIndex::unsharded(),
encryption_key: None,
})
]),
disk_consistent_lsn: "0/16960E8".parse::<Lsn>().unwrap(),
@@ -574,6 +657,7 @@ mod tests {
l2_lsn: None,
gc_compaction: None,
marked_invisible_at: None,
keys: Vec::new(),
};
let part = IndexPart::from_json_bytes(example.as_bytes()).unwrap();
@@ -627,6 +711,7 @@ mod tests {
l2_lsn: None,
gc_compaction: None,
marked_invisible_at: None,
keys: Vec::new(),
};
let empty_layers_parsed = IndexPart::from_json_bytes(empty_layers_json.as_bytes()).unwrap();
@@ -653,14 +738,16 @@ mod tests {
("000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__0000000001696070-00000000016960E9".parse().unwrap(), LayerFileMetadata {
file_size: 25600000,
generation: Generation::none(),
shard: ShardIndex::unsharded()
shard: ShardIndex::unsharded(),
encryption_key: None,
}),
("000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000016B59D8-00000000016B5A51".parse().unwrap(), LayerFileMetadata {
// serde_json should always parse this but this might be a double with jq for
// example.
file_size: 9007199254741001,
generation: Generation::none(),
shard: ShardIndex::unsharded()
shard: ShardIndex::unsharded(),
encryption_key: None,
})
]),
disk_consistent_lsn: "0/16960E8".parse::<Lsn>().unwrap(),
@@ -675,6 +762,7 @@ mod tests {
l2_lsn: None,
gc_compaction: None,
marked_invisible_at: None,
keys: Vec::new(),
};
let part = IndexPart::from_json_bytes(example.as_bytes()).unwrap();
@@ -703,11 +791,13 @@ mod tests {
file_size: 23289856,
generation: Generation::new(1),
shard: ShardIndex::unsharded(),
encryption_key: None,
}),
("000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000014EF499-00000000015A7619".parse().unwrap(), LayerFileMetadata {
file_size: 1015808,
generation: Generation::new(1),
shard: ShardIndex::unsharded(),
encryption_key: None,
})
]),
disk_consistent_lsn: Lsn::from_str("0/15A7618").unwrap(),
@@ -726,6 +816,7 @@ mod tests {
l2_lsn: None,
gc_compaction: None,
marked_invisible_at: None,
keys: Vec::new(),
};
let part = IndexPart::from_json_bytes(example.as_bytes()).unwrap();
@@ -756,14 +847,16 @@ mod tests {
("000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__0000000001696070-00000000016960E9".parse().unwrap(), LayerFileMetadata {
file_size: 25600000,
generation: Generation::none(),
shard: ShardIndex::unsharded()
shard: ShardIndex::unsharded(),
encryption_key: None,
}),
("000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000016B59D8-00000000016B5A51".parse().unwrap(), LayerFileMetadata {
// serde_json should always parse this but this might be a double with jq for
// example.
file_size: 9007199254741001,
generation: Generation::none(),
shard: ShardIndex::unsharded()
shard: ShardIndex::unsharded(),
encryption_key: None,
})
]),
disk_consistent_lsn: "0/16960E8".parse::<Lsn>().unwrap(),
@@ -782,6 +875,7 @@ mod tests {
l2_lsn: None,
gc_compaction: None,
marked_invisible_at: None,
keys: Vec::new(),
};
let part = IndexPart::from_json_bytes(example.as_bytes()).unwrap();
@@ -815,12 +909,14 @@ mod tests {
("000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__0000000001696070-00000000016960E9".parse().unwrap(), LayerFileMetadata {
file_size: 25600000,
generation: Generation::none(),
shard: ShardIndex::unsharded()
shard: ShardIndex::unsharded(),
encryption_key: None,
}),
("000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000016B59D8-00000000016B5A51".parse().unwrap(), LayerFileMetadata {
file_size: 9007199254741001,
generation: Generation::none(),
shard: ShardIndex::unsharded()
shard: ShardIndex::unsharded(),
encryption_key: None,
})
]),
disk_consistent_lsn: "0/16960E8".parse::<Lsn>().unwrap(),
@@ -843,6 +939,7 @@ mod tests {
l2_lsn: None,
gc_compaction: None,
marked_invisible_at: None,
keys: Vec::new(),
};
let part = IndexPart::from_json_bytes(example.as_bytes()).unwrap();
@@ -877,12 +974,14 @@ mod tests {
("000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__0000000001696070-00000000016960E9".parse().unwrap(), LayerFileMetadata {
file_size: 25600000,
generation: Generation::none(),
shard: ShardIndex::unsharded()
shard: ShardIndex::unsharded(),
encryption_key: None,
}),
("000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000016B59D8-00000000016B5A51".parse().unwrap(), LayerFileMetadata {
file_size: 9007199254741001,
generation: Generation::none(),
shard: ShardIndex::unsharded()
shard: ShardIndex::unsharded(),
encryption_key: None,
})
]),
disk_consistent_lsn: "0/16960E8".parse::<Lsn>().unwrap(),
@@ -905,6 +1004,7 @@ mod tests {
l2_lsn: None,
gc_compaction: None,
marked_invisible_at: None,
keys: Vec::new(),
};
let part = IndexPart::from_json_bytes(example.as_bytes()).unwrap();
@@ -941,12 +1041,14 @@ mod tests {
("000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__0000000001696070-00000000016960E9".parse().unwrap(), LayerFileMetadata {
file_size: 25600000,
generation: Generation::none(),
shard: ShardIndex::unsharded()
shard: ShardIndex::unsharded(),
encryption_key: None,
}),
("000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000016B59D8-00000000016B5A51".parse().unwrap(), LayerFileMetadata {
file_size: 9007199254741001,
generation: Generation::none(),
shard: ShardIndex::unsharded()
shard: ShardIndex::unsharded(),
encryption_key: None,
})
]),
disk_consistent_lsn: "0/16960E8".parse::<Lsn>().unwrap(),
@@ -972,6 +1074,7 @@ mod tests {
l2_lsn: None,
gc_compaction: None,
marked_invisible_at: None,
keys: Vec::new(),
};
let part = IndexPart::from_json_bytes(example.as_bytes()).unwrap();
@@ -1017,12 +1120,14 @@ mod tests {
("000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__0000000001696070-00000000016960E9".parse().unwrap(), LayerFileMetadata {
file_size: 25600000,
generation: Generation::none(),
shard: ShardIndex::unsharded()
shard: ShardIndex::unsharded(),
encryption_key: None,
}),
("000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000016B59D8-00000000016B5A51".parse().unwrap(), LayerFileMetadata {
file_size: 9007199254741001,
generation: Generation::none(),
shard: ShardIndex::unsharded()
shard: ShardIndex::unsharded(),
encryption_key: None,
})
]),
disk_consistent_lsn: "0/16960E8".parse::<Lsn>().unwrap(),
@@ -1052,6 +1157,7 @@ mod tests {
l2_lsn: None,
gc_compaction: None,
marked_invisible_at: None,
keys: Vec::new(),
};
let part = IndexPart::from_json_bytes(example.as_bytes()).unwrap();
@@ -1098,12 +1204,14 @@ mod tests {
("000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__0000000001696070-00000000016960E9".parse().unwrap(), LayerFileMetadata {
file_size: 25600000,
generation: Generation::none(),
shard: ShardIndex::unsharded()
shard: ShardIndex::unsharded(),
encryption_key: None,
}),
("000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000016B59D8-00000000016B5A51".parse().unwrap(), LayerFileMetadata {
file_size: 9007199254741001,
generation: Generation::none(),
shard: ShardIndex::unsharded()
shard: ShardIndex::unsharded(),
encryption_key: None,
})
]),
disk_consistent_lsn: "0/16960E8".parse::<Lsn>().unwrap(),
@@ -1133,6 +1241,7 @@ mod tests {
l2_lsn: None,
gc_compaction: None,
marked_invisible_at: None,
keys: Vec::new(),
};
let part = IndexPart::from_json_bytes(example.as_bytes()).unwrap();
@@ -1183,12 +1292,14 @@ mod tests {
("000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__0000000001696070-00000000016960E9".parse().unwrap(), LayerFileMetadata {
file_size: 25600000,
generation: Generation::none(),
shard: ShardIndex::unsharded()
shard: ShardIndex::unsharded(),
encryption_key: None,
}),
("000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000016B59D8-00000000016B5A51".parse().unwrap(), LayerFileMetadata {
file_size: 9007199254741001,
generation: Generation::none(),
shard: ShardIndex::unsharded()
shard: ShardIndex::unsharded(),
encryption_key: None,
})
]),
disk_consistent_lsn: "0/16960E8".parse::<Lsn>().unwrap(),
@@ -1220,6 +1331,7 @@ mod tests {
last_completed_lsn: "0/16960E8".parse::<Lsn>().unwrap(),
}),
marked_invisible_at: None,
keys: Vec::new(),
};
let part = IndexPart::from_json_bytes(example.as_bytes()).unwrap();
@@ -1271,12 +1383,14 @@ mod tests {
("000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__0000000001696070-00000000016960E9".parse().unwrap(), LayerFileMetadata {
file_size: 25600000,
generation: Generation::none(),
shard: ShardIndex::unsharded()
shard: ShardIndex::unsharded(),
encryption_key: None,
}),
("000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000016B59D8-00000000016B5A51".parse().unwrap(), LayerFileMetadata {
file_size: 9007199254741001,
generation: Generation::none(),
shard: ShardIndex::unsharded()
shard: ShardIndex::unsharded(),
encryption_key: None,
})
]),
disk_consistent_lsn: "0/16960E8".parse::<Lsn>().unwrap(),
@@ -1308,6 +1422,139 @@ mod tests {
last_completed_lsn: "0/16960E8".parse::<Lsn>().unwrap(),
}),
marked_invisible_at: Some(parse_naive_datetime("2023-07-31T09:00:00.123000000")),
keys: Vec::new(),
};
let part = IndexPart::from_json_bytes(example.as_bytes()).unwrap();
assert_eq!(part, expected);
}
#[test]
fn v15_keys_are_parsed() {
let example = r#"{
"version": 15,
"layer_metadata":{
"000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__0000000001696070-00000000016960E9": { "file_size": 25600000, "encryption_key": { "version": 1, "generation": 5 } },
"000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000016B59D8-00000000016B5A51": { "file_size": 9007199254741001, "encryption_key": { "version": 2, "generation": 6 } }
},
"disk_consistent_lsn":"0/16960E8",
"metadata": {
"disk_consistent_lsn": "0/16960E8",
"prev_record_lsn": "0/1696070",
"ancestor_timeline": "e45a7f37d3ee2ff17dc14bf4f4e3f52e",
"ancestor_lsn": "0/0",
"latest_gc_cutoff_lsn": "0/1696070",
"initdb_lsn": "0/1696070",
"pg_version": 14
},
"gc_blocking": {
"started_at": "2024-07-19T09:00:00.123",
"reasons": ["DetachAncestor"]
},
"import_pgdata": {
"V1": {
"Done": {
"idempotency_key": "specified-by-client-218a5213-5044-4562-a28d-d024c5f057f5",
"started_at": "2024-11-13T09:23:42.123",
"finished_at": "2024-11-13T09:42:23.123"
}
}
},
"rel_size_migration": "legacy",
"l2_lsn": "0/16960E8",
"gc_compaction": {
"last_completed_lsn": "0/16960E8"
},
"marked_invisible_at": "2023-07-31T09:00:00.123",
"keys": [
{
"key": "dGVzdF9rZXk=",
"id": {
"version": 1,
"generation": 5
},
"created_at": "2024-07-19T09:00:00.123"
},
{
"key": "dGVzdF9rZXlfMg==",
"id": {
"version": 2,
"generation": 6
},
"created_at": "2024-07-19T10:00:00.123"
}
]
}"#;
let expected = IndexPart {
version: 15,
layer_metadata: HashMap::from([
("000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__0000000001696070-00000000016960E9".parse().unwrap(), LayerFileMetadata {
file_size: 25600000,
generation: Generation::none(),
shard: ShardIndex::unsharded(),
encryption_key: Some(EncryptionKeyId {
version: KeyVersion(1),
generation: Generation::Valid(5),
}),
}),
("000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000016B59D8-00000000016B5A51".parse().unwrap(), LayerFileMetadata {
file_size: 9007199254741001,
generation: Generation::none(),
shard: ShardIndex::unsharded(),
encryption_key: Some(EncryptionKeyId {
version: KeyVersion(2),
generation: Generation::Valid(6),
}),
})
]),
disk_consistent_lsn: "0/16960E8".parse::<Lsn>().unwrap(),
metadata: TimelineMetadata::new(
Lsn::from_str("0/16960E8").unwrap(),
Some(Lsn::from_str("0/1696070").unwrap()),
Some(TimelineId::from_str("e45a7f37d3ee2ff17dc14bf4f4e3f52e").unwrap()),
Lsn::INVALID,
Lsn::from_str("0/1696070").unwrap(),
Lsn::from_str("0/1696070").unwrap(),
14,
).with_recalculated_checksum().unwrap(),
deleted_at: None,
lineage: Default::default(),
gc_blocking: Some(GcBlocking {
started_at: parse_naive_datetime("2024-07-19T09:00:00.123000000"),
reasons: enumset::EnumSet::from_iter([GcBlockingReason::DetachAncestor]),
}),
last_aux_file_policy: Default::default(),
archived_at: None,
import_pgdata: Some(import_pgdata::index_part_format::Root::V1(import_pgdata::index_part_format::V1::Done(import_pgdata::index_part_format::Done{
started_at: parse_naive_datetime("2024-11-13T09:23:42.123000000"),
finished_at: parse_naive_datetime("2024-11-13T09:42:23.123000000"),
idempotency_key: import_pgdata::index_part_format::IdempotencyKey::new("specified-by-client-218a5213-5044-4562-a28d-d024c5f057f5".to_string()),
}))),
rel_size_migration: Some(RelSizeMigration::Legacy),
l2_lsn: Some("0/16960E8".parse::<Lsn>().unwrap()),
gc_compaction: Some(GcCompactionState {
last_completed_lsn: "0/16960E8".parse::<Lsn>().unwrap(),
}),
marked_invisible_at: Some(parse_naive_datetime("2023-07-31T09:00:00.123000000")),
keys: vec![
EncryptionKey {
key: "test_key".as_bytes().to_vec(),
id: EncryptionKeyId {
version: KeyVersion(1),
generation: Generation::Valid(5),
},
created_at: parse_naive_datetime("2024-07-19T09:00:00.123000000"),
},
EncryptionKey {
key: "test_key_2".as_bytes().to_vec(),
id: EncryptionKeyId {
version: KeyVersion(2),
generation: Generation::Valid(6),
},
created_at: parse_naive_datetime("2024-07-19T10:00:00.123000000"),
}
],
};
let part = IndexPart::from_json_bytes(example.as_bytes()).unwrap();

View File

@@ -17,7 +17,7 @@ use utils::id::{TenantId, TimelineId};
use utils::{backoff, pausable_failpoint};
use super::Generation;
use super::index::IndexPart;
use super::index::{EncryptionKeyPair, IndexPart};
use super::manifest::TenantManifest;
use crate::tenant::remote_timeline_client::{
remote_index_path, remote_initdb_archive_path, remote_initdb_preserved_archive_path,
@@ -101,6 +101,7 @@ pub(super) async fn upload_timeline_layer<'a>(
local_path: &'a Utf8Path,
remote_path: &'a RemotePath,
metadata_size: u64,
encryption_key_pair: Option<EncryptionKeyPair>,
cancel: &CancellationToken,
) -> anyhow::Result<()> {
fail_point!("before-upload-layer", |_| {
@@ -144,7 +145,14 @@ pub(super) async fn upload_timeline_layer<'a>(
let reader = tokio_util::io::ReaderStream::with_capacity(source_file, super::BUFFER_SIZE);
storage
.upload(reader, fs_size, remote_path, None, cancel)
.upload_with_encryption(
reader,
fs_size,
remote_path,
None,
encryption_key_pair.as_ref().map(|k| k.plain_key.as_slice()),
cancel,
)
.await
.with_context(|| format!("upload layer from local path '{local_path}'"))
}

View File

@@ -1310,6 +1310,7 @@ impl<'a> TenantDownloader<'a> {
let downloaded_bytes = download_layer_file(
self.conf,
self.remote_storage,
None, // TODO: add encryption key pair
*tenant_shard_id,
*timeline_id,
&layer.name,

View File

@@ -21,7 +21,7 @@ use super::scheduler::{
use super::{CommandRequest, SecondaryTenantError, UploadCommand};
use crate::TEMP_FILE_SUFFIX;
use crate::metrics::SECONDARY_MODE;
use crate::tenant::Tenant;
use crate::tenant::TenantShard;
use crate::tenant::config::AttachmentMode;
use crate::tenant::mgr::{GetTenantError, TenantManager};
use crate::tenant::remote_timeline_client::remote_heatmap_path;
@@ -74,7 +74,7 @@ impl RunningJob for WriteInProgress {
}
struct UploadPending {
tenant: Arc<Tenant>,
tenant: Arc<TenantShard>,
last_upload: Option<LastUploadState>,
target_time: Option<Instant>,
period: Option<Duration>,
@@ -106,7 +106,7 @@ impl scheduler::Completion for WriteComplete {
struct UploaderTenantState {
// This Weak only exists to enable culling idle instances of this type
// when the Tenant has been deallocated.
tenant: Weak<Tenant>,
tenant: Weak<TenantShard>,
/// Digest of the serialized heatmap that we last successfully uploaded
last_upload_state: Option<LastUploadState>,
@@ -357,7 +357,7 @@ struct LastUploadState {
/// of the object we would have uploaded.
async fn upload_tenant_heatmap(
remote_storage: GenericRemoteStorage,
tenant: &Arc<Tenant>,
tenant: &Arc<TenantShard>,
last_upload: Option<LastUploadState>,
) -> Result<UploadHeatmapOutcome, UploadHeatmapError> {
debug_assert_current_span_has_tenant_id();

View File

@@ -360,7 +360,7 @@ where
/// Periodic execution phase: inspect all attached tenants and schedule any work they require.
///
/// The type in `tenants` should be a tenant-like structure, e.g. [`crate::tenant::Tenant`] or [`crate::tenant::secondary::SecondaryTenant`]
/// The type in `tenants` should be a tenant-like structure, e.g. [`crate::tenant::TenantShard`] or [`crate::tenant::secondary::SecondaryTenant`]
///
/// This function resets the pending list: it is assumed that the caller may change their mind about
/// which tenants need work between calls to schedule_iteration.

View File

@@ -12,7 +12,7 @@ use tracing::*;
use utils::id::TimelineId;
use utils::lsn::Lsn;
use super::{GcError, LogicalSizeCalculationCause, Tenant};
use super::{GcError, LogicalSizeCalculationCause, TenantShard};
use crate::context::RequestContext;
use crate::pgdatadir_mapping::CalculateLogicalSizeError;
use crate::tenant::{MaybeOffloaded, Timeline};
@@ -156,7 +156,7 @@ pub struct TimelineInputs {
/// initdb_lsn branchpoints* next_pitr_cutoff latest
/// ```
pub(super) async fn gather_inputs(
tenant: &Tenant,
tenant: &TenantShard,
limit: &Arc<Semaphore>,
max_retention_period: Option<u64>,
logical_size_cache: &mut HashMap<(TimelineId, Lsn), u64>,

View File

@@ -1620,7 +1620,7 @@ pub(crate) mod test {
use crate::tenant::harness::{TIMELINE_ID, TenantHarness};
use crate::tenant::storage_layer::{Layer, ResidentLayer};
use crate::tenant::vectored_blob_io::StreamingVectoredReadPlanner;
use crate::tenant::{Tenant, Timeline};
use crate::tenant::{TenantShard, Timeline};
/// Construct an index for a fictional delta layer and and then
/// traverse in order to plan vectored reads for a query. Finally,
@@ -2209,7 +2209,7 @@ pub(crate) mod test {
}
pub(crate) async fn produce_delta_layer(
tenant: &Tenant,
tenant: &TenantShard,
tline: &Arc<Timeline>,
mut deltas: Vec<(Key, Lsn, Value)>,
ctx: &RequestContext,

View File

@@ -559,11 +559,12 @@ impl ImageLayerInner {
let view = BufView::new_slice(&blobs_buf.buf);
for meta in blobs_buf.blobs.iter() {
let img_buf = meta.read(&view).await?;
// Just read the raw header+data and pass it through to the target layer, without
// decoding and recompressing it.
let raw = meta.raw_with_header(&view);
key_count += 1;
writer
.put_image(meta.meta.key, img_buf.into_bytes(), ctx)
.put_image_raw(meta.meta.key, raw.into_bytes(), ctx)
.await
.context(format!("Storing key {}", meta.meta.key))?;
}
@@ -853,6 +854,41 @@ impl ImageLayerWriterInner {
Ok(())
}
///
/// Write the next image to the file, as a raw blob header and data.
///
/// The page versions must be appended in blknum order.
///
async fn put_image_raw(
&mut self,
key: Key,
raw_with_header: Bytes,
ctx: &RequestContext,
) -> anyhow::Result<()> {
ensure!(self.key_range.contains(&key));
// NB: we don't update the (un)compressed metrics, since we can't determine them without
// decompressing the image. This seems okay.
self.num_keys += 1;
let (_, res) = self
.blob_writer
.write_blob_raw(raw_with_header.slice_len(), ctx)
.await;
let offset = res?;
let mut keybuf: [u8; KEY_SIZE] = [0u8; KEY_SIZE];
key.write_to_byte_slice(&mut keybuf);
self.tree.append(&keybuf, offset)?;
#[cfg(feature = "testing")]
{
self.last_written_key = key;
}
Ok(())
}
///
/// Finish writing the image layer.
///
@@ -888,7 +924,13 @@ impl ImageLayerWriterInner {
crate::metrics::COMPRESSION_IMAGE_INPUT_BYTES_CONSIDERED
.inc_by(self.uncompressed_bytes_eligible);
crate::metrics::COMPRESSION_IMAGE_INPUT_BYTES_CHOSEN.inc_by(self.uncompressed_bytes_chosen);
crate::metrics::COMPRESSION_IMAGE_OUTPUT_BYTES.inc_by(compressed_size);
// NB: filter() may pass through raw pages from a different layer, without looking at
// whether these are compressed or not. We don't track metrics for these, so avoid
// increasing `COMPRESSION_IMAGE_OUTPUT_BYTES` in this case too.
if self.uncompressed_bytes > 0 {
crate::metrics::COMPRESSION_IMAGE_OUTPUT_BYTES.inc_by(compressed_size);
};
let mut file = self.blob_writer.into_inner();
@@ -1034,6 +1076,25 @@ impl ImageLayerWriter {
self.inner.as_mut().unwrap().put_image(key, img, ctx).await
}
///
/// Write the next value to the file, as a raw header and data. This allows passing through a
/// raw, potentially compressed image from a different layer file without recompressing it.
///
/// The page versions must be appended in blknum order.
///
pub async fn put_image_raw(
&mut self,
key: Key,
raw_with_header: Bytes,
ctx: &RequestContext,
) -> anyhow::Result<()> {
self.inner
.as_mut()
.unwrap()
.put_image_raw(key, raw_with_header, ctx)
.await
}
/// Estimated size of the image layer.
pub(crate) fn estimated_size(&self) -> u64 {
let inner = self.inner.as_ref().unwrap();
@@ -1167,7 +1228,7 @@ mod test {
use crate::tenant::harness::{TIMELINE_ID, TenantHarness};
use crate::tenant::storage_layer::{Layer, ResidentLayer};
use crate::tenant::vectored_blob_io::StreamingVectoredReadPlanner;
use crate::tenant::{Tenant, Timeline};
use crate::tenant::{TenantShard, Timeline};
#[tokio::test]
async fn image_layer_rewrite() {
@@ -1349,7 +1410,7 @@ mod test {
}
async fn produce_image_layer(
tenant: &Tenant,
tenant: &TenantShard,
tline: &Arc<Timeline>,
mut images: Vec<(Key, Bytes)>,
lsn: Lsn,

View File

@@ -24,7 +24,7 @@ use crate::task_mgr::{self, BACKGROUND_RUNTIME, TOKIO_WORKER_THREADS, TaskKind};
use crate::tenant::throttle::Stats;
use crate::tenant::timeline::CompactionError;
use crate::tenant::timeline::compaction::CompactionOutcome;
use crate::tenant::{Tenant, TenantState};
use crate::tenant::{TenantShard, TenantState};
/// Semaphore limiting concurrent background tasks (across all tenants).
///
@@ -117,7 +117,7 @@ pub(crate) async fn acquire_concurrency_permit(
}
/// Start per tenant background loops: compaction, GC, and ingest housekeeping.
pub fn start_background_loops(tenant: &Arc<Tenant>, can_start: Option<&Barrier>) {
pub fn start_background_loops(tenant: &Arc<TenantShard>, can_start: Option<&Barrier>) {
let tenant_shard_id = tenant.tenant_shard_id;
task_mgr::spawn(
@@ -198,7 +198,7 @@ pub fn start_background_loops(tenant: &Arc<Tenant>, can_start: Option<&Barrier>)
}
/// Compaction task's main loop.
async fn compaction_loop(tenant: Arc<Tenant>, cancel: CancellationToken) {
async fn compaction_loop(tenant: Arc<TenantShard>, cancel: CancellationToken) {
const BASE_BACKOFF_SECS: f64 = 1.0;
const MAX_BACKOFF_SECS: f64 = 300.0;
const RECHECK_CONFIG_INTERVAL: Duration = Duration::from_secs(10);
@@ -348,7 +348,7 @@ pub(crate) fn log_compaction_error(
}
/// GC task's main loop.
async fn gc_loop(tenant: Arc<Tenant>, cancel: CancellationToken) {
async fn gc_loop(tenant: Arc<TenantShard>, cancel: CancellationToken) {
const MAX_BACKOFF_SECS: f64 = 300.0;
let mut error_run = 0; // consecutive errors
@@ -432,7 +432,7 @@ async fn gc_loop(tenant: Arc<Tenant>, cancel: CancellationToken) {
}
/// Tenant housekeeping's main loop.
async fn tenant_housekeeping_loop(tenant: Arc<Tenant>, cancel: CancellationToken) {
async fn tenant_housekeeping_loop(tenant: Arc<TenantShard>, cancel: CancellationToken) {
let mut last_throttle_flag_reset_at = Instant::now();
loop {
if wait_for_active_tenant(&tenant, &cancel).await.is_break() {
@@ -483,7 +483,7 @@ async fn tenant_housekeeping_loop(tenant: Arc<Tenant>, cancel: CancellationToken
/// Waits until the tenant becomes active, or returns `ControlFlow::Break()` to shut down.
async fn wait_for_active_tenant(
tenant: &Arc<Tenant>,
tenant: &Arc<TenantShard>,
cancel: &CancellationToken,
) -> ControlFlow<()> {
if tenant.current_state() == TenantState::Active {

View File

@@ -412,7 +412,7 @@ pub struct Timeline {
/// Timeline deletion will acquire both compaction and gc locks in whatever order.
gc_lock: tokio::sync::Mutex<()>,
/// Cloned from [`super::Tenant::pagestream_throttle`] on construction.
/// Cloned from [`super::TenantShard::pagestream_throttle`] on construction.
pub(crate) pagestream_throttle: Arc<crate::tenant::throttle::Throttle>,
/// Size estimator for aux file v2
@@ -2065,7 +2065,7 @@ impl Timeline {
pub(crate) fn activate(
self: &Arc<Self>,
parent: Arc<crate::tenant::Tenant>,
parent: Arc<crate::tenant::TenantShard>,
broker_client: BrokerClientChannel,
background_jobs_can_start: Option<&completion::Barrier>,
ctx: &RequestContext,
@@ -2702,6 +2702,14 @@ impl Timeline {
.clone()
}
pub fn get_compaction_shard_ancestor(&self) -> bool {
let tenant_conf = self.tenant_conf.load();
tenant_conf
.tenant_conf
.compaction_shard_ancestor
.unwrap_or(self.conf.default_tenant_conf.compaction_shard_ancestor)
}
fn get_eviction_policy(&self) -> EvictionPolicy {
let tenant_conf = self.tenant_conf.load();
tenant_conf
@@ -3317,7 +3325,7 @@ impl Timeline {
// (1) and (4)
// TODO: this is basically a no-op now, should we remove it?
self.remote_client.schedule_barrier()?;
// Tenant::create_timeline will wait for these uploads to happen before returning, or
// TenantShard::create_timeline will wait for these uploads to happen before returning, or
// on retry.
// Now that we have the full layer map, we may calculate the visibility of layers within it (a global scan)
@@ -4856,6 +4864,7 @@ impl Timeline {
else {
panic!("delta layer cannot be empty if no filter is applied");
};
(
// FIXME: even though we have a single image and single delta layer assumption
// we push them to vec
@@ -5702,6 +5711,12 @@ impl Timeline {
return;
}
if self.cancel.is_cancelled() {
// We already requested stopping the tenant, so we cannot wait for the logical size
// calculation to complete given the task might have been already cancelled.
return;
}
if let Some(await_bg_cancel) = self
.current_logical_size
.cancel_wait_for_background_loop_concurrency_limit_semaphore
@@ -5740,7 +5755,7 @@ impl Timeline {
/// from our ancestor to be branches of this timeline.
pub(crate) async fn prepare_to_detach_from_ancestor(
self: &Arc<Timeline>,
tenant: &crate::tenant::Tenant,
tenant: &crate::tenant::TenantShard,
options: detach_ancestor::Options,
behavior: DetachBehavior,
ctx: &RequestContext,
@@ -5759,7 +5774,7 @@ impl Timeline {
/// resetting the tenant.
pub(crate) async fn detach_from_ancestor_and_reparent(
self: &Arc<Timeline>,
tenant: &crate::tenant::Tenant,
tenant: &crate::tenant::TenantShard,
prepared: detach_ancestor::PreparedTimelineDetach,
ancestor_timeline_id: TimelineId,
ancestor_lsn: Lsn,
@@ -5783,7 +5798,7 @@ impl Timeline {
/// The tenant must've been reset if ancestry was modified previously (in tenant manager).
pub(crate) async fn complete_detaching_timeline_ancestor(
self: &Arc<Timeline>,
tenant: &crate::tenant::Tenant,
tenant: &crate::tenant::TenantShard,
attempt: detach_ancestor::Attempt,
ctx: &RequestContext,
) -> Result<(), detach_ancestor::Error> {
@@ -6845,14 +6860,14 @@ impl Timeline {
/// Persistently blocks gc for `Manual` reason.
///
/// Returns true if no such block existed before, false otherwise.
pub(crate) async fn block_gc(&self, tenant: &super::Tenant) -> anyhow::Result<bool> {
pub(crate) async fn block_gc(&self, tenant: &super::TenantShard) -> anyhow::Result<bool> {
use crate::tenant::remote_timeline_client::index::GcBlockingReason;
assert_eq!(self.tenant_shard_id, tenant.tenant_shard_id);
tenant.gc_block.insert(self, GcBlockingReason::Manual).await
}
/// Persistently unblocks gc for `Manual` reason.
pub(crate) async fn unblock_gc(&self, tenant: &super::Tenant) -> anyhow::Result<()> {
pub(crate) async fn unblock_gc(&self, tenant: &super::TenantShard) -> anyhow::Result<()> {
use crate::tenant::remote_timeline_client::index::GcBlockingReason;
assert_eq!(self.tenant_shard_id, tenant.tenant_shard_id);
tenant.gc_block.remove(self, GcBlockingReason::Manual).await
@@ -6870,8 +6885,8 @@ impl Timeline {
/// Force create an image layer and place it into the layer map.
///
/// DO NOT use this function directly. Use [`Tenant::branch_timeline_test_with_layers`]
/// or [`Tenant::create_test_timeline_with_layers`] to ensure all these layers are
/// DO NOT use this function directly. Use [`TenantShard::branch_timeline_test_with_layers`]
/// or [`TenantShard::create_test_timeline_with_layers`] to ensure all these layers are
/// placed into the layer map in one run AND be validated.
#[cfg(test)]
pub(super) async fn force_create_image_layer(
@@ -6918,17 +6933,15 @@ impl Timeline {
}
// Update remote_timeline_client state to reflect existence of this layer
self.remote_client
.schedule_layer_file_upload(image_layer)
.unwrap();
self.remote_client.schedule_layer_file_upload(image_layer)?;
Ok(())
}
/// Force create a delta layer and place it into the layer map.
///
/// DO NOT use this function directly. Use [`Tenant::branch_timeline_test_with_layers`]
/// or [`Tenant::create_test_timeline_with_layers`] to ensure all these layers are
/// DO NOT use this function directly. Use [`TenantShard::branch_timeline_test_with_layers`]
/// or [`TenantShard::create_test_timeline_with_layers`] to ensure all these layers are
/// placed into the layer map in one run AND be validated.
#[cfg(test)]
pub(super) async fn force_create_delta_layer(
@@ -6981,9 +6994,7 @@ impl Timeline {
}
// Update remote_timeline_client state to reflect existence of this layer
self.remote_client
.schedule_layer_file_upload(delta_layer)
.unwrap();
self.remote_client.schedule_layer_file_upload(delta_layer)?;
Ok(())
}

View File

@@ -7,7 +7,7 @@
use std::collections::{BinaryHeap, HashMap, HashSet, VecDeque};
use std::ops::{Deref, Range};
use std::sync::Arc;
use std::time::{Duration, Instant};
use std::time::{Duration, Instant, SystemTime};
use super::layer_manager::LayerManager;
use super::{
@@ -56,7 +56,8 @@ use crate::tenant::storage_layer::batch_split_writer::{
use crate::tenant::storage_layer::filter_iterator::FilterIterator;
use crate::tenant::storage_layer::merge_iterator::MergeIterator;
use crate::tenant::storage_layer::{
AsLayerDesc, PersistentLayerDesc, PersistentLayerKey, ValueReconstructState,
AsLayerDesc, LayerVisibilityHint, PersistentLayerDesc, PersistentLayerKey,
ValueReconstructState,
};
use crate::tenant::tasks::log_compaction_error;
use crate::tenant::timeline::{
@@ -69,7 +70,14 @@ use crate::virtual_file::{MaybeFatalIo, VirtualFile};
/// Maximum number of deltas before generating an image layer in bottom-most compaction.
const COMPACTION_DELTA_THRESHOLD: usize = 5;
#[derive(Debug, Clone, Copy, Hash, PartialEq, Eq)]
/// Ratio of shard-local pages below which we trigger shard ancestor layer rewrites. 0.3 means that
/// <= 30% of layer pages must belong to the descendant shard to rewrite the layer.
///
/// We choose a value < 0.5 to avoid rewriting all visible layers every time we do a power-of-two
/// shard split, which gets expensive for large tenants.
const ANCESTOR_COMPACTION_REWRITE_THRESHOLD: f64 = 0.3;
#[derive(Default, Debug, Clone, Copy, Hash, PartialEq, Eq, Serialize)]
pub struct GcCompactionJobId(pub usize);
impl std::fmt::Display for GcCompactionJobId {
@@ -97,6 +105,43 @@ pub enum GcCompactionQueueItem {
Notify(GcCompactionJobId, Option<Lsn>),
}
/// Statistics for gc-compaction meta jobs, which contains several sub compaction jobs.
#[derive(Debug, Clone, Serialize, Default)]
pub struct GcCompactionMetaStatistics {
/// The total number of sub compaction jobs.
pub total_sub_compaction_jobs: usize,
/// The total number of sub compaction jobs that failed.
pub failed_sub_compaction_jobs: usize,
/// The total number of sub compaction jobs that succeeded.
pub succeeded_sub_compaction_jobs: usize,
/// The layer size before compaction.
pub before_compaction_layer_size: u64,
/// The layer size after compaction.
pub after_compaction_layer_size: u64,
/// The start time of the meta job.
pub start_time: Option<SystemTime>,
/// The end time of the meta job.
pub end_time: Option<SystemTime>,
/// The duration of the meta job.
pub duration_secs: f64,
/// The id of the meta job.
pub meta_job_id: GcCompactionJobId,
/// The LSN below which the layers are compacted, used to compute the statistics.
pub below_lsn: Lsn,
}
impl GcCompactionMetaStatistics {
fn finalize(&mut self) {
let end_time = SystemTime::now();
if let Some(start_time) = self.start_time {
if let Ok(duration) = end_time.duration_since(start_time) {
self.duration_secs = duration.as_secs_f64();
}
}
self.end_time = Some(end_time);
}
}
impl GcCompactionQueueItem {
pub fn into_compact_info_resp(
self,
@@ -134,6 +179,7 @@ struct GcCompactionQueueInner {
queued: VecDeque<(GcCompactionJobId, GcCompactionQueueItem)>,
guards: HashMap<GcCompactionJobId, GcCompactionGuardItems>,
last_id: GcCompactionJobId,
meta_statistics: Option<GcCompactionMetaStatistics>,
}
impl GcCompactionQueueInner {
@@ -165,6 +211,7 @@ impl GcCompactionQueue {
queued: VecDeque::new(),
guards: HashMap::new(),
last_id: GcCompactionJobId(0),
meta_statistics: None,
}),
consumer_lock: tokio::sync::Mutex::new(()),
}
@@ -349,6 +396,23 @@ impl GcCompactionQueue {
Ok(())
}
async fn collect_layer_below_lsn(
&self,
timeline: &Arc<Timeline>,
lsn: Lsn,
) -> Result<u64, CompactionError> {
let guard = timeline.layers.read().await;
let layer_map = guard.layer_map()?;
let layers = layer_map.iter_historic_layers().collect_vec();
let mut size = 0;
for layer in layers {
if layer.lsn_range.start <= lsn {
size += layer.file_size();
}
}
Ok(size)
}
/// Notify the caller the job has finished and unblock GC.
fn notify_and_unblock(&self, id: GcCompactionJobId) {
info!("compaction job id={} finished", id);
@@ -358,6 +422,16 @@ impl GcCompactionQueue {
let _ = tx.send(());
}
}
if let Some(ref meta_statistics) = guard.meta_statistics {
if meta_statistics.meta_job_id == id {
if let Ok(stats) = serde_json::to_string(&meta_statistics) {
info!(
"gc-compaction meta statistics for job id = {}: {}",
id, stats
);
}
}
}
}
fn clear_running_job(&self) {
@@ -397,7 +471,11 @@ impl GcCompactionQueue {
let mut pending_tasks = Vec::new();
// gc-compaction might pick more layers or fewer layers to compact. The L2 LSN does not need to be accurate.
// And therefore, we simply assume the maximum LSN of all jobs is the expected L2 LSN.
let expected_l2_lsn = jobs.iter().map(|job| job.compact_lsn_range.end).max();
let expected_l2_lsn = jobs
.iter()
.map(|job| job.compact_lsn_range.end)
.max()
.unwrap();
for job in jobs {
// Unfortunately we need to convert the `GcCompactJob` back to `CompactionOptions`
// until we do further refactors to allow directly call `compact_with_gc`.
@@ -422,9 +500,13 @@ impl GcCompactionQueue {
if !auto {
pending_tasks.push(GcCompactionQueueItem::Notify(id, None));
} else {
pending_tasks.push(GcCompactionQueueItem::Notify(id, expected_l2_lsn));
pending_tasks.push(GcCompactionQueueItem::Notify(id, Some(expected_l2_lsn)));
}
let layer_size = self
.collect_layer_below_lsn(timeline, expected_l2_lsn)
.await?;
{
let mut guard = self.inner.lock().unwrap();
let mut tasks = Vec::new();
@@ -436,7 +518,16 @@ impl GcCompactionQueue {
for item in tasks {
guard.queued.push_front(item);
}
guard.meta_statistics = Some(GcCompactionMetaStatistics {
meta_job_id: id,
start_time: Some(SystemTime::now()),
before_compaction_layer_size: layer_size,
below_lsn: expected_l2_lsn,
total_sub_compaction_jobs: jobs_len,
..Default::default()
});
}
info!(
"scheduled enhanced gc bottom-most compaction with sub-compaction, split into {} jobs",
jobs_len
@@ -565,6 +656,10 @@ impl GcCompactionQueue {
Err(err) => {
warn!(%err, "failed to run gc-compaction subcompaction job");
self.clear_running_job();
let mut guard = self.inner.lock().unwrap();
if let Some(ref mut meta_statistics) = guard.meta_statistics {
meta_statistics.failed_sub_compaction_jobs += 1;
}
return Err(err);
}
};
@@ -574,8 +669,34 @@ impl GcCompactionQueue {
// we need to clean things up before returning from the function.
yield_for_l0 = true;
}
{
let mut guard = self.inner.lock().unwrap();
if let Some(ref mut meta_statistics) = guard.meta_statistics {
meta_statistics.succeeded_sub_compaction_jobs += 1;
}
}
}
GcCompactionQueueItem::Notify(id, l2_lsn) => {
let below_lsn = {
let mut guard = self.inner.lock().unwrap();
if let Some(ref mut meta_statistics) = guard.meta_statistics {
meta_statistics.below_lsn
} else {
Lsn::INVALID
}
};
let layer_size = if below_lsn != Lsn::INVALID {
self.collect_layer_below_lsn(timeline, below_lsn).await?
} else {
0
};
{
let mut guard = self.inner.lock().unwrap();
if let Some(ref mut meta_statistics) = guard.meta_statistics {
meta_statistics.after_compaction_layer_size = layer_size;
meta_statistics.finalize();
}
}
self.notify_and_unblock(id);
if let Some(l2_lsn) = l2_lsn {
let current_l2_lsn = timeline
@@ -819,7 +940,15 @@ impl KeyHistoryRetention {
base_img: &Option<(Lsn, &Bytes)>,
history: &[(Lsn, &NeonWalRecord)],
tline: &Arc<Timeline>,
skip_empty: bool,
) -> anyhow::Result<()> {
if base_img.is_none() && history.is_empty() {
if skip_empty {
return Ok(());
}
anyhow::bail!("verification failed: key {} has no history at {}", key, lsn);
};
let mut records = history
.iter()
.map(|(lsn, val)| (*lsn, (*val).clone()))
@@ -860,17 +989,12 @@ impl KeyHistoryRetention {
if *retain_lsn >= min_lsn {
// Only verify after the key appears in the full history for the first time.
if base_img.is_none() && history.is_empty() {
anyhow::bail!(
"verificatoin failed: key {} has no history at {}",
key,
retain_lsn
);
};
// We don't modify history: in theory, we could replace the history with a single
// image as in `generate_key_retention` to make redos at later LSNs faster. But we
// want to verify everything as if they are read from the real layer map.
collect_and_verify(key, *retain_lsn, &base_img, &history, tline).await?;
collect_and_verify(key, *retain_lsn, &base_img, &history, tline, false)
.await
.context("below horizon retain_lsn")?;
}
}
@@ -878,13 +1002,17 @@ impl KeyHistoryRetention {
match val {
Value::Image(img) => {
// Above the GC horizon, we verify every time we see an image.
collect_and_verify(key, *lsn, &base_img, &history, tline).await?;
collect_and_verify(key, *lsn, &base_img, &history, tline, true)
.await
.context("above horizon full image")?;
base_img = Some((*lsn, img));
history.clear();
}
Value::WalRecord(rec) if val.will_init() => {
// Above the GC horizon, we verify every time we see an init record.
collect_and_verify(key, *lsn, &base_img, &history, tline).await?;
collect_and_verify(key, *lsn, &base_img, &history, tline, true)
.await
.context("above horizon init record")?;
base_img = None;
history.clear();
history.push((*lsn, rec));
@@ -895,7 +1023,9 @@ impl KeyHistoryRetention {
}
}
// Ensure the latest record is readable.
collect_and_verify(key, max_lsn, &base_img, &history, tline).await?;
collect_and_verify(key, max_lsn, &base_img, &history, tline, false)
.await
.context("latest record")?;
Ok(())
}
}
@@ -1161,6 +1291,7 @@ impl Timeline {
.parts
.extend(sparse_partitioning.into_dense().parts);
// 3. Create new image layers for partitions that have been modified "enough".
let (image_layers, outcome) = self
.create_image_layers(
@@ -1222,8 +1353,7 @@ impl Timeline {
let partition_count = self.partitioning.read().0.0.parts.len();
// 4. Shard ancestor compaction
if self.shard_identity.count >= ShardCount::new(2) {
if self.get_compaction_shard_ancestor() && self.shard_identity.count >= ShardCount::new(2) {
// Limit the number of layer rewrites to the number of partitions: this means its
// runtime should be comparable to a full round of image layer creations, rather than
// being potentially much longer.
@@ -1273,7 +1403,10 @@ impl Timeline {
let pitr_cutoff = self.gc_info.read().unwrap().cutoffs.time;
let layers = self.layers.read().await;
for layer_desc in layers.layer_map()?.iter_historic_layers() {
let layers_iter = layers.layer_map()?.iter_historic_layers();
let (layers_total, mut layers_checked) = (layers_iter.len(), 0);
for layer_desc in layers_iter {
layers_checked += 1;
let layer = layers.get_from_desc(&layer_desc);
if layer.metadata().shard.shard_count == self.shard_identity.count {
// This layer does not belong to a historic ancestor, no need to re-image it.
@@ -1317,14 +1450,15 @@ impl Timeline {
continue;
}
// Don't bother re-writing a layer unless it will at least halve its size
// Only rewrite a layer if we can reclaim significant space.
if layer_local_page_count != u32::MAX
&& layer_local_page_count > layer_raw_page_count / 2
&& layer_local_page_count as f64 / layer_raw_page_count as f64
<= ANCESTOR_COMPACTION_REWRITE_THRESHOLD
{
debug!(%layer,
"layer is already mostly local ({}/{}), not rewriting",
layer_local_page_count,
layer_raw_page_count
"layer has a large share of local pages \
({layer_local_page_count}/{layer_raw_page_count} > \
{ANCESTOR_COMPACTION_REWRITE_THRESHOLD}), not rewriting",
);
}
@@ -1336,12 +1470,19 @@ impl Timeline {
continue;
}
// We do not yet implement rewrite of delta layers.
if layer_desc.is_delta() {
// We do not yet implement rewrite of delta layers
debug!(%layer, "Skipping rewrite of delta layer");
continue;
}
// We don't bother rewriting layers that aren't visible, since these won't be needed by
// reads and will likely be garbage collected soon.
if layer.visibility() != LayerVisibilityHint::Visible {
debug!(%layer, "Skipping rewrite of invisible layer");
continue;
}
// Only rewrite layers if their generations differ. This guarantees:
// - that local rewrite is safe, as local layer paths will differ between existing layer and rewritten one
// - that the layer is persistent in remote storage, as we only see old-generation'd layer via loading from remote storage
@@ -1371,7 +1512,8 @@ impl Timeline {
}
info!(
"starting shard ancestor compaction, rewriting {} layers and dropping {} layers \
"starting shard ancestor compaction, rewriting {} layers and dropping {} layers, \
checked {layers_checked}/{layers_total} layers \
(latest_gc_cutoff={} pitr_cutoff={})",
layers_to_rewrite.len(),
drop_layers.len(),

View File

@@ -18,8 +18,8 @@ use crate::tenant::remote_timeline_client::{
PersistIndexPartWithDeletedFlagError, RemoteTimelineClient,
};
use crate::tenant::{
CreateTimelineCause, DeleteTimelineError, MaybeDeletedIndexPart, Tenant, TenantManifestError,
Timeline, TimelineOrOffloaded,
CreateTimelineCause, DeleteTimelineError, MaybeDeletedIndexPart, TenantManifestError,
TenantShard, Timeline, TimelineOrOffloaded,
};
use crate::virtual_file::MaybeFatalIo;
@@ -113,7 +113,7 @@ pub(super) async fn delete_local_timeline_directory(
/// It is important that this gets called when DeletionGuard is being held.
/// For more context see comments in [`make_timeline_delete_guard`]
async fn remove_maybe_offloaded_timeline_from_tenant(
tenant: &Tenant,
tenant: &TenantShard,
timeline: &TimelineOrOffloaded,
_: &DeletionGuard, // using it as a witness
) -> anyhow::Result<()> {
@@ -192,7 +192,7 @@ impl DeleteTimelineFlow {
// error out if some of the shutdown tasks have already been completed!
#[instrument(skip_all)]
pub async fn run(
tenant: &Arc<Tenant>,
tenant: &Arc<TenantShard>,
timeline_id: TimelineId,
) -> Result<(), DeleteTimelineError> {
super::debug_assert_current_span_has_tenant_and_timeline_id();
@@ -288,7 +288,7 @@ impl DeleteTimelineFlow {
/// Shortcut to create Timeline in stopping state and spawn deletion task.
#[instrument(skip_all, fields(%timeline_id))]
pub(crate) async fn resume_deletion(
tenant: Arc<Tenant>,
tenant: Arc<TenantShard>,
timeline_id: TimelineId,
local_metadata: &TimelineMetadata,
remote_client: RemoteTimelineClient,
@@ -338,7 +338,7 @@ impl DeleteTimelineFlow {
fn schedule_background(
guard: DeletionGuard,
conf: &'static PageServerConf,
tenant: Arc<Tenant>,
tenant: Arc<TenantShard>,
timeline: TimelineOrOffloaded,
remote_client: Arc<RemoteTimelineClient>,
) {
@@ -381,7 +381,7 @@ impl DeleteTimelineFlow {
async fn background(
mut guard: DeletionGuard,
conf: &PageServerConf,
tenant: &Tenant,
tenant: &TenantShard,
timeline: &TimelineOrOffloaded,
remote_client: Arc<RemoteTimelineClient>,
) -> Result<(), DeleteTimelineError> {
@@ -435,7 +435,7 @@ pub(super) enum TimelineDeleteGuardKind {
}
pub(super) fn make_timeline_delete_guard(
tenant: &Tenant,
tenant: &TenantShard,
timeline_id: TimelineId,
guard_kind: TimelineDeleteGuardKind,
) -> Result<(TimelineOrOffloaded, DeletionGuard), DeleteTimelineError> {

View File

@@ -23,7 +23,7 @@ use super::layer_manager::LayerManager;
use super::{FlushLayerError, Timeline};
use crate::context::{DownloadBehavior, RequestContext};
use crate::task_mgr::TaskKind;
use crate::tenant::Tenant;
use crate::tenant::TenantShard;
use crate::tenant::remote_timeline_client::index::GcBlockingReason::DetachAncestor;
use crate::tenant::storage_layer::layer::local_layer_path;
use crate::tenant::storage_layer::{
@@ -265,7 +265,7 @@ async fn generate_tombstone_image_layer(
/// See [`Timeline::prepare_to_detach_from_ancestor`]
pub(super) async fn prepare(
detached: &Arc<Timeline>,
tenant: &Tenant,
tenant: &TenantShard,
behavior: DetachBehavior,
options: Options,
ctx: &RequestContext,
@@ -590,7 +590,7 @@ pub(super) async fn prepare(
async fn start_new_attempt(
detached: &Timeline,
tenant: &Tenant,
tenant: &TenantShard,
ancestor_timeline_id: TimelineId,
ancestor_lsn: Lsn,
) -> Result<Attempt, Error> {
@@ -611,7 +611,7 @@ async fn start_new_attempt(
async fn continue_with_blocked_gc(
detached: &Timeline,
tenant: &Tenant,
tenant: &TenantShard,
ancestor_timeline_id: TimelineId,
ancestor_lsn: Lsn,
) -> Result<Attempt, Error> {
@@ -622,7 +622,7 @@ async fn continue_with_blocked_gc(
fn obtain_exclusive_attempt(
detached: &Timeline,
tenant: &Tenant,
tenant: &TenantShard,
ancestor_timeline_id: TimelineId,
ancestor_lsn: Lsn,
) -> Result<Attempt, Error> {
@@ -655,7 +655,7 @@ fn obtain_exclusive_attempt(
fn reparented_direct_children(
detached: &Arc<Timeline>,
tenant: &Tenant,
tenant: &TenantShard,
) -> Result<HashSet<TimelineId>, Error> {
let mut all_direct_children = tenant
.timelines
@@ -950,7 +950,7 @@ impl DetachingAndReparenting {
/// See [`Timeline::detach_from_ancestor_and_reparent`].
pub(super) async fn detach_and_reparent(
detached: &Arc<Timeline>,
tenant: &Tenant,
tenant: &TenantShard,
prepared: PreparedTimelineDetach,
ancestor_timeline_id: TimelineId,
ancestor_lsn: Lsn,
@@ -1184,7 +1184,7 @@ pub(super) async fn detach_and_reparent(
pub(super) async fn complete(
detached: &Arc<Timeline>,
tenant: &Tenant,
tenant: &TenantShard,
mut attempt: Attempt,
_ctx: &RequestContext,
) -> Result<(), Error> {
@@ -1258,7 +1258,7 @@ where
}
fn check_no_archived_children_of_ancestor(
tenant: &Tenant,
tenant: &TenantShard,
detached: &Arc<Timeline>,
ancestor: &Arc<Timeline>,
ancestor_lsn: Lsn,

View File

@@ -33,7 +33,7 @@ use crate::tenant::size::CalculateSyntheticSizeError;
use crate::tenant::storage_layer::LayerVisibilityHint;
use crate::tenant::tasks::{BackgroundLoopKind, BackgroundLoopSemaphorePermit, sleep_random};
use crate::tenant::timeline::EvictionError;
use crate::tenant::{LogicalSizeCalculationCause, Tenant};
use crate::tenant::{LogicalSizeCalculationCause, TenantShard};
#[derive(Default)]
pub struct EvictionTaskTimelineState {
@@ -48,7 +48,7 @@ pub struct EvictionTaskTenantState {
impl Timeline {
pub(super) fn launch_eviction_task(
self: &Arc<Self>,
parent: Arc<Tenant>,
parent: Arc<TenantShard>,
background_tasks_can_start: Option<&completion::Barrier>,
) {
let self_clone = Arc::clone(self);
@@ -75,7 +75,7 @@ impl Timeline {
}
#[instrument(skip_all, fields(tenant_id = %self.tenant_shard_id.tenant_id, shard_id = %self.tenant_shard_id.shard_slug(), timeline_id = %self.timeline_id))]
async fn eviction_task(self: Arc<Self>, tenant: Arc<Tenant>) {
async fn eviction_task(self: Arc<Self>, tenant: Arc<TenantShard>) {
// acquire the gate guard only once within a useful span
let Ok(guard) = self.gate.enter() else {
return;
@@ -118,7 +118,7 @@ impl Timeline {
#[instrument(skip_all, fields(policy_kind = policy.discriminant_str()))]
async fn eviction_iteration(
self: &Arc<Self>,
tenant: &Tenant,
tenant: &TenantShard,
policy: &EvictionPolicy,
cancel: &CancellationToken,
gate: &GateGuard,
@@ -175,7 +175,7 @@ impl Timeline {
async fn eviction_iteration_threshold(
self: &Arc<Self>,
tenant: &Tenant,
tenant: &TenantShard,
p: &EvictionPolicyLayerAccessThreshold,
cancel: &CancellationToken,
gate: &GateGuard,
@@ -309,7 +309,7 @@ impl Timeline {
/// disk usage based eviction task.
async fn imitiate_only(
self: &Arc<Self>,
tenant: &Tenant,
tenant: &TenantShard,
p: &EvictionPolicyLayerAccessThreshold,
cancel: &CancellationToken,
gate: &GateGuard,
@@ -363,7 +363,7 @@ impl Timeline {
#[instrument(skip_all)]
async fn imitate_layer_accesses(
&self,
tenant: &Tenant,
tenant: &TenantShard,
p: &EvictionPolicyLayerAccessThreshold,
cancel: &CancellationToken,
gate: &GateGuard,
@@ -499,7 +499,7 @@ impl Timeline {
#[instrument(skip_all)]
async fn imitate_synthetic_size_calculation_worker(
&self,
tenant: &Tenant,
tenant: &TenantShard,
cancel: &CancellationToken,
ctx: &RequestContext,
) {

View File

@@ -244,7 +244,8 @@ impl RemoteStorageWrapper {
kind: DownloadKind::Large,
etag: None,
byte_start: Bound::Included(start_inclusive),
byte_end: Bound::Excluded(end_exclusive)
byte_end: Bound::Excluded(end_exclusive),
encryption_key: None,
},
&self.cancel)
.await?;

View File

@@ -8,7 +8,7 @@ use crate::span::debug_assert_current_span_has_tenant_and_timeline_id;
use crate::tenant::remote_timeline_client::ShutdownIfArchivedError;
use crate::tenant::timeline::delete::{TimelineDeleteGuardKind, make_timeline_delete_guard};
use crate::tenant::{
DeleteTimelineError, OffloadedTimeline, Tenant, TenantManifestError, TimelineOrOffloaded,
DeleteTimelineError, OffloadedTimeline, TenantManifestError, TenantShard, TimelineOrOffloaded,
};
#[derive(thiserror::Error, Debug)]
@@ -33,7 +33,7 @@ impl From<TenantManifestError> for OffloadError {
}
pub(crate) async fn offload_timeline(
tenant: &Tenant,
tenant: &TenantShard,
timeline: &Arc<Timeline>,
) -> Result<(), OffloadError> {
debug_assert_current_span_has_tenant_and_timeline_id();
@@ -123,7 +123,7 @@ pub(crate) async fn offload_timeline(
///
/// Returns the strong count of the timeline `Arc`
fn remove_timeline_from_tenant(
tenant: &Tenant,
tenant: &TenantShard,
timeline: &Timeline,
_: &DeletionGuard, // using it as a witness
) -> usize {

View File

@@ -15,17 +15,19 @@ use super::Timeline;
use crate::context::RequestContext;
use crate::import_datadir;
use crate::span::debug_assert_current_span_has_tenant_and_timeline_id;
use crate::tenant::{CreateTimelineError, CreateTimelineIdempotency, Tenant, TimelineOrOffloaded};
use crate::tenant::{
CreateTimelineError, CreateTimelineIdempotency, TenantShard, TimelineOrOffloaded,
};
/// A timeline with some of its files on disk, being initialized.
/// This struct ensures the atomicity of the timeline init: it's either properly created and inserted into pageserver's memory, or
/// its local files are removed. If we crash while this class exists, then the timeline's local
/// state is cleaned up during [`Tenant::clean_up_timelines`], because the timeline's content isn't in remote storage.
/// state is cleaned up during [`TenantShard::clean_up_timelines`], because the timeline's content isn't in remote storage.
///
/// The caller is responsible for proper timeline data filling before the final init.
#[must_use]
pub struct UninitializedTimeline<'t> {
pub(crate) owning_tenant: &'t Tenant,
pub(crate) owning_tenant: &'t TenantShard,
timeline_id: TimelineId,
raw_timeline: Option<(Arc<Timeline>, TimelineCreateGuard)>,
/// Whether we spawned the inner Timeline's tasks such that we must later shut it down
@@ -35,7 +37,7 @@ pub struct UninitializedTimeline<'t> {
impl<'t> UninitializedTimeline<'t> {
pub(crate) fn new(
owning_tenant: &'t Tenant,
owning_tenant: &'t TenantShard,
timeline_id: TimelineId,
raw_timeline: Option<(Arc<Timeline>, TimelineCreateGuard)>,
) -> Self {
@@ -156,7 +158,7 @@ impl<'t> UninitializedTimeline<'t> {
/// Prepares timeline data by loading it from the basebackup archive.
pub(crate) async fn import_basebackup_from_tar(
mut self,
tenant: Arc<Tenant>,
tenant: Arc<TenantShard>,
copyin_read: &mut (impl tokio::io::AsyncRead + Send + Sync + Unpin),
base_lsn: Lsn,
broker_client: storage_broker::BrokerClientChannel,
@@ -227,17 +229,17 @@ pub(crate) fn cleanup_timeline_directory(create_guard: TimelineCreateGuard) {
error!("Failed to clean up uninitialized timeline directory {timeline_path:?}: {e:?}")
}
}
// Having cleaned up, we can release this TimelineId in `[Tenant::timelines_creating]` to allow other
// Having cleaned up, we can release this TimelineId in `[TenantShard::timelines_creating]` to allow other
// timeline creation attempts under this TimelineId to proceed
drop(create_guard);
}
/// A guard for timeline creations in process: as long as this object exists, the timeline ID
/// is kept in `[Tenant::timelines_creating]` to exclude concurrent attempts to create the same timeline.
/// is kept in `[TenantShard::timelines_creating]` to exclude concurrent attempts to create the same timeline.
#[must_use]
pub(crate) struct TimelineCreateGuard {
pub(crate) _tenant_gate_guard: GateGuard,
pub(crate) owning_tenant: Arc<Tenant>,
pub(crate) owning_tenant: Arc<TenantShard>,
pub(crate) timeline_id: TimelineId,
pub(crate) timeline_path: Utf8PathBuf,
pub(crate) idempotency: CreateTimelineIdempotency,
@@ -263,7 +265,7 @@ pub(crate) enum TimelineExclusionError {
impl TimelineCreateGuard {
pub(crate) fn new(
owning_tenant: &Arc<Tenant>,
owning_tenant: &Arc<TenantShard>,
timeline_id: TimelineId,
timeline_path: Utf8PathBuf,
idempotency: CreateTimelineIdempotency,

View File

@@ -9,6 +9,7 @@ use tracing::info;
use utils::generation::Generation;
use utils::lsn::{AtomicLsn, Lsn};
use super::remote_timeline_client::index::EncryptionKeyPair;
use super::remote_timeline_client::is_same_remote_layer_path;
use super::storage_layer::{AsLayerDesc as _, LayerName, ResidentLayer};
use crate::tenant::metadata::TimelineMetadata;
@@ -245,7 +246,7 @@ impl UploadQueueInitialized {
pub(crate) fn num_inprogress_layer_uploads(&self) -> usize {
self.inprogress_tasks
.iter()
.filter(|(_, t)| matches!(t.op, UploadOp::UploadLayer(_, _, _)))
.filter(|(_, t)| matches!(t.op, UploadOp::UploadLayer(_, _, _, _)))
.count()
}
@@ -461,7 +462,12 @@ pub struct Delete {
#[derive(Clone, Debug)]
pub enum UploadOp {
/// Upload a layer file. The last field indicates the last operation for thie file.
UploadLayer(ResidentLayer, LayerFileMetadata, Option<OpType>),
UploadLayer(
ResidentLayer,
LayerFileMetadata,
Option<EncryptionKeyPair>,
Option<OpType>,
),
/// Upload a index_part.json file
UploadMetadata {
@@ -483,7 +489,7 @@ pub enum UploadOp {
impl std::fmt::Display for UploadOp {
fn fmt(&self, f: &mut std::fmt::Formatter) -> std::fmt::Result {
match self {
UploadOp::UploadLayer(layer, metadata, mode) => {
UploadOp::UploadLayer(layer, metadata, _, mode) => {
write!(
f,
"UploadLayer({}, size={:?}, gen={:?}, mode={:?})",
@@ -517,13 +523,13 @@ impl UploadOp {
(UploadOp::Shutdown, _) | (_, UploadOp::Shutdown) => false,
// Uploads and deletes can bypass each other unless they're for the same file.
(UploadOp::UploadLayer(a, ameta, _), UploadOp::UploadLayer(b, bmeta, _)) => {
(UploadOp::UploadLayer(a, ameta, _, _), UploadOp::UploadLayer(b, bmeta, _, _)) => {
let aname = &a.layer_desc().layer_name();
let bname = &b.layer_desc().layer_name();
!is_same_remote_layer_path(aname, ameta, bname, bmeta)
}
(UploadOp::UploadLayer(u, umeta, _), UploadOp::Delete(d))
| (UploadOp::Delete(d), UploadOp::UploadLayer(u, umeta, _)) => {
(UploadOp::UploadLayer(u, umeta, _, _), UploadOp::Delete(d))
| (UploadOp::Delete(d), UploadOp::UploadLayer(u, umeta, _, _)) => {
d.layers.iter().all(|(dname, dmeta)| {
!is_same_remote_layer_path(&u.layer_desc().layer_name(), umeta, dname, dmeta)
})
@@ -539,8 +545,8 @@ impl UploadOp {
// Similarly, index uploads can bypass uploads and deletes as long as neither the
// uploaded index nor the active index references the file (the latter would be
// incorrect use by the caller).
(UploadOp::UploadLayer(u, umeta, _), UploadOp::UploadMetadata { uploaded: i })
| (UploadOp::UploadMetadata { uploaded: i }, UploadOp::UploadLayer(u, umeta, _)) => {
(UploadOp::UploadLayer(u, umeta, _, _), UploadOp::UploadMetadata { uploaded: i })
| (UploadOp::UploadMetadata { uploaded: i }, UploadOp::UploadLayer(u, umeta, _, _)) => {
let uname = u.layer_desc().layer_name();
!i.references(&uname, umeta) && !index.references(&uname, umeta)
}
@@ -577,7 +583,7 @@ mod tests {
fn assert_same_op(a: &UploadOp, b: &UploadOp) {
use UploadOp::*;
match (a, b) {
(UploadLayer(a, ameta, atype), UploadLayer(b, bmeta, btype)) => {
(UploadLayer(a, ameta, _, atype), UploadLayer(b, bmeta, _, btype)) => {
assert_eq!(a.layer_desc().layer_name(), b.layer_desc().layer_name());
assert_eq!(ameta, bmeta);
assert_eq!(atype, btype);
@@ -641,6 +647,7 @@ mod tests {
generation: timeline.generation,
shard: timeline.get_shard_index(),
file_size: size as u64,
encryption_key: None,
};
make_layer_with_metadata(timeline, name, metadata)
}
@@ -710,7 +717,7 @@ mod tests {
// Enqueue non-conflicting upload, delete, and index before and after a barrier.
let ops = [
UploadOp::UploadLayer(layer0.clone(), layer0.metadata(), None),
UploadOp::UploadLayer(layer0.clone(), layer0.metadata(), None, None),
UploadOp::Delete(Delete {
layers: vec![(layer1.layer_desc().layer_name(), layer1.metadata())],
}),
@@ -718,7 +725,7 @@ mod tests {
uploaded: index.clone(),
},
UploadOp::Barrier(barrier),
UploadOp::UploadLayer(layer2.clone(), layer2.metadata(), None),
UploadOp::UploadLayer(layer2.clone(), layer2.metadata(), None, None),
UploadOp::Delete(Delete {
layers: vec![(layer3.layer_desc().layer_name(), layer3.metadata())],
}),
@@ -843,9 +850,9 @@ mod tests {
);
let ops = [
UploadOp::UploadLayer(layer0a.clone(), layer0a.metadata(), None),
UploadOp::UploadLayer(layer0b.clone(), layer0b.metadata(), None),
UploadOp::UploadLayer(layer0c.clone(), layer0c.metadata(), None),
UploadOp::UploadLayer(layer0a.clone(), layer0a.metadata(), None, None),
UploadOp::UploadLayer(layer0b.clone(), layer0b.metadata(), None, None),
UploadOp::UploadLayer(layer0c.clone(), layer0c.metadata(), None, None),
];
queue.queued_operations.extend(ops.clone());
@@ -882,14 +889,14 @@ mod tests {
);
let ops = [
UploadOp::UploadLayer(layer0.clone(), layer0.metadata(), None),
UploadOp::UploadLayer(layer0.clone(), layer0.metadata(), None, None),
UploadOp::Delete(Delete {
layers: vec![
(layer0.layer_desc().layer_name(), layer0.metadata()),
(layer1.layer_desc().layer_name(), layer1.metadata()),
],
}),
UploadOp::UploadLayer(layer1.clone(), layer1.metadata(), None),
UploadOp::UploadLayer(layer1.clone(), layer1.metadata(), None, None),
];
queue.queued_operations.extend(ops.clone());
@@ -938,15 +945,15 @@ mod tests {
);
let ops = [
UploadOp::UploadLayer(layer0.clone(), layer0.metadata(), None),
UploadOp::UploadLayer(layer0.clone(), layer0.metadata(), None, None),
UploadOp::Delete(Delete {
layers: vec![
(layer0.layer_desc().layer_name(), layer0.metadata()),
(layer1.layer_desc().layer_name(), layer1.metadata()),
],
}),
UploadOp::UploadLayer(layer1.clone(), layer1.metadata(), None),
UploadOp::UploadLayer(layer2.clone(), layer2.metadata(), None),
UploadOp::UploadLayer(layer1.clone(), layer1.metadata(), None, None),
UploadOp::UploadLayer(layer2.clone(), layer2.metadata(), None, None),
UploadOp::Delete(Delete {
layers: vec![(layer3.layer_desc().layer_name(), layer3.metadata())],
}),
@@ -984,9 +991,9 @@ mod tests {
);
let ops = [
UploadOp::UploadLayer(layer0.clone(), layer0.metadata(), None),
UploadOp::UploadLayer(layer1.clone(), layer1.metadata(), None),
UploadOp::UploadLayer(layer2.clone(), layer2.metadata(), None),
UploadOp::UploadLayer(layer0.clone(), layer0.metadata(), None, None),
UploadOp::UploadLayer(layer1.clone(), layer1.metadata(), None, None),
UploadOp::UploadLayer(layer2.clone(), layer2.metadata(), None, None),
];
queue.queued_operations.extend(ops.clone());
@@ -1061,15 +1068,15 @@ mod tests {
let index2 = index_with(&index1, &layer2);
let ops = [
UploadOp::UploadLayer(layer0.clone(), layer0.metadata(), None),
UploadOp::UploadLayer(layer0.clone(), layer0.metadata(), None, None),
UploadOp::UploadMetadata {
uploaded: index0.clone(),
},
UploadOp::UploadLayer(layer1.clone(), layer1.metadata(), None),
UploadOp::UploadLayer(layer1.clone(), layer1.metadata(), None, None),
UploadOp::UploadMetadata {
uploaded: index1.clone(),
},
UploadOp::UploadLayer(layer2.clone(), layer2.metadata(), None),
UploadOp::UploadLayer(layer2.clone(), layer2.metadata(), None, None),
UploadOp::UploadMetadata {
uploaded: index2.clone(),
},
@@ -1128,7 +1135,7 @@ mod tests {
let ops = [
// Initial upload, with a barrier to prevent index coalescing.
UploadOp::UploadLayer(layer.clone(), layer.metadata(), None),
UploadOp::UploadLayer(layer.clone(), layer.metadata(), None, None),
UploadOp::UploadMetadata {
uploaded: index_upload.clone(),
},
@@ -1177,7 +1184,7 @@ mod tests {
let ops = [
// Initial upload, with a barrier to prevent index coalescing.
UploadOp::UploadLayer(layer.clone(), layer.metadata(), None),
UploadOp::UploadLayer(layer.clone(), layer.metadata(), None, None),
UploadOp::UploadMetadata {
uploaded: index_upload.clone(),
},
@@ -1187,7 +1194,7 @@ mod tests {
uploaded: index_deref.clone(),
},
// Replace and reference the layer.
UploadOp::UploadLayer(layer.clone(), layer.metadata(), None),
UploadOp::UploadLayer(layer.clone(), layer.metadata(), None, None),
UploadOp::UploadMetadata {
uploaded: index_ref.clone(),
},
@@ -1235,7 +1242,7 @@ mod tests {
// Enqueue non-conflicting upload, delete, and index before and after a shutdown.
let ops = [
UploadOp::UploadLayer(layer0.clone(), layer0.metadata(), None),
UploadOp::UploadLayer(layer0.clone(), layer0.metadata(), None, None),
UploadOp::Delete(Delete {
layers: vec![(layer1.layer_desc().layer_name(), layer1.metadata())],
}),
@@ -1243,7 +1250,7 @@ mod tests {
uploaded: index.clone(),
},
UploadOp::Shutdown,
UploadOp::UploadLayer(layer2.clone(), layer2.metadata(), None),
UploadOp::UploadLayer(layer2.clone(), layer2.metadata(), None, None),
UploadOp::Delete(Delete {
layers: vec![(layer3.layer_desc().layer_name(), layer3.metadata())],
}),
@@ -1305,10 +1312,10 @@ mod tests {
);
let ops = [
UploadOp::UploadLayer(layer0.clone(), layer0.metadata(), None),
UploadOp::UploadLayer(layer1.clone(), layer1.metadata(), None),
UploadOp::UploadLayer(layer2.clone(), layer2.metadata(), None),
UploadOp::UploadLayer(layer3.clone(), layer3.metadata(), None),
UploadOp::UploadLayer(layer0.clone(), layer0.metadata(), None, None),
UploadOp::UploadLayer(layer1.clone(), layer1.metadata(), None, None),
UploadOp::UploadLayer(layer2.clone(), layer2.metadata(), None, None),
UploadOp::UploadLayer(layer3.clone(), layer3.metadata(), None, None),
];
queue.queued_operations.extend(ops.clone());
@@ -1359,7 +1366,7 @@ mod tests {
.layer_metadata
.insert(layer.layer_desc().layer_name(), layer.metadata());
vec![
UploadOp::UploadLayer(layer.clone(), layer.metadata(), None),
UploadOp::UploadLayer(layer.clone(), layer.metadata(), None, None),
UploadOp::Delete(Delete {
layers: vec![(layer.layer_desc().layer_name(), layer.metadata())],
}),
@@ -1378,6 +1385,7 @@ mod tests {
shard,
generation: Generation::Valid(generation),
file_size: 0,
encryption_key: None,
};
make_layer_with_metadata(&tli, name, metadata)
};

View File

@@ -26,7 +26,7 @@ use utils::lsn::Lsn;
use utils::vec_map::VecMap;
use crate::context::RequestContext;
use crate::tenant::blob_io::{BYTE_UNCOMPRESSED, BYTE_ZSTD, LEN_COMPRESSION_BIT_MASK};
use crate::tenant::blob_io::{BYTE_UNCOMPRESSED, BYTE_ZSTD, Header};
use crate::virtual_file::{self, IoBufferMut, VirtualFile};
/// Metadata bundled with the start and end offset of a blob.
@@ -111,18 +111,20 @@ impl From<Bytes> for BufView<'_> {
pub struct VectoredBlob {
/// Blob metadata.
pub meta: BlobMeta,
/// Start offset.
start: usize,
/// Header start offset.
header_start: usize,
/// Data start offset.
data_start: usize,
/// End offset.
end: usize,
/// Compression used on the the blob.
/// Compression used on the data, extracted from the header.
compression_bits: u8,
}
impl VectoredBlob {
/// Reads a decompressed view of the blob.
pub(crate) async fn read<'a>(&self, buf: &BufView<'a>) -> Result<BufView<'a>, std::io::Error> {
let view = buf.view(self.start..self.end);
let view = buf.view(self.data_start..self.end);
match self.compression_bits {
BYTE_UNCOMPRESSED => Ok(view),
@@ -140,13 +142,18 @@ impl VectoredBlob {
std::io::ErrorKind::InvalidData,
format!(
"Failed to decompress blob for {}@{}, {}..{}: invalid compression byte {bits:x}",
self.meta.key, self.meta.lsn, self.start, self.end
self.meta.key, self.meta.lsn, self.data_start, self.end
),
);
Err(error)
}
}
}
/// Returns the raw blob including header.
pub(crate) fn raw_with_header<'a>(&self, buf: &BufView<'a>) -> BufView<'a> {
buf.view(self.header_start..self.end)
}
}
impl std::fmt::Display for VectoredBlob {
@@ -154,7 +161,7 @@ impl std::fmt::Display for VectoredBlob {
write!(
f,
"{}@{}, {}..{}",
self.meta.key, self.meta.lsn, self.start, self.end
self.meta.key, self.meta.lsn, self.data_start, self.end
)
}
}
@@ -493,50 +500,28 @@ impl<'a> VectoredBlobReader<'a> {
let blobs_at = read.blobs_at.as_slice();
let start_offset = read.start;
let mut metas = Vec::with_capacity(blobs_at.len());
let mut blobs = Vec::with_capacity(blobs_at.len());
// Blobs in `read` only provide their starting offset. The end offset
// of a blob is implicit: the start of the next blob if one exists
// or the end of the read.
for (blob_start, meta) in blobs_at {
let blob_start_in_buf = blob_start - start_offset;
let first_len_byte = buf[blob_start_in_buf as usize];
for (blob_start, meta) in blobs_at.iter().copied() {
let header_start = (blob_start - read.start) as usize;
let header = Header::decode(&buf[header_start..])?;
let data_start = header_start + header.header_len;
let end = data_start + header.data_len;
let compression_bits = header.compression_bits;
// Each blob is prefixed by a header containing its size and compression information.
// Extract the size and skip that header to find the start of the data.
// The size can be 1 or 4 bytes. The most significant bit is 0 in the
// 1 byte case and 1 in the 4 byte case.
let (size_length, blob_size, compression_bits) = if first_len_byte < 0x80 {
(1, first_len_byte as u64, BYTE_UNCOMPRESSED)
} else {
let mut blob_size_buf = [0u8; 4];
let offset_in_buf = blob_start_in_buf as usize;
blob_size_buf.copy_from_slice(&buf[offset_in_buf..offset_in_buf + 4]);
blob_size_buf[0] &= !LEN_COMPRESSION_BIT_MASK;
let compression_bits = first_len_byte & LEN_COMPRESSION_BIT_MASK;
(
4,
u32::from_be_bytes(blob_size_buf) as u64,
compression_bits,
)
};
let start = (blob_start_in_buf + size_length) as usize;
let end = start + blob_size as usize;
metas.push(VectoredBlob {
start,
blobs.push(VectoredBlob {
header_start,
data_start,
end,
meta: *meta,
meta,
compression_bits,
});
}
Ok(VectoredBlobsBuf { buf, blobs: metas })
Ok(VectoredBlobsBuf { buf, blobs })
}
}
@@ -997,6 +982,15 @@ mod tests {
&read_buf[..],
"mismatch for idx={idx} at offset={offset}"
);
// Check that raw_with_header returns a valid header.
let raw = read_blob.raw_with_header(&view);
let header = Header::decode(&raw)?;
if !compression || header.header_len == 1 {
assert_eq!(header.compression_bits, BYTE_UNCOMPRESSED);
}
assert_eq!(raw.len(), header.total_len());
buf = result.buf;
}
Ok(())

View File

@@ -1366,7 +1366,8 @@ pub(crate) type IoBuffer = AlignedBuffer<ConstAlign<{ get_io_buffer_alignment()
pub(crate) type IoPageSlice<'a> =
AlignedSlice<'a, PAGE_SZ, ConstAlign<{ get_io_buffer_alignment() }>>;
static IO_MODE: AtomicU8 = AtomicU8::new(IoMode::preferred() as u8);
static IO_MODE: once_cell::sync::Lazy<AtomicU8> =
once_cell::sync::Lazy::new(|| AtomicU8::new(IoMode::preferred() as u8));
pub(crate) fn set_io_mode(mode: IoMode) {
IO_MODE.store(mode as u8, std::sync::atomic::Ordering::Relaxed);

View File

@@ -95,7 +95,7 @@ static uint32 local_request_counter;
* Various settings related to prompt (fast) handling of PageStream responses
* at any CHECK_FOR_INTERRUPTS point.
*/
int readahead_getpage_pull_timeout_ms = 0;
int readahead_getpage_pull_timeout_ms = 50;
static int PS_TIMEOUT_ID = 0;
static bool timeout_set = false;
static bool timeout_signaled = false;

View File

@@ -75,7 +75,7 @@ char *neon_auth_token;
int readahead_buffer_size = 128;
int flush_every_n_requests = 8;
int neon_protocol_version = 2;
int neon_protocol_version = 3;
static int neon_compute_mode = 0;
static int max_reconnect_attempts = 60;
@@ -1362,7 +1362,7 @@ pg_init_libpagestore(void)
"",
PGC_POSTMASTER,
0, /* no flags required */
check_neon_id, NULL, NULL);
NULL, NULL, NULL);
DefineCustomStringVariable("neon.branch_id",
"Neon branch_id the server is running on",
NULL,
@@ -1370,7 +1370,7 @@ pg_init_libpagestore(void)
"",
PGC_POSTMASTER,
0, /* no flags required */
check_neon_id, NULL, NULL);
NULL, NULL, NULL);
DefineCustomStringVariable("neon.endpoint_id",
"Neon endpoint_id the server is running on",
NULL,
@@ -1378,7 +1378,7 @@ pg_init_libpagestore(void)
"",
PGC_POSTMASTER,
0, /* no flags required */
check_neon_id, NULL, NULL);
NULL, NULL, NULL);
DefineCustomIntVariable("neon.stripe_size",
"sharding stripe size",
@@ -1432,7 +1432,7 @@ pg_init_libpagestore(void)
"PageStream connection when we have pages which "
"were read ahead but not yet received.",
&readahead_getpage_pull_timeout_ms,
0, 0, 5 * 60 * 1000,
50, 0, 5 * 60 * 1000,
PGC_USERSET,
GUC_UNIT_MS,
NULL, NULL, NULL);
@@ -1440,7 +1440,7 @@ pg_init_libpagestore(void)
"Version of compute<->page server protocol",
NULL,
&neon_protocol_version,
2, /* use protocol version 2 */
3, /* use protocol version 3 */
2, /* min */
3, /* max */
PGC_SU_BACKEND,

View File

@@ -2040,7 +2040,7 @@ neon_finish_unlogged_build_phase_1(SMgrRelation reln)
/*
* neon_end_unlogged_build() -- Finish an unlogged rel build.
*
* Call this after you have finished WAL-logging an relation that was
* Call this after you have finished WAL-logging a relation that was
* first populated without WAL-logging.
*
* This removes the local copy of the rel, since it's now been fully
@@ -2059,14 +2059,35 @@ neon_end_unlogged_build(SMgrRelation reln)
if (unlogged_build_phase != UNLOGGED_BUILD_NOT_PERMANENT)
{
XLogRecPtr recptr;
BlockNumber nblocks;
Assert(unlogged_build_phase == UNLOGGED_BUILD_PHASE_2);
Assert(reln->smgr_relpersistence == RELPERSISTENCE_UNLOGGED);
/*
* Update the last-written LSN cache.
*
* The relation is still on local disk so we can get the size by
* calling mdnblocks() directly. For the LSN, GetXLogInsertRecPtr() is
* very conservative. If we could assume that this function is called
* from the same backend that WAL-logged the contents, we could use
* XactLastRecEnd here. But better safe than sorry.
*/
nblocks = mdnblocks(reln, MAIN_FORKNUM);
recptr = GetXLogInsertRecPtr();
neon_set_lwlsn_block_range(recptr,
InfoFromNInfoB(rinfob),
MAIN_FORKNUM, 0, nblocks);
neon_set_lwlsn_relation(recptr,
InfoFromNInfoB(rinfob),
MAIN_FORKNUM);
/* Make the relation look permanent again */
reln->smgr_relpersistence = RELPERSISTENCE_PERMANENT;
/* Remove local copy */
rinfob = InfoBFromSMgrRel(reln);
for (int forknum = 0; forknum <= MAX_FORKNUM; forknum++)
{
neon_log(SmgrTrace, "forgetting cached relsize for %u/%u/%u.%u",

View File

@@ -890,7 +890,7 @@ libpqwp_connect_start(char *conninfo)
* palloc will exit on failure though, so there's not much we could do if
* it *did* fail.
*/
conn = palloc(sizeof(WalProposerConn));
conn = (WalProposerConn*)MemoryContextAllocZero(TopMemoryContext, sizeof(WalProposerConn));
conn->pg_conn = pg_conn;
conn->is_nonblocking = false; /* connections always start in blocking
* mode */

View File

@@ -776,7 +776,6 @@ impl From<&jose_jwk::Key> for KeyType {
}
#[cfg(test)]
#[expect(clippy::unwrap_used)]
mod tests {
use std::future::IntoFuture;
use std::net::SocketAddr;

View File

@@ -253,7 +253,6 @@ fn project_name_valid(name: &str) -> bool {
}
#[cfg(test)]
#[expect(clippy::unwrap_used)]
mod tests {
use ComputeUserInfoParseError::*;
use serde_json::json;

View File

@@ -258,7 +258,7 @@ async fn ssl_handshake<S: AsyncRead + AsyncWrite + Unpin>(
"unexpected startup packet, rejecting connection"
);
stream
.throw_error_str(ERR_INSECURE_CONNECTION, crate::error::ErrorKind::User)
.throw_error_str(ERR_INSECURE_CONNECTION, crate::error::ErrorKind::User, None)
.await?
}
}

View File

@@ -259,7 +259,6 @@ impl EndpointsCache {
}
#[cfg(test)]
#[expect(clippy::unwrap_used)]
mod tests {
use super::*;

Some files were not shown because too many files have changed in this diff Show More