Compare commits

..

86 Commits

Author SHA1 Message Date
Tristan Partin
0df4286056 Update docs/rfcs/2025-04-23-architecture-naming-conventions.md
Co-authored-by: Alexander Bayandin <alexander@neon.tech>
2025-07-22 10:32:20 -05:00
Tristan Partin
800f90f6a6 Add Architecture Naming Scheme RFC
Signed-off-by: Tristan Partin <tristan@neon.tech>
2025-04-23 13:07:30 -05:00
devin-ai-integration[bot]
8e09ecf2ab Fix KeyError in physical replication benchmark test (#11675)
# Fix KeyError in physical replication benchmark test

This PR fixes the failing physical replication benchmark test that was
encountering a KeyError: 'endpoints'.

The issue was in accessing `project["project"]["endpoints"][0]["id"]`
when it should be `project["endpoints"][0]["id"]`, consistent with how
endpoints are accessed elsewhere in the codebase.

Fixed the issue in both test functions:
- test_ro_replica_lag
- test_replication_start_stop

Link to Devin run:
https://app.devin.ai/sessions/be3fe9a9ee5942e4b12e74a7055f541b
Requested by: Peter Bendel

Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Co-authored-by: peterbendel@neon.tech <peterbendel@neon.tech>
2025-04-23 14:51:08 +00:00
Mikhail Kot
c3534cea39 Rename object_storage->endpoint_storage (#11678)
1. Rename service to avoid ambiguity as discussed in Slack
2. Ignore endpoint_id in read paths as requested in
https://github.com/neondatabase/cloud/issues/26346#issuecomment-2806758224
2025-04-23 14:03:19 +00:00
Folke Behrens
21d3d60cef proxy/pglb: Add in-process connection support (#11677)
Define a `Connection` and a `Stream` type that resemble simple QUIC
connections
and (multiplexed) streams.
2025-04-23 12:18:30 +00:00
Tristan Partin
b00db536bb Add CPU architecture to the remote extensions object key (#11590)
ARM computes are incoming and we need to account for that in remote
extensions. Previously, we just blindly assumed that all computes were
x86_64.

Note that we use the Go architecture naming convention instead of the
Rust one directly to do our best and be consistent across the stack.

Part-of: https://github.com/neondatabase/cloud/issues/23148

Signed-off-by: Tristan Partin <tristan@neon.tech>
2025-04-22 22:47:22 +00:00
Arpad Müller
149cbd1e0a Support single and two safekeeper scenarios (#11483)
In tests and when one safekeeper is down in small regions, we need to
contend with one or two safekeepers. Before, we gave an error in
`safekeepers_for_new_timeline`. Now we just silently allow the timeline
to be created on one or two safekeepers.

Part of #9011
2025-04-22 21:27:01 +00:00
Alexander Lakhin
7b949daf13 fix(test): allow reconcile errors in test_storage_controller_heartbeats (#11665)
## Problem

test_storage_controller_heartbeats is flaky because of unallowed
reconciler errors (#11625)

## Summary of changes

Allow reconcile errors as in other tests in test_storage_controller.py.
2025-04-22 18:13:16 +00:00
Konstantin Knizhnik
132b6154bb Unlogged build debug compare local v2 (#11554)
## Problem

Init fork is used in DEBUG_COMPARE_LOCAL to determine unlogged relation
or unlogged build.
But it is created only after the relation is initialized and so can be
swapped out, producing `Page is evicted with zero LSN` error.

## Summary of changes

Create init fork together with main fork for unlogged relations in
DEBUG_COMPARE_LOCAL mode.

---------

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
2025-04-22 18:07:45 +00:00
Alex Chi Z.
ad3519ebcb fix(pageserver): report synthetic size = 1 if all tls offloaded (#11648)
## Problem

A quick workaround for https://github.com/neondatabase/neon/issues/11631

## Summary of changes

Report synthetic size == 1 if all timelines are offloaded.

Signed-off-by: Alex Chi Z <chi@neon.tech>
2025-04-22 14:28:22 +00:00
Dmitrii Kovalkov
6173c0f44c safekeeper: add enable_tls_wal_service_api (#11520)
## Problem
Safekeeper doesn't use TLS in wal service
- Closes: https://github.com/neondatabase/cloud/issues/27302

## Summary of changes
- Add `enable_tls_wal_service_api` option to safekeeper's cmd arguments
- Propagate `tls_server_config` to `wal_service` if the option is
enabled
- Create `BACKGROUND_RUNTIME` for small background tasks and offload SSL
certificate reloader to it.

No integration tests for now because support from compute side is
required: https://github.com/neondatabase/cloud/issues/25823
2025-04-22 13:19:03 +00:00
a-masterov
fd916abf25 Remove NOTICE messages, which can make the pg_repack regression test fail. (#11659)
## Problem
The pg_repack test can be flaky due to unpredictable `NOTICE` messages
about waiting for some processes.
E.g., 
```
 INFO: repacking table "public.issue3_2"
+NOTICE: Waiting for 1 transactions to finish. First PID: 427
```
## Summary of changes
The `client_min_messages` set to `warning` for the regression tests.
2025-04-22 11:43:45 +00:00
Alexander Bayandin
cd2e1fbc7c CI(benchmarks): upload perf results for passed tests (#11649)
## Problem

We run benchmarks in batches (five parallel jobs on different runners).
If any test in a batch fails, we won’t upload any results for that
batch, even for the tests that passed.

## Summary of changes
- Move the results upload to a separate step in the run-python-test-set
action, and execute this step even if tests fail.
2025-04-22 09:41:28 +00:00
Tristan Partin
5df4a747e6 Update pgbouncer in compute images to 1.24.1 (#11651)
Fixes CVE-2025-2291.

Link:
https://www.postgresql.org/about/news/pgbouncer-1241-released-fixes-cve-2025-2291-3059/

Signed-off-by: Tristan Partin <tristan@neon.tech>
2025-04-21 17:49:17 +00:00
Vlad Lazar
cbf442292b pageserver: handle empty get vectored queries (#11652)
## Problem

If all batched requests are excluded from the query by
`Timeine::get_rel_page_at_lsn_batched` (e.g. because they are past the
end of the relation), the read path would panic since it doesn't expect
empty queries. This is a change in behaviour that was introduced with
the scattered query implementation.

## Summary of Changes

Handle empty queries explicitly.
2025-04-21 17:45:16 +00:00
Heikki Linnakangas
4d0c1e8b78 refactor: Extract some code in pagebench getpage command to function (#11563)
This makes it easier to add a different client implementation alongside
the current one. I started working on a new gRPC-based protocol to
replace the libpq protocol, which will introduce a new function like
`client_libpq`, but for the new protocol.

It's a little more readable with less indentation anyway.
2025-04-19 08:38:03 +00:00
JC Grünhage
3158442a59 fix(ci): set token for fast-forward failure comments and allow merging with state unstable (#11647)
## Problem

https://github.com/neondatabase/neon/actions/runs/14538136318/job/40790985693?pr=11645
failed, even though the relevant parts of the CI had passed and
auto-merge determined the PR is ready to merge. After that, commenting
failed.

## Summary of changes
- set GH_TOKEN for commenting after fast-forward failure
- allow merging with mergeable_state unstable
2025-04-18 17:49:34 +00:00
JC Grünhage
f006879fb7 fix(ci): make regex to find rc branches less strict (#11646)
## Problem

https://github.com/neondatabase/neon/actions/runs/14537161022/job/40787763965
failed to find the correct RC PR run, preventing artifact re-use. This
broke in https://github.com/neondatabase/neon/pull/11547.

There's a hotfix release containing this in
https://github.com/neondatabase/neon/pull/11645.

## Summary of changes
Make the regex for finding the RC PR run less strict, it was needlessly
precise.
2025-04-18 16:39:18 +00:00
Dmitrii Kovalkov
a0d844dfed pageserver + safekeeper: pass ssl ca certs to broker client (#11635)
## Problem
Pageservers and safakeepers do not pass CA certificates to broker
client, so the client do not trust locally issued certificates.
- Part of https://github.com/neondatabase/cloud/issues/27492

## Summary of changes
- Change `ssl_ca_certs` type in PS/SK's config to `Pem` which may be
converted to both `reqwest` and `tonic` certificates.
- Pass CA certificates to storage broker client in PS and SK
2025-04-18 06:27:23 +00:00
Alex Chi Z.
5073e46df4 feat(pageserver): use rfc3339 time and print ratio in gc-compact stats (#11638)
## Problem

follow-up on https://github.com/neondatabase/neon/pull/11601

## Summary of changes

- serialize the start/end time using rfc3339 time string
- compute the size ratio of the compaction

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2025-04-18 05:28:01 +00:00
Alexander Bayandin
182bd95a4e CI(regress-tests): run tests on large-metal (#11634)
## Problem

Regression tests are more flaky on virtualised (`qemu-x64-*`) runners

See https://neondb.slack.com/archives/C069Z2199DL/p1744891865307769
Ref https://github.com/neondatabase/neon/issues/11627

## Summary of changes
- Switch `regress-tests` to metal-only large runners to mitigate flaky
behaviour
2025-04-18 01:25:38 +00:00
Anastasia Lubennikova
ce7795a67d compute: use project_id, endpoint_id as tag (#11556)
for compute audit logs

part of https://github.com/neondatabase/cloud/issues/21955
2025-04-17 23:32:38 +00:00
Suhas Thalanki
134d01c771 remove pg_anon.patch (#11636)
This PR removes `pg_anon.patch` as the `anon` v1 extension has been
removed and the patch is not being used anywhere
2025-04-17 22:08:16 +00:00
Arpad Müller
c1e4befd56 Additional fixes and improvements to storcon safekeeper timelines (#11477)
This delivers some additional fixes and improvements to storcon managed
safekeeper timelines:

* use `i32::MAX` for the generation number of timeline deletion
* start the generation for new timelines at 1 instead of 0: this ensures
that the other components actually are generation enabled
* fix database operations we use for metrics
* use join in list_pending_ops to prevent the classical ORM issue where
one does many db queries
* use enums in `test_storcon_create_delete_sk_down`. we are adding a
second parameter, and having two bool parameters is weird.
* extend `test_storcon_create_delete_sk_down` with a test of whole
tenant deletion. this hasn't been tested before.
* remove some redundant logging contexts
* Don't require mutable access to the service lock for scheduling
pending ops in memory. In order to pull this off, create reconcilers
eagerly. The advantage is that we don't need mutable access to the
service lock that way any more.

Part of #9011

---------

Co-authored-by: Arseny Sher <sher-ars@yandex.ru>
2025-04-17 20:25:30 +00:00
a-masterov
6c2e5c044c random operations test (#10986)
## Problem
We need to test the stability of Neon.

## Summary of changes
The test runs random operations on a Neon project. It performs via the
Public API calls the following operations: `create a branch`, `delete a
branch`, `add a read-only endpoint`, `delete a read-only endpoint`,
`restore a branch to a random position in the past`. All the branches
and endpoints are loaded with `pgbench`.

---------

Co-authored-by: Peter Bendel <peterbendel@neon.tech>
Co-authored-by: Alexander Bayandin <alexander@neon.tech>
2025-04-17 19:59:35 +00:00
Alex Chi Z.
748539b222 fix(pageserver): lower L0 compaction threshold (#11617)
## Problem

We saw OOMs due to L0 compaction happening simultaneously for all shards
of the same tenant right after the shard split.

## Summary of changes

Lower the threshold so that we compact fewer files.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2025-04-17 19:51:28 +00:00
Alex Chi Z.
ad0c5fdae7 fix(test): allow stale generation warnings in storcon (#11624)
## Problem

https://github.com/neondatabase/neon/pull/11531 did not fully fix the
problem because the warning is part of the storcon instead of
pageserver.

## Summary of changes

Allow stale generation error in storcon.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2025-04-17 16:12:24 +00:00
Christian Schwarz
2b041964b3 cover direct IO + concurrent IO in unit, regression & perf tests (#11585)
This mirrors the production config.

Thread that discusses the merits of this:
- https://neondb.slack.com/archives/C033RQ5SPDH/p1744742010740569

# Refs
- context
https://neondb.slack.com/archives/C04BLQ4LW7K/p1744724844844589?thread_ts=1744705831.014169&cid=C04BLQ4LW7K
- prep for https://github.com/neondatabase/neon/pull/11558 which adds
new io mode `direct-rw`

# Impact on CI turnaround time

Spot-checking impact on CI timings

- Baseline: [some recent main
commit](https://github.com/neondatabase/neon/actions/runs/14471549758/job/40587837475)
- Comparison: [this
commit](https://github.com/neondatabase/neon/actions/runs/14471945087/job/40589613274)
in this PR here

Impact on CI turnaround time

- Regression tests:
  - x64: very minor, sometimes better; likely in the noise
  - arm64: substantial  30min => 40min
- Benchmarks (x86 only I think): very minor; noise seems higher than
regress tests

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
Co-authored-by: Alex Chi Z. <4198311+skyzh@users.noreply.github.com>
Co-authored-by: Peter Bendel <peterbendel@neon.tech>
Co-authored-by: Alex Chi Z <chi@neon.tech>
2025-04-17 15:53:10 +00:00
John Spray
d4c059a884 tests: use endpoint http wrapper to get auth (#11628)
## Problem

`test_compute_startup_simple` and `test_compute_ondemand_slru_startup`
are failing.

This test implicitly asserts that the metrics.json endpoint succeeds and
returns all expected metrics, but doesn't make it easy to see what went
wrong if it doesn't (e.g. in this failure
https://neon-github-public-dev.s3.amazonaws.com/reports/main/14513210240/index.html#suites/13d8e764c394daadbad415a08454c04e/b0f92a86b2ed309f/)

In this case, it was failing because of a missing auth token, because it
was using `requests` directly instead of using the endpoint http client
type.

## Summary of changes

- Use endpoint http wrapper to get raise_for_status & auth token
2025-04-17 15:03:23 +00:00
Folke Behrens
2c56c46d48 compute: Set max log level for local proxy sql_over_http mod to WARN (#11629)
neondatabase/cloud#27738
2025-04-17 14:38:19 +00:00
Tristan Partin
d1728a6bcd Remove old compatibility hack for remote extensions (#11620)
Control plane has long since been updated to send the right value.

Signed-off-by: Tristan Partin <tristan@neon.tech>
2025-04-17 14:08:42 +00:00
John Spray
0a27973584 pageserver: rename Tenant to TenantShard (#11589)
## Problem

`Tenant` isn't really a whole tenant: it's just one shard of a tenant.

## Summary of changes

- Automated rename of Tenant to TenantShard
- Followup commit to change references in comments
2025-04-17 13:29:16 +00:00
Alexander Bayandin
07c2411f6b tests: remove mentions of ALLOW_*_COMPATIBILITY_BREAKAGE (#11618)
## Problem

There are mentions of `ALLOW_BACKWARD_COMPATIBILITY_BREAKAGE` and
`ALLOW_FORWARD_COMPATIBILITY_BREAKAGE`, but in reality, this mechanism
doesn't work, so let's remove it to avoid confusion.

The idea behind it was to allow some breaking changes by adding a
special label to a PR that would `xfail` the test. However, in practice,
this means we would need to carry this label through all subsequent PRs
until the release (and artifact regeneration). This approach isn't
really viable, as it increases the risk of missing a compatibility break
in another PR.

## Summary of changes
- Remove mentions and handling of
`ALLOW_BACKWARD_COMPATIBILITY_BREAKAGE` /
`ALLOW_FORWARD_COMPATIBILITY_BREAKAGE`
2025-04-17 10:03:21 +00:00
Alexander Bayandin
5819938c93 CI(pg-clients): fix workflow permissions (#11623)
## Problem

`pg-clients` can't start:

```
The workflow is not valid. .github/workflows/pg-clients.yml (Line: 44, Col: 3): Error calling workflow 'neondatabase/neon/.github/workflows/build-build-tools-image.yml@aa19f10e7e958fbe0e0641f2e8c5952ce3be44b3'. The nested job 'check-image' is requesting 'packages: read', but is only allowed 'packages: none'. .github/workflows/pg-clients.yml (Line: 44, Col: 3): Error calling workflow 'neondatabase/neon/.github/workflows/build-build-tools-image.yml@aa19f10e7e958fbe0e0641f2e8c5952ce3be44b3'. The nested job 'build-image' is requesting 'packages: write', but is only allowed 'packages: none'.
```

## Summary of changes
- Grant required `packages: write` permissions to the workflow
2025-04-17 08:54:23 +00:00
Konstantin Knizhnik
b7548de814 Disable autovacuum and increase limit for WS approximation (#11583)
## Problem

Test lfc working set approximation becomes flaky after recent changes in
prefetch.
May be it is caused by updating HLL in `lfc_write`, may be by some other
reasons.

## Summary of changes

1. Disable autovacuum in this test (as possible source of extra page
accesses).
2. Increase upper boundary for WS approximation from 12 to 20.

---------

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
2025-04-17 05:07:45 +00:00
Tristan Partin
9794f386f4 Make Postgres 17 the default version (#11619)
This is mostly a documentation update, but a few updates with regard to
neon_local, pageserver, and tests.

17 is our default for users in production, so dropping references to 16
makes sense.

Signed-off-by: Tristan Partin <tristan@neon.tech>

Signed-off-by: Tristan Partin <tristan@neon.tech>
2025-04-16 23:23:37 +00:00
Tristan Partin
79083de61c Remove forward compatibility hacks related to compute_ctl auth (#11621)
These various hacks were needed for the forward compatibility tests.
Enough time has passed since the merge that these are no longer needed.

Signed-off-by: Tristan Partin <tristan@neon.tech>
2025-04-16 23:14:24 +00:00
Folke Behrens
ec9079f483 Allow unwrap() in tests when clippy::unwrap_used is denied (#11616)
## Problem

The proxy denies using `unwrap()`s in regular code, but we want to use
it in test code
and so have to allow it for each test block.

## Summary of changes

Set `allow-unwrap-in-tests = true` in clippy.toml and remove all
exceptions.
2025-04-16 20:05:21 +00:00
Ivan Efremov
b9b25e13a0 feat(proxy): Return prefixed errors to testodrome (#11561)
Testodrome measures uptime based on the failed requests and errors. In
case of testodrome request we send back error based on the service. This
will help us distinguish error types in testodrome and rely on the
uptime SLI.
2025-04-16 19:03:23 +00:00
Alex Chi Z.
cf2e695f49 feat(pageserver): gc-compaction meta statistics (#11601)
## Problem

We currently only have gc-compaction statistics for each single
sub-compaction job.

## Summary of changes

Add meta statistics across all sub-compaction jobs scheduled.

Signed-off-by: Alex Chi Z <chi@neon.tech>
2025-04-16 18:51:48 +00:00
Conrad Ludgate
fc233794f6 fix(proxy): make sure that sql-over-http is TLS aware (#11612)
I noticed that while auth-broker -> local-proxy is TLS aware, and TCP
proxy -> postgres is TLS aware, HTTP proxy -> postgres is not 😅
2025-04-16 18:37:17 +00:00
Tristan Partin
c002236145 Remove compute_ctl authorization bypass if testing feature was enable (#11596)
We want to exercise the authorization middleware in our regression
tests.

Signed-off-by: Tristan Partin <tristan@neon.tech>
2025-04-16 17:54:51 +00:00
Erik Grinaker
4af0b9b387 pageserver: don't recompress images in ImageLayerInner::filter() (#11592)
## Problem

During shard ancestor compaction, we currently recompress all page
images as we move them into a new layer file. This is expensive and
unnecessary.

Resolves #11562.
Requires #11607.

## Summary of changes

Pass through compressed page images in `ImageLayerInner::filter()`.
2025-04-16 17:10:15 +00:00
Vlad Lazar
0e00faf528 tests: stability fixes for test_migration_to_cold_secondary (#11606)
1. Compute may generate WAL on shutdown. The test assumes that after
shutdown,
no further ingest happens. Tweak the compute shutdown to make the
assumption true.
2. Assertion of local layer count post cold migration is not right since
we may have downloaded
layers due to ingest. Remove it.

Closes https://github.com/neondatabase/neon/issues/11587
2025-04-16 16:31:23 +00:00
Anastasia Lubennikova
7747a9619f compute: fix copy-paste typo for neon GUC parameters check (#11610)
fix for commit
[5063151](5063151271)
2025-04-16 15:55:11 +00:00
Erik Grinaker
46100717ad pageserver: add VectoredBlob::raw_with_header (#11607)
## Problem

To avoid recompressing page images during layer filtering, we need
access to the raw header and data from vectored reads such that we can
pass them through to the target layer.

Touches #11562.

## Summary of changes

Adds `VectoredBlob::raw_with_header()` to return a raw view of the
header+data, and updates `read()` to track it.

Also adds `blob_io::Header` with header metadata and decode logic, to
reuse for tests and assertions. This isn't yet widely used.
2025-04-16 15:38:10 +00:00
Erik Grinaker
00eeff9b8d pageserver: add compaction_shard_ancestor to disable shard ancestor compaction (#11608)
## Problem

Splits of large tenants (several TB) can cause a huge amount of shard
ancestor compaction work, which can overload Pageservers.

Touches https://github.com/neondatabase/cloud/issues/22532.

## Summary of changes

Add a setting `compaction_shard_ancestor` (default `true`) to disable
shard ancestor compaction on a per-tenant basis.
2025-04-16 14:41:02 +00:00
Matthias van de Meent
2a46426157 Update neon GUCs with new default settings (#11595)
Staging and prod both have these settings configured like this, so let's
update this so we can eventually drop the overrides in prod.
2025-04-16 13:42:22 +00:00
Tristan Partin
edc11253b6 Fix neon_local public key parsing when create compute JWKS (#11602)
Finally figured out the right incantation. I had had this in my original
go, but due to some refactoring and apparently missed testing, I
committed a mistake. The reason this doesn't currently break anything is
that we bypass the authorization middleware when the "testing" cargo
feature is enabled.

Signed-off-by: Tristan Partin <tristan@neon.tech>
2025-04-16 12:51:48 +00:00
Heikki Linnakangas
b4e26a6284 Set last-written LSN as part of smgr_end_unlogged_build() (#11584)
This way, the callers don't need to do it, reducing the footprint of
changes we've had to made to various index AM's build functions.
2025-04-16 12:34:18 +00:00
Vlad Lazar
96b46365e4 tests: attach final metrics to allure report (#11604)
## Problem

Metrics are saved in https://github.com/neondatabase/neon/pull/11559,
but the file is not matched by the attachment regex.

## Summary of changes

Make attachment regex match the metrics file.
2025-04-16 10:26:47 +00:00
Alex Chi Z.
aa19f10e7e fix(test): allow shutdown warning in preempt tests (#11600)
## Problem

test_gc_compaction_preempt is still flaky

## Summary of changes

- allow shutdown warning logs

Signed-off-by: Alex Chi Z <chi@neon.tech>
2025-04-15 21:50:28 +00:00
Konstantin Knizhnik
35170656fe Allocate WalProposerConn using TopMemoryAllocator (#11577)
## Problem

See https://neondb.slack.com/archives/C04DGM6SMTM/p1744659631698609
`WalProposerConn` is allocated using current memory context which life
time is not long enough.

## Summary of changes

Allocate `WalProposerConn`  using `TopMemoryContext`.

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
2025-04-15 19:13:12 +00:00
Tristan Partin
cd9ad75797 Remove compute_ctl authorization bypass on localhost (#11597)
For whatever reason, this never worked in production computes anyway.

Signed-off-by: Tristan Partin <tristan@neon.tech>
2025-04-15 19:12:34 +00:00
Tristan Partin
eadb05f78e Teach neon_local to pass the Authorization header to compute_ctl (#11490)
This allows us to remove hacks in the compute_ctl authorization
middleware which allowed for bypasses of auth checks.

Fixes: https://github.com/neondatabase/neon/issues/11316

Signed-off-by: Tristan Partin <tristan@neon.tech>
2025-04-15 17:27:49 +00:00
Fedor Dikarev
c5115518e9 remove temp file from repo (#11586)
## Problem
In https://github.com/neondatabase/neon/pull/11409 we added temp file to
the repo.

## Summary of changes
Remove temp file from the repo.
2025-04-15 15:29:15 +00:00
Alex Chi Z.
931f8c4300 fix(pageserver): check if cancelled before waiting logical size (2/2) (#11575)
## Problem

close https://github.com/neondatabase/neon/issues/11486, proceeding
https://github.com/neondatabase/neon/pull/11531

## Summary of changes

This patch fixes the rest 50% of instability of
`test_create_churn_during_restart`. During tenant warmup, we'll request
logical size; however, if the startup gets cancelled, we won't be able
to spawn the initial logical size calculation task that sets the
`cancel_wait_for_background_loop_concurrency_limit_semaphore`.

Therefore, we check `cancelled` before proceeding to get
`cancel_wait_for_background_loop_concurrency_limit_semaphore`. There
will still be a race if the timeline shutdown happens after L5710 and
before L5711, but it should be enough to reduce the flakiness of the
test.

Signed-off-by: Alex Chi Z <chi@neon.tech>
2025-04-15 15:16:16 +00:00
Alexander Bayandin
0f7c2cc382 CI(release): add time to RC PR branch names (#11547)
## Problem

We can't have more than one open release PR created on the same day (due
to non-unique enough branch names).

## Summary of changes
- Add time (hours and minutes) to RC PR branch names
- Also make sure we use UTC for releases
2025-04-15 15:08:05 +00:00
Erik Grinaker
983d56502b pageserver: reduce shard ancestor rewrite threshold to 30% (#11582)
## Problem

When doing power-of-two shard splits (i.e. 4 → 8 → 16), we end up
rewriting all layers since half of the pages will be local due to
striping. This causes a lot of resource usage when splitting large
tenants.

## Summary of changes

Drop the threshold of local/total pages to 30%, to reduce the amount of
layer rewrites after splits.
2025-04-15 14:26:29 +00:00
Erik Grinaker
bcef542d5b pageserver: don't rewrite invisible layers during ancestor compaction (#11580)
## Problem

Shard ancestor compaction can be very expensive following shard splits
of large tenants. We currently rewrite garbage layers after shard splits
as well, which can be a significant amount of data.

Touches https://github.com/neondatabase/cloud/issues/22532.

## Summary of changes

Don't rewrite invisible layers after shard splits.
2025-04-15 14:25:58 +00:00
a-masterov
e31455d936 Add the tests for the extensions pg_jsonschema and pg_session_jwt (#11323)
## Problem
`pg_jsonschema` and `pg_session_jwt` are not yet covered by tests
## Summary of changes
Added the tests for these extensions.
2025-04-15 14:06:01 +00:00
Alex Chi Z.
a4ea7d6194 fix(pageserver): gc-compaction verification false failure (#11564)
## Problem

https://github.com/neondatabase/neon/pull/11515 introduced a bug that
some key history cannot be verified.

If a key only exists above the horizon, the verification will fail for
its first occurrence because the history does not exist at that point.

As gc-compaction skips a key range whenever an error occurs, it might be
doing some wasted work in staging/prod now. But I'm not planning a
hotfix this week as the bug doesn't affect correctness/performance.

## Summary of changes

Allow keys with only above horizon history in the verification.

Signed-off-by: Alex Chi Z <chi@neon.tech>
2025-04-15 13:58:32 +00:00
Alexander Bayandin
19bea5fd0c CI: do not wait for tests to trigger deploy job (#11548)
## Problem

There is too much delay between merging a PR into `main` and deploying
the changes to staging

## Summary of changes
- Trigger `deploy` job without waiting for `build-and-test-locally` job
2025-04-15 11:23:41 +00:00
a-masterov
5be94e28c4 Update the documentation of the cloud regress test (#11539)
## Problem
The information in the README.md contained errors, and some information
was missing.
## Summary of changes
Found errors are fixed, and new information is added.

---------

Co-authored-by: Alexander Bayandin <alexander@neon.tech>
2025-04-15 11:00:25 +00:00
Alexander Bayandin
63a106021a CI(allure-report-generate): Install allure to /tmp (#11579)
## Problem

The `/__w/neon/neon` directory is mounted from host to container and
persists between runs.
Sometimes the next workflow run fails to delete it:

```
Deleting the contents of '/__w/neon/neon'
Error: File was unable to be removed Error: EACCES: permission denied, rmdir '/__w/neon/neon/allure-2.32.2/bin'
```

## Summary of changes
- Download and install allure to `/tmp` which exists in container only

Ref https://github.com/neondatabase/cloud/issues/27186
2025-04-15 09:29:36 +00:00
Fedor Dikarev
9a6ace9bde introduce new runners: unit-perf and use them for benchmark jobs (#11409)
## Problem
Benchmarks results are inconsistent on existing small-metal runners

## Summary of changes
Introduce new `unit-perf` runners, and lets run benchmark on them.

The new hardware has slower, but consistent, CPU frequency - if run with
default governor schedutil.
Thus we needed to adjust some testcases' timeouts and add some retry
steps where hard-coded timeouts couldn't be increased without changing
the system under test.
-
[wait_for_last_record_lsn](6592d69a67/test_runner/fixtures/pageserver/utils.py (L193))
1000s -> 2000s
-
[test_branch_creation_many](https://github.com/neondatabase/neon/pull/11409/files#diff-2ebfe76f89004d563c7e53e3ca82462e1d85e92e6d5588e8e8f598bbe119e927)
1000s
-
[test_ingest_insert_bulk](https://github.com/neondatabase/neon/pull/11409/files#diff-e90e685be4a87053bc264a68740969e6a8872c8897b8b748d0e8c5f683a68d9f)
- with back throttling disabled compute becomes unresponsive for more
than 60 seconds (PG hard-coded client authentication connection timeout)
-
[test_sharded_ingest](https://github.com/neondatabase/neon/pull/11409/files#diff-e8d870165bd44acb9a6d8350f8640b301c1385a4108430b8d6d659b697e4a3f1)
600s -> 1200s

Right now there are only 2 runners of that class, and if we decide to go
with them, we have to check how much that type of runners we need, so
jobs not stuck with waiting for that type of runners available.

However we now decided to run those runners with governor performance
instead of schedutil.
This achieves almost same performance as previous runners but still
achieves consistent results for same commit

Related issue to activate performance governor on these runners
https://github.com/neondatabase/runner/pull/138

## Verification that it helps

### analyze runtimes on new runner for same commit

Table of runtimes for the same commit on different runners in
[run](https://github.com/neondatabase/neon/actions/runs/14417589789)

| Run | Benchmarks (1) | Benchmarks (2) |Benchmarks (3) |Benchmarks (4)
| Benchmarks (5) |
|--------|--------|---------|---------|---------|---------|
| 1 | 1950.37s | 6374.55s |  3646.15s |  4149.48s |  2330.22s | 
| 2 | - | 6369.27s |  3666.65s |  4162.42s |  2329.23s | 
| Delta % |  - |  0,07 %  | 0,5 %   |   0,3 % | 0,04 %   |
| with governor performance | 1519.57s |  4131.62s |  - | -  |  - |
| second run gov. perf. | 1513.62s |  4134.67s |  - | -  |  - |
| Delta % |  0,3 % |  0,07 %  |  -  |  - | -   |
| speedup gov. performance | 22 % |  35 % |  - | -  |  - |
| current desktop class hetzner runners (main) | 1487.10s | 3699.67s | -
| - | - |
| slower than desktop class | 2 % |  12 % |  - | -  |  - |


In summary, the runtimes for the same commit on this hardware varies
less than 1 %.

---------

Co-authored-by: BodoBolero <peterbendel@neon.tech>
2025-04-15 08:21:44 +00:00
Erik Grinaker
8c77ccfc01 pageserver: log total progress during shard ancestor compaction (#11565)
## Problem

Shard ancestor compaction doesn't currently log any global progress
information, only for the current batch.

## Summary of changes

Log the number of layers checked for eligibility this iteration, and the
total number of layers to check. This will indicate how far along the
total shard ancestor compaction has gotten for this iteration.
2025-04-15 07:25:09 +00:00
Tristan Partin
cbd2fc2395 Clean up logs and error messages in compute_ctl authorize middleware (#11576)
Signed-off-by: Tristan Partin <tristan@neon.tech>
2025-04-15 01:21:18 +00:00
Tristan Partin
028a191040 Continue with s/spec/config changes (#11574)
Signed-off-by: Tristan Partin <tristan@neon.tech>
2025-04-14 21:18:21 +00:00
Vlad Lazar
8cce27bedb pageserver: add a randomized read path test (#11519)
## Problem

Every time we make changes to the read path to fix a bug or add a
feature,
we end up adding another incomprehensible test.

## Summary of changes

Add some generic infrastructure for generating a layer map from a type
spec
and use that for a read path test. The test is randomized but uses a
fixed seed
by default. A fuzzing mode is available for confidence building.

See [Notion
page](https://www.notion.so/neondatabase/Read-Path-Unit-Testing-Fuzzing-1d1f189e0047806c8e5cd37781b0a350?pvs=4)
for a diagram of the layer map
used.

Just for fun I tried removing [this
commit](9990199cb4)
from https://github.com/neondatabase/neon/pull/11494
and it caught the bug in the normal mode (no fuzzing required).
2025-04-14 15:31:32 +00:00
Vlad Lazar
90b706cd96 tests: save pageserver metrics at the end of the test (#11559)
## Problem

Sometimes it's useful to see the pageserver metrics after a test in
order to debug stuff.
For example, for https://github.com/neondatabase/neon/issues/11465 I'd
like to know
what the remote storage latencies are from the client.

## Summary of changes

When stopping the env, record the pageserver metrics into a file in the
pageserver's workdir.
2025-04-14 15:13:20 +00:00
Alex Chi Z.
057ce115de fix(test): allow stale generation errors (1/2) (#11531)
## Problem

Part of https://github.com/neondatabase/neon/issues/11486

## Summary of changes

50% of the test instability of `test_create_churn_during_restart` are
due to error message gets changed. Allow the new error message.

Still need to fix other errors due to failure to acquire semaphore in
this or the next patch.

Signed-off-by: Alex Chi Z <chi@neon.tech>
2025-04-14 14:51:17 +00:00
Vlad Lazar
e85607eed8 tests: remove config tweak allowing old versions to start with a batching config (#11560)
## Problem

Pageservers now ignore unknown config fields, so this config tweaking is
no longer needed.

## Summary of changes

Get rid of the hack.

Closes https://github.com/neondatabase/neon/issues/11524
2025-04-14 14:42:35 +00:00
Tristan Partin
437071888e Fix logging in nightly physical replication benchmarks (#11541)
Signed-off-by: Tristan Partin <tristan@neon.tech>
2025-04-14 13:57:33 +00:00
Vlad Lazar
148b3701cf pageserver: add metrics for get page batch breaking reasons (#11545)
## Problem

https://github.com/neondatabase/neon/pull/11494 changes the batching
logic, but we don't have a way to evaluate it.

## Summary of changes

This PR introduces a global and per timeline metric which tracks the
reason for
which a batch was broken.
2025-04-14 13:24:47 +00:00
Christian Schwarz
daebe50e19 refactor: plumb gate and cancellation down to to blob_io::BlobWriter (#11543)
In #10063 we will switch BlobWriter to use the owned buffers IO buffered
writer, which implements double-buffering by virtue of a background task
that performs the flushing.

That task's lifecylce must be contained within the Timeline lifecycle,
so, it must hold the timeline gate open and respect Timeline::cancel.

This PR does the noisy plumbing to reduce the #10063 diff.

Refs
- extracted from https://github.com/neondatabase/neon/pull/10063
- epic https://github.com/neondatabase/neon/issues/9868
2025-04-14 11:51:01 +00:00
Arpad Müller
e0ee6fbeff Remove deprecated --compute-hook-url storcon param (#11551)
We have already migrated the storage controller to
`--control-plane-url`, added in #11173. The new param was added to
support also safekeeper specific endpoints. See the docs changes in
#11195 for further details.

Part of #11163
2025-04-14 10:36:40 +00:00
Konstantin Knizhnik
307fa2ceb7 Remove unused n_synced variable from HandleSafekeeperResponse (#11553)
## Problem

clang produce warning about unused variable `n_synced` in
HandleSafekeeperResponse

## Summary of changes

Remove local variable.

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
2025-04-14 09:45:13 +00:00
Vlad Lazar
a338984dc7 pageserver: support keys at different LSNs in one get page batch (#11494)
## Problem

Get page batching stops when we encounter requests at different LSNs.
We are leaving batching factor on the table.

## Summary of changes

The goal is to support keys with different LSNs in a single batch and
still serve them with a single vectored get.
Important restriction: the same key at different LSNs is not supported
in one batch. Returning different key
versions is a much more intrusive change.

Firstly, the read path is changed to support "scattered" queries. This
is a conceptually simple step from
https://github.com/neondatabase/neon/pull/11463. Instead of initializing
the fringe for one keyspace,
we do it for multiple at different LSNs and let the logic already
present into the fringe handle selection.

Secondly, page service code is updated to support batching at different
LSNs. Eeach request parsed from the wire determines its effective
request LSN and keeps it in mem for the batcher toinspect. The batcher
allows keys at
different LSNs in one batch as long one key is not requested at
different LSNs.

I'd suggest doing the first pass commit by commit to get a feel for the
changes.

## Results

I used the batching test from [Christian's
PR](https://github.com/neondatabase/neon/pull/11391) which increases the
change of batch breaks. Looking at the logs I think the new code is at
the max batching factor for the workload (we
only break batches due to them being oversized or because the executor
is idle).

```
Main:
Reasons for stopping batching: {'LSN changed': 22843, 'of batch size': 33417}
test_throughput[release-pg16-50-pipelining_config0-30-100-128-batchable {'max_batch_size': 32, 'execution': 'concurrent-futures', 'mode': 'pipelined'}].perfmetric.batching_factor: 14.6662

My branch:
Reasons for stopping batching: {'of batch size': 37024}
test_throughput[release-pg16-50-pipelining_config0-30-100-128-batchable {'max_batch_size': 32, 'execution': 'concurrent-futures', 'mode': 'pipelined'}].perfmetric.batching_factor: 19.8333
```

Related: https://github.com/neondatabase/neon/issues/10765
2025-04-14 09:05:29 +00:00
Konstantin Knizhnik
8936a7abd8 Increase limit for worker processes for isolation test (#11504)
## Problem

See https://github.com/neondatabase/neon/issues/10652

Neon extension launches 2 BGW which reduce limit for parallel workers
and so affecting parallel_deadlock isolation test.

## Summary of changes

Increase `max_worker_processes` from default 8 to 16 for isolation test.

---------

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
2025-04-12 18:09:12 +00:00
Conrad Ludgate
946e971df8 feat(proxy): add batching to cancellation queue processing (#10607)
Add batching to the redis queue, which allows us to clear it out quicker
should it slow down temporarily.
2025-04-12 09:16:22 +00:00
Dmitrii Kovalkov
d109bf8c1d neon_local: use ed25519 to gen local ssl certs (#11542)
## Problem
neon_local uses rsa to generate local SSL certs, which is slow
Follow-up on:
- https://github.com/neondatabase/neon/pull/11025#discussion_r1989453785
- https://github.com/neondatabase/neon/pull/11538

## Summary of changes
- Change key from rsa to ed25519 in neon_local
2025-04-11 17:49:15 +00:00
Alex Chi Z.
4f7b2cdd4f feat(pageserver): gc-compaction result verification (#11515)
## Problem

Part of #9114 

There was a debug-mode verification mode that verifies at every
retain_lsn. However, the code was tangled within the actual history
generation itself and it's hard to reason about correctness. This patch
adds a separate post-verification of the gc-compaction result that redos
logs at every retain_lsn and every record above the GC horizon. This
ensures that all key history we produce with gc-compaction is readable,
and if there're read errors after gc-compaction, it can only be
read-path errors instead of gc-compaction bugs.

## Summary of changes

* Add gc_compaction_verification flag, default to true.
* Implement a post-verification process.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2025-04-11 15:50:29 +00:00
Alex Chi Z.
66f56ddaec fix(pageserver): allow shutdown errors for gc compaction tests (#11530)
## Problem

`test_pageserver_compaction_preempt` is flaky.

## Summary of changes

Allow the shutdown errors.

Signed-off-by: Alex Chi Z <chi@neon.tech>
2025-04-11 15:20:51 +00:00
Erik Grinaker
fd16caa7d0 pageserver: yield for L0 during ancestor compaction (#11536)
## Problem

Shard ancestor compaction does not yield for L0 compaction, potentially
starving it.

close https://github.com/neondatabase/neon/issues/11125

## Summary of changes

* Yield for L0 during shard ancestor compaction.
* Return `CompactionOutcome::Pending` when limited by `rewrite_max`, for
eager rescheduling.
2025-04-11 15:09:28 +00:00
Tristan Partin
ff5a527167 Consolidate compute_ctl configuration structures (#11514)
Previously, the structure of the spec file was just the compute spec.
However, the response from the control plane get spec request included
the compute spec and the compute_ctl config. This divergence was
hindering other work such as adding regression tests for compute_ctl
HTTP authorization.

Signed-off-by: Tristan Partin <tristan@neon.tech>
2025-04-11 15:06:29 +00:00
216 changed files with 6214 additions and 2814 deletions

View File

@@ -19,7 +19,7 @@
!pageserver/
!pgxn/
!proxy/
!object_storage/
!endpoint_storage/
!storage_scrubber/
!safekeeper/
!storage_broker/

View File

@@ -6,6 +6,7 @@ self-hosted-runner:
- small
- small-metal
- small-arm64
- unit-perf
- us-east-2
config-variables:
- AWS_ECR_REGION

View File

@@ -70,6 +70,7 @@ runs:
- name: Install Allure
shell: bash -euxo pipefail {0}
working-directory: /tmp
run: |
if ! which allure; then
ALLURE_ZIP=allure-${ALLURE_VERSION}.zip

View File

@@ -113,8 +113,6 @@ runs:
TEST_OUTPUT: /tmp/test_output
BUILD_TYPE: ${{ inputs.build_type }}
COMPATIBILITY_SNAPSHOT_DIR: /tmp/compatibility_snapshot_pg${{ inputs.pg_version }}
ALLOW_BACKWARD_COMPATIBILITY_BREAKAGE: contains(github.event.pull_request.labels.*.name, 'backward compatibility breakage')
ALLOW_FORWARD_COMPATIBILITY_BREAKAGE: contains(github.event.pull_request.labels.*.name, 'forward compatibility breakage')
RERUN_FAILED: ${{ inputs.rerun_failed }}
PG_VERSION: ${{ inputs.pg_version }}
SANITIZERS: ${{ inputs.sanitizers }}
@@ -135,6 +133,7 @@ runs:
fi
PERF_REPORT_DIR="$(realpath test_runner/perf-report-local)"
echo "PERF_REPORT_DIR=${PERF_REPORT_DIR}" >> ${GITHUB_ENV}
rm -rf $PERF_REPORT_DIR
TEST_SELECTION="test_runner/${{ inputs.test_selection }}"
@@ -211,11 +210,12 @@ runs:
--verbose \
-rA $TEST_SELECTION $EXTRA_PARAMS
if [[ "${{ inputs.save_perf_report }}" == "true" ]]; then
export REPORT_FROM="$PERF_REPORT_DIR"
export REPORT_TO="$PLATFORM"
scripts/generate_and_push_perf_report.sh
fi
- name: Upload performance report
if: ${{ !cancelled() && inputs.save_perf_report == 'true' }}
shell: bash -euxo pipefail {0}
run: |
export REPORT_FROM="${PERF_REPORT_DIR}"
scripts/generate_and_push_perf_report.sh
- name: Upload compatibility snapshot
# Note, that we use `github.base_ref` which is a target branch for a PR

View File

@@ -272,10 +272,13 @@ jobs:
# run pageserver tests with different settings
for get_vectored_concurrent_io in sequential sidecar-task; do
for io_engine in std-fs tokio-epoll-uring ; do
NEON_PAGESERVER_UNIT_TEST_GET_VECTORED_CONCURRENT_IO=$get_vectored_concurrent_io \
NEON_PAGESERVER_UNIT_TEST_VIRTUAL_FILE_IOENGINE=$io_engine \
${cov_prefix} \
cargo nextest run $CARGO_FLAGS $CARGO_FEATURES -E 'package(pageserver)'
for io_mode in buffered direct direct-rw ; do
NEON_PAGESERVER_UNIT_TEST_GET_VECTORED_CONCURRENT_IO=$get_vectored_concurrent_io \
NEON_PAGESERVER_UNIT_TEST_VIRTUAL_FILE_IOENGINE=$io_engine \
NEON_PAGESERVER_UNIT_TEST_VIRTUAL_FILE_IOMODE=$io_mode \
${cov_prefix} \
cargo nextest run $CARGO_FLAGS $CARGO_FEATURES -E 'package(pageserver)'
done
done
done
@@ -346,7 +349,7 @@ jobs:
contents: read
statuses: write
needs: [ build-neon ]
runs-on: ${{ fromJSON(format('["self-hosted", "{0}"]', inputs.arch == 'arm64' && 'large-arm64' || 'large')) }}
runs-on: ${{ fromJSON(format('["self-hosted", "{0}"]', inputs.arch == 'arm64' && 'large-arm64' || 'large-metal')) }}
container:
image: ${{ inputs.build-tools-image }}
credentials:
@@ -392,6 +395,7 @@ jobs:
BUILD_TAG: ${{ inputs.build-tag }}
PAGESERVER_VIRTUAL_FILE_IO_ENGINE: tokio-epoll-uring
PAGESERVER_GET_VECTORED_CONCURRENT_IO: sidecar-task
PAGESERVER_VIRTUAL_FILE_IO_MODE: direct
USE_LFC: ${{ matrix.lfc_state == 'with-lfc' && 'true' || 'false' }}
# Temporary disable this step until we figure out why it's so flaky

View File

@@ -53,10 +53,13 @@ jobs:
|| inputs.component-name == 'Compute' && 'release-compute'
}}
run: |
today=$(date +'%Y-%m-%d')
echo "title=${COMPONENT_NAME} release ${today}" | tee -a ${GITHUB_OUTPUT}
echo "rc-branch=rc/${RELEASE_BRANCH}/${today}" | tee -a ${GITHUB_OUTPUT}
echo "release-branch=${RELEASE_BRANCH}" | tee -a ${GITHUB_OUTPUT}
now_date=$(date -u +'%Y-%m-%d')
now_time=$(date -u +'%H-%M-%Z')
{
echo "title=${COMPONENT_NAME} release ${now_date}"
echo "rc-branch=rc/${RELEASE_BRANCH}/${now_date}_${now_time}"
echo "release-branch=${RELEASE_BRANCH}"
} | tee -a ${GITHUB_OUTPUT}
- name: Configure git
run: |

View File

@@ -165,5 +165,5 @@ jobs:
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
CURRENT_SHA: ${{ github.sha }}
run: |
RELEASE_PR_RUN_ID=$(gh api "/repos/${GITHUB_REPOSITORY}/actions/runs?head_sha=$CURRENT_SHA" | jq '[.workflow_runs[] | select(.name == "Build and Test") | select(.head_branch | test("^rc/release(-(proxy|compute))?/[0-9]{4}-[0-9]{2}-[0-9]{2}$"; "s"))] | first | .id // ("Failed to find Build and Test run from RC PR!" | halt_error(1))')
RELEASE_PR_RUN_ID=$(gh api "/repos/${GITHUB_REPOSITORY}/actions/runs?head_sha=$CURRENT_SHA" | jq '[.workflow_runs[] | select(.name == "Build and Test") | select(.head_branch | test("^rc/release.*$"; "s"))] | first | .id // ("Failed to find Build and Test run from RC PR!" | halt_error(1))')
echo "release-pr-run-id=$RELEASE_PR_RUN_ID" | tee -a $GITHUB_OUTPUT

View File

@@ -284,7 +284,7 @@ jobs:
statuses: write
contents: write
pull-requests: write
runs-on: [ self-hosted, small-metal ]
runs-on: [ self-hosted, unit-perf ]
container:
image: ${{ needs.build-build-tools-image.outputs.image }}-bookworm
credentials:
@@ -323,6 +323,8 @@ jobs:
PERF_TEST_RESULT_CONNSTR: "${{ secrets.PERF_TEST_RESULT_CONNSTR }}"
TEST_RESULT_CONNSTR: "${{ secrets.REGRESS_TEST_RESULT_CONNSTR_NEW }}"
PAGESERVER_VIRTUAL_FILE_IO_ENGINE: tokio-epoll-uring
PAGESERVER_GET_VECTORED_CONCURRENT_IO: sidecar-task
PAGESERVER_VIRTUAL_FILE_IO_MODE: direct
SYNC_BETWEEN_TESTS: true
# XXX: no coverage data handling here, since benchmarks are run on release builds,
# while coverage is currently collected for the debug ones
@@ -1271,7 +1273,7 @@ jobs:
exit 1
deploy:
needs: [ check-permissions, push-neon-image-dev, push-compute-image-dev, push-neon-image-prod, push-compute-image-prod, meta, build-and-test-locally, trigger-custom-extensions-build-and-wait ]
needs: [ check-permissions, push-neon-image-dev, push-compute-image-dev, push-neon-image-prod, push-compute-image-prod, meta, trigger-custom-extensions-build-and-wait ]
# `!failure() && !cancelled()` is required because the workflow depends on the job that can be skipped: `push-neon-image-prod` and `push-compute-image-prod`
if: ${{ contains(fromJSON('["push-main", "storage-release", "proxy-release", "compute-release"]'), needs.meta.outputs.run-kind) && !failure() && !cancelled() }}
permissions:

View File

@@ -27,15 +27,17 @@ jobs:
- name: Fast forwarding
uses: sequoia-pgp/fast-forward@ea7628bedcb0b0b96e94383ada458d812fca4979
# See https://docs.github.com/en/graphql/reference/enums#mergestatestatus
if: ${{ github.event.pull_request.mergeable_state == 'clean' }}
if: ${{ contains(fromJSON('["clean", "unstable"]'), github.event.pull_request.mergeable_state) }}
with:
merge: true
comment: on-error
github_token: ${{ secrets.CI_ACCESS_TOKEN }}
- name: Comment if mergeable_state is not clean
if: ${{ github.event.pull_request.mergeable_state != 'clean' }}
if: ${{ !contains(fromJSON('["clean", "unstable"]'), github.event.pull_request.mergeable_state) }}
env:
GH_TOKEN: ${{ secrets.CI_ACCESS_TOKEN }}
run: |
gh pr comment ${{ github.event.pull_request.number }} \
--repo "${GITHUB_REPOSITORY}" \
--body "Not trying to forward pull-request, because \`mergeable_state\` is \`${{ github.event.pull_request.mergeable_state }}\`, not \`clean\`."
--body "Not trying to forward pull-request, because \`mergeable_state\` is \`${{ github.event.pull_request.mergeable_state }}\`, not \`clean\` or \`unstable\`."

View File

@@ -30,7 +30,7 @@ permissions:
statuses: write # require for posting a status update
env:
DEFAULT_PG_VERSION: 16
DEFAULT_PG_VERSION: 17
PLATFORM: neon-captest-new
AWS_DEFAULT_REGION: eu-central-1
@@ -42,6 +42,8 @@ jobs:
github-event-name: ${{ github.event_name }}
build-build-tools-image:
permissions:
packages: write
needs: [ check-permissions ]
uses: ./.github/workflows/build-build-tools-image.yml
secrets: inherit

93
.github/workflows/random-ops-test.yml vendored Normal file
View File

@@ -0,0 +1,93 @@
name: Random Operations Test
on:
schedule:
# * is a special character in YAML so you have to quote this string
# ┌───────────── minute (0 - 59)
# │ ┌───────────── hour (0 - 23)
# │ │ ┌───────────── day of the month (1 - 31)
# │ │ │ ┌───────────── month (1 - 12 or JAN-DEC)
# │ │ │ │ ┌───────────── day of the week (0 - 6 or SUN-SAT)
- cron: '23 */2 * * *' # runs every 2 hours
workflow_dispatch:
inputs:
random_seed:
type: number
description: 'The random seed'
required: false
default: 0
num_operations:
type: number
description: "The number of operations to test"
default: 250
defaults:
run:
shell: bash -euxo pipefail {0}
permissions: {}
env:
DEFAULT_PG_VERSION: 16
PLATFORM: neon-captest-new
AWS_DEFAULT_REGION: eu-central-1
jobs:
run-random-rests:
env:
POSTGRES_DISTRIB_DIR: /tmp/neon/pg_install
runs-on: small
permissions:
id-token: write
statuses: write
strategy:
fail-fast: false
matrix:
pg-version: [16, 17]
container:
image: ghcr.io/neondatabase/build-tools:pinned-bookworm
credentials:
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
options: --init
steps:
- name: Harden the runner (Audit all outbound calls)
uses: step-security/harden-runner@4d991eb9b905ef189e4c376166672c3f2f230481 # v2.11.0
with:
egress-policy: audit
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
- name: Download Neon artifact
uses: ./.github/actions/download
with:
name: neon-${{ runner.os }}-${{ runner.arch }}-release-artifact
path: /tmp/neon/
prefix: latest
aws-oicd-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
- name: Run tests
uses: ./.github/actions/run-python-test-set
with:
build_type: remote
test_selection: random_ops
run_in_parallel: false
extra_params: -m remote_cluster
pg_version: ${{ matrix.pg-version }}
aws-oicd-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
env:
NEON_API_KEY: ${{ secrets.NEON_STAGING_API_KEY }}
RANDOM_SEED: ${{ inputs.random_seed }}
NUM_OPERATIONS: ${{ inputs.num_operations }}
- name: Create Allure report
if: ${{ !cancelled() }}
id: create-allure-report
uses: ./.github/actions/allure-report-generate
with:
store-test-results-into-db: true
aws-oicd-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
env:
REGRESS_TEST_RESULT_CONNSTR_NEW: ${{ secrets.REGRESS_TEST_RESULT_CONNSTR_NEW }}

70
Cargo.lock generated
View File

@@ -1416,6 +1416,7 @@ name = "control_plane"
version = "0.1.0"
dependencies = [
"anyhow",
"base64 0.13.1",
"camino",
"clap",
"comfy-table",
@@ -1425,10 +1426,12 @@ dependencies = [
"humantime",
"humantime-serde",
"hyper 0.14.30",
"jsonwebtoken",
"nix 0.27.1",
"once_cell",
"pageserver_api",
"pageserver_client",
"pem",
"postgres_backend",
"postgres_connection",
"regex",
@@ -1437,6 +1440,8 @@ dependencies = [
"scopeguard",
"serde",
"serde_json",
"sha2",
"spki 0.7.3",
"storage_broker",
"thiserror 1.0.69",
"tokio",
@@ -2032,6 +2037,33 @@ dependencies = [
"zeroize",
]
[[package]]
name = "endpoint_storage"
version = "0.0.1"
dependencies = [
"anyhow",
"axum",
"axum-extra",
"camino",
"camino-tempfile",
"futures",
"http-body-util",
"itertools 0.10.5",
"jsonwebtoken",
"prometheus",
"rand 0.8.5",
"remote_storage",
"serde",
"serde_json",
"test-log",
"tokio",
"tokio-util",
"tower 0.5.2",
"tracing",
"utils",
"workspace_hack",
]
[[package]]
name = "enum-map"
version = "2.5.0"
@@ -2817,6 +2849,7 @@ dependencies = [
"hyper 0.14.30",
"itertools 0.10.5",
"jemalloc_pprof",
"jsonwebtoken",
"metrics",
"once_cell",
"pprof",
@@ -3992,33 +4025,6 @@ dependencies = [
"memchr",
]
[[package]]
name = "object_storage"
version = "0.0.1"
dependencies = [
"anyhow",
"axum",
"axum-extra",
"camino",
"camino-tempfile",
"futures",
"http-body-util",
"itertools 0.10.5",
"jsonwebtoken",
"prometheus",
"rand 0.8.5",
"remote_storage",
"serde",
"serde_json",
"test-log",
"tokio",
"tokio-util",
"tower 0.5.2",
"tracing",
"utils",
"workspace_hack",
]
[[package]]
name = "once_cell"
version = "1.20.2"
@@ -4269,6 +4275,7 @@ dependencies = [
"hyper 0.14.30",
"indoc",
"itertools 0.10.5",
"jsonwebtoken",
"md5",
"metrics",
"nix 0.27.1",
@@ -4278,6 +4285,7 @@ dependencies = [
"pageserver_api",
"pageserver_client",
"pageserver_compaction",
"pem",
"pin-project-lite",
"postgres-protocol",
"postgres-types",
@@ -4345,6 +4353,7 @@ dependencies = [
"humantime-serde",
"itertools 0.10.5",
"nix 0.27.1",
"once_cell",
"postgres_backend",
"postgres_ffi",
"rand 0.8.5",
@@ -5685,9 +5694,9 @@ dependencies = [
[[package]]
name = "ring"
version = "0.17.13"
version = "0.17.14"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "70ac5d832aa16abd7d1def883a8545280c20a60f523a370aa3a9617c2b8550ee"
checksum = "a4689e6c2294d81e88dc6261c768b63bc4fcdb852be6d1352498b114f61383b7"
dependencies = [
"cc",
"cfg-if",
@@ -5988,10 +5997,12 @@ dependencies = [
"humantime",
"hyper 0.14.30",
"itertools 0.10.5",
"jsonwebtoken",
"metrics",
"once_cell",
"pageserver_api",
"parking_lot 0.12.1",
"pem",
"postgres-protocol",
"postgres_backend",
"postgres_ffi",
@@ -7872,6 +7883,7 @@ dependencies = [
"metrics",
"nix 0.27.1",
"once_cell",
"pem",
"pin-project-lite",
"postgres_connection",
"pprof",

View File

@@ -40,7 +40,7 @@ members = [
"libs/proxy/postgres-protocol2",
"libs/proxy/postgres-types2",
"libs/proxy/tokio-postgres2",
"object_storage",
"endpoint_storage",
]
[workspace.package]
@@ -141,6 +141,7 @@ parking_lot = "0.12"
parquet = { version = "53", default-features = false, features = ["zstd"] }
parquet_derive = "53"
pbkdf2 = { version = "0.12.1", features = ["simple", "std"] }
pem = "3.0.3"
pin-project-lite = "0.2"
pprof = { version = "0.14", features = ["criterion", "flamegraph", "frame-pointer", "prost-codec"] }
procfs = "0.16"
@@ -174,6 +175,7 @@ signal-hook = "0.3"
smallvec = "1.11"
smol_str = { version = "0.2.0", features = ["serde"] }
socket2 = "0.5"
spki = "0.7.3"
strum = "0.26"
strum_macros = "0.26"
"subtle" = "2.5.0"

View File

@@ -89,7 +89,7 @@ RUN set -e \
--bin storage_broker \
--bin storage_controller \
--bin proxy \
--bin object_storage \
--bin endpoint_storage \
--bin neon_local \
--bin storage_scrubber \
--locked --release
@@ -122,7 +122,7 @@ COPY --from=build --chown=neon:neon /home/nonroot/target/release/safekeeper
COPY --from=build --chown=neon:neon /home/nonroot/target/release/storage_broker /usr/local/bin
COPY --from=build --chown=neon:neon /home/nonroot/target/release/storage_controller /usr/local/bin
COPY --from=build --chown=neon:neon /home/nonroot/target/release/proxy /usr/local/bin
COPY --from=build --chown=neon:neon /home/nonroot/target/release/object_storage /usr/local/bin
COPY --from=build --chown=neon:neon /home/nonroot/target/release/endpoint_storage /usr/local/bin
COPY --from=build --chown=neon:neon /home/nonroot/target/release/neon_local /usr/local/bin
COPY --from=build --chown=neon:neon /home/nonroot/target/release/storage_scrubber /usr/local/bin

View File

@@ -270,7 +270,7 @@ By default, this runs both debug and release modes, and all supported postgres v
testing locally, it is convenient to run just one set of permutations, like this:
```sh
DEFAULT_PG_VERSION=16 BUILD_TYPE=release ./scripts/pytest
DEFAULT_PG_VERSION=17 BUILD_TYPE=release ./scripts/pytest
```
## Flamegraphs

View File

@@ -12,3 +12,5 @@ disallowed-macros = [
# cannot disallow this, because clippy finds used from tokio macros
#"tokio::pin",
]
allow-unwrap-in-tests = true

View File

@@ -1677,7 +1677,7 @@ RUN set -e \
&& apt clean && rm -rf /var/lib/apt/lists/*
# Use `dist_man_MANS=` to skip manpage generation (which requires python3/pandoc)
ENV PGBOUNCER_TAG=pgbouncer_1_22_1
ENV PGBOUNCER_TAG=pgbouncer_1_24_1
RUN set -e \
&& git clone --recurse-submodules --depth 1 --branch ${PGBOUNCER_TAG} https://github.com/pgbouncer/pgbouncer.git pgbouncer \
&& cd pgbouncer \

View File

@@ -1,265 +0,0 @@
commit 00aa659afc9c7336ab81036edec3017168aabf40
Author: Heikki Linnakangas <heikki@neon.tech>
Date: Tue Nov 12 16:59:19 2024 +0200
Temporarily disable test that depends on timezone
diff --git a/tests/expected/generalization.out b/tests/expected/generalization.out
index 23ef5fa..9e60deb 100644
--- a/ext-src/pg_anon-src/tests/expected/generalization.out
+++ b/ext-src/pg_anon-src/tests/expected/generalization.out
@@ -284,12 +284,9 @@ SELECT anon.generalize_tstzrange('19041107','century');
["Tue Jan 01 00:00:00 1901 PST","Mon Jan 01 00:00:00 2001 PST")
(1 row)
-SELECT anon.generalize_tstzrange('19041107','millennium');
- generalize_tstzrange
------------------------------------------------------------------
- ["Thu Jan 01 00:00:00 1001 PST","Mon Jan 01 00:00:00 2001 PST")
-(1 row)
-
+-- temporarily disabled, see:
+-- https://gitlab.com/dalibo/postgresql_anonymizer/-/commit/199f0a392b37c59d92ae441fb8f037e094a11a52#note_2148017485
+--SELECT anon.generalize_tstzrange('19041107','millennium');
-- generalize_daterange
SELECT anon.generalize_daterange('19041107');
generalize_daterange
diff --git a/tests/sql/generalization.sql b/tests/sql/generalization.sql
index b868344..b4fc977 100644
--- a/ext-src/pg_anon-src/tests/sql/generalization.sql
+++ b/ext-src/pg_anon-src/tests/sql/generalization.sql
@@ -61,7 +61,9 @@ SELECT anon.generalize_tstzrange('19041107','month');
SELECT anon.generalize_tstzrange('19041107','year');
SELECT anon.generalize_tstzrange('19041107','decade');
SELECT anon.generalize_tstzrange('19041107','century');
-SELECT anon.generalize_tstzrange('19041107','millennium');
+-- temporarily disabled, see:
+-- https://gitlab.com/dalibo/postgresql_anonymizer/-/commit/199f0a392b37c59d92ae441fb8f037e094a11a52#note_2148017485
+--SELECT anon.generalize_tstzrange('19041107','millennium');
-- generalize_daterange
SELECT anon.generalize_daterange('19041107');
commit 7dd414ee75f2875cffb1d6ba474df1f135a6fc6f
Author: Alexey Masterov <alexeymasterov@neon.tech>
Date: Fri May 31 06:34:26 2024 +0000
These alternative expected files were added to consider the neon features
diff --git a/ext-src/pg_anon-src/tests/expected/permissions_masked_role_1.out b/ext-src/pg_anon-src/tests/expected/permissions_masked_role_1.out
new file mode 100644
index 0000000..2539cfd
--- /dev/null
+++ b/ext-src/pg_anon-src/tests/expected/permissions_masked_role_1.out
@@ -0,0 +1,101 @@
+BEGIN;
+CREATE EXTENSION anon CASCADE;
+NOTICE: installing required extension "pgcrypto"
+SELECT anon.init();
+ init
+------
+ t
+(1 row)
+
+CREATE ROLE mallory_the_masked_user;
+SECURITY LABEL FOR anon ON ROLE mallory_the_masked_user IS 'MASKED';
+CREATE TABLE t1(i INT);
+ALTER TABLE t1 ADD COLUMN t TEXT;
+SECURITY LABEL FOR anon ON COLUMN t1.t
+IS 'MASKED WITH VALUE NULL';
+INSERT INTO t1 VALUES (1,'test');
+--
+-- We're checking the owner's permissions
+--
+-- see
+-- https://postgresql-anonymizer.readthedocs.io/en/latest/SECURITY/#permissions
+--
+SET ROLE mallory_the_masked_user;
+SELECT anon.pseudo_first_name(0) IS NOT NULL;
+ ?column?
+----------
+ t
+(1 row)
+
+-- SHOULD FAIL
+DO $$
+BEGIN
+ PERFORM anon.init();
+ EXCEPTION WHEN insufficient_privilege
+ THEN RAISE NOTICE 'insufficient_privilege';
+END$$;
+NOTICE: insufficient_privilege
+-- SHOULD FAIL
+DO $$
+BEGIN
+ PERFORM anon.anonymize_table('t1');
+ EXCEPTION WHEN insufficient_privilege
+ THEN RAISE NOTICE 'insufficient_privilege';
+END$$;
+NOTICE: insufficient_privilege
+-- SHOULD FAIL
+SAVEPOINT fail_start_engine;
+SELECT anon.start_dynamic_masking();
+ERROR: Only supersusers can start the dynamic masking engine.
+CONTEXT: PL/pgSQL function anon.start_dynamic_masking(boolean) line 18 at RAISE
+ROLLBACK TO fail_start_engine;
+RESET ROLE;
+SELECT anon.start_dynamic_masking();
+ start_dynamic_masking
+-----------------------
+ t
+(1 row)
+
+SET ROLE mallory_the_masked_user;
+SELECT * FROM mask.t1;
+ i | t
+---+---
+ 1 |
+(1 row)
+
+-- SHOULD FAIL
+DO $$
+BEGIN
+ SELECT * FROM public.t1;
+ EXCEPTION WHEN insufficient_privilege
+ THEN RAISE NOTICE 'insufficient_privilege';
+END$$;
+NOTICE: insufficient_privilege
+-- SHOULD FAIL
+SAVEPOINT fail_stop_engine;
+SELECT anon.stop_dynamic_masking();
+ERROR: Only supersusers can stop the dynamic masking engine.
+CONTEXT: PL/pgSQL function anon.stop_dynamic_masking() line 18 at RAISE
+ROLLBACK TO fail_stop_engine;
+RESET ROLE;
+SELECT anon.stop_dynamic_masking();
+NOTICE: The previous priviledges of 'mallory_the_masked_user' are not restored. You need to grant them manually.
+ stop_dynamic_masking
+----------------------
+ t
+(1 row)
+
+SET ROLE mallory_the_masked_user;
+SELECT COUNT(*)=1 FROM anon.pg_masking_rules;
+ ?column?
+----------
+ t
+(1 row)
+
+-- SHOULD FAIL
+SAVEPOINT fail_seclabel_on_role;
+SECURITY LABEL FOR anon ON ROLE mallory_the_masked_user IS NULL;
+ERROR: permission denied
+DETAIL: The current user must have the CREATEROLE attribute.
+ROLLBACK TO fail_seclabel_on_role;
+ROLLBACK;
diff --git a/ext-src/pg_anon-src/tests/expected/permissions_owner_1.out b/ext-src/pg_anon-src/tests/expected/permissions_owner_1.out
new file mode 100644
index 0000000..8b090fe
--- /dev/null
+++ b/ext-src/pg_anon-src/tests/expected/permissions_owner_1.out
@@ -0,0 +1,104 @@
+BEGIN;
+CREATE EXTENSION anon CASCADE;
+NOTICE: installing required extension "pgcrypto"
+SELECT anon.init();
+ init
+------
+ t
+(1 row)
+
+CREATE ROLE oscar_the_owner;
+ALTER DATABASE :DBNAME OWNER TO oscar_the_owner;
+CREATE ROLE mallory_the_masked_user;
+SECURITY LABEL FOR anon ON ROLE mallory_the_masked_user IS 'MASKED';
+--
+-- We're checking the owner's permissions
+--
+-- see
+-- https://postgresql-anonymizer.readthedocs.io/en/latest/SECURITY/#permissions
+--
+SET ROLE oscar_the_owner;
+SELECT anon.pseudo_first_name(0) IS NOT NULL;
+ ?column?
+----------
+ t
+(1 row)
+
+-- SHOULD FAIL
+DO $$
+BEGIN
+ PERFORM anon.init();
+ EXCEPTION WHEN insufficient_privilege
+ THEN RAISE NOTICE 'insufficient_privilege';
+END$$;
+NOTICE: insufficient_privilege
+CREATE TABLE t1(i INT);
+ALTER TABLE t1 ADD COLUMN t TEXT;
+SECURITY LABEL FOR anon ON COLUMN t1.t
+IS 'MASKED WITH VALUE NULL';
+INSERT INTO t1 VALUES (1,'test');
+SELECT anon.anonymize_table('t1');
+ anonymize_table
+-----------------
+ t
+(1 row)
+
+SELECT * FROM t1;
+ i | t
+---+---
+ 1 |
+(1 row)
+
+UPDATE t1 SET t='test' WHERE i=1;
+-- SHOULD FAIL
+SAVEPOINT fail_start_engine;
+SELECT anon.start_dynamic_masking();
+ start_dynamic_masking
+-----------------------
+ t
+(1 row)
+
+ROLLBACK TO fail_start_engine;
+RESET ROLE;
+SELECT anon.start_dynamic_masking();
+ start_dynamic_masking
+-----------------------
+ t
+(1 row)
+
+SET ROLE oscar_the_owner;
+SELECT * FROM t1;
+ i | t
+---+------
+ 1 | test
+(1 row)
+
+--SELECT * FROM mask.t1;
+-- SHOULD FAIL
+SAVEPOINT fail_stop_engine;
+SELECT anon.stop_dynamic_masking();
+ERROR: permission denied for schema mask
+CONTEXT: SQL statement "DROP VIEW mask.t1;"
+PL/pgSQL function anon.mask_drop_view(oid) line 3 at EXECUTE
+SQL statement "SELECT anon.mask_drop_view(oid)
+ FROM pg_catalog.pg_class
+ WHERE relnamespace=quote_ident(pg_catalog.current_setting('anon.sourceschema'))::REGNAMESPACE
+ AND relkind IN ('r','p','f')"
+PL/pgSQL function anon.stop_dynamic_masking() line 22 at PERFORM
+ROLLBACK TO fail_stop_engine;
+RESET ROLE;
+SELECT anon.stop_dynamic_masking();
+NOTICE: The previous priviledges of 'mallory_the_masked_user' are not restored. You need to grant them manually.
+ stop_dynamic_masking
+----------------------
+ t
+(1 row)
+
+SET ROLE oscar_the_owner;
+-- SHOULD FAIL
+SAVEPOINT fail_seclabel_on_role;
+SECURITY LABEL FOR anon ON ROLE mallory_the_masked_user IS NULL;
+ERROR: permission denied
+DETAIL: The current user must have the CREATEROLE attribute.
+ROLLBACK TO fail_seclabel_on_role;
+ROLLBACK;

View File

@@ -11,6 +11,14 @@ index bf6edcb..89b4c7f 100644
USE_PGXS = 1 # use pgxs if not in contrib directory
PGXS := $(shell $(PG_CONFIG) --pgxs)
diff --git a/regress/expected/init-extension.out b/regress/expected/init-extension.out
index 9f2e171..f6e4f8d 100644
--- a/regress/expected/init-extension.out
+++ b/regress/expected/init-extension.out
@@ -1,3 +1,2 @@
SET client_min_messages = warning;
CREATE EXTENSION pg_repack;
-RESET client_min_messages;
diff --git a/regress/expected/nosuper.out b/regress/expected/nosuper.out
index 8d0a94e..63b68bf 100644
--- a/regress/expected/nosuper.out
@@ -42,6 +50,14 @@ index 8d0a94e..63b68bf 100644
INFO: repacking table "public.tbl_cluster"
ERROR: query failed: ERROR: current transaction is aborted, commands ignored until end of transaction block
DETAIL: query was: RESET lock_timeout
diff --git a/regress/sql/init-extension.sql b/regress/sql/init-extension.sql
index 9f2e171..f6e4f8d 100644
--- a/regress/sql/init-extension.sql
+++ b/regress/sql/init-extension.sql
@@ -1,3 +1,2 @@
SET client_min_messages = warning;
CREATE EXTENSION pg_repack;
-RESET client_min_messages;
diff --git a/regress/sql/nosuper.sql b/regress/sql/nosuper.sql
index 072f0fa..dbe60f8 100644
--- a/regress/sql/nosuper.sql

View File

@@ -15,7 +15,7 @@ index 7a4b88c..56678af 100644
HEADERS = src/halfvec.h src/sparsevec.h src/vector.h
diff --git a/src/hnswbuild.c b/src/hnswbuild.c
index b667478..dc95d89 100644
index b667478..1298aa1 100644
--- a/src/hnswbuild.c
+++ b/src/hnswbuild.c
@@ -843,9 +843,17 @@ HnswParallelBuildMain(dsm_segment *seg, shm_toc *toc)
@@ -36,7 +36,7 @@ index b667478..dc95d89 100644
/* Close relations within worker */
index_close(indexRel, indexLockmode);
table_close(heapRel, heapLockmode);
@@ -1100,12 +1108,39 @@ BuildIndex(Relation heap, Relation index, IndexInfo *indexInfo,
@@ -1100,13 +1108,25 @@ BuildIndex(Relation heap, Relation index, IndexInfo *indexInfo,
SeedRandom(42);
#endif
@@ -48,32 +48,17 @@ index b667478..dc95d89 100644
BuildGraph(buildstate, forkNum);
- if (RelationNeedsWAL(index) || forkNum == INIT_FORKNUM)
+#ifdef NEON_SMGR
+ smgr_finish_unlogged_build_phase_1(RelationGetSmgr(index));
+#endif
+
+ if (RelationNeedsWAL(index) || forkNum == INIT_FORKNUM) {
if (RelationNeedsWAL(index) || forkNum == INIT_FORKNUM)
log_newpage_range(index, forkNum, 0, RelationGetNumberOfBlocksInFork(index, forkNum), true);
+#ifdef NEON_SMGR
+ {
+#if PG_VERSION_NUM >= 160000
+ RelFileLocator rlocator = RelationGetSmgr(index)->smgr_rlocator.locator;
+#else
+ RelFileNode rlocator = RelationGetSmgr(index)->smgr_rnode.node;
+#endif
+ if (set_lwlsn_block_range_hook)
+ set_lwlsn_block_range_hook(XactLastRecEnd, rlocator,
+ MAIN_FORKNUM, 0, RelationGetNumberOfBlocks(index));
+ if (set_lwlsn_relation_hook)
+ set_lwlsn_relation_hook(XactLastRecEnd, rlocator, MAIN_FORKNUM);
+ }
+#endif
+ }
+
+#ifdef NEON_SMGR
+ smgr_end_unlogged_build(RelationGetSmgr(index));
+#endif
+
FreeBuildState(buildstate);
}

View File

@@ -1,5 +1,5 @@
diff --git a/src/ruminsert.c b/src/ruminsert.c
index 255e616..7a2240f 100644
index 255e616..1c6edb7 100644
--- a/src/ruminsert.c
+++ b/src/ruminsert.c
@@ -628,6 +628,10 @@ rumbuild(Relation heap, Relation index, struct IndexInfo *indexInfo)
@@ -24,24 +24,12 @@ index 255e616..7a2240f 100644
/*
* Write index to xlog
*/
@@ -713,6 +721,22 @@ rumbuild(Relation heap, Relation index, struct IndexInfo *indexInfo)
@@ -713,6 +721,10 @@ rumbuild(Relation heap, Relation index, struct IndexInfo *indexInfo)
UnlockReleaseBuffer(buffer);
}
+#ifdef NEON_SMGR
+ {
+#if PG_VERSION_NUM >= 160000
+ RelFileLocator rlocator = RelationGetSmgr(index)->smgr_rlocator.locator;
+#else
+ RelFileNode rlocator = RelationGetSmgr(index)->smgr_rnode.node;
+#endif
+ if (set_lwlsn_block_range_hook)
+ set_lwlsn_block_range_hook(XactLastRecEnd, rlocator, MAIN_FORKNUM, 0, RelationGetNumberOfBlocks(index));
+ if (set_lwlsn_relation_hook)
+ set_lwlsn_relation_hook(XactLastRecEnd, rlocator, MAIN_FORKNUM);
+
+ smgr_end_unlogged_build(index->rd_smgr);
+ }
+ smgr_end_unlogged_build(index->rd_smgr);
+#endif
+
/*

View File

@@ -22,7 +22,7 @@ commands:
- name: local_proxy
user: postgres
sysvInitAction: respawn
shell: '/usr/local/bin/local_proxy --config-path /etc/local_proxy/config.json --pid-path /etc/local_proxy/pid --http 0.0.0.0:10432'
shell: 'RUST_LOG="info,proxy::serverless::sql_over_http=warn" /usr/local/bin/local_proxy --config-path /etc/local_proxy/config.json --pid-path /etc/local_proxy/pid --http 0.0.0.0:10432'
- name: postgres-exporter
user: nobody
sysvInitAction: respawn

View File

@@ -22,7 +22,7 @@ commands:
- name: local_proxy
user: postgres
sysvInitAction: respawn
shell: '/usr/local/bin/local_proxy --config-path /etc/local_proxy/config.json --pid-path /etc/local_proxy/pid --http 0.0.0.0:10432'
shell: 'RUST_LOG="info,proxy::serverless::sql_over_http=warn" /usr/local/bin/local_proxy --config-path /etc/local_proxy/config.json --pid-path /etc/local_proxy/pid --http 0.0.0.0:10432'
- name: postgres-exporter
user: nobody
sysvInitAction: respawn

View File

@@ -29,13 +29,12 @@
//! ```sh
//! compute_ctl -D /var/db/postgres/compute \
//! -C 'postgresql://cloud_admin@localhost/postgres' \
//! -S /var/db/postgres/specs/current.json \
//! -c /var/db/postgres/configs/config.json \
//! -b /usr/local/bin/postgres \
//! -r http://pg-ext-s3-gateway \
//! ```
use std::ffi::OsString;
use std::fs::File;
use std::path::Path;
use std::process::exit;
use std::sync::mpsc;
use std::thread;
@@ -43,8 +42,7 @@ use std::time::Duration;
use anyhow::{Context, Result};
use clap::Parser;
use compute_api::responses::ComputeCtlConfig;
use compute_api::spec::ComputeSpec;
use compute_api::responses::ComputeConfig;
use compute_tools::compute::{
BUILD_TAG, ComputeNode, ComputeNodeParams, forward_termination_signal,
};
@@ -59,24 +57,13 @@ use tracing::{error, info};
use url::Url;
use utils::failpoint_support;
// Compatibility hack: if the control plane specified any remote-ext-config
// use the default value for extension storage proxy gateway.
// Remove this once the control plane is updated to pass the gateway URL
fn parse_remote_ext_config(arg: &str) -> Result<String> {
if arg.starts_with("http") {
Ok(arg.trim_end_matches('/').to_string())
} else {
Ok("http://pg-ext-s3-gateway".to_string())
}
}
#[derive(Parser)]
#[command(rename_all = "kebab-case")]
struct Cli {
#[arg(short = 'b', long, default_value = "postgres", env = "POSTGRES_PATH")]
pub pgbin: String,
#[arg(short = 'r', long, value_parser = parse_remote_ext_config)]
#[arg(short = 'r', long)]
pub remote_ext_config: Option<String>,
/// The port to bind the external listening HTTP server to. Clients running
@@ -118,8 +105,8 @@ struct Cli {
#[arg(long)]
pub set_disk_quota_for_fs: Option<String>,
#[arg(short = 'S', long, group = "spec-path")]
pub spec_path: Option<OsString>,
#[arg(short = 'c', long)]
pub config: Option<OsString>,
#[arg(short = 'i', long, group = "compute-id")]
pub compute_id: String,
@@ -127,8 +114,9 @@ struct Cli {
#[arg(
short = 'p',
long,
conflicts_with = "spec-path",
value_name = "CONTROL_PLANE_API_BASE_URL"
conflicts_with = "config",
value_name = "CONTROL_PLANE_API_BASE_URL",
requires = "compute-id"
)]
pub control_plane_uri: Option<String>,
}
@@ -138,7 +126,7 @@ fn main() -> Result<()> {
let scenario = failpoint_support::init();
// For historical reasons, the main thread that processes the spec and launches postgres
// For historical reasons, the main thread that processes the config and launches postgres
// is synchronous, but we always have this tokio runtime available and we "enter" it so
// that you can use tokio::spawn() and tokio::runtime::Handle::current().block_on(...)
// from all parts of compute_ctl.
@@ -154,7 +142,7 @@ fn main() -> Result<()> {
let connstr = Url::parse(&cli.connstr).context("cannot parse connstr as a URL")?;
let cli_spec = try_spec_from_cli(&cli)?;
let config = get_config(&cli)?;
let compute_node = ComputeNode::new(
ComputeNodeParams {
@@ -175,8 +163,7 @@ fn main() -> Result<()> {
#[cfg(target_os = "linux")]
vm_monitor_addr: cli.vm_monitor_addr,
},
cli_spec.spec,
cli_spec.compute_ctl_config,
config,
)?;
let exit_code = compute_node.run()?;
@@ -201,27 +188,17 @@ async fn init() -> Result<()> {
Ok(())
}
fn try_spec_from_cli(cli: &Cli) -> Result<CliSpecParams> {
// First, read spec from the path if provided
if let Some(ref spec_path) = cli.spec_path {
let file = File::open(Path::new(spec_path))?;
return Ok(CliSpecParams {
spec: Some(serde_json::from_reader(file)?),
compute_ctl_config: ComputeCtlConfig::default(),
});
fn get_config(cli: &Cli) -> Result<ComputeConfig> {
// First, read the config from the path if provided
if let Some(ref config) = cli.config {
let file = File::open(config)?;
return Ok(serde_json::from_reader(&file)?);
}
if cli.control_plane_uri.is_none() {
panic!("must specify --control-plane-uri");
};
// If the spec wasn't provided in the CLI arguments, then retrieve it from
// If the config wasn't provided in the CLI arguments, then retrieve it from
// the control plane
match get_spec_from_control_plane(cli.control_plane_uri.as_ref().unwrap(), &cli.compute_id) {
Ok(resp) => Ok(CliSpecParams {
spec: resp.0,
compute_ctl_config: resp.1,
}),
match get_config_from_control_plane(cli.control_plane_uri.as_ref().unwrap(), &cli.compute_id) {
Ok(config) => Ok(config),
Err(e) => {
error!(
"cannot get response from control plane: {}\n\
@@ -233,13 +210,6 @@ fn try_spec_from_cli(cli: &Cli) -> Result<CliSpecParams> {
}
}
struct CliSpecParams {
/// If a spec was provided via CLI or file, the [`ComputeSpec`]
spec: Option<ComputeSpec>,
#[allow(dead_code)]
compute_ctl_config: ComputeCtlConfig,
}
fn deinit_and_exit(exit_code: Option<i32>) -> ! {
// Shutdown trace pipeline gracefully, so that it has a chance to send any
// pending traces before we exit. Shutting down OTEL tracing provider may

View File

@@ -11,7 +11,7 @@ use std::{env, fs};
use anyhow::{Context, Result};
use chrono::{DateTime, Utc};
use compute_api::privilege::Privilege;
use compute_api::responses::{ComputeCtlConfig, ComputeMetrics, ComputeStatus};
use compute_api::responses::{ComputeConfig, ComputeCtlConfig, ComputeMetrics, ComputeStatus};
use compute_api::spec::{
ComputeAudit, ComputeFeature, ComputeMode, ComputeSpec, ExtVersion, PgIdent,
};
@@ -303,11 +303,7 @@ struct StartVmMonitorResult {
}
impl ComputeNode {
pub fn new(
params: ComputeNodeParams,
cli_spec: Option<ComputeSpec>,
compute_ctl_config: ComputeCtlConfig,
) -> Result<Self> {
pub fn new(params: ComputeNodeParams, config: ComputeConfig) -> Result<Self> {
let connstr = params.connstr.as_str();
let conn_conf = postgres::config::Config::from_str(connstr)
.context("cannot build postgres config from connstr")?;
@@ -315,8 +311,8 @@ impl ComputeNode {
.context("cannot build tokio postgres config from connstr")?;
let mut new_state = ComputeState::new();
if let Some(cli_spec) = cli_spec {
let pspec = ParsedSpec::try_from(cli_spec).map_err(|msg| anyhow::anyhow!(msg))?;
if let Some(spec) = config.spec {
let pspec = ParsedSpec::try_from(spec).map_err(|msg| anyhow::anyhow!(msg))?;
new_state.pspec = Some(pspec);
}
@@ -327,7 +323,7 @@ impl ComputeNode {
state: Mutex::new(new_state),
state_changed: Condvar::new(),
ext_download_progress: RwLock::new(HashMap::new()),
compute_ctl_config,
compute_ctl_config: config.compute_ctl_config,
})
}
@@ -645,7 +641,26 @@ impl ComputeNode {
let log_directory_path = Path::new(&self.params.pgdata).join("log");
let log_directory_path = log_directory_path.to_string_lossy().to_string();
configure_audit_rsyslog(log_directory_path.clone(), "hipaa", &remote_endpoint)?;
// Add project_id,endpoint_id tag to identify the logs.
//
// These ids are passed from cplane,
// for backwards compatibility (old computes that don't have them),
// we set them to None.
// TODO: Clean up this code when all computes have them.
let tag: Option<String> = match (
pspec.spec.project_id.as_deref(),
pspec.spec.endpoint_id.as_deref(),
) {
(Some(project_id), Some(endpoint_id)) => {
Some(format!("{project_id}/{endpoint_id}"))
}
(Some(project_id), None) => Some(format!("{project_id}/None")),
(None, Some(endpoint_id)) => Some(format!("None,{endpoint_id}")),
(None, None) => None,
};
configure_audit_rsyslog(log_directory_path.clone(), tag, &remote_endpoint)?;
// Launch a background task to clean up the audit logs
launch_pgaudit_gc(log_directory_path);

View File

@@ -6,4 +6,5 @@ pub(crate) mod request_id;
pub(crate) use json::Json;
pub(crate) use path::Path;
pub(crate) use query::Query;
#[allow(unused)]
pub(crate) use request_id::RequestId;

View File

@@ -1,7 +1,7 @@
use std::{collections::HashSet, net::SocketAddr};
use std::collections::HashSet;
use anyhow::{Result, anyhow};
use axum::{RequestExt, body::Body, extract::ConnectInfo};
use axum::{RequestExt, body::Body};
use axum_extra::{
TypedHeader,
headers::{Authorization, authorization::Bearer},
@@ -11,9 +11,9 @@ use futures::future::BoxFuture;
use http::{Request, Response, StatusCode};
use jsonwebtoken::{Algorithm, DecodingKey, TokenData, Validation, jwk::JwkSet};
use tower_http::auth::AsyncAuthorizeRequest;
use tracing::warn;
use tracing::{debug, warn};
use crate::http::{JsonResponse, extract::RequestId};
use crate::http::JsonResponse;
#[derive(Clone, Debug)]
pub(in crate::http) struct Authorize {
@@ -52,31 +52,6 @@ impl AsyncAuthorizeRequest<Body> for Authorize {
let validation = self.validation.clone();
Box::pin(async move {
let request_id = request.extract_parts::<RequestId>().await.unwrap();
// TODO: Remove this stanza after teaching neon_local and the
// regression tests to use a JWT + JWKS.
//
// https://github.com/neondatabase/neon/issues/11316
if cfg!(feature = "testing") {
warn!(%request_id, "Skipping compute_ctl authorization check");
return Ok(request);
}
let connect_info = request
.extract_parts::<ConnectInfo<SocketAddr>>()
.await
.unwrap();
// In the event the request is coming from the loopback interface,
// allow all requests
if connect_info.ip().is_loopback() {
warn!(%request_id, "Bypassed authorization because request is coming from the loopback interface");
return Ok(request);
}
let TypedHeader(Authorization(bearer)) = request
.extract_parts::<TypedHeader<Authorization<Bearer>>>()
.await
@@ -92,7 +67,7 @@ impl AsyncAuthorizeRequest<Body> for Authorize {
if data.claims.compute_id != compute_id {
return Err(JsonResponse::error(
StatusCode::UNAUTHORIZED,
"invalid claims in authorization token",
"invalid compute ID in authorization token claims",
));
}
@@ -112,12 +87,16 @@ impl Authorize {
token: &str,
validation: &Validation,
) -> Result<TokenData<ComputeClaims>> {
debug_assert!(!jwks.keys.is_empty());
debug!("verifying token {}", token);
for jwk in jwks.keys.iter() {
let decoding_key = match DecodingKey::from_jwk(jwk) {
Ok(key) => key,
Err(e) => {
warn!(
"Failed to construct decoding key from {}: {}",
"failed to construct decoding key from {}: {}",
jwk.common.key_id.as_ref().unwrap(),
e
);
@@ -130,7 +109,7 @@ impl Authorize {
Ok(data) => return Ok(data),
Err(e) => {
warn!(
"Failed to decode authorization token using {}: {}",
"failed to decode authorization token using {}: {}",
jwk.common.key_id.as_ref().unwrap(),
e
);
@@ -140,6 +119,6 @@ impl Authorize {
}
}
Err(anyhow!("Failed to verify authorization token"))
Err(anyhow!("failed to verify authorization token"))
}
}

View File

@@ -19,13 +19,13 @@ pub(crate) static INSTALLED_EXTENSIONS: Lazy<UIntGaugeVec> = Lazy::new(|| {
// but for all our APIs we defined a 'slug'/method/operationId in the OpenAPI spec.
// And it's fair to call it a 'RPC' (Remote Procedure Call).
pub enum CPlaneRequestRPC {
GetSpec,
GetConfig,
}
impl CPlaneRequestRPC {
pub fn as_str(&self) -> &str {
match self {
CPlaneRequestRPC::GetSpec => "GetSpec",
CPlaneRequestRPC::GetConfig => "GetConfig",
}
}
}

View File

@@ -50,13 +50,13 @@ fn restart_rsyslog() -> Result<()> {
pub fn configure_audit_rsyslog(
log_directory: String,
tag: &str,
tag: Option<String>,
remote_endpoint: &str,
) -> Result<()> {
let config_content: String = format!(
include_str!("config_template/compute_audit_rsyslog_template.conf"),
log_directory = log_directory,
tag = tag,
tag = tag.unwrap_or("".to_string()),
remote_endpoint = remote_endpoint
);

View File

@@ -3,9 +3,8 @@ use std::path::Path;
use anyhow::{Result, anyhow, bail};
use compute_api::responses::{
ComputeCtlConfig, ControlPlaneComputeStatus, ControlPlaneSpecResponse,
ComputeConfig, ControlPlaneComputeStatus, ControlPlaneConfigResponse,
};
use compute_api::spec::ComputeSpec;
use reqwest::StatusCode;
use tokio_postgres::Client;
use tracing::{error, info, instrument};
@@ -21,7 +20,7 @@ use crate::params::PG_HBA_ALL_MD5;
fn do_control_plane_request(
uri: &str,
jwt: &str,
) -> Result<ControlPlaneSpecResponse, (bool, String, String)> {
) -> Result<ControlPlaneConfigResponse, (bool, String, String)> {
let resp = reqwest::blocking::Client::new()
.get(uri)
.header("Authorization", format!("Bearer {}", jwt))
@@ -29,14 +28,14 @@ fn do_control_plane_request(
.map_err(|e| {
(
true,
format!("could not perform spec request to control plane: {:?}", e),
format!("could not perform request to control plane: {:?}", e),
UNKNOWN_HTTP_STATUS.to_string(),
)
})?;
let status = resp.status();
match status {
StatusCode::OK => match resp.json::<ControlPlaneSpecResponse>() {
StatusCode::OK => match resp.json::<ControlPlaneConfigResponse>() {
Ok(spec_resp) => Ok(spec_resp),
Err(e) => Err((
true,
@@ -69,40 +68,35 @@ fn do_control_plane_request(
}
}
/// Request spec from the control-plane by compute_id. If `NEON_CONTROL_PLANE_TOKEN`
/// env variable is set, it will be used for authorization.
pub fn get_spec_from_control_plane(
base_uri: &str,
compute_id: &str,
) -> Result<(Option<ComputeSpec>, ComputeCtlConfig)> {
/// Request config from the control-plane by compute_id. If
/// `NEON_CONTROL_PLANE_TOKEN` env variable is set, it will be used for
/// authorization.
pub fn get_config_from_control_plane(base_uri: &str, compute_id: &str) -> Result<ComputeConfig> {
let cp_uri = format!("{base_uri}/compute/api/v2/computes/{compute_id}/spec");
let jwt: String = match std::env::var("NEON_CONTROL_PLANE_TOKEN") {
Ok(v) => v,
Err(_) => "".to_string(),
};
let jwt: String = std::env::var("NEON_CONTROL_PLANE_TOKEN").unwrap_or_default();
let mut attempt = 1;
info!("getting spec from control plane: {}", cp_uri);
info!("getting config from control plane: {}", cp_uri);
// Do 3 attempts to get spec from the control plane using the following logic:
// - network error -> then retry
// - compute id is unknown or any other error -> bail out
// - no spec for compute yet (Empty state) -> return Ok(None)
// - got spec -> return Ok(Some(spec))
// - got config -> return Ok(Some(config))
while attempt < 4 {
let result = match do_control_plane_request(&cp_uri, &jwt) {
Ok(spec_resp) => {
Ok(config_resp) => {
CPLANE_REQUESTS_TOTAL
.with_label_values(&[
CPlaneRequestRPC::GetSpec.as_str(),
CPlaneRequestRPC::GetConfig.as_str(),
&StatusCode::OK.to_string(),
])
.inc();
match spec_resp.status {
ControlPlaneComputeStatus::Empty => Ok((None, spec_resp.compute_ctl_config)),
match config_resp.status {
ControlPlaneComputeStatus::Empty => Ok(config_resp.into()),
ControlPlaneComputeStatus::Attached => {
if let Some(spec) = spec_resp.spec {
Ok((Some(spec), spec_resp.compute_ctl_config))
if config_resp.spec.is_some() {
Ok(config_resp.into())
} else {
bail!("compute is attached, but spec is empty")
}
@@ -111,7 +105,7 @@ pub fn get_spec_from_control_plane(
}
Err((retry, msg, status)) => {
CPLANE_REQUESTS_TOTAL
.with_label_values(&[CPlaneRequestRPC::GetSpec.as_str(), &status])
.with_label_values(&[CPlaneRequestRPC::GetConfig.as_str(), &status])
.inc();
if retry {
Err(anyhow!(msg))
@@ -122,7 +116,7 @@ pub fn get_spec_from_control_plane(
};
if let Err(e) = &result {
error!("attempt {} to get spec failed with: {}", attempt, e);
error!("attempt {} to get config failed with: {}", attempt, e);
} else {
return result;
}
@@ -133,13 +127,13 @@ pub fn get_spec_from_control_plane(
// All attempts failed, return error.
Err(anyhow::anyhow!(
"Exhausted all attempts to retrieve the spec from the control plane"
"Exhausted all attempts to retrieve the config from the control plane"
))
}
/// Check `pg_hba.conf` and update if needed to allow external connections.
pub fn update_pg_hba(pgdata_path: &Path) -> Result<()> {
// XXX: consider making it a part of spec.json
// XXX: consider making it a part of config.json
let pghba_path = pgdata_path.join("pg_hba.conf");
if config::line_in_file(&pghba_path, PG_HBA_ALL_MD5)? {
@@ -153,7 +147,7 @@ pub fn update_pg_hba(pgdata_path: &Path) -> Result<()> {
/// Create a standby.signal file
pub fn add_standby_signal(pgdata_path: &Path) -> Result<()> {
// XXX: consider making it a part of spec.json
// XXX: consider making it a part of config.json
let signalfile = pgdata_path.join("standby.signal");
if !signalfile.exists() {

View File

@@ -6,13 +6,16 @@ license.workspace = true
[dependencies]
anyhow.workspace = true
base64.workspace = true
camino.workspace = true
clap.workspace = true
comfy-table.workspace = true
futures.workspace = true
humantime.workspace = true
jsonwebtoken.workspace = true
nix.workspace = true
once_cell.workspace = true
pem.workspace = true
humantime-serde.workspace = true
hyper0.workspace = true
regex.workspace = true
@@ -20,6 +23,8 @@ reqwest = { workspace = true, features = ["blocking", "json"] }
scopeguard.workspace = true
serde.workspace = true
serde_json.workspace = true
sha2.workspace = true
spki.workspace = true
thiserror.workspace = true
toml.workspace = true
toml_edit.workspace = true

View File

@@ -18,12 +18,11 @@ use anyhow::{Context, Result, anyhow, bail};
use clap::Parser;
use compute_api::spec::ComputeMode;
use control_plane::endpoint::ComputeControlPlane;
use control_plane::endpoint_storage::{ENDPOINT_STORAGE_DEFAULT_PORT, EndpointStorage};
use control_plane::local_env::{
InitForceMode, LocalEnv, NeonBroker, NeonLocalInitConf, NeonLocalInitPageserverConf,
ObjectStorageConf, SafekeeperConf,
EndpointStorageConf, InitForceMode, LocalEnv, NeonBroker, NeonLocalInitConf,
NeonLocalInitPageserverConf, SafekeeperConf,
};
use control_plane::object_storage::OBJECT_STORAGE_DEFAULT_PORT;
use control_plane::object_storage::ObjectStorage;
use control_plane::pageserver::PageServerNode;
use control_plane::safekeeper::SafekeeperNode;
use control_plane::storage_controller::{
@@ -63,7 +62,7 @@ const DEFAULT_PAGESERVER_ID: NodeId = NodeId(1);
const DEFAULT_BRANCH_NAME: &str = "main";
project_git_version!(GIT_VERSION);
const DEFAULT_PG_VERSION: u32 = 16;
const DEFAULT_PG_VERSION: u32 = 17;
const DEFAULT_PAGESERVER_CONTROL_PLANE_API: &str = "http://127.0.0.1:1234/upcall/v1/";
@@ -93,7 +92,7 @@ enum NeonLocalCmd {
#[command(subcommand)]
Safekeeper(SafekeeperCmd),
#[command(subcommand)]
ObjectStorage(ObjectStorageCmd),
EndpointStorage(EndpointStorageCmd),
#[command(subcommand)]
Endpoint(EndpointCmd),
#[command(subcommand)]
@@ -460,14 +459,14 @@ enum SafekeeperCmd {
#[derive(clap::Subcommand)]
#[clap(about = "Manage object storage")]
enum ObjectStorageCmd {
Start(ObjectStorageStartCmd),
Stop(ObjectStorageStopCmd),
enum EndpointStorageCmd {
Start(EndpointStorageStartCmd),
Stop(EndpointStorageStopCmd),
}
#[derive(clap::Args)]
#[clap(about = "Start object storage")]
struct ObjectStorageStartCmd {
struct EndpointStorageStartCmd {
#[clap(short = 't', long, help = "timeout until we fail the command")]
#[arg(default_value = "10s")]
start_timeout: humantime::Duration,
@@ -475,7 +474,7 @@ struct ObjectStorageStartCmd {
#[derive(clap::Args)]
#[clap(about = "Stop object storage")]
struct ObjectStorageStopCmd {
struct EndpointStorageStopCmd {
#[arg(value_enum, default_value = "fast")]
#[clap(
short = 'm',
@@ -552,6 +551,7 @@ enum EndpointCmd {
Start(EndpointStartCmdArgs),
Reconfigure(EndpointReconfigureCmdArgs),
Stop(EndpointStopCmdArgs),
GenerateJwt(EndpointGenerateJwtCmdArgs),
}
#[derive(clap::Args)]
@@ -699,6 +699,13 @@ struct EndpointStopCmdArgs {
mode: String,
}
#[derive(clap::Args)]
#[clap(about = "Generate a JWT for an endpoint")]
struct EndpointGenerateJwtCmdArgs {
#[clap(help = "Postgres endpoint id")]
endpoint_id: String,
}
#[derive(clap::Subcommand)]
#[clap(about = "Manage neon_local branch name mappings")]
enum MappingsCmd {
@@ -789,7 +796,9 @@ fn main() -> Result<()> {
}
NeonLocalCmd::StorageBroker(subcmd) => rt.block_on(handle_storage_broker(&subcmd, env)),
NeonLocalCmd::Safekeeper(subcmd) => rt.block_on(handle_safekeeper(&subcmd, env)),
NeonLocalCmd::ObjectStorage(subcmd) => rt.block_on(handle_object_storage(&subcmd, env)),
NeonLocalCmd::EndpointStorage(subcmd) => {
rt.block_on(handle_endpoint_storage(&subcmd, env))
}
NeonLocalCmd::Endpoint(subcmd) => rt.block_on(handle_endpoint(&subcmd, env)),
NeonLocalCmd::Mappings(subcmd) => handle_mappings(&subcmd, env),
};
@@ -1006,8 +1015,8 @@ fn handle_init(args: &InitCmdArgs) -> anyhow::Result<LocalEnv> {
}
})
.collect(),
object_storage: ObjectStorageConf {
port: OBJECT_STORAGE_DEFAULT_PORT,
endpoint_storage: EndpointStorageConf {
port: ENDPOINT_STORAGE_DEFAULT_PORT,
},
pg_distrib_dir: None,
neon_distrib_dir: None,
@@ -1528,6 +1537,16 @@ async fn handle_endpoint(subcmd: &EndpointCmd, env: &local_env::LocalEnv) -> Res
.with_context(|| format!("postgres endpoint {endpoint_id} is not found"))?;
endpoint.stop(&args.mode, args.destroy)?;
}
EndpointCmd::GenerateJwt(args) => {
let endpoint_id = &args.endpoint_id;
let endpoint = cplane
.endpoints
.get(endpoint_id)
.with_context(|| format!("postgres endpoint {endpoint_id} is not found"))?;
let jwt = endpoint.generate_jwt()?;
print!("{jwt}");
}
}
Ok(())
@@ -1717,12 +1736,15 @@ async fn handle_safekeeper(subcmd: &SafekeeperCmd, env: &local_env::LocalEnv) ->
Ok(())
}
async fn handle_object_storage(subcmd: &ObjectStorageCmd, env: &local_env::LocalEnv) -> Result<()> {
use ObjectStorageCmd::*;
let storage = ObjectStorage::from_env(env);
async fn handle_endpoint_storage(
subcmd: &EndpointStorageCmd,
env: &local_env::LocalEnv,
) -> Result<()> {
use EndpointStorageCmd::*;
let storage = EndpointStorage::from_env(env);
// In tests like test_forward_compatibility or test_graceful_cluster_restart
// old neon binaries (without object_storage) are present
// old neon binaries (without endpoint_storage) are present
if !storage.bin.exists() {
eprintln!(
"{} binary not found. Ignore if this is a compatibility test",
@@ -1732,13 +1754,13 @@ async fn handle_object_storage(subcmd: &ObjectStorageCmd, env: &local_env::Local
}
match subcmd {
Start(ObjectStorageStartCmd { start_timeout }) => {
Start(EndpointStorageStartCmd { start_timeout }) => {
if let Err(e) = storage.start(start_timeout).await {
eprintln!("object_storage start failed: {e}");
eprintln!("endpoint_storage start failed: {e}");
exit(1);
}
}
Stop(ObjectStorageStopCmd { stop_mode }) => {
Stop(EndpointStorageStopCmd { stop_mode }) => {
let immediate = match stop_mode {
StopMode::Fast => false,
StopMode::Immediate => true,
@@ -1848,10 +1870,10 @@ async fn handle_start_all_impl(
}
js.spawn(async move {
ObjectStorage::from_env(env)
EndpointStorage::from_env(env)
.start(&retry_timeout)
.await
.map_err(|e| e.context("start object_storage"))
.map_err(|e| e.context("start endpoint_storage"))
});
})();
@@ -1950,9 +1972,9 @@ async fn try_stop_all(env: &local_env::LocalEnv, immediate: bool) {
}
}
let storage = ObjectStorage::from_env(env);
let storage = EndpointStorage::from_env(env);
if let Err(e) = storage.stop(immediate) {
eprintln!("object_storage stop failed: {:#}", e);
eprintln!("endpoint_storage stop failed: {:#}", e);
}
for ps_conf in &env.pageservers {

View File

@@ -29,7 +29,7 @@
//! compute.log - log output of `compute_ctl` and `postgres`
//! endpoint.json - serialized `EndpointConf` struct
//! postgresql.conf - postgresql settings
//! spec.json - passed to `compute_ctl`
//! config.json - passed to `compute_ctl`
//! pgdata/
//! postgresql.conf - copy of postgresql.conf created by `compute_ctl`
//! zenith.signal
@@ -42,20 +42,30 @@ use std::path::PathBuf;
use std::process::Command;
use std::str::FromStr;
use std::sync::Arc;
use std::time::{Duration, Instant, SystemTime, UNIX_EPOCH};
use std::time::{Duration, Instant};
use anyhow::{Context, Result, anyhow, bail};
use compute_api::requests::ConfigurationRequest;
use compute_api::responses::{ComputeCtlConfig, ComputeStatus, ComputeStatusResponse};
use compute_api::requests::{ComputeClaims, ConfigurationRequest};
use compute_api::responses::{
ComputeConfig, ComputeCtlConfig, ComputeStatus, ComputeStatusResponse, TlsConfig,
};
use compute_api::spec::{
Cluster, ComputeAudit, ComputeFeature, ComputeMode, ComputeSpec, Database, PgIdent,
RemoteExtSpec, Role,
};
use jsonwebtoken::jwk::{
AlgorithmParameters, CommonParameters, EllipticCurve, Jwk, JwkSet, KeyAlgorithm, KeyOperations,
OctetKeyPairParameters, OctetKeyPairType, PublicKeyUse,
};
use nix::sys::signal::{Signal, kill};
use pageserver_api::shard::ShardStripeSize;
use pem::Pem;
use reqwest::header::CONTENT_TYPE;
use safekeeper_api::membership::SafekeeperGeneration;
use serde::{Deserialize, Serialize};
use sha2::{Digest, Sha256};
use spki::der::Decode;
use spki::{SubjectPublicKeyInfo, SubjectPublicKeyInfoRef};
use tracing::debug;
use url::Host;
use utils::id::{NodeId, TenantId, TimelineId};
@@ -80,6 +90,7 @@ pub struct EndpointConf {
drop_subscriptions_before_start: bool,
features: Vec<ComputeFeature>,
cluster: Option<Cluster>,
compute_ctl_config: ComputeCtlConfig,
}
//
@@ -135,6 +146,37 @@ impl ComputeControlPlane {
.unwrap_or(self.base_port)
}
/// Create a JSON Web Key Set. This ideally matches the way we create a JWKS
/// from the production control plane.
fn create_jwks_from_pem(pem: &Pem) -> Result<JwkSet> {
let spki: SubjectPublicKeyInfoRef = SubjectPublicKeyInfo::from_der(pem.contents())?;
let public_key = spki.subject_public_key.raw_bytes();
let mut hasher = Sha256::new();
hasher.update(public_key);
let key_hash = hasher.finalize();
Ok(JwkSet {
keys: vec![Jwk {
common: CommonParameters {
public_key_use: Some(PublicKeyUse::Signature),
key_operations: Some(vec![KeyOperations::Verify]),
key_algorithm: Some(KeyAlgorithm::EdDSA),
key_id: Some(base64::encode_config(key_hash, base64::URL_SAFE_NO_PAD)),
x509_url: None::<String>,
x509_chain: None::<Vec<String>>,
x509_sha1_fingerprint: None::<String>,
x509_sha256_fingerprint: None::<String>,
},
algorithm: AlgorithmParameters::OctetKeyPair(OctetKeyPairParameters {
key_type: OctetKeyPairType::OctetKeyPair,
curve: EllipticCurve::Ed25519,
x: base64::encode_config(public_key, base64::URL_SAFE_NO_PAD),
}),
}],
})
}
#[allow(clippy::too_many_arguments)]
pub fn new_endpoint(
&mut self,
@@ -152,6 +194,10 @@ impl ComputeControlPlane {
let pg_port = pg_port.unwrap_or_else(|| self.get_port());
let external_http_port = external_http_port.unwrap_or_else(|| self.get_port() + 1);
let internal_http_port = internal_http_port.unwrap_or_else(|| external_http_port + 1);
let compute_ctl_config = ComputeCtlConfig {
jwks: Self::create_jwks_from_pem(&self.env.read_public_key()?)?,
tls: None::<TlsConfig>,
};
let ep = Arc::new(Endpoint {
endpoint_id: endpoint_id.to_owned(),
pg_address: SocketAddr::new(IpAddr::from(Ipv4Addr::LOCALHOST), pg_port),
@@ -179,6 +225,7 @@ impl ComputeControlPlane {
reconfigure_concurrency: 1,
features: vec![],
cluster: None,
compute_ctl_config: compute_ctl_config.clone(),
});
ep.create_endpoint_dir()?;
@@ -198,6 +245,7 @@ impl ComputeControlPlane {
reconfigure_concurrency: 1,
features: vec![],
cluster: None,
compute_ctl_config,
})?,
)?;
std::fs::write(
@@ -240,7 +288,6 @@ impl ComputeControlPlane {
///////////////////////////////////////////////////////////////////////////////
#[derive(Debug)]
pub struct Endpoint {
/// used as the directory name
endpoint_id: String,
@@ -269,6 +316,9 @@ pub struct Endpoint {
features: Vec<ComputeFeature>,
// Cluster settings
cluster: Option<Cluster>,
/// The compute_ctl config for the endpoint's compute.
compute_ctl_config: ComputeCtlConfig,
}
#[derive(PartialEq, Eq)]
@@ -331,6 +381,7 @@ impl Endpoint {
drop_subscriptions_before_start: conf.drop_subscriptions_before_start,
features: conf.features,
cluster: conf.cluster,
compute_ctl_config: conf.compute_ctl_config,
})
}
@@ -578,6 +629,13 @@ impl Endpoint {
Ok(safekeeper_connstrings)
}
/// Generate a JWT with the correct claims.
pub fn generate_jwt(&self) -> Result<String> {
self.env.generate_auth_token(&ComputeClaims {
compute_id: self.endpoint_id.clone(),
})
}
#[allow(clippy::too_many_arguments)]
pub async fn start(
&self,
@@ -619,90 +677,97 @@ impl Endpoint {
remote_extensions = None;
};
// Create spec file
let mut spec = ComputeSpec {
skip_pg_catalog_updates: self.skip_pg_catalog_updates,
format_version: 1.0,
operation_uuid: None,
features: self.features.clone(),
swap_size_bytes: None,
disk_quota_bytes: None,
disable_lfc_resizing: None,
cluster: Cluster {
cluster_id: None, // project ID: not used
name: None, // project name: not used
state: None,
roles: if create_test_user {
vec![Role {
// Create config file
let config = {
let mut spec = ComputeSpec {
skip_pg_catalog_updates: self.skip_pg_catalog_updates,
format_version: 1.0,
operation_uuid: None,
features: self.features.clone(),
swap_size_bytes: None,
disk_quota_bytes: None,
disable_lfc_resizing: None,
cluster: Cluster {
cluster_id: None, // project ID: not used
name: None, // project name: not used
state: None,
roles: if create_test_user {
vec![Role {
name: PgIdent::from_str("test").unwrap(),
encrypted_password: None,
options: None,
}]
} else {
Vec::new()
},
databases: if create_test_user {
vec![Database {
name: PgIdent::from_str("neondb").unwrap(),
owner: PgIdent::from_str("test").unwrap(),
options: None,
restrict_conn: false,
invalid: false,
}]
} else {
Vec::new()
},
settings: None,
postgresql_conf: Some(postgresql_conf.clone()),
},
delta_operations: None,
tenant_id: Some(self.tenant_id),
timeline_id: Some(self.timeline_id),
project_id: None,
branch_id: None,
endpoint_id: Some(self.endpoint_id.clone()),
mode: self.mode,
pageserver_connstring: Some(pageserver_connstring),
safekeepers_generation: safekeepers_generation.map(|g| g.into_inner()),
safekeeper_connstrings,
storage_auth_token: auth_token.clone(),
remote_extensions,
pgbouncer_settings: None,
shard_stripe_size: Some(shard_stripe_size),
local_proxy_config: None,
reconfigure_concurrency: self.reconfigure_concurrency,
drop_subscriptions_before_start: self.drop_subscriptions_before_start,
audit_log_level: ComputeAudit::Disabled,
logs_export_host: None::<String>,
};
// this strange code is needed to support respec() in tests
if self.cluster.is_some() {
debug!("Cluster is already set in the endpoint spec, using it");
spec.cluster = self.cluster.clone().unwrap();
debug!("spec.cluster {:?}", spec.cluster);
// fill missing fields again
if create_test_user {
spec.cluster.roles.push(Role {
name: PgIdent::from_str("test").unwrap(),
encrypted_password: None,
options: None,
}]
} else {
Vec::new()
},
databases: if create_test_user {
vec![Database {
});
spec.cluster.databases.push(Database {
name: PgIdent::from_str("neondb").unwrap(),
owner: PgIdent::from_str("test").unwrap(),
options: None,
restrict_conn: false,
invalid: false,
}]
} else {
Vec::new()
},
settings: None,
postgresql_conf: Some(postgresql_conf.clone()),
},
delta_operations: None,
tenant_id: Some(self.tenant_id),
timeline_id: Some(self.timeline_id),
project_id: None,
branch_id: None,
endpoint_id: Some(self.endpoint_id.clone()),
mode: self.mode,
pageserver_connstring: Some(pageserver_connstring),
safekeepers_generation: safekeepers_generation.map(|g| g.into_inner()),
safekeeper_connstrings,
storage_auth_token: auth_token.clone(),
remote_extensions,
pgbouncer_settings: None,
shard_stripe_size: Some(shard_stripe_size),
local_proxy_config: None,
reconfigure_concurrency: self.reconfigure_concurrency,
drop_subscriptions_before_start: self.drop_subscriptions_before_start,
audit_log_level: ComputeAudit::Disabled,
logs_export_host: None::<String>,
});
}
spec.cluster.postgresql_conf = Some(postgresql_conf);
}
ComputeConfig {
spec: Some(spec),
compute_ctl_config: self.compute_ctl_config.clone(),
}
};
// this strange code is needed to support respec() in tests
if self.cluster.is_some() {
debug!("Cluster is already set in the endpoint spec, using it");
spec.cluster = self.cluster.clone().unwrap();
debug!("spec.cluster {:?}", spec.cluster);
// fill missing fields again
if create_test_user {
spec.cluster.roles.push(Role {
name: PgIdent::from_str("test").unwrap(),
encrypted_password: None,
options: None,
});
spec.cluster.databases.push(Database {
name: PgIdent::from_str("neondb").unwrap(),
owner: PgIdent::from_str("test").unwrap(),
options: None,
restrict_conn: false,
invalid: false,
});
}
spec.cluster.postgresql_conf = Some(postgresql_conf);
}
let spec_path = self.endpoint_path().join("spec.json");
std::fs::write(spec_path, serde_json::to_string_pretty(&spec)?)?;
let config_path = self.endpoint_path().join("config.json");
std::fs::write(config_path, serde_json::to_string_pretty(&config)?)?;
// Open log file. We'll redirect the stdout and stderr of `compute_ctl` to it.
let logfile = std::fs::OpenOptions::new()
@@ -728,10 +793,8 @@ impl Endpoint {
])
.args(["--pgdata", self.pgdata().to_str().unwrap()])
.args(["--connstr", &conn_str])
.args([
"--spec-path",
self.endpoint_path().join("spec.json").to_str().unwrap(),
])
.arg("--config")
.arg(self.endpoint_path().join("config.json").as_os_str())
.args([
"--pgbin",
self.env
@@ -742,16 +805,7 @@ impl Endpoint {
])
// TODO: It would be nice if we generated compute IDs with the same
// algorithm as the real control plane.
.args([
"--compute-id",
&format!(
"compute-{}",
SystemTime::now()
.duration_since(UNIX_EPOCH)
.unwrap()
.as_secs()
),
])
.args(["--compute-id", &self.endpoint_id])
.stdin(std::process::Stdio::null())
.stderr(logfile.try_clone()?)
.stdout(logfile);
@@ -849,6 +903,7 @@ impl Endpoint {
self.external_http_address.port()
),
)
.bearer_auth(self.generate_jwt()?)
.send()
.await?;
@@ -873,10 +928,12 @@ impl Endpoint {
stripe_size: Option<ShardStripeSize>,
safekeepers: Option<Vec<NodeId>>,
) -> Result<()> {
let mut spec: ComputeSpec = {
let spec_path = self.endpoint_path().join("spec.json");
let file = std::fs::File::open(spec_path)?;
serde_json::from_reader(file)?
let (mut spec, compute_ctl_config) = {
let config_path = self.endpoint_path().join("config.json");
let file = std::fs::File::open(config_path)?;
let config: ComputeConfig = serde_json::from_reader(file)?;
(config.spec.unwrap(), config.compute_ctl_config)
};
let postgresql_conf = self.read_postgresql_conf()?;
@@ -923,10 +980,11 @@ impl Endpoint {
self.external_http_address.port()
))
.header(CONTENT_TYPE.as_str(), "application/json")
.bearer_auth(self.generate_jwt()?)
.body(
serde_json::to_string(&ConfigurationRequest {
spec,
compute_ctl_config: ComputeCtlConfig::default(),
compute_ctl_config,
})
.unwrap(),
)

View File

@@ -1,34 +1,33 @@
use crate::background_process::{self, start_process, stop_process};
use crate::local_env::LocalEnv;
use anyhow::anyhow;
use anyhow::{Context, Result};
use camino::Utf8PathBuf;
use std::io::Write;
use std::time::Duration;
/// Directory within .neon which will be used by default for LocalFs remote storage.
pub const OBJECT_STORAGE_REMOTE_STORAGE_DIR: &str = "local_fs_remote_storage/object_storage";
pub const OBJECT_STORAGE_DEFAULT_PORT: u16 = 9993;
pub const ENDPOINT_STORAGE_REMOTE_STORAGE_DIR: &str = "local_fs_remote_storage/endpoint_storage";
pub const ENDPOINT_STORAGE_DEFAULT_PORT: u16 = 9993;
pub struct ObjectStorage {
pub struct EndpointStorage {
pub bin: Utf8PathBuf,
pub data_dir: Utf8PathBuf,
pub pemfile: Utf8PathBuf,
pub port: u16,
}
impl ObjectStorage {
pub fn from_env(env: &LocalEnv) -> ObjectStorage {
ObjectStorage {
bin: Utf8PathBuf::from_path_buf(env.object_storage_bin()).unwrap(),
data_dir: Utf8PathBuf::from_path_buf(env.object_storage_data_dir()).unwrap(),
impl EndpointStorage {
pub fn from_env(env: &LocalEnv) -> EndpointStorage {
EndpointStorage {
bin: Utf8PathBuf::from_path_buf(env.endpoint_storage_bin()).unwrap(),
data_dir: Utf8PathBuf::from_path_buf(env.endpoint_storage_data_dir()).unwrap(),
pemfile: Utf8PathBuf::from_path_buf(env.public_key_path.clone()).unwrap(),
port: env.object_storage.port,
port: env.endpoint_storage.port,
}
}
fn config_path(&self) -> Utf8PathBuf {
self.data_dir.join("object_storage.json")
self.data_dir.join("endpoint_storage.json")
}
fn listen_addr(&self) -> Utf8PathBuf {
@@ -49,7 +48,7 @@ impl ObjectStorage {
let cfg = Cfg {
listen: self.listen_addr(),
pemfile: parent.join(self.pemfile.clone()),
local_path: parent.join(OBJECT_STORAGE_REMOTE_STORAGE_DIR),
local_path: parent.join(ENDPOINT_STORAGE_REMOTE_STORAGE_DIR),
r#type: "LocalFs".to_string(),
};
std::fs::create_dir_all(self.config_path().parent().unwrap())?;
@@ -59,24 +58,19 @@ impl ObjectStorage {
}
pub async fn start(&self, retry_timeout: &Duration) -> Result<()> {
println!("Starting s3 proxy at {}", self.listen_addr());
println!("Starting endpoint_storage at {}", self.listen_addr());
std::io::stdout().flush().context("flush stdout")?;
let process_status_check = || async {
tokio::time::sleep(Duration::from_millis(500)).await;
let res = reqwest::Client::new()
.get(format!("http://{}/metrics", self.listen_addr()))
.send()
.await;
match res {
Ok(response) if response.status().is_success() => Ok(true),
Ok(_) => Err(anyhow!("Failed to query /metrics")),
Err(e) => Err(anyhow!("Failed to check node status: {e}")),
let res = reqwest::Client::new().get(format!("http://{}/metrics", self.listen_addr()));
match res.send().await {
Ok(res) => Ok(res.status().is_success()),
Err(_) => Ok(false),
}
};
let res = start_process(
"object_storage",
"endpoint_storage",
&self.data_dir.clone().into_std_path_buf(),
&self.bin.clone().into_std_path_buf(),
vec![self.config_path().to_string()],
@@ -94,14 +88,14 @@ impl ObjectStorage {
}
pub fn stop(&self, immediate: bool) -> anyhow::Result<()> {
stop_process(immediate, "object_storage", &self.pid_file())
stop_process(immediate, "endpoint_storage", &self.pid_file())
}
fn log_file(&self) -> Utf8PathBuf {
self.data_dir.join("object_storage.log")
self.data_dir.join("endpoint_storage.log")
}
fn pid_file(&self) -> Utf8PathBuf {
self.data_dir.join("object_storage.pid")
self.data_dir.join("endpoint_storage.pid")
}
}

View File

@@ -9,8 +9,8 @@
mod background_process;
pub mod broker;
pub mod endpoint;
pub mod endpoint_storage;
pub mod local_env;
pub mod object_storage;
pub mod pageserver;
pub mod postgresql_conf;
pub mod safekeeper;

View File

@@ -12,17 +12,18 @@ use std::{env, fs};
use anyhow::{Context, bail};
use clap::ValueEnum;
use pem::Pem;
use postgres_backend::AuthType;
use reqwest::Url;
use serde::{Deserialize, Serialize};
use utils::auth::encode_from_key_file;
use utils::id::{NodeId, TenantId, TenantTimelineId, TimelineId};
use crate::object_storage::{OBJECT_STORAGE_REMOTE_STORAGE_DIR, ObjectStorage};
use crate::endpoint_storage::{ENDPOINT_STORAGE_REMOTE_STORAGE_DIR, EndpointStorage};
use crate::pageserver::{PAGESERVER_REMOTE_STORAGE_DIR, PageServerNode};
use crate::safekeeper::SafekeeperNode;
pub const DEFAULT_PG_VERSION: u32 = 16;
pub const DEFAULT_PG_VERSION: u32 = 17;
//
// This data structures represents neon_local CLI config
@@ -56,6 +57,7 @@ pub struct LocalEnv {
// used to issue tokens during e.g pg start
pub private_key_path: PathBuf,
/// Path to environment's public key
pub public_key_path: PathBuf,
pub broker: NeonBroker,
@@ -70,7 +72,7 @@ pub struct LocalEnv {
pub safekeepers: Vec<SafekeeperConf>,
pub object_storage: ObjectStorageConf,
pub endpoint_storage: EndpointStorageConf,
// Control plane upcall API for pageserver: if None, we will not run storage_controller If set, this will
// be propagated into each pageserver's configuration.
@@ -108,7 +110,7 @@ pub struct OnDiskConfig {
)]
pub pageservers: Vec<PageServerConf>,
pub safekeepers: Vec<SafekeeperConf>,
pub object_storage: ObjectStorageConf,
pub endpoint_storage: EndpointStorageConf,
pub control_plane_api: Option<Url>,
pub control_plane_hooks_api: Option<Url>,
pub control_plane_compute_hook_api: Option<Url>,
@@ -142,7 +144,7 @@ pub struct NeonLocalInitConf {
pub storage_controller: Option<NeonStorageControllerConf>,
pub pageservers: Vec<NeonLocalInitPageserverConf>,
pub safekeepers: Vec<SafekeeperConf>,
pub object_storage: ObjectStorageConf,
pub endpoint_storage: EndpointStorageConf,
pub control_plane_api: Option<Url>,
pub control_plane_hooks_api: Option<Url>,
pub generate_local_ssl_certs: bool,
@@ -150,7 +152,7 @@ pub struct NeonLocalInitConf {
#[derive(Serialize, Default, Deserialize, PartialEq, Eq, Clone, Debug)]
#[serde(default)]
pub struct ObjectStorageConf {
pub struct EndpointStorageConf {
pub port: u16,
}
@@ -411,8 +413,8 @@ impl LocalEnv {
self.pg_dir(pg_version, "lib")
}
pub fn object_storage_bin(&self) -> PathBuf {
self.neon_distrib_dir.join("object_storage")
pub fn endpoint_storage_bin(&self) -> PathBuf {
self.neon_distrib_dir.join("endpoint_storage")
}
pub fn pageserver_bin(&self) -> PathBuf {
@@ -448,8 +450,8 @@ impl LocalEnv {
self.base_data_dir.join("safekeepers").join(data_dir_name)
}
pub fn object_storage_data_dir(&self) -> PathBuf {
self.base_data_dir.join("object_storage")
pub fn endpoint_storage_data_dir(&self) -> PathBuf {
self.base_data_dir.join("endpoint_storage")
}
pub fn get_pageserver_conf(&self, id: NodeId) -> anyhow::Result<&PageServerConf> {
@@ -613,7 +615,7 @@ impl LocalEnv {
control_plane_compute_hook_api: _,
branch_name_mappings,
generate_local_ssl_certs,
object_storage,
endpoint_storage,
} = on_disk_config;
LocalEnv {
base_data_dir: repopath.to_owned(),
@@ -630,7 +632,7 @@ impl LocalEnv {
control_plane_hooks_api,
branch_name_mappings,
generate_local_ssl_certs,
object_storage,
endpoint_storage,
}
};
@@ -740,7 +742,7 @@ impl LocalEnv {
control_plane_compute_hook_api: None,
branch_name_mappings: self.branch_name_mappings.clone(),
generate_local_ssl_certs: self.generate_local_ssl_certs,
object_storage: self.object_storage.clone(),
endpoint_storage: self.endpoint_storage.clone(),
},
)
}
@@ -758,11 +760,11 @@ impl LocalEnv {
// this function is used only for testing purposes in CLI e g generate tokens during init
pub fn generate_auth_token<S: Serialize>(&self, claims: &S) -> anyhow::Result<String> {
let private_key_path = self.get_private_key_path();
let key_data = fs::read(private_key_path)?;
encode_from_key_file(claims, &key_data)
let key = self.read_private_key()?;
encode_from_key_file(claims, &key)
}
/// Get the path to the private key.
pub fn get_private_key_path(&self) -> PathBuf {
if self.private_key_path.is_absolute() {
self.private_key_path.to_path_buf()
@@ -771,6 +773,29 @@ impl LocalEnv {
}
}
/// Get the path to the public key.
pub fn get_public_key_path(&self) -> PathBuf {
if self.public_key_path.is_absolute() {
self.public_key_path.to_path_buf()
} else {
self.base_data_dir.join(&self.public_key_path)
}
}
/// Read the contents of the private key file.
pub fn read_private_key(&self) -> anyhow::Result<Pem> {
let private_key_path = self.get_private_key_path();
let pem = pem::parse(fs::read(private_key_path)?)?;
Ok(pem)
}
/// Read the contents of the public key file.
pub fn read_public_key(&self) -> anyhow::Result<Pem> {
let public_key_path = self.get_public_key_path();
let pem = pem::parse(fs::read(public_key_path)?)?;
Ok(pem)
}
/// Materialize the [`NeonLocalInitConf`] to disk. Called during [`neon_local init`].
pub fn init(conf: NeonLocalInitConf, force: &InitForceMode) -> anyhow::Result<()> {
let base_path = base_path();
@@ -824,7 +849,7 @@ impl LocalEnv {
control_plane_api,
generate_local_ssl_certs,
control_plane_hooks_api,
object_storage,
endpoint_storage,
} = conf;
// Find postgres binaries.
@@ -876,7 +901,7 @@ impl LocalEnv {
control_plane_hooks_api,
branch_name_mappings: Default::default(),
generate_local_ssl_certs,
object_storage,
endpoint_storage,
};
if generate_local_ssl_certs {
@@ -904,13 +929,13 @@ impl LocalEnv {
.context("pageserver init failed")?;
}
ObjectStorage::from_env(&env)
EndpointStorage::from_env(&env)
.init()
.context("object storage init failed")?;
// setup remote remote location for default LocalFs remote storage
std::fs::create_dir_all(env.base_data_dir.join(PAGESERVER_REMOTE_STORAGE_DIR))?;
std::fs::create_dir_all(env.base_data_dir.join(OBJECT_STORAGE_REMOTE_STORAGE_DIR))?;
std::fs::create_dir_all(env.base_data_dir.join(ENDPOINT_STORAGE_REMOTE_STORAGE_DIR))?;
env.persist_config()
}
@@ -956,6 +981,7 @@ fn generate_auth_keys(private_key_path: &Path, public_key_path: &Path) -> anyhow
String::from_utf8_lossy(&keygen_output.stderr)
);
}
// Extract the public key from the private key file
//
// openssl pkey -in auth_private_key.pem -pubout -out auth_public_key.pem
@@ -972,6 +998,7 @@ fn generate_auth_keys(private_key_path: &Path, public_key_path: &Path) -> anyhow
String::from_utf8_lossy(&keygen_output.stderr)
);
}
Ok(())
}
@@ -980,7 +1007,7 @@ fn generate_ssl_ca_cert(cert_path: &Path, key_path: &Path) -> anyhow::Result<()>
// -out rootCA.crt -keyout rootCA.key
let keygen_output = Command::new("openssl")
.args([
"req", "-x509", "-newkey", "rsa:2048", "-nodes", "-days", "36500",
"req", "-x509", "-newkey", "ed25519", "-nodes", "-days", "36500",
])
.args(["-subj", "/CN=Neon Local CA"])
.args(["-out", cert_path.to_str().unwrap()])
@@ -1010,7 +1037,7 @@ fn generate_ssl_cert(
// -subj "/CN=localhost" -addext "subjectAltName=DNS:localhost,IP:127.0.0.1"
let keygen_output = Command::new("openssl")
.args(["req", "-new", "-nodes"])
.args(["-newkey", "rsa:2048"])
.args(["-newkey", "ed25519"])
.args(["-subj", "/CN=localhost"])
.args(["-addext", "subjectAltName=DNS:localhost,IP:127.0.0.1"])
.args(["-keyout", key_path.to_str().unwrap()])

View File

@@ -413,6 +413,11 @@ impl PageServerNode {
.map(serde_json::from_str)
.transpose()
.context("Failed to parse 'compaction_algorithm' json")?,
compaction_shard_ancestor: settings
.remove("compaction_shard_ancestor")
.map(|x| x.parse::<bool>())
.transpose()
.context("Failed to parse 'compaction_shard_ancestor' as a bool")?,
compaction_l0_first: settings
.remove("compaction_l0_first")
.map(|x| x.parse::<bool>())
@@ -535,6 +540,11 @@ impl PageServerNode {
.map(|x| x.parse::<bool>())
.transpose()
.context("Failed to parse 'gc_compaction_enabled' as bool")?,
gc_compaction_verification: settings
.remove("gc_compaction_verification")
.map(|x| x.parse::<bool>())
.transpose()
.context("Failed to parse 'gc_compaction_verification' as bool")?,
gc_compaction_initial_threshold_kb: settings
.remove("gc_compaction_initial_threshold_kb")
.map(|x| x.parse::<u64>())

View File

@@ -18,6 +18,7 @@ use pageserver_api::models::{
};
use pageserver_api::shard::TenantShardId;
use pageserver_client::mgmt_api::ResponseErrorMessageExt;
use pem::Pem;
use postgres_backend::AuthType;
use reqwest::{Certificate, Method};
use serde::de::DeserializeOwned;
@@ -34,8 +35,8 @@ use crate::local_env::{LocalEnv, NeonStorageControllerConf};
pub struct StorageController {
env: LocalEnv,
private_key: Option<Vec<u8>>,
public_key: Option<String>,
private_key: Option<Pem>,
public_key: Option<Pem>,
client: reqwest::Client,
config: NeonStorageControllerConf,
@@ -116,7 +117,9 @@ impl StorageController {
AuthType::Trust => (None, None),
AuthType::NeonJWT => {
let private_key_path = env.get_private_key_path();
let private_key = fs::read(private_key_path).expect("failed to read private key");
let private_key =
pem::parse(fs::read(private_key_path).expect("failed to read private key"))
.expect("failed to parse PEM file");
// If pageserver auth is enabled, this implicitly enables auth for this service,
// using the same credentials.
@@ -138,9 +141,13 @@ impl StorageController {
.expect("Empty key dir")
.expect("Error reading key dir");
std::fs::read_to_string(dent.path()).expect("Can't read public key")
pem::parse(std::fs::read_to_string(dent.path()).expect("Can't read public key"))
.expect("Failed to parse PEM file")
} else {
std::fs::read_to_string(&public_key_path).expect("Can't read public key")
pem::parse(
std::fs::read_to_string(&public_key_path).expect("Can't read public key"),
)
.expect("Failed to parse PEM file")
};
(Some(private_key), Some(public_key))
}

View File

@@ -1,4 +1,3 @@
# Example docker compose configuration
The configuration in this directory is used for testing Neon docker images: it is
@@ -8,3 +7,13 @@ you can experiment with a miniature Neon system, use `cargo neon` rather than co
This configuration does not start the storage controller, because the controller
needs a way to reconfigure running computes, and no such thing exists in this setup.
## Generating the JWKS for a compute
```shell
openssl genpkey -algorithm Ed25519 -out private-key.pem
openssl pkey -in private-key.pem -pubout -out public-key.pem
openssl pkey -pubin -inform pem -in public-key.pem -pubout -outform der -out public-key.der
key="$(xxd -plain -cols 32 -s -32 public-key.der)"
key_id="$(printf '%s' "$key" | sha256sum | awk '{ print $1 }' | basenc --base64url --wrap=0)"
x="$(printf '%s' "$key" | basenc --base64url --wrap=0)"
```

View File

@@ -0,0 +1,3 @@
-----BEGIN PRIVATE KEY-----
MC4CAQAwBQYDK2VwBCIEIOmnRbzt2AJ0d+S3aU1hiYOl/tXpvz1FmWBfwHYBgOma
-----END PRIVATE KEY-----

Binary file not shown.

View File

@@ -0,0 +1,3 @@
-----BEGIN PUBLIC KEY-----
MCowBQYDK2VwAyEADY0al/U0bgB3+9fUGk+3PKWnsck9OyxN5DjHIN6Xep0=
-----END PUBLIC KEY-----

View File

@@ -11,8 +11,8 @@ generate_id() {
PG_VERSION=${PG_VERSION:-14}
SPEC_FILE_ORG=/var/db/postgres/specs/spec.json
SPEC_FILE=/tmp/spec.json
CONFIG_FILE_ORG=/var/db/postgres/configs/config.json
CONFIG_FILE=/tmp/config.json
echo "Waiting pageserver become ready."
while ! nc -z pageserver 6400; do
@@ -20,7 +20,7 @@ while ! nc -z pageserver 6400; do
done
echo "Page server is ready."
cp ${SPEC_FILE_ORG} ${SPEC_FILE}
cp ${CONFIG_FILE_ORG} ${CONFIG_FILE}
if [ -n "${TENANT_ID:-}" ] && [ -n "${TIMELINE_ID:-}" ]; then
tenant_id=${TENANT_ID}
@@ -73,17 +73,17 @@ else
ulid_extension=ulid
fi
echo "Adding pgx_ulid"
shared_libraries=$(jq -r '.cluster.settings[] | select(.name=="shared_preload_libraries").value' ${SPEC_FILE})
sed -i "s/${shared_libraries}/${shared_libraries},${ulid_extension}/" ${SPEC_FILE}
shared_libraries=$(jq -r '.spec.cluster.settings[] | select(.name=="shared_preload_libraries").value' ${CONFIG_FILE})
sed -i "s/${shared_libraries}/${shared_libraries},${ulid_extension}/" ${CONFIG_FILE}
echo "Overwrite tenant id and timeline id in spec file"
sed -i "s/TENANT_ID/${tenant_id}/" ${SPEC_FILE}
sed -i "s/TIMELINE_ID/${timeline_id}/" ${SPEC_FILE}
sed -i "s/TENANT_ID/${tenant_id}/" ${CONFIG_FILE}
sed -i "s/TIMELINE_ID/${timeline_id}/" ${CONFIG_FILE}
cat ${SPEC_FILE}
cat ${CONFIG_FILE}
echo "Start compute node"
/usr/local/bin/compute_ctl --pgdata /var/db/postgres/compute \
-C "postgresql://cloud_admin@localhost:55433/postgres" \
-b /usr/local/bin/postgres \
--compute-id "compute-$RANDOM" \
-S ${SPEC_FILE}
--config "$CONFIG_FILE"

View File

@@ -0,0 +1,160 @@
{
"spec": {
"format_version": 1.0,
"timestamp": "2022-10-12T18:00:00.000Z",
"operation_uuid": "0f657b36-4b0f-4a2d-9c2e-1dcd615e7d8c",
"cluster": {
"cluster_id": "docker_compose",
"name": "docker_compose_test",
"state": "restarted",
"roles": [
{
"name": "cloud_admin",
"encrypted_password": "b093c0d3b281ba6da1eacc608620abd8",
"options": null
}
],
"databases": [
],
"settings": [
{
"name": "fsync",
"value": "off",
"vartype": "bool"
},
{
"name": "wal_level",
"value": "logical",
"vartype": "enum"
},
{
"name": "wal_log_hints",
"value": "on",
"vartype": "bool"
},
{
"name": "log_connections",
"value": "on",
"vartype": "bool"
},
{
"name": "port",
"value": "55433",
"vartype": "integer"
},
{
"name": "shared_buffers",
"value": "1MB",
"vartype": "string"
},
{
"name": "max_connections",
"value": "100",
"vartype": "integer"
},
{
"name": "listen_addresses",
"value": "0.0.0.0",
"vartype": "string"
},
{
"name": "max_wal_senders",
"value": "10",
"vartype": "integer"
},
{
"name": "max_replication_slots",
"value": "10",
"vartype": "integer"
},
{
"name": "wal_sender_timeout",
"value": "5s",
"vartype": "string"
},
{
"name": "wal_keep_size",
"value": "0",
"vartype": "integer"
},
{
"name": "password_encryption",
"value": "md5",
"vartype": "enum"
},
{
"name": "restart_after_crash",
"value": "off",
"vartype": "bool"
},
{
"name": "synchronous_standby_names",
"value": "walproposer",
"vartype": "string"
},
{
"name": "shared_preload_libraries",
"value": "neon,pg_cron,timescaledb,pg_stat_statements",
"vartype": "string"
},
{
"name": "neon.safekeepers",
"value": "safekeeper1:5454,safekeeper2:5454,safekeeper3:5454",
"vartype": "string"
},
{
"name": "neon.timeline_id",
"value": "TIMELINE_ID",
"vartype": "string"
},
{
"name": "neon.tenant_id",
"value": "TENANT_ID",
"vartype": "string"
},
{
"name": "neon.pageserver_connstring",
"value": "host=pageserver port=6400",
"vartype": "string"
},
{
"name": "max_replication_write_lag",
"value": "500MB",
"vartype": "string"
},
{
"name": "max_replication_flush_lag",
"value": "10GB",
"vartype": "string"
},
{
"name": "cron.database",
"value": "postgres",
"vartype": "string"
}
]
},
"delta_operations": [
]
},
"compute_ctl_config": {
"jwks": {
"keys": [
{
"use": "sig",
"key_ops": [
"verify"
],
"alg": "EdDSA",
"kid": "ZGIxMzAzOGY0YWQwODk2ODU1MTk1NzMxMDFkYmUyOWU2NzZkOWNjNjMyMGRkZGJjOWY0MjdjYWVmNzE1MjUyOAo=",
"kty": "OKP",
"crv": "Ed25519",
"x": "MGQ4ZDFhOTdmNTM0NmUwMDc3ZmJkN2Q0MWE0ZmI3M2NhNWE3YjFjOTNkM2IyYzRkZTQzOGM3MjBkZTk3N2E5ZAo="
}
]
}
}
}

View File

@@ -1,141 +0,0 @@
{
"format_version": 1.0,
"timestamp": "2022-10-12T18:00:00.000Z",
"operation_uuid": "0f657b36-4b0f-4a2d-9c2e-1dcd615e7d8c",
"cluster": {
"cluster_id": "docker_compose",
"name": "docker_compose_test",
"state": "restarted",
"roles": [
{
"name": "cloud_admin",
"encrypted_password": "b093c0d3b281ba6da1eacc608620abd8",
"options": null
}
],
"databases": [
],
"settings": [
{
"name": "fsync",
"value": "off",
"vartype": "bool"
},
{
"name": "wal_level",
"value": "logical",
"vartype": "enum"
},
{
"name": "wal_log_hints",
"value": "on",
"vartype": "bool"
},
{
"name": "log_connections",
"value": "on",
"vartype": "bool"
},
{
"name": "port",
"value": "55433",
"vartype": "integer"
},
{
"name": "shared_buffers",
"value": "1MB",
"vartype": "string"
},
{
"name": "max_connections",
"value": "100",
"vartype": "integer"
},
{
"name": "listen_addresses",
"value": "0.0.0.0",
"vartype": "string"
},
{
"name": "max_wal_senders",
"value": "10",
"vartype": "integer"
},
{
"name": "max_replication_slots",
"value": "10",
"vartype": "integer"
},
{
"name": "wal_sender_timeout",
"value": "5s",
"vartype": "string"
},
{
"name": "wal_keep_size",
"value": "0",
"vartype": "integer"
},
{
"name": "password_encryption",
"value": "md5",
"vartype": "enum"
},
{
"name": "restart_after_crash",
"value": "off",
"vartype": "bool"
},
{
"name": "synchronous_standby_names",
"value": "walproposer",
"vartype": "string"
},
{
"name": "shared_preload_libraries",
"value": "neon,pg_cron,timescaledb,pg_stat_statements",
"vartype": "string"
},
{
"name": "neon.safekeepers",
"value": "safekeeper1:5454,safekeeper2:5454,safekeeper3:5454",
"vartype": "string"
},
{
"name": "neon.timeline_id",
"value": "TIMELINE_ID",
"vartype": "string"
},
{
"name": "neon.tenant_id",
"value": "TENANT_ID",
"vartype": "string"
},
{
"name": "neon.pageserver_connstring",
"value": "host=pageserver port=6400",
"vartype": "string"
},
{
"name": "max_replication_write_lag",
"value": "500MB",
"vartype": "string"
},
{
"name": "max_replication_flush_lag",
"value": "10GB",
"vartype": "string"
},
{
"name": "cron.database",
"value": "postgres",
"vartype": "string"
}
]
},
"delta_operations": [
]
}

View File

@@ -159,7 +159,7 @@ services:
#- RUST_BACKTRACE=1
# Mount the test files directly, for faster editing cycle.
volumes:
- ./compute_wrapper/var/db/postgres/specs/:/var/db/postgres/specs/
- ./compute_wrapper/var/db/postgres/configs/:/var/db/postgres/configs/
- ./compute_wrapper/shell/:/shell/
ports:
- 55433:55433 # pg protocol handler

View File

@@ -0,0 +1,8 @@
EXTENSION = pg_jsonschema
DATA = pg_jsonschema--1.0.sql
REGRESS = jsonschema_valid_api jsonschema_edge_cases
REGRESS_OPTS = --load-extension=pg_jsonschema
PG_CONFIG ?= pg_config
PGXS := $(shell $(PG_CONFIG) --pgxs)
include $(PGXS)

View File

@@ -0,0 +1,87 @@
-- Schema with enums, nulls, extra properties disallowed
SELECT jsonschema_is_valid('{
"type": "object",
"properties": {
"status": { "type": "string", "enum": ["active", "inactive", "pending"] },
"email": { "type": ["string", "null"], "format": "email" }
},
"required": ["status"],
"additionalProperties": false
}'::json);
jsonschema_is_valid
---------------------
t
(1 row)
-- Valid enum and null email
SELECT jsonschema_validation_errors(
'{
"type": "object",
"properties": {
"status": { "type": "string", "enum": ["active", "inactive", "pending"] },
"email": { "type": ["string", "null"], "format": "email" }
},
"required": ["status"],
"additionalProperties": false
}'::json,
'{"status": "active", "email": null}'::json
);
jsonschema_validation_errors
------------------------------
{}
(1 row)
-- Invalid enum value
SELECT jsonschema_validation_errors(
'{
"type": "object",
"properties": {
"status": { "type": "string", "enum": ["active", "inactive", "pending"] },
"email": { "type": ["string", "null"], "format": "email" }
},
"required": ["status"],
"additionalProperties": false
}'::json,
'{"status": "disabled", "email": null}'::json
);
jsonschema_validation_errors
----------------------------------------------------------------------
{"\"disabled\" is not one of [\"active\",\"inactive\",\"pending\"]"}
(1 row)
-- Invalid email format (assuming format is validated)
SELECT jsonschema_validation_errors(
'{
"type": "object",
"properties": {
"status": { "type": "string", "enum": ["active", "inactive", "pending"] },
"email": { "type": ["string", "null"], "format": "email" }
},
"required": ["status"],
"additionalProperties": false
}'::json,
'{"status": "active", "email": "not-an-email"}'::json
);
jsonschema_validation_errors
-----------------------------------------
{"\"not-an-email\" is not a \"email\""}
(1 row)
-- Extra property not allowed
SELECT jsonschema_validation_errors(
'{
"type": "object",
"properties": {
"status": { "type": "string", "enum": ["active", "inactive", "pending"] },
"email": { "type": ["string", "null"], "format": "email" }
},
"required": ["status"],
"additionalProperties": false
}'::json,
'{"status": "active", "extra": "should not be here"}'::json
);
jsonschema_validation_errors
--------------------------------------------------------------------
{"Additional properties are not allowed ('extra' was unexpected)"}
(1 row)

View File

@@ -0,0 +1,65 @@
-- Define schema
SELECT jsonschema_is_valid('{
"type": "object",
"properties": {
"username": { "type": "string" },
"age": { "type": "integer" }
},
"required": ["username"]
}'::json);
jsonschema_is_valid
---------------------
t
(1 row)
-- Valid instance
SELECT jsonschema_validation_errors(
'{
"type": "object",
"properties": {
"username": { "type": "string" },
"age": { "type": "integer" }
},
"required": ["username"]
}'::json,
'{"username": "alice", "age": 25}'::json
);
jsonschema_validation_errors
------------------------------
{}
(1 row)
-- Invalid instance: missing required "username"
SELECT jsonschema_validation_errors(
'{
"type": "object",
"properties": {
"username": { "type": "string" },
"age": { "type": "integer" }
},
"required": ["username"]
}'::json,
'{"age": 25}'::json
);
jsonschema_validation_errors
-----------------------------------------
{"\"username\" is a required property"}
(1 row)
-- Invalid instance: wrong type for "age"
SELECT jsonschema_validation_errors(
'{
"type": "object",
"properties": {
"username": { "type": "string" },
"age": { "type": "integer" }
},
"required": ["username"]
}'::json,
'{"username": "bob", "age": "twenty"}'::json
);
jsonschema_validation_errors
-------------------------------------------
{"\"twenty\" is not of type \"integer\""}
(1 row)

View File

@@ -0,0 +1,66 @@
-- Schema with enums, nulls, extra properties disallowed
SELECT jsonschema_is_valid('{
"type": "object",
"properties": {
"status": { "type": "string", "enum": ["active", "inactive", "pending"] },
"email": { "type": ["string", "null"], "format": "email" }
},
"required": ["status"],
"additionalProperties": false
}'::json);
-- Valid enum and null email
SELECT jsonschema_validation_errors(
'{
"type": "object",
"properties": {
"status": { "type": "string", "enum": ["active", "inactive", "pending"] },
"email": { "type": ["string", "null"], "format": "email" }
},
"required": ["status"],
"additionalProperties": false
}'::json,
'{"status": "active", "email": null}'::json
);
-- Invalid enum value
SELECT jsonschema_validation_errors(
'{
"type": "object",
"properties": {
"status": { "type": "string", "enum": ["active", "inactive", "pending"] },
"email": { "type": ["string", "null"], "format": "email" }
},
"required": ["status"],
"additionalProperties": false
}'::json,
'{"status": "disabled", "email": null}'::json
);
-- Invalid email format (assuming format is validated)
SELECT jsonschema_validation_errors(
'{
"type": "object",
"properties": {
"status": { "type": "string", "enum": ["active", "inactive", "pending"] },
"email": { "type": ["string", "null"], "format": "email" }
},
"required": ["status"],
"additionalProperties": false
}'::json,
'{"status": "active", "email": "not-an-email"}'::json
);
-- Extra property not allowed
SELECT jsonschema_validation_errors(
'{
"type": "object",
"properties": {
"status": { "type": "string", "enum": ["active", "inactive", "pending"] },
"email": { "type": ["string", "null"], "format": "email" }
},
"required": ["status"],
"additionalProperties": false
}'::json,
'{"status": "active", "extra": "should not be here"}'::json
);

View File

@@ -0,0 +1,48 @@
-- Define schema
SELECT jsonschema_is_valid('{
"type": "object",
"properties": {
"username": { "type": "string" },
"age": { "type": "integer" }
},
"required": ["username"]
}'::json);
-- Valid instance
SELECT jsonschema_validation_errors(
'{
"type": "object",
"properties": {
"username": { "type": "string" },
"age": { "type": "integer" }
},
"required": ["username"]
}'::json,
'{"username": "alice", "age": 25}'::json
);
-- Invalid instance: missing required "username"
SELECT jsonschema_validation_errors(
'{
"type": "object",
"properties": {
"username": { "type": "string" },
"age": { "type": "integer" }
},
"required": ["username"]
}'::json,
'{"age": 25}'::json
);
-- Invalid instance: wrong type for "age"
SELECT jsonschema_validation_errors(
'{
"type": "object",
"properties": {
"username": { "type": "string" },
"age": { "type": "integer" }
},
"required": ["username"]
}'::json,
'{"username": "bob", "age": "twenty"}'::json
);

View File

@@ -0,0 +1,9 @@
EXTENSION = pg_session_jwt
REGRESS = basic_functions
REGRESS_OPTS = --load-extension=$(EXTENSION)
export PGOPTIONS = -c pg_session_jwt.jwk={"crv":"Ed25519","kty":"OKP","x":"R_Abz-63zJ00l-IraL5fQhwkhGVZCSooQFV5ntC3C7M"}
PG_CONFIG ?= pg_config
PGXS := $(shell $(PG_CONFIG) --pgxs)
include $(PGXS)

View File

@@ -0,0 +1,35 @@
-- Basic functionality tests for pg_session_jwt
-- Test auth.init() function
SELECT auth.init();
init
------
(1 row)
-- Test an invalid JWT
SELECT auth.jwt_session_init('INVALID-JWT');
ERROR: invalid JWT encoding
-- Test creating a session with an expired JWT
SELECT auth.jwt_session_init('eyJhbGciOiJFZERTQSJ9.eyJleHAiOjE3NDI1NjQ0MzIsImlhdCI6MTc0MjU2NDI1MiwianRpIjo0MjQyNDIsInN1YiI6InVzZXIxMjMifQ.A6FwKuaSduHB9O7Gz37g0uoD_U9qVS0JNtT7YABGVgB7HUD1AMFc9DeyhNntWBqncg8k5brv-hrNTuUh5JYMAw');
ERROR: Token used after it has expired
-- Test creating a session with a valid JWT
SELECT auth.jwt_session_init('eyJhbGciOiJFZERTQSJ9.eyJleHAiOjQ4OTYxNjQyNTIsImlhdCI6MTc0MjU2NDI1MiwianRpIjo0MzQzNDMsInN1YiI6InVzZXIxMjMifQ.2TXVgjb6JSUq6_adlvp-m_SdOxZSyGS30RS9TLB0xu2N83dMSs2NybwE1NMU8Fb0tcAZR_ET7M2rSxbTrphfCg');
jwt_session_init
------------------
(1 row)
-- Test auth.session() function
SELECT auth.session();
session
-------------------------------------------------------------------------
{"exp": 4896164252, "iat": 1742564252, "jti": 434343, "sub": "user123"}
(1 row)
-- Test auth.user_id() function
SELECT auth.user_id() AS user_id;
user_id
---------
user123
(1 row)

View File

@@ -0,0 +1,19 @@
-- Basic functionality tests for pg_session_jwt
-- Test auth.init() function
SELECT auth.init();
-- Test an invalid JWT
SELECT auth.jwt_session_init('INVALID-JWT');
-- Test creating a session with an expired JWT
SELECT auth.jwt_session_init('eyJhbGciOiJFZERTQSJ9.eyJleHAiOjE3NDI1NjQ0MzIsImlhdCI6MTc0MjU2NDI1MiwianRpIjo0MjQyNDIsInN1YiI6InVzZXIxMjMifQ.A6FwKuaSduHB9O7Gz37g0uoD_U9qVS0JNtT7YABGVgB7HUD1AMFc9DeyhNntWBqncg8k5brv-hrNTuUh5JYMAw');
-- Test creating a session with a valid JWT
SELECT auth.jwt_session_init('eyJhbGciOiJFZERTQSJ9.eyJleHAiOjQ4OTYxNjQyNTIsImlhdCI6MTc0MjU2NDI1MiwianRpIjo0MzQzNDMsInN1YiI6InVzZXIxMjMifQ.2TXVgjb6JSUq6_adlvp-m_SdOxZSyGS30RS9TLB0xu2N83dMSs2NybwE1NMU8Fb0tcAZR_ET7M2rSxbTrphfCg');
-- Test auth.session() function
SELECT auth.session();
-- Test auth.user_id() function
SELECT auth.user_id() AS user_id;

View File

@@ -0,0 +1,53 @@
# Architecture Naming Scheme
## Summary
Neon computes are going to support multiple CPU architectures, including 64-bit
x86 and ARM. Architecture naming schemes are fairly inconsistent, at least when
it comes to these two architectures in particular. Sometimes 64-bit x86 is known
as `amd64` and at other times `x86_64`. For 64-bit ARM, it's a similar story
with `arm64` and `aarch64`.
## Motivation
Consistency when referring to these architectures across the platform in
important not only so that everyone is on the same page, but also because we
have architecture-dependent portions of the platform. One notable example is
remote extensions. Remote extensions need to be compiled and packaged into
compressed tarballs that are then downloaded by computes on demand. Downloading
a tarball for the wrong architecture will cause `dlopen(3)` to fail when
Postgres attempts to load the library. The compressed tarballs are located in an
S3 bucket with object keys of the form
`$BUILD_TAG/$ARCH/$PG_VERSION_NUM/${EXTENSION_NAME}.tar.zst` to mitigate the
potential for failure. The build and deployment pipeline for remote extensions
needs to be in lock step with the code in the compute that actually fetches
remote extensions.
## Impacted Components
- Control Plane when persisting the target architecture in compute flags if
specified.
- Austocaling when scheduling compute pods.
- `compute_ctl` when downloading remote extensions.
- Build and deployment pipeline for remote extensions.
## Prior Art for CPU Architecture Names
- [Rust](https://doc.rust-lang.org/std/env/consts/constant.ARCH.html)
- `x86_64`
- `aarch64`
- [Go](https://pkg.go.dev/internal/goarch#pkg-constants)
- `amd64`
- `arm64`
- Kubernetes
- Because Kubernetes is written in Go, it inherits the naming scheme.
Going all in on either the Rust naming scheme or the Go naming scheme can save
us a branch when getting the architecture, so we should pick one and force the
other side to conform.
## Decision
The heart of our platform is Kubernetes. We will inherit the Go naming
scheme just like Kubernetes. This makes things easy for the autoscaling team
when they do pod scheduling on nodes.

View File

@@ -1,5 +1,5 @@
[package]
name = "object_storage"
name = "endpoint_storage"
version = "0.0.1"
edition.workspace = true
license.workspace = true

View File

@@ -2,7 +2,7 @@ use anyhow::anyhow;
use axum::body::{Body, Bytes};
use axum::response::{IntoResponse, Response};
use axum::{Router, http::StatusCode};
use object_storage::{PrefixS3Path, S3Path, Storage, bad_request, internal_error, not_found, ok};
use endpoint_storage::{PrefixS3Path, S3Path, Storage, bad_request, internal_error, not_found, ok};
use remote_storage::TimeoutOrCancel;
use remote_storage::{DownloadError, DownloadOpts, GenericRemoteStorage, RemotePath};
use std::{sync::Arc, time::SystemTime, time::UNIX_EPOCH};
@@ -46,12 +46,12 @@ async fn metrics() -> Result {
async fn get(S3Path { path }: S3Path, state: State) -> Result {
info!(%path, "downloading");
let download_err = |e| {
if let DownloadError::NotFound = e {
info!(%path, %e, "downloading"); // 404 is not an issue of _this_ service
let download_err = |err| {
if let DownloadError::NotFound = err {
info!(%path, %err, "downloading"); // 404 is not an issue of _this_ service
return not_found(&path);
}
internal_error(e, &path, "downloading")
internal_error(err, &path, "downloading")
};
let cancel = state.cancel.clone();
let opts = &DownloadOpts::default();
@@ -249,7 +249,7 @@ mod tests {
};
let proxy = Storage {
auth: object_storage::JwtAuth::new(TEST_PUB_KEY_ED25519).unwrap(),
auth: endpoint_storage::JwtAuth::new(TEST_PUB_KEY_ED25519).unwrap(),
storage,
cancel: cancel.clone(),
max_upload_file_limit: usize::MAX,
@@ -343,14 +343,14 @@ MC4CAQAwBQYDK2VwBCIEID/Drmc1AA6U/znNRWpF3zEGegOATQxfkdWxitcOMsIH
TimelineId::from_array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3, 4, 5, 7]);
const ENDPOINT_ID: &str = "ep-winter-frost-a662z3vg";
fn token() -> String {
let claims = object_storage::Claims {
let claims = endpoint_storage::Claims {
tenant_id: TENANT_ID,
timeline_id: TIMELINE_ID,
endpoint_id: ENDPOINT_ID.into(),
exp: u64::MAX,
};
let key = jsonwebtoken::EncodingKey::from_ed_pem(TEST_PRIV_KEY_ED25519).unwrap();
let header = jsonwebtoken::Header::new(object_storage::VALIDATION_ALGO);
let header = jsonwebtoken::Header::new(endpoint_storage::VALIDATION_ALGO);
jsonwebtoken::encode(&header, &claims, &key).unwrap()
}
@@ -364,7 +364,10 @@ MC4CAQAwBQYDK2VwBCIEID/Drmc1AA6U/znNRWpF3zEGegOATQxfkdWxitcOMsIH
vec![TIMELINE_ID.to_string(), TimelineId::generate().to_string()],
vec![ENDPOINT_ID, "ep-ololo"]
)
.skip(1);
// first one is fully valid path, second path is valid for GET as
// read paths may have different endpoint if tenant and timeline matches
// (needed for prewarming RO->RW replica)
.skip(2);
for ((uri, method), (tenant, timeline, endpoint)) in iproduct!(routes(), args) {
info!(%uri, %method, %tenant, %timeline, %endpoint);
@@ -475,6 +478,16 @@ MC4CAQAwBQYDK2VwBCIEID/Drmc1AA6U/znNRWpF3zEGegOATQxfkdWxitcOMsIH
requests_chain(chain.into_iter(), |_| token()).await;
}
#[testlog(tokio::test)]
async fn read_other_endpoint_data() {
let uri = format!("/{TENANT_ID}/{TIMELINE_ID}/other_endpoint/key");
let chain = vec![
(uri.clone(), "GET", "", StatusCode::NOT_FOUND, false),
(uri.clone(), "PUT", "", StatusCode::UNAUTHORIZED, false),
];
requests_chain(chain.into_iter(), |_| token()).await;
}
fn delete_prefix_token(uri: &str) -> String {
use serde::Serialize;
let parts = uri.split("/").collect::<Vec<&str>>();
@@ -482,7 +495,7 @@ MC4CAQAwBQYDK2VwBCIEID/Drmc1AA6U/znNRWpF3zEGegOATQxfkdWxitcOMsIH
struct PrefixClaims {
tenant_id: TenantId,
timeline_id: Option<TimelineId>,
endpoint_id: Option<object_storage::EndpointId>,
endpoint_id: Option<endpoint_storage::EndpointId>,
exp: u64,
}
let claims = PrefixClaims {
@@ -492,7 +505,7 @@ MC4CAQAwBQYDK2VwBCIEID/Drmc1AA6U/znNRWpF3zEGegOATQxfkdWxitcOMsIH
exp: u64::MAX,
};
let key = jsonwebtoken::EncodingKey::from_ed_pem(TEST_PRIV_KEY_ED25519).unwrap();
let header = jsonwebtoken::Header::new(object_storage::VALIDATION_ALGO);
let header = jsonwebtoken::Header::new(endpoint_storage::VALIDATION_ALGO);
jsonwebtoken::encode(&header, &claims, &key).unwrap()
}

View File

@@ -169,10 +169,19 @@ impl FromRequestParts<Arc<Storage>> for S3Path {
.auth
.decode(bearer.token())
.map_err(|e| bad_request(e, "decoding token"))?;
// Read paths may have different endpoint ids. For readonly -> readwrite replica
// prewarming, endpoint must read other endpoint's data.
let endpoint_id = if parts.method == axum::http::Method::GET {
claims.endpoint_id.clone()
} else {
path.endpoint_id.clone()
};
let route = Claims {
tenant_id: path.tenant_id,
timeline_id: path.timeline_id,
endpoint_id: path.endpoint_id.clone(),
endpoint_id,
exp: claims.exp,
};
if route != claims {

View File

@@ -1,4 +1,4 @@
//! `object_storage` is a service which provides API for uploading and downloading
//! `endpoint_storage` is a service which provides API for uploading and downloading
//! files. It is used by compute and control plane for accessing LFC prewarm data.
//! This service is deployed either as a separate component or as part of compute image
//! for large computes.
@@ -33,7 +33,7 @@ async fn main() -> anyhow::Result<()> {
let config: String = std::env::args().skip(1).take(1).collect();
if config.is_empty() {
anyhow::bail!("Usage: object_storage config.json")
anyhow::bail!("Usage: endpoint_storage config.json")
}
info!("Reading config from {config}");
let config = std::fs::read_to_string(config.clone())?;
@@ -41,7 +41,7 @@ async fn main() -> anyhow::Result<()> {
info!("Reading pemfile from {}", config.pemfile.clone());
let pemfile = std::fs::read(config.pemfile.clone())?;
info!("Loading public key from {}", config.pemfile.clone());
let auth = object_storage::JwtAuth::new(&pemfile)?;
let auth = endpoint_storage::JwtAuth::new(&pemfile)?;
let listener = tokio::net::TcpListener::bind(config.listen).await.unwrap();
info!("listening on {}", listener.local_addr().unwrap());
@@ -50,7 +50,7 @@ async fn main() -> anyhow::Result<()> {
let cancel = tokio_util::sync::CancellationToken::new();
app::check_storage_permissions(&storage, cancel.clone()).await?;
let proxy = std::sync::Arc::new(object_storage::Storage {
let proxy = std::sync::Arc::new(endpoint_storage::Storage {
auth,
storage,
cancel: cancel.clone(),

View File

@@ -14,6 +14,32 @@ pub struct GenericAPIError {
pub error: String,
}
/// All configuration parameters necessary for a compute. When
/// [`ComputeConfig::spec`] is provided, it means that the compute is attached
/// to a tenant. [`ComputeConfig::compute_ctl_config`] will always be provided
/// and contains parameters necessary for operating `compute_ctl` independently
/// of whether a tenant is attached to the compute or not.
///
/// This also happens to be the body of `compute_ctl`'s /configure request.
#[derive(Debug, Deserialize, Serialize)]
pub struct ComputeConfig {
/// The compute spec
pub spec: Option<ComputeSpec>,
/// The compute_ctl configuration
#[allow(dead_code)]
pub compute_ctl_config: ComputeCtlConfig,
}
impl From<ControlPlaneConfigResponse> for ComputeConfig {
fn from(value: ControlPlaneConfigResponse) -> Self {
Self {
spec: value.spec,
compute_ctl_config: value.compute_ctl_config,
}
}
}
#[derive(Debug, Clone, Serialize)]
pub struct ExtensionInstallResponse {
pub extension: PgIdent,
@@ -134,7 +160,7 @@ pub struct CatalogObjects {
pub databases: Vec<Database>,
}
#[derive(Clone, Debug, Deserialize, Serialize)]
#[derive(Clone, Debug, Deserialize, Eq, PartialEq, Serialize)]
pub struct ComputeCtlConfig {
/// Set of JSON web keys that the compute can use to authenticate
/// communication from the control plane.
@@ -153,7 +179,7 @@ impl Default for ComputeCtlConfig {
}
}
#[derive(Clone, Debug, Deserialize, Serialize)]
#[derive(Clone, Debug, Deserialize, Eq, PartialEq, Serialize)]
pub struct TlsConfig {
pub key_path: String,
pub cert_path: String,
@@ -161,7 +187,7 @@ pub struct TlsConfig {
/// Response of the `/computes/{compute_id}/spec` control-plane API.
#[derive(Deserialize, Debug)]
pub struct ControlPlaneSpecResponse {
pub struct ControlPlaneConfigResponse {
pub spec: Option<ComputeSpec>,
pub status: ControlPlaneComputeStatus,
pub compute_ctl_config: ComputeCtlConfig,

View File

@@ -1,8 +1,8 @@
//! `ComputeSpec` represents the contents of the spec.json file.
//!
//! The spec.json file is used to pass information to 'compute_ctl'. It contains
//! all the information needed to start up the right version of PostgreSQL,
//! and connect it to the storage nodes.
//! The ComputeSpec contains all the information needed to start up
//! the right version of PostgreSQL, and connect it to the storage nodes.
//! It can be passed as part of the `config.json`, or the control plane can
//! provide it by calling the compute_ctl's `/compute_ctl` endpoint, or
//! compute_ctl can fetch it by calling the control plane's API.
use std::collections::HashMap;
use indexmap::IndexMap;
@@ -242,13 +242,22 @@ impl RemoteExtSpec {
match self.extension_data.get(real_ext_name) {
Some(_ext_data) => {
// We have decided to use the Go naming convention due to Kubernetes.
let arch = match std::env::consts::ARCH {
"x86_64" => "amd64",
"aarch64" => "arm64",
arch => arch,
};
// Construct the path to the extension archive
// BUILD_TAG/PG_MAJOR_VERSION/extensions/EXTENSION_NAME.tar.zst
//
// Keep it in sync with path generation in
// https://github.com/neondatabase/build-custom-extensions/tree/main
let archive_path_str =
format!("{build_tag}/{pg_major_version}/extensions/{real_ext_name}.tar.zst");
let archive_path_str = format!(
"{build_tag}/{arch}/{pg_major_version}/extensions/{real_ext_name}.tar.zst"
);
Ok((
real_ext_name.to_string(),
RemotePath::from_string(&archive_path_str)?,

View File

@@ -14,6 +14,7 @@ futures.workspace = true
hyper0.workspace = true
itertools.workspace = true
jemalloc_pprof.workspace = true
jsonwebtoken.workspace = true
once_cell.workspace = true
pprof.workspace = true
regex.workspace = true

View File

@@ -8,6 +8,7 @@ use bytes::{Bytes, BytesMut};
use hyper::header::{AUTHORIZATION, CONTENT_DISPOSITION, CONTENT_TYPE, HeaderName};
use hyper::http::HeaderValue;
use hyper::{Body, Method, Request, Response};
use jsonwebtoken::TokenData;
use metrics::{Encoder, IntCounter, TextEncoder, register_int_counter};
use once_cell::sync::Lazy;
use pprof::ProfilerGuardBuilder;
@@ -618,7 +619,7 @@ pub fn auth_middleware<B: hyper::body::HttpBody + Send + Sync + 'static>(
})?;
let token = parse_token(header_value)?;
let data = auth.decode(token).map_err(|err| {
let data: TokenData<Claims> = auth.decode(token).map_err(|err| {
warn!("Authentication error: {err}");
// Rely on From<AuthError> for ApiError impl
err

View File

@@ -35,6 +35,7 @@ nix = {workspace = true, optional = true}
reqwest.workspace = true
rand.workspace = true
tracing-utils.workspace = true
once_cell.workspace = true
[dev-dependencies]
bincode.workspace = true

View File

@@ -207,6 +207,10 @@ pub struct PageServicePipeliningConfigPipelined {
/// Causes runtime errors if larger than max get_vectored batch size.
pub max_batch_size: NonZeroUsize,
pub execution: PageServiceProtocolPipelinedExecutionStrategy,
// The default below is such that new versions of the software can start
// with the old configuration.
#[serde(default)]
pub batching: PageServiceProtocolPipelinedBatchingStrategy,
}
#[derive(Debug, Clone, Copy, PartialEq, Eq, serde::Serialize, serde::Deserialize)]
@@ -216,6 +220,19 @@ pub enum PageServiceProtocolPipelinedExecutionStrategy {
Tasks,
}
#[derive(Default, Debug, Clone, Copy, PartialEq, Eq, serde::Serialize, serde::Deserialize)]
#[serde(rename_all = "kebab-case")]
pub enum PageServiceProtocolPipelinedBatchingStrategy {
/// All get page requests in a batch will be at the same LSN
#[default]
UniformLsn,
/// Get page requests in a batch may be at different LSN
///
/// One key cannot be present more than once at different LSNs in
/// the same batch.
ScatteredLsn,
}
#[derive(Debug, Clone, PartialEq, Eq, serde::Serialize, serde::Deserialize)]
#[serde(tag = "mode", rename_all = "kebab-case")]
pub enum GetVectoredConcurrentIo {
@@ -362,6 +379,8 @@ pub struct TenantConfigToml {
/// size exceeds `compaction_upper_limit * checkpoint_distance`.
pub compaction_upper_limit: usize,
pub compaction_algorithm: crate::models::CompactionAlgorithmSettings,
/// If true, enable shard ancestor compaction (enabled by default).
pub compaction_shard_ancestor: bool,
/// If true, compact down L0 across all tenant timelines before doing regular compaction. L0
/// compaction must be responsive to avoid read amp during heavy ingestion. Defaults to true.
pub compaction_l0_first: bool,
@@ -452,6 +471,8 @@ pub struct TenantConfigToml {
// gc-compaction related configs
/// Enable automatic gc-compaction trigger on this tenant.
pub gc_compaction_enabled: bool,
/// Enable verification of gc-compaction results.
pub gc_compaction_verification: bool,
/// The initial threshold for gc-compaction in KB. Once the total size of layers below the gc-horizon is above this threshold,
/// gc-compaction will be triggered.
pub gc_compaction_initial_threshold_kb: u64,
@@ -613,9 +634,12 @@ impl Default for ConfigToml {
page_service_pipelining: if !cfg!(test) {
PageServicePipeliningConfig::Serial
} else {
// Do not turn this into the default until scattered reads have been
// validated and rolled-out fully.
PageServicePipeliningConfig::Pipelined(PageServicePipeliningConfigPipelined {
max_batch_size: NonZeroUsize::new(32).unwrap(),
execution: PageServiceProtocolPipelinedExecutionStrategy::ConcurrentFutures,
batching: PageServiceProtocolPipelinedBatchingStrategy::ScatteredLsn,
})
},
get_vectored_concurrent_io: if !cfg!(test) {
@@ -655,12 +679,13 @@ pub mod tenant_conf_defaults {
pub const DEFAULT_COMPACTION_PERIOD: &str = "20 s";
pub const DEFAULT_COMPACTION_THRESHOLD: usize = 10;
pub const DEFAULT_COMPACTION_SHARD_ANCESTOR: bool = true;
// This value needs to be tuned to avoid OOM. We have 3/4*CPUs threads for L0 compaction, that's
// 3/4*16=9 on most of our pageservers. Compacting 20 layers requires about 1 GB memory (could
// be reduced later by optimizing L0 hole calculation to avoid loading all keys into memory). So
// with this config, we can get a maximum peak compaction usage of 9 GB.
pub const DEFAULT_COMPACTION_UPPER_LIMIT: usize = 20;
// 3/4*8=6 on most of our pageservers. Compacting 10 layers requires a maximum of
// DEFAULT_CHECKPOINT_DISTANCE*10 memory, that's 2560MB. So with this config, we can get a maximum peak
// compaction usage of 15360MB.
pub const DEFAULT_COMPACTION_UPPER_LIMIT: usize = 10;
// Enable L0 compaction pass and semaphore by default. L0 compaction must be responsive to avoid
// read amp.
pub const DEFAULT_COMPACTION_L0_FIRST: bool = true;
@@ -677,8 +702,11 @@ pub mod tenant_conf_defaults {
// Relevant: https://github.com/neondatabase/neon/issues/3394
pub const DEFAULT_GC_PERIOD: &str = "1 hr";
pub const DEFAULT_IMAGE_CREATION_THRESHOLD: usize = 3;
// If there are more than threshold * compaction_threshold (that is 3 * 10 in the default config) L0 layers, image
// layer creation will end immediately. Set to 0 to disable.
// Currently, any value other than 0 will trigger image layer creation preemption immediately with L0 backpressure
// without looking at the exact number of L0 layers.
// It was expected to have the following behavior:
// > If there are more than threshold * compaction_threshold (that is 3 * 10 in the default config) L0 layers, image
// > layer creation will end immediately. Set to 0 to disable.
pub const DEFAULT_IMAGE_CREATION_PREEMPT_THRESHOLD: usize = 3;
pub const DEFAULT_PITR_INTERVAL: &str = "7 days";
pub const DEFAULT_WALRECEIVER_CONNECT_TIMEOUT: &str = "10 seconds";
@@ -692,6 +720,7 @@ pub mod tenant_conf_defaults {
// image layers should be created.
pub const DEFAULT_IMAGE_LAYER_CREATION_CHECK_THRESHOLD: u8 = 2;
pub const DEFAULT_GC_COMPACTION_ENABLED: bool = false;
pub const DEFAULT_GC_COMPACTION_VERIFICATION: bool = true;
pub const DEFAULT_GC_COMPACTION_INITIAL_THRESHOLD_KB: u64 = 5 * 1024 * 1024; // 5GB
pub const DEFAULT_GC_COMPACTION_RATIO_PERCENT: u64 = 100;
}
@@ -711,6 +740,7 @@ impl Default for TenantConfigToml {
compaction_algorithm: crate::models::CompactionAlgorithmSettings {
kind: DEFAULT_COMPACTION_ALGORITHM,
},
compaction_shard_ancestor: DEFAULT_COMPACTION_SHARD_ANCESTOR,
compaction_l0_first: DEFAULT_COMPACTION_L0_FIRST,
compaction_l0_semaphore: DEFAULT_COMPACTION_L0_SEMAPHORE,
l0_flush_delay_threshold: None,
@@ -746,6 +776,7 @@ impl Default for TenantConfigToml {
wal_receiver_protocol_override: None,
rel_size_v2_enabled: false,
gc_compaction_enabled: DEFAULT_GC_COMPACTION_ENABLED,
gc_compaction_verification: DEFAULT_GC_COMPACTION_VERIFICATION,
gc_compaction_initial_threshold_kb: DEFAULT_GC_COMPACTION_INITIAL_THRESHOLD_KB,
gc_compaction_ratio_percent: DEFAULT_GC_COMPACTION_RATIO_PERCENT,
sampling_ratio: None,

View File

@@ -526,6 +526,8 @@ pub struct TenantConfigPatch {
#[serde(skip_serializing_if = "FieldPatch::is_noop")]
pub compaction_algorithm: FieldPatch<CompactionAlgorithmSettings>,
#[serde(skip_serializing_if = "FieldPatch::is_noop")]
pub compaction_shard_ancestor: FieldPatch<bool>,
#[serde(skip_serializing_if = "FieldPatch::is_noop")]
pub compaction_l0_first: FieldPatch<bool>,
#[serde(skip_serializing_if = "FieldPatch::is_noop")]
pub compaction_l0_semaphore: FieldPatch<bool>,
@@ -576,6 +578,8 @@ pub struct TenantConfigPatch {
#[serde(skip_serializing_if = "FieldPatch::is_noop")]
pub gc_compaction_enabled: FieldPatch<bool>,
#[serde(skip_serializing_if = "FieldPatch::is_noop")]
pub gc_compaction_verification: FieldPatch<bool>,
#[serde(skip_serializing_if = "FieldPatch::is_noop")]
pub gc_compaction_initial_threshold_kb: FieldPatch<u64>,
#[serde(skip_serializing_if = "FieldPatch::is_noop")]
pub gc_compaction_ratio_percent: FieldPatch<u64>,
@@ -613,6 +617,9 @@ pub struct TenantConfig {
#[serde(skip_serializing_if = "Option::is_none")]
pub compaction_algorithm: Option<CompactionAlgorithmSettings>,
#[serde(skip_serializing_if = "Option::is_none")]
pub compaction_shard_ancestor: Option<bool>,
#[serde(skip_serializing_if = "Option::is_none")]
pub compaction_l0_first: Option<bool>,
@@ -696,6 +703,9 @@ pub struct TenantConfig {
#[serde(skip_serializing_if = "Option::is_none")]
pub gc_compaction_enabled: Option<bool>,
#[serde(skip_serializing_if = "Option::is_none")]
pub gc_compaction_verification: Option<bool>,
#[serde(skip_serializing_if = "Option::is_none")]
pub gc_compaction_initial_threshold_kb: Option<u64>,
@@ -719,6 +729,7 @@ impl TenantConfig {
mut compaction_threshold,
mut compaction_upper_limit,
mut compaction_algorithm,
mut compaction_shard_ancestor,
mut compaction_l0_first,
mut compaction_l0_semaphore,
mut l0_flush_delay_threshold,
@@ -744,6 +755,7 @@ impl TenantConfig {
mut wal_receiver_protocol_override,
mut rel_size_v2_enabled,
mut gc_compaction_enabled,
mut gc_compaction_verification,
mut gc_compaction_initial_threshold_kb,
mut gc_compaction_ratio_percent,
mut sampling_ratio,
@@ -766,6 +778,9 @@ impl TenantConfig {
.compaction_upper_limit
.apply(&mut compaction_upper_limit);
patch.compaction_algorithm.apply(&mut compaction_algorithm);
patch
.compaction_shard_ancestor
.apply(&mut compaction_shard_ancestor);
patch.compaction_l0_first.apply(&mut compaction_l0_first);
patch
.compaction_l0_semaphore
@@ -835,6 +850,9 @@ impl TenantConfig {
patch
.gc_compaction_enabled
.apply(&mut gc_compaction_enabled);
patch
.gc_compaction_verification
.apply(&mut gc_compaction_verification);
patch
.gc_compaction_initial_threshold_kb
.apply(&mut gc_compaction_initial_threshold_kb);
@@ -851,6 +869,7 @@ impl TenantConfig {
compaction_threshold,
compaction_upper_limit,
compaction_algorithm,
compaction_shard_ancestor,
compaction_l0_first,
compaction_l0_semaphore,
l0_flush_delay_threshold,
@@ -876,6 +895,7 @@ impl TenantConfig {
wal_receiver_protocol_override,
rel_size_v2_enabled,
gc_compaction_enabled,
gc_compaction_verification,
gc_compaction_initial_threshold_kb,
gc_compaction_ratio_percent,
sampling_ratio,
@@ -910,6 +930,9 @@ impl TenantConfig {
.as_ref()
.unwrap_or(&global_conf.compaction_algorithm)
.clone(),
compaction_shard_ancestor: self
.compaction_shard_ancestor
.unwrap_or(global_conf.compaction_shard_ancestor),
compaction_l0_first: self
.compaction_l0_first
.unwrap_or(global_conf.compaction_l0_first),
@@ -974,6 +997,9 @@ impl TenantConfig {
gc_compaction_enabled: self
.gc_compaction_enabled
.unwrap_or(global_conf.gc_compaction_enabled),
gc_compaction_verification: self
.gc_compaction_verification
.unwrap_or(global_conf.gc_compaction_verification),
gc_compaction_initial_threshold_kb: self
.gc_compaction_initial_threshold_kb
.unwrap_or(global_conf.gc_compaction_initial_threshold_kb),
@@ -1791,8 +1817,34 @@ pub mod virtual_file {
}
impl IoMode {
pub const fn preferred() -> Self {
Self::Buffered
pub fn preferred() -> Self {
// The default behavior when running Rust unit tests without any further
// flags is to use the newest behavior if available on the platform (Direct).
// The CI uses the following environment variable to unit tests for all
// different modes.
// NB: the Python regression & perf tests have their own defaults management
// that writes pageserver.toml; they do not use this variable.
if cfg!(test) {
use once_cell::sync::Lazy;
static CACHED: Lazy<IoMode> = Lazy::new(|| {
utils::env::var_serde_json_string(
"NEON_PAGESERVER_UNIT_TEST_VIRTUAL_FILE_IO_MODE",
)
.unwrap_or({
#[cfg(target_os = "linux")]
{
IoMode::Direct
}
#[cfg(not(target_os = "linux"))]
{
IoMode::Buffered
}
})
});
*CACHED
} else {
IoMode::Buffered
}
}
}

View File

@@ -29,6 +29,7 @@ futures = { workspace = true }
jsonwebtoken.workspace = true
nix = { workspace = true, features = ["ioctl"] }
once_cell.workspace = true
pem.workspace = true
pin-project-lite.workspace = true
regex.workspace = true
serde.workspace = true

View File

@@ -11,7 +11,8 @@ use camino::Utf8Path;
use jsonwebtoken::{
Algorithm, DecodingKey, EncodingKey, Header, TokenData, Validation, decode, encode,
};
use serde::{Deserialize, Serialize};
use pem::Pem;
use serde::{Deserialize, Serialize, de::DeserializeOwned};
use crate::id::TenantId;
@@ -73,7 +74,10 @@ impl SwappableJwtAuth {
pub fn swap(&self, jwt_auth: JwtAuth) {
self.0.swap(Arc::new(jwt_auth));
}
pub fn decode(&self, token: &str) -> std::result::Result<TokenData<Claims>, AuthError> {
pub fn decode<D: DeserializeOwned>(
&self,
token: &str,
) -> std::result::Result<TokenData<D>, AuthError> {
self.0.load().decode(token)
}
}
@@ -148,7 +152,10 @@ impl JwtAuth {
/// The function tries the stored decoding keys in succession,
/// and returns the first yielding a successful result.
/// If there is no working decoding key, it returns the last error.
pub fn decode(&self, token: &str) -> std::result::Result<TokenData<Claims>, AuthError> {
pub fn decode<D: DeserializeOwned>(
&self,
token: &str,
) -> std::result::Result<TokenData<D>, AuthError> {
let mut res = None;
for decoding_key in &self.decoding_keys {
res = Some(decode(token, decoding_key, &self.validation));
@@ -173,8 +180,8 @@ impl std::fmt::Debug for JwtAuth {
}
// this function is used only for testing purposes in CLI e g generate tokens during init
pub fn encode_from_key_file<S: Serialize>(claims: &S, key_data: &[u8]) -> Result<String> {
let key = EncodingKey::from_ed_pem(key_data)?;
pub fn encode_from_key_file<S: Serialize>(claims: &S, pem: &Pem) -> Result<String> {
let key = EncodingKey::from_ed_der(pem.contents());
Ok(encode(&Header::new(STORAGE_TOKEN_ALGORITHM), claims, &key)?)
}
@@ -188,13 +195,13 @@ mod tests {
//
// openssl genpkey -algorithm ed25519 -out ed25519-priv.pem
// openssl pkey -in ed25519-priv.pem -pubout -out ed25519-pub.pem
const TEST_PUB_KEY_ED25519: &[u8] = br#"
const TEST_PUB_KEY_ED25519: &str = r#"
-----BEGIN PUBLIC KEY-----
MCowBQYDK2VwAyEARYwaNBayR+eGI0iXB4s3QxE3Nl2g1iWbr6KtLWeVD/w=
-----END PUBLIC KEY-----
"#;
const TEST_PRIV_KEY_ED25519: &[u8] = br#"
const TEST_PRIV_KEY_ED25519: &str = r#"
-----BEGIN PRIVATE KEY-----
MC4CAQAwBQYDK2VwBCIEID/Drmc1AA6U/znNRWpF3zEGegOATQxfkdWxitcOMsIH
-----END PRIVATE KEY-----
@@ -222,9 +229,9 @@ MC4CAQAwBQYDK2VwBCIEID/Drmc1AA6U/znNRWpF3zEGegOATQxfkdWxitcOMsIH
// Check it can be validated with the public key
let auth = JwtAuth::new(vec![
DecodingKey::from_ed_pem(TEST_PUB_KEY_ED25519).unwrap(),
DecodingKey::from_ed_pem(TEST_PUB_KEY_ED25519.as_bytes()).unwrap(),
]);
let claims_from_token = auth.decode(encoded_eddsa).unwrap().claims;
let claims_from_token: Claims = auth.decode(encoded_eddsa).unwrap().claims;
assert_eq!(claims_from_token, expected_claims);
}
@@ -235,13 +242,14 @@ MC4CAQAwBQYDK2VwBCIEID/Drmc1AA6U/znNRWpF3zEGegOATQxfkdWxitcOMsIH
scope: Scope::Tenant,
};
let encoded = encode_from_key_file(&claims, TEST_PRIV_KEY_ED25519).unwrap();
let pem = pem::parse(TEST_PRIV_KEY_ED25519).unwrap();
let encoded = encode_from_key_file(&claims, &pem).unwrap();
// decode it back
let auth = JwtAuth::new(vec![
DecodingKey::from_ed_pem(TEST_PUB_KEY_ED25519).unwrap(),
DecodingKey::from_ed_pem(TEST_PUB_KEY_ED25519.as_bytes()).unwrap(),
]);
let decoded = auth.decode(&encoded).unwrap();
let decoded: TokenData<Claims> = auth.decode(&encoded).unwrap();
assert_eq!(decoded.claims, claims);
}

View File

@@ -10,6 +10,8 @@ default = []
# which adds some runtime cost to run tests on outage conditions
testing = ["fail/failpoints", "pageserver_api/testing", "wal_decoder/testing", "pageserver_client/testing"]
fuzz-read-path = ["testing"]
[dependencies]
anyhow.workspace = true
arc-swap.workspace = true
@@ -33,6 +35,7 @@ humantime.workspace = true
humantime-serde.workspace = true
hyper0.workspace = true
itertools.workspace = true
jsonwebtoken.workspace = true
md5.workspace = true
nix.workspace = true
# hack to get the number of worker threads tokio uses
@@ -75,6 +78,7 @@ metrics.workspace = true
pageserver_api.workspace = true
pageserver_client.workspace = true # for ResponseErrorMessageExt TOOD refactor that
pageserver_compaction.workspace = true
pem.workspace = true
postgres_connection.workspace = true
postgres_ffi.workspace = true
pq_proto.workspace = true

View File

@@ -68,6 +68,13 @@ pub(crate) struct Args {
targets: Option<Vec<TenantTimelineId>>,
}
/// State shared by all clients
#[derive(Debug)]
struct SharedState {
start_work_barrier: tokio::sync::Barrier,
live_stats: LiveStats,
}
#[derive(Debug, Default)]
struct LiveStats {
completed_requests: AtomicU64,
@@ -240,24 +247,26 @@ async fn main_impl(
all_ranges
};
let live_stats = Arc::new(LiveStats::default());
let num_live_stats_dump = 1;
let num_work_sender_tasks = args.num_clients.get() * timelines.len();
let num_main_impl = 1;
let start_work_barrier = Arc::new(tokio::sync::Barrier::new(
num_live_stats_dump + num_work_sender_tasks + num_main_impl,
));
let shared_state = Arc::new(SharedState {
start_work_barrier: tokio::sync::Barrier::new(
num_live_stats_dump + num_work_sender_tasks + num_main_impl,
),
live_stats: LiveStats::default(),
});
let cancel = CancellationToken::new();
let ss = shared_state.clone();
tokio::spawn({
let stats = Arc::clone(&live_stats);
let start_work_barrier = Arc::clone(&start_work_barrier);
async move {
start_work_barrier.wait().await;
ss.start_work_barrier.wait().await;
loop {
let start = std::time::Instant::now();
tokio::time::sleep(std::time::Duration::from_secs(1)).await;
let stats = &ss.live_stats;
let completed_requests = stats.completed_requests.swap(0, Ordering::Relaxed);
let missed = stats.missed.swap(0, Ordering::Relaxed);
let elapsed = start.elapsed();
@@ -270,14 +279,12 @@ async fn main_impl(
}
});
let cancel = CancellationToken::new();
let rps_period = args
.per_client_rate
.map(|rps_limit| Duration::from_secs_f64(1.0 / (rps_limit as f64)));
let make_worker: &dyn Fn(WorkerId) -> Pin<Box<dyn Send + Future<Output = ()>>> = &|worker_id| {
let live_stats = live_stats.clone();
let start_work_barrier = start_work_barrier.clone();
let ss = shared_state.clone();
let cancel = cancel.clone();
let ranges: Vec<KeyRange> = all_ranges
.iter()
.filter(|r| r.timeline == worker_id.timeline)
@@ -287,85 +294,8 @@ async fn main_impl(
rand::distributions::weighted::WeightedIndex::new(ranges.iter().map(|v| v.len()))
.unwrap();
let cancel = cancel.clone();
Box::pin(async move {
let client =
pageserver_client::page_service::Client::new(args.page_service_connstring.clone())
.await
.unwrap();
let mut client = client
.pagestream(worker_id.timeline.tenant_id, worker_id.timeline.timeline_id)
.await
.unwrap();
start_work_barrier.wait().await;
let client_start = Instant::now();
let mut ticks_processed = 0;
let mut inflight = VecDeque::new();
while !cancel.is_cancelled() {
// Detect if a request took longer than the RPS rate
if let Some(period) = &rps_period {
let periods_passed_until_now =
usize::try_from(client_start.elapsed().as_micros() / period.as_micros())
.unwrap();
if periods_passed_until_now > ticks_processed {
live_stats.missed((periods_passed_until_now - ticks_processed) as u64);
}
ticks_processed = periods_passed_until_now;
}
while inflight.len() < args.queue_depth.get() {
let start = Instant::now();
let req = {
let mut rng = rand::thread_rng();
let r = &ranges[weights.sample(&mut rng)];
let key: i128 = rng.gen_range(r.start..r.end);
let key = Key::from_i128(key);
assert!(key.is_rel_block_key());
let (rel_tag, block_no) = key
.to_rel_block()
.expect("we filter non-rel-block keys out above");
PagestreamGetPageRequest {
hdr: PagestreamRequest {
reqid: 0,
request_lsn: if rng.gen_bool(args.req_latest_probability) {
Lsn::MAX
} else {
r.timeline_lsn
},
not_modified_since: r.timeline_lsn,
},
rel: rel_tag,
blkno: block_no,
}
};
client.getpage_send(req).await.unwrap();
inflight.push_back(start);
}
let start = inflight.pop_front().unwrap();
client.getpage_recv().await.unwrap();
let end = Instant::now();
live_stats.request_done();
ticks_processed += 1;
STATS.with(|stats| {
stats
.borrow()
.lock()
.unwrap()
.observe(end.duration_since(start))
.unwrap();
});
if let Some(period) = &rps_period {
let next_at = client_start
+ Duration::from_micros(
(ticks_processed) as u64 * u64::try_from(period.as_micros()).unwrap(),
);
tokio::time::sleep_until(next_at.into()).await;
}
}
client_libpq(args, worker_id, ss, cancel, rps_period, ranges, weights).await
})
};
@@ -387,7 +317,7 @@ async fn main_impl(
};
info!("waiting for everything to become ready");
start_work_barrier.wait().await;
shared_state.start_work_barrier.wait().await;
info!("work started");
if let Some(runtime) = args.runtime {
tokio::time::sleep(runtime.into()).await;
@@ -416,3 +346,91 @@ async fn main_impl(
anyhow::Ok(())
}
async fn client_libpq(
args: &Args,
worker_id: WorkerId,
shared_state: Arc<SharedState>,
cancel: CancellationToken,
rps_period: Option<Duration>,
ranges: Vec<KeyRange>,
weights: rand::distributions::weighted::WeightedIndex<i128>,
) {
let client = pageserver_client::page_service::Client::new(args.page_service_connstring.clone())
.await
.unwrap();
let mut client = client
.pagestream(worker_id.timeline.tenant_id, worker_id.timeline.timeline_id)
.await
.unwrap();
shared_state.start_work_barrier.wait().await;
let client_start = Instant::now();
let mut ticks_processed = 0;
let mut inflight = VecDeque::new();
while !cancel.is_cancelled() {
// Detect if a request took longer than the RPS rate
if let Some(period) = &rps_period {
let periods_passed_until_now =
usize::try_from(client_start.elapsed().as_micros() / period.as_micros()).unwrap();
if periods_passed_until_now > ticks_processed {
shared_state
.live_stats
.missed((periods_passed_until_now - ticks_processed) as u64);
}
ticks_processed = periods_passed_until_now;
}
while inflight.len() < args.queue_depth.get() {
let start = Instant::now();
let req = {
let mut rng = rand::thread_rng();
let r = &ranges[weights.sample(&mut rng)];
let key: i128 = rng.gen_range(r.start..r.end);
let key = Key::from_i128(key);
assert!(key.is_rel_block_key());
let (rel_tag, block_no) = key
.to_rel_block()
.expect("we filter non-rel-block keys out above");
PagestreamGetPageRequest {
hdr: PagestreamRequest {
reqid: 0,
request_lsn: if rng.gen_bool(args.req_latest_probability) {
Lsn::MAX
} else {
r.timeline_lsn
},
not_modified_since: r.timeline_lsn,
},
rel: rel_tag,
blkno: block_no,
}
};
client.getpage_send(req).await.unwrap();
inflight.push_back(start);
}
let start = inflight.pop_front().unwrap();
client.getpage_recv().await.unwrap();
let end = Instant::now();
shared_state.live_stats.request_done();
ticks_processed += 1;
STATS.with(|stats| {
stats
.borrow()
.lock()
.unwrap()
.observe(end.duration_since(start))
.unwrap();
});
if let Some(period) = &rps_period {
let next_at = client_start
+ Duration::from_micros(
(ticks_processed) as u64 * u64::try_from(period.as_micros()).unwrap(),
);
tokio::time::sleep_until(next_at.into()).await;
}
}
}

View File

@@ -34,7 +34,7 @@ use utils::lsn::Lsn;
use crate::context::RequestContext;
use crate::pgdatadir_mapping::Version;
use crate::tenant::storage_layer::IoConcurrency;
use crate::tenant::timeline::GetVectoredError;
use crate::tenant::timeline::{GetVectoredError, VersionedKeySpaceQuery};
use crate::tenant::{PageReconstructError, Timeline};
#[derive(Debug, thiserror::Error)]
@@ -353,9 +353,10 @@ where
let mut slru_builder = SlruSegmentsBuilder::new(&mut self.ar);
for part in slru_partitions.parts {
let query = VersionedKeySpaceQuery::uniform(part, self.lsn);
let blocks = self
.timeline
.get_vectored(part, self.lsn, self.io_concurrency.clone(), self.ctx)
.get_vectored(query, self.io_concurrency.clone(), self.ctx)
.await?;
for (key, block) in blocks {

View File

@@ -416,8 +416,18 @@ fn start_pageserver(
// The storage_broker::connect call needs to happen inside a tokio runtime thread.
let broker_client = WALRECEIVER_RUNTIME
.block_on(async {
let tls_config = storage_broker::ClientTlsConfig::new().ca_certificates(
conf.ssl_ca_certs
.iter()
.map(pem::encode)
.map(storage_broker::Certificate::from_pem),
);
// Note: we do not attempt connecting here (but validate endpoints sanity).
storage_broker::connect(conf.broker_endpoint.clone(), conf.broker_keepalive_interval)
storage_broker::connect(
conf.broker_endpoint.clone(),
conf.broker_keepalive_interval,
tls_config,
)
})
.with_context(|| {
format!(

View File

@@ -17,9 +17,10 @@ use once_cell::sync::OnceCell;
use pageserver_api::config::{DiskUsageEvictionTaskConfig, MaxVectoredReadBytes};
use pageserver_api::models::ImageCompressionAlgorithm;
use pageserver_api::shard::TenantShardId;
use pem::Pem;
use postgres_backend::AuthType;
use remote_storage::{RemotePath, RemoteStorageConfig};
use reqwest::{Certificate, Url};
use reqwest::Url;
use storage_broker::Uri;
use utils::id::{NodeId, TimelineId};
use utils::logging::{LogFormat, SecretString};
@@ -67,8 +68,8 @@ pub struct PageServerConf {
/// Period to reload certificate and private key from files.
/// Default: 60s.
pub ssl_cert_reload_period: Duration,
/// Trusted root CA certificates to use in https APIs.
pub ssl_ca_certs: Vec<Certificate>,
/// Trusted root CA certificates to use in https APIs in PEM format.
pub ssl_ca_certs: Vec<Pem>,
/// Current availability zone. Used for traffic metrics.
pub availability_zone: Option<String>,
@@ -118,13 +119,13 @@ pub struct PageServerConf {
/// A lower value implicitly deprioritizes loading such tenants, vs. other work in the system.
pub concurrent_tenant_warmup: ConfigurableSemaphore,
/// Number of concurrent [`Tenant::gather_size_inputs`](crate::tenant::Tenant::gather_size_inputs) allowed.
/// Number of concurrent [`TenantShard::gather_size_inputs`](crate::tenant::TenantShard::gather_size_inputs) allowed.
pub concurrent_tenant_size_logical_size_queries: ConfigurableSemaphore,
/// Limit of concurrent [`Tenant::gather_size_inputs`] issued by module `eviction_task`.
/// Limit of concurrent [`TenantShard::gather_size_inputs`] issued by module `eviction_task`.
/// The number of permits is the same as `concurrent_tenant_size_logical_size_queries`.
/// See the comment in `eviction_task` for details.
///
/// [`Tenant::gather_size_inputs`]: crate::tenant::Tenant::gather_size_inputs
/// [`TenantShard::gather_size_inputs`]: crate::tenant::TenantShard::gather_size_inputs
pub eviction_task_immitated_concurrent_logical_size_queries: ConfigurableSemaphore,
// How often to collect metrics and send them to the metrics endpoint.
@@ -497,7 +498,10 @@ impl PageServerConf {
ssl_ca_certs: match ssl_ca_file {
Some(ssl_ca_file) => {
let buf = std::fs::read(ssl_ca_file)?;
Certificate::from_pem_bundle(&buf)?
pem::parse_many(&buf)?
.into_iter()
.filter(|pem| pem.tag() == "CERTIFICATE")
.collect()
}
None => Vec::new(),
},
@@ -588,10 +592,10 @@ impl ConfigurableSemaphore {
/// Initializse using a non-zero amount of permits.
///
/// Require a non-zero initial permits, because using permits == 0 is a crude way to disable a
/// feature such as [`Tenant::gather_size_inputs`]. Otherwise any semaphore using future will
/// feature such as [`TenantShard::gather_size_inputs`]. Otherwise any semaphore using future will
/// behave like [`futures::future::pending`], just waiting until new permits are added.
///
/// [`Tenant::gather_size_inputs`]: crate::tenant::Tenant::gather_size_inputs
/// [`TenantShard::gather_size_inputs`]: crate::tenant::TenantShard::gather_size_inputs
pub fn new(initial_permits: NonZeroUsize) -> Self {
ConfigurableSemaphore {
initial_permits,

View File

@@ -24,7 +24,7 @@ use crate::task_mgr::{self, BACKGROUND_RUNTIME, TaskKind};
use crate::tenant::mgr::TenantManager;
use crate::tenant::size::CalculateSyntheticSizeError;
use crate::tenant::tasks::BackgroundLoopKind;
use crate::tenant::{LogicalSizeCalculationCause, Tenant};
use crate::tenant::{LogicalSizeCalculationCause, TenantShard};
mod disk_cache;
mod metrics;
@@ -428,7 +428,7 @@ async fn calculate_synthetic_size_worker(
}
}
async fn calculate_and_log(tenant: &Tenant, cancel: &CancellationToken, ctx: &RequestContext) {
async fn calculate_and_log(tenant: &TenantShard, cancel: &CancellationToken, ctx: &RequestContext) {
const CAUSE: LogicalSizeCalculationCause =
LogicalSizeCalculationCause::ConsumptionMetricsSyntheticSize;

View File

@@ -175,9 +175,9 @@ impl MetricsKey {
.absolute_values()
}
/// [`Tenant::remote_size`]
/// [`TenantShard::remote_size`]
///
/// [`Tenant::remote_size`]: crate::tenant::Tenant::remote_size
/// [`TenantShard::remote_size`]: crate::tenant::TenantShard::remote_size
const fn remote_storage_size(tenant_id: TenantId) -> AbsoluteValueFactory {
MetricsKey {
tenant_id,
@@ -199,9 +199,9 @@ impl MetricsKey {
.absolute_values()
}
/// [`Tenant::cached_synthetic_size`] as refreshed by [`calculate_synthetic_size_worker`].
/// [`TenantShard::cached_synthetic_size`] as refreshed by [`calculate_synthetic_size_worker`].
///
/// [`Tenant::cached_synthetic_size`]: crate::tenant::Tenant::cached_synthetic_size
/// [`TenantShard::cached_synthetic_size`]: crate::tenant::TenantShard::cached_synthetic_size
/// [`calculate_synthetic_size_worker`]: super::calculate_synthetic_size_worker
const fn synthetic_size(tenant_id: TenantId) -> AbsoluteValueFactory {
MetricsKey {
@@ -254,7 +254,7 @@ pub(super) async fn collect_all_metrics(
async fn collect<S>(tenants: S, cache: &Cache, ctx: &RequestContext) -> Vec<NewRawMetric>
where
S: futures::stream::Stream<Item = (TenantId, Arc<crate::tenant::Tenant>)>,
S: futures::stream::Stream<Item = (TenantId, Arc<crate::tenant::TenantShard>)>,
{
let mut current_metrics: Vec<NewRawMetric> = Vec::new();
@@ -263,7 +263,9 @@ where
while let Some((tenant_id, tenant)) = tenants.next().await {
let mut tenant_resident_size = 0;
for timeline in tenant.list_timelines() {
let timelines = tenant.list_timelines();
let timelines_len = timelines.len();
for timeline in timelines {
let timeline_id = timeline.timeline_id;
match TimelineSnapshot::collect(&timeline, ctx) {
@@ -289,6 +291,11 @@ where
tenant_resident_size += timeline.resident_physical_size();
}
if timelines_len == 0 {
// Force set it to 1 byte to avoid not being reported -- all timelines are offloaded.
tenant_resident_size = 1;
}
let snap = TenantSnapshot::collect(&tenant, tenant_resident_size);
snap.to_metrics(tenant_id, Utc::now(), cache, &mut current_metrics);
}
@@ -308,7 +315,7 @@ impl TenantSnapshot {
///
/// `resident_size` is calculated of the timelines we had access to for other metrics, so we
/// cannot just list timelines here.
fn collect(t: &Arc<crate::tenant::Tenant>, resident_size: u64) -> Self {
fn collect(t: &Arc<crate::tenant::TenantShard>, resident_size: u64) -> Self {
TenantSnapshot {
resident_size,
remote_size: t.remote_size(),

View File

@@ -8,6 +8,7 @@ use pageserver_api::upcall_api::{
ReAttachRequest, ReAttachResponse, ReAttachResponseTenant, ValidateRequest,
ValidateRequestTenant, ValidateResponse,
};
use reqwest::Certificate;
use serde::Serialize;
use serde::de::DeserializeOwned;
use tokio_util::sync::CancellationToken;
@@ -76,8 +77,8 @@ impl StorageControllerUpcallClient {
client = client.default_headers(headers);
}
for ssl_ca_cert in &conf.ssl_ca_certs {
client = client.add_root_certificate(ssl_ca_cert.clone());
for cert in &conf.ssl_ca_certs {
client = client.add_root_certificate(Certificate::from_der(cert.contents())?);
}
Ok(Some(Self {

View File

@@ -1873,7 +1873,7 @@ async fn update_tenant_config_handler(
&ShardParameters::default(),
);
crate::tenant::Tenant::persist_tenant_config(state.conf, &tenant_shard_id, &location_conf)
crate::tenant::TenantShard::persist_tenant_config(state.conf, &tenant_shard_id, &location_conf)
.await
.map_err(|e| ApiError::InternalServerError(anyhow::anyhow!(e)))?;
@@ -1917,7 +1917,7 @@ async fn patch_tenant_config_handler(
&ShardParameters::default(),
);
crate::tenant::Tenant::persist_tenant_config(state.conf, &tenant_shard_id, &location_conf)
crate::tenant::TenantShard::persist_tenant_config(state.conf, &tenant_shard_id, &location_conf)
.await
.map_err(|e| ApiError::InternalServerError(anyhow::anyhow!(e)))?;

View File

@@ -49,7 +49,7 @@ use tracing::{info, info_span};
/// backwards-compatible changes to the metadata format.
pub const STORAGE_FORMAT_VERSION: u16 = 3;
pub const DEFAULT_PG_VERSION: u32 = 16;
pub const DEFAULT_PG_VERSION: u32 = 17;
// Magic constants used to identify different kinds of files
pub const IMAGE_FILE_MAGIC: u16 = 0x5A60;

View File

@@ -17,7 +17,7 @@ use metrics::{
use once_cell::sync::Lazy;
use pageserver_api::config::{
PageServicePipeliningConfig, PageServicePipeliningConfigPipelined,
PageServiceProtocolPipelinedExecutionStrategy,
PageServiceProtocolPipelinedBatchingStrategy, PageServiceProtocolPipelinedExecutionStrategy,
};
use pageserver_api::models::InMemoryLayerInfo;
use pageserver_api::shard::TenantShardId;
@@ -1086,7 +1086,7 @@ pub(crate) static TIMELINE_EPHEMERAL_BYTES: Lazy<UIntGauge> = Lazy::new(|| {
.expect("Failed to register metric")
});
/// Metrics related to the lifecycle of a [`crate::tenant::Tenant`] object: things
/// Metrics related to the lifecycle of a [`crate::tenant::TenantShard`] object: things
/// like how long it took to load.
///
/// Note that these are process-global metrics, _not_ per-tenant metrics. Per-tenant
@@ -1289,7 +1289,6 @@ pub(crate) enum StorageIoOperation {
Seek,
Fsync,
Metadata,
SetLen,
}
impl StorageIoOperation {
@@ -1304,7 +1303,6 @@ impl StorageIoOperation {
StorageIoOperation::Seek => "seek",
StorageIoOperation::Fsync => "fsync",
StorageIoOperation::Metadata => "metadata",
StorageIoOperation::SetLen => "set_len",
}
}
}
@@ -1716,6 +1714,28 @@ pub enum SmgrQueryType {
Test,
}
#[derive(
Debug,
Clone,
Copy,
IntoStaticStr,
strum_macros::EnumCount,
strum_macros::EnumIter,
strum_macros::FromRepr,
enum_map::Enum,
)]
#[strum(serialize_all = "snake_case")]
pub enum GetPageBatchBreakReason {
BatchFull,
NonBatchableRequest,
NonUniformLsn,
SamePageAtDifferentLsn,
NonUniformTimeline,
ExecutorSteal,
#[cfg(feature = "testing")]
NonUniformKey,
}
pub(crate) struct SmgrQueryTimePerTimeline {
global_started: [IntCounter; SmgrQueryType::COUNT],
global_latency: [Histogram; SmgrQueryType::COUNT],
@@ -1727,6 +1747,8 @@ pub(crate) struct SmgrQueryTimePerTimeline {
per_timeline_flush_in_progress_micros: IntCounter,
global_batch_wait_time: Histogram,
per_timeline_batch_wait_time: Histogram,
global_batch_break_reason: [IntCounter; GetPageBatchBreakReason::COUNT],
per_timeline_batch_break_reason: GetPageBatchBreakReasonTimelineMetrics,
throttling: Arc<tenant_throttling::Pagestream>,
}
@@ -1860,12 +1882,55 @@ static PAGE_SERVICE_BATCH_SIZE_PER_TENANT_TIMELINE: Lazy<HistogramVec> = Lazy::n
.expect("failed to define a metric")
});
static PAGE_SERVICE_BATCH_BREAK_REASON_GLOBAL: Lazy<IntCounterVec> = Lazy::new(|| {
register_int_counter_vec!(
// it's a counter, but, name is prepared to extend it to a histogram of queue depth
"pageserver_page_service_batch_break_reason_global",
"Reason for breaking batches of get page requests",
&["reason"],
)
.expect("failed to define a metric")
});
struct GetPageBatchBreakReasonTimelineMetrics {
map: EnumMap<GetPageBatchBreakReason, IntCounter>,
}
impl GetPageBatchBreakReasonTimelineMetrics {
fn new(tenant_id: &str, shard_slug: &str, timeline_id: &str) -> Self {
GetPageBatchBreakReasonTimelineMetrics {
map: EnumMap::from_array(std::array::from_fn(|reason_idx| {
let reason = GetPageBatchBreakReason::from_usize(reason_idx);
PAGE_SERVICE_BATCH_BREAK_REASON_PER_TENANT_TIMELINE.with_label_values(&[
tenant_id,
shard_slug,
timeline_id,
reason.into(),
])
})),
}
}
fn inc(&self, reason: GetPageBatchBreakReason) {
self.map[reason].inc()
}
}
static PAGE_SERVICE_BATCH_BREAK_REASON_PER_TENANT_TIMELINE: Lazy<IntCounterVec> = Lazy::new(|| {
register_int_counter_vec!(
"pageserver_page_service_batch_break_reason",
"Reason for breaking batches of get page requests",
&["tenant_id", "shard_id", "timeline_id", "reason"],
)
.expect("failed to define a metric")
});
pub(crate) static PAGE_SERVICE_CONFIG_MAX_BATCH_SIZE: Lazy<IntGaugeVec> = Lazy::new(|| {
register_int_gauge_vec!(
"pageserver_page_service_config_max_batch_size",
"Configured maximum batch size for the server-side batching functionality of page_service. \
Labels expose more of the configuration parameters.",
&["mode", "execution"]
&["mode", "execution", "batching"]
)
.expect("failed to define a metric")
});
@@ -1873,10 +1938,11 @@ pub(crate) static PAGE_SERVICE_CONFIG_MAX_BATCH_SIZE: Lazy<IntGaugeVec> = Lazy::
fn set_page_service_config_max_batch_size(conf: &PageServicePipeliningConfig) {
PAGE_SERVICE_CONFIG_MAX_BATCH_SIZE.reset();
let (label_values, value) = match conf {
PageServicePipeliningConfig::Serial => (["serial", "-"], 1),
PageServicePipeliningConfig::Serial => (["serial", "-", "-"], 1),
PageServicePipeliningConfig::Pipelined(PageServicePipeliningConfigPipelined {
max_batch_size,
execution,
batching,
}) => {
let mode = "pipelined";
let execution = match execution {
@@ -1885,7 +1951,12 @@ fn set_page_service_config_max_batch_size(conf: &PageServicePipeliningConfig) {
}
PageServiceProtocolPipelinedExecutionStrategy::Tasks => "tasks",
};
([mode, execution], max_batch_size.get())
let batching = match batching {
PageServiceProtocolPipelinedBatchingStrategy::UniformLsn => "uniform-lsn",
PageServiceProtocolPipelinedBatchingStrategy::ScatteredLsn => "scattered-lsn",
};
([mode, execution, batching], max_batch_size.get())
}
};
PAGE_SERVICE_CONFIG_MAX_BATCH_SIZE
@@ -1981,6 +2052,15 @@ impl SmgrQueryTimePerTimeline {
.get_metric_with_label_values(&[&tenant_id, &shard_slug, &timeline_id])
.unwrap();
let global_batch_break_reason = std::array::from_fn(|i| {
let reason = GetPageBatchBreakReason::from_usize(i);
PAGE_SERVICE_BATCH_BREAK_REASON_GLOBAL
.get_metric_with_label_values(&[reason.into()])
.unwrap()
});
let per_timeline_batch_break_reason =
GetPageBatchBreakReasonTimelineMetrics::new(&tenant_id, &shard_slug, &timeline_id);
let global_flush_in_progress_micros =
PAGE_SERVICE_SMGR_FLUSH_INPROGRESS_MICROS_GLOBAL.clone();
let per_timeline_flush_in_progress_micros = PAGE_SERVICE_SMGR_FLUSH_INPROGRESS_MICROS
@@ -1998,6 +2078,8 @@ impl SmgrQueryTimePerTimeline {
per_timeline_flush_in_progress_micros,
global_batch_wait_time,
per_timeline_batch_wait_time,
global_batch_break_reason,
per_timeline_batch_break_reason,
throttling: pagestream_throttle_metrics,
}
}
@@ -2026,9 +2108,16 @@ impl SmgrQueryTimePerTimeline {
}
/// TODO: do something about this? seems odd, we have a similar call on SmgrOpTimer
pub(crate) fn observe_getpage_batch_start(&self, batch_size: usize) {
pub(crate) fn observe_getpage_batch_start(
&self,
batch_size: usize,
break_reason: GetPageBatchBreakReason,
) {
self.global_batch_size.observe(batch_size as f64);
self.per_timeline_batch_size.observe(batch_size as f64);
self.global_batch_break_reason[break_reason.into_usize()].inc();
self.per_timeline_batch_break_reason.inc(break_reason);
}
}
@@ -3394,6 +3483,15 @@ impl TimelineMetrics {
shard_id,
timeline_id,
]);
for reason in GetPageBatchBreakReason::iter() {
let _ = PAGE_SERVICE_BATCH_BREAK_REASON_PER_TENANT_TIMELINE.remove_label_values(&[
tenant_id,
shard_id,
timeline_id,
reason.into(),
]);
}
}
}
@@ -4272,6 +4370,7 @@ pub fn preinitialize_metrics(
[
&BACKGROUND_LOOP_PERIOD_OVERRUN_COUNT,
&SMGR_QUERY_STARTED_GLOBAL,
&PAGE_SERVICE_BATCH_BREAK_REASON_GLOBAL,
]
.into_iter()
.for_each(|c| {

View File

@@ -15,10 +15,11 @@ use async_compression::tokio::write::GzipEncoder;
use bytes::Buf;
use futures::FutureExt;
use itertools::Itertools;
use jsonwebtoken::TokenData;
use once_cell::sync::OnceCell;
use pageserver_api::config::{
PageServicePipeliningConfig, PageServicePipeliningConfigPipelined,
PageServiceProtocolPipelinedExecutionStrategy,
PageServiceProtocolPipelinedBatchingStrategy, PageServiceProtocolPipelinedExecutionStrategy,
};
use pageserver_api::key::rel_block_to_key;
use pageserver_api::models::{
@@ -58,8 +59,8 @@ use crate::context::{
DownloadBehavior, PerfInstrumentFutureExt, RequestContext, RequestContextBuilder,
};
use crate::metrics::{
self, COMPUTE_COMMANDS_COUNTERS, ComputeCommandKind, LIVE_CONNECTIONS, SmgrOpTimer,
TimelineMetrics,
self, COMPUTE_COMMANDS_COUNTERS, ComputeCommandKind, GetPageBatchBreakReason, LIVE_CONNECTIONS,
SmgrOpTimer, TimelineMetrics,
};
use crate::pgdatadir_mapping::Version;
use crate::span::{
@@ -75,7 +76,7 @@ use crate::tenant::timeline::{self, WaitLsnError};
use crate::tenant::{GetTimelineError, PageReconstructError, Timeline};
use crate::{basebackup, timed_after_cancellation};
/// How long we may wait for a [`crate::tenant::mgr::TenantSlot::InProgress`]` and/or a [`crate::tenant::Tenant`] which
/// How long we may wait for a [`crate::tenant::mgr::TenantSlot::InProgress`]` and/or a [`crate::tenant::TenantShard`] which
/// is not yet in state [`TenantState::Active`].
///
/// NB: this is a different value than [`crate::http::routes::ACTIVE_TENANT_TIMEOUT`].
@@ -641,6 +642,7 @@ impl std::fmt::Display for BatchedPageStreamError {
struct BatchedGetPageRequest {
req: PagestreamGetPageRequest,
timer: SmgrOpTimer,
effective_request_lsn: Lsn,
ctx: RequestContext,
}
@@ -670,8 +672,8 @@ enum BatchedFeMessage {
GetPage {
span: Span,
shard: timeline::handle::WeakHandle<TenantManagerTypes>,
effective_request_lsn: Lsn,
pages: smallvec::SmallVec<[BatchedGetPageRequest; 1]>,
batch_break_reason: GetPageBatchBreakReason,
},
DbSize {
span: Span,
@@ -724,6 +726,119 @@ impl BatchedFeMessage {
BatchedFeMessage::RespondError { .. } => {}
}
}
fn should_break_batch(
&self,
other: &BatchedFeMessage,
max_batch_size: NonZeroUsize,
batching_strategy: PageServiceProtocolPipelinedBatchingStrategy,
) -> Option<GetPageBatchBreakReason> {
match (self, other) {
(
BatchedFeMessage::GetPage {
shard: accum_shard,
pages: accum_pages,
..
},
BatchedFeMessage::GetPage {
shard: this_shard,
pages: this_pages,
..
},
) => {
assert_eq!(this_pages.len(), 1);
if accum_pages.len() >= max_batch_size.get() {
trace!(%max_batch_size, "stopping batching because of batch size");
assert_eq!(accum_pages.len(), max_batch_size.get());
return Some(GetPageBatchBreakReason::BatchFull);
}
if !accum_shard.is_same_handle_as(this_shard) {
trace!("stopping batching because timeline object mismatch");
// TODO: we _could_ batch & execute each shard seperately (and in parallel).
// But the current logic for keeping responses in order does not support that.
return Some(GetPageBatchBreakReason::NonUniformTimeline);
}
match batching_strategy {
PageServiceProtocolPipelinedBatchingStrategy::UniformLsn => {
if let Some(last_in_batch) = accum_pages.last() {
if last_in_batch.effective_request_lsn
!= this_pages[0].effective_request_lsn
{
trace!(
accum_lsn = %last_in_batch.effective_request_lsn,
this_lsn = %this_pages[0].effective_request_lsn,
"stopping batching because LSN changed"
);
return Some(GetPageBatchBreakReason::NonUniformLsn);
}
}
}
PageServiceProtocolPipelinedBatchingStrategy::ScatteredLsn => {
// The read path doesn't curently support serving the same page at different LSNs.
// While technically possible, it's uncertain if the complexity is worth it.
// Break the batch if such a case is encountered.
let same_page_different_lsn = accum_pages.iter().any(|batched| {
batched.req.rel == this_pages[0].req.rel
&& batched.req.blkno == this_pages[0].req.blkno
&& batched.effective_request_lsn
!= this_pages[0].effective_request_lsn
});
if same_page_different_lsn {
trace!(
rel=%this_pages[0].req.rel,
blkno=%this_pages[0].req.blkno,
lsn=%this_pages[0].effective_request_lsn,
"stopping batching because same page was requested at different LSNs"
);
return Some(GetPageBatchBreakReason::SamePageAtDifferentLsn);
}
}
}
None
}
#[cfg(feature = "testing")]
(
BatchedFeMessage::Test {
shard: accum_shard,
requests: accum_requests,
..
},
BatchedFeMessage::Test {
shard: this_shard,
requests: this_requests,
..
},
) => {
assert!(this_requests.len() == 1);
if accum_requests.len() >= max_batch_size.get() {
trace!(%max_batch_size, "stopping batching because of batch size");
assert_eq!(accum_requests.len(), max_batch_size.get());
return Some(GetPageBatchBreakReason::BatchFull);
}
if !accum_shard.is_same_handle_as(this_shard) {
trace!("stopping batching because timeline object mismatch");
// TODO: we _could_ batch & execute each shard seperately (and in parallel).
// But the current logic for keeping responses in order does not support that.
return Some(GetPageBatchBreakReason::NonUniformTimeline);
}
let this_batch_key = this_requests[0].req.batch_key;
let accum_batch_key = accum_requests[0].req.batch_key;
if this_requests[0].req.batch_key != accum_requests[0].req.batch_key {
trace!(%accum_batch_key, %this_batch_key, "stopping batching because batch key changed");
return Some(GetPageBatchBreakReason::NonUniformKey);
}
None
}
(_, _) => Some(GetPageBatchBreakReason::NonBatchableRequest),
}
}
}
impl PageServerHandler {
@@ -1025,34 +1140,32 @@ impl PageServerHandler {
.await?;
// We're holding the Handle
// TODO: if we actually need to wait for lsn here, it delays the entire batch which doesn't need to wait
let res = Self::wait_or_get_last_lsn(
let effective_request_lsn = match Self::effective_request_lsn(
&shard,
shard.get_last_record_lsn(),
req.hdr.request_lsn,
req.hdr.not_modified_since,
&shard.get_applied_gc_cutoff_lsn(),
&ctx,
)
.maybe_perf_instrument(&ctx, |current_perf_span| {
info_span!(
target: PERF_TRACE_TARGET,
parent: current_perf_span,
"WAIT_LSN",
)
})
.await;
let effective_request_lsn = match res {
) {
Ok(lsn) => lsn,
Err(e) => {
return respond_error!(span, e);
}
};
BatchedFeMessage::GetPage {
span,
shard: shard.downgrade(),
effective_request_lsn,
pages: smallvec::smallvec![BatchedGetPageRequest { req, timer, ctx }],
pages: smallvec::smallvec![BatchedGetPageRequest {
req,
timer,
effective_request_lsn,
ctx,
}],
// The executor grabs the batch when it becomes idle.
// Hence, [`GetPageBatchBreakReason::ExecutorSteal`] is the
// default reason for breaking the batch.
batch_break_reason: GetPageBatchBreakReason::ExecutorSteal,
}
}
#[cfg(feature = "testing")]
@@ -1078,6 +1191,7 @@ impl PageServerHandler {
#[instrument(skip_all, level = tracing::Level::TRACE)]
#[allow(clippy::boxed_local)]
fn pagestream_do_batch(
batching_strategy: PageServiceProtocolPipelinedBatchingStrategy,
max_batch_size: NonZeroUsize,
batch: &mut Result<BatchedFeMessage, QueryError>,
this_msg: Result<BatchedFeMessage, QueryError>,
@@ -1089,90 +1203,59 @@ impl PageServerHandler {
Err(e) => return Err(Err(e)),
};
match (&mut *batch, this_msg) {
// something batched already, let's see if we can add this message to the batch
(
Ok(BatchedFeMessage::GetPage {
span: _,
shard: accum_shard,
pages: accum_pages,
effective_request_lsn: accum_lsn,
}),
BatchedFeMessage::GetPage {
span: _,
shard: this_shard,
pages: this_pages,
effective_request_lsn: this_lsn,
},
) if (|| {
assert_eq!(this_pages.len(), 1);
if accum_pages.len() >= max_batch_size.get() {
trace!(%accum_lsn, %this_lsn, %max_batch_size, "stopping batching because of batch size");
assert_eq!(accum_pages.len(), max_batch_size.get());
return false;
}
if !accum_shard.is_same_handle_as(&this_shard) {
trace!(%accum_lsn, %this_lsn, "stopping batching because timeline object mismatch");
// TODO: we _could_ batch & execute each shard seperately (and in parallel).
// But the current logic for keeping responses in order does not support that.
return false;
}
// the vectored get currently only supports a single LSN, so, bounce as soon
// as the effective request_lsn changes
if *accum_lsn != this_lsn {
trace!(%accum_lsn, %this_lsn, "stopping batching because LSN changed");
return false;
}
true
})() =>
{
// ok to batch
accum_pages.extend(this_pages);
Ok(())
let eligible_batch = match batch {
Ok(b) => b,
Err(_) => {
return Err(Ok(this_msg));
}
#[cfg(feature = "testing")]
(
Ok(BatchedFeMessage::Test {
shard: accum_shard,
requests: accum_requests,
..
}),
BatchedFeMessage::Test {
shard: this_shard,
requests: this_requests,
..
},
) if (|| {
assert!(this_requests.len() == 1);
if accum_requests.len() >= max_batch_size.get() {
trace!(%max_batch_size, "stopping batching because of batch size");
assert_eq!(accum_requests.len(), max_batch_size.get());
return false;
};
let batch_break =
eligible_batch.should_break_batch(&this_msg, max_batch_size, batching_strategy);
match batch_break {
Some(reason) => {
if let BatchedFeMessage::GetPage {
batch_break_reason, ..
} = eligible_batch
{
*batch_break_reason = reason;
}
if !accum_shard.is_same_handle_as(&this_shard) {
trace!("stopping batching because timeline object mismatch");
// TODO: we _could_ batch & execute each shard seperately (and in parallel).
// But the current logic for keeping responses in order does not support that.
return false;
}
let this_batch_key = this_requests[0].req.batch_key;
let accum_batch_key = accum_requests[0].req.batch_key;
if this_requests[0].req.batch_key != accum_requests[0].req.batch_key {
trace!(%accum_batch_key, %this_batch_key, "stopping batching because batch key changed");
return false;
}
true
})() =>
{
// ok to batch
accum_requests.extend(this_requests);
Ok(())
}
// something batched already but this message is unbatchable
(_, this_msg) => {
// by default, don't continue batching
Err(Ok(this_msg))
}
None => {
// ok to batch
match (eligible_batch, this_msg) {
(
BatchedFeMessage::GetPage {
pages: accum_pages, ..
},
BatchedFeMessage::GetPage {
pages: this_pages, ..
},
) => {
accum_pages.extend(this_pages);
Ok(())
}
#[cfg(feature = "testing")]
(
BatchedFeMessage::Test {
requests: accum_requests,
..
},
BatchedFeMessage::Test {
requests: this_requests,
..
},
) => {
accum_requests.extend(this_requests);
Ok(())
}
// Shape guaranteed by [`BatchedFeMessage::should_break_batch`]
_ => unreachable!(),
}
}
}
}
@@ -1393,8 +1476,8 @@ impl PageServerHandler {
BatchedFeMessage::GetPage {
span,
shard,
effective_request_lsn,
pages,
batch_break_reason,
} => {
fail::fail_point!("ps::handle-pagerequest-message::getpage");
let (shard, ctx) = upgrade_handle_and_set_context!(shard);
@@ -1405,9 +1488,9 @@ impl PageServerHandler {
let res = self
.handle_get_page_at_lsn_request_batched(
&shard,
effective_request_lsn,
pages,
io_concurrency,
batch_break_reason,
&ctx,
)
.instrument(span.clone())
@@ -1724,6 +1807,7 @@ impl PageServerHandler {
let PageServicePipeliningConfigPipelined {
max_batch_size,
execution,
batching: batching_strategy,
} = pipelining_config;
// Macro to _define_ a pipeline stage.
@@ -1775,7 +1859,7 @@ impl PageServerHandler {
exit |= read_res.is_err();
let could_send = batch_tx
.send(read_res, |batch, res| {
Self::pagestream_do_batch(max_batch_size, batch, res)
Self::pagestream_do_batch(batching_strategy, max_batch_size, batch, res)
})
.await;
exit |= could_send.is_err();
@@ -1871,7 +1955,39 @@ impl PageServerHandler {
ctx: &RequestContext,
) -> Result<Lsn, PageStreamError> {
let last_record_lsn = timeline.get_last_record_lsn();
let effective_request_lsn = Self::effective_request_lsn(
timeline,
last_record_lsn,
request_lsn,
not_modified_since,
latest_gc_cutoff_lsn,
)?;
if effective_request_lsn > last_record_lsn {
timeline
.wait_lsn(
not_modified_since,
crate::tenant::timeline::WaitLsnWaiter::PageService,
timeline::WaitLsnTimeout::Default,
ctx,
)
.await?;
// Since we waited for 'effective_request_lsn' to arrive, that is now the last
// record LSN. (Or close enough for our purposes; the last-record LSN can
// advance immediately after we return anyway)
}
Ok(effective_request_lsn)
}
fn effective_request_lsn(
timeline: &Timeline,
last_record_lsn: Lsn,
request_lsn: Lsn,
not_modified_since: Lsn,
latest_gc_cutoff_lsn: &RcuReadGuard<Lsn>,
) -> Result<Lsn, PageStreamError> {
// Sanity check the request
if request_lsn < not_modified_since {
return Err(PageStreamError::BadRequest(
@@ -1906,19 +2022,7 @@ impl PageServerHandler {
}
}
// Wait for WAL up to 'not_modified_since' to arrive, if necessary
if not_modified_since > last_record_lsn {
timeline
.wait_lsn(
not_modified_since,
crate::tenant::timeline::WaitLsnWaiter::PageService,
timeline::WaitLsnTimeout::Default,
ctx,
)
.await?;
// Since we waited for 'not_modified_since' to arrive, that is now the last
// record LSN. (Or close enough for our purposes; the last-record LSN can
// advance immediately after we return anyway)
Ok(not_modified_since)
} else {
// It might be better to use max(not_modified_since, latest_gc_cutoff_lsn)
@@ -2073,16 +2177,16 @@ impl PageServerHandler {
async fn handle_get_page_at_lsn_request_batched(
&mut self,
timeline: &Timeline,
effective_lsn: Lsn,
requests: smallvec::SmallVec<[BatchedGetPageRequest; 1]>,
io_concurrency: IoConcurrency,
batch_break_reason: GetPageBatchBreakReason,
ctx: &RequestContext,
) -> Vec<Result<(PagestreamBeMessage, SmgrOpTimer), BatchedPageStreamError>> {
debug_assert_current_span_has_tenant_and_timeline_id();
timeline
.query_metrics
.observe_getpage_batch_start(requests.len());
.observe_getpage_batch_start(requests.len(), batch_break_reason);
// If a page trace is running, submit an event for this request.
if let Some(page_trace) = timeline.page_trace.load().as_ref() {
@@ -2092,20 +2196,81 @@ impl PageServerHandler {
// Ignore error (trace buffer may be full or tracer may have disconnected).
_ = page_trace.try_send(PageTraceEvent {
key,
effective_lsn,
effective_lsn: batch.effective_request_lsn,
time,
});
}
}
// If any request in the batch needs to wait for LSN, then do so now.
let mut perf_instrument = false;
let max_effective_lsn = requests
.iter()
.map(|req| {
if req.ctx.has_perf_span() {
perf_instrument = true;
}
req.effective_request_lsn
})
.max()
.expect("batch is never empty");
let ctx = match perf_instrument {
true => RequestContextBuilder::from(ctx)
.root_perf_span(|| {
info_span!(
target: PERF_TRACE_TARGET,
"GET_VECTORED",
tenant_id = %timeline.tenant_shard_id.tenant_id,
timeline_id = %timeline.timeline_id,
shard = %timeline.tenant_shard_id.shard_slug(),
%max_effective_lsn
)
})
.attached_child(),
false => ctx.attached_child(),
};
let last_record_lsn = timeline.get_last_record_lsn();
if max_effective_lsn > last_record_lsn {
if let Err(e) = timeline
.wait_lsn(
max_effective_lsn,
crate::tenant::timeline::WaitLsnWaiter::PageService,
timeline::WaitLsnTimeout::Default,
&ctx,
)
.maybe_perf_instrument(&ctx, |current_perf_span| {
info_span!(
target: PERF_TRACE_TARGET,
parent: current_perf_span,
"WAIT_LSN",
)
})
.await
{
return Vec::from_iter(requests.into_iter().map(|req| {
Err(BatchedPageStreamError {
err: PageStreamError::from(e.clone()),
req: req.req.hdr,
})
}));
}
}
let results = timeline
.get_rel_page_at_lsn_batched(
requests
.iter()
.map(|p| (&p.req.rel, &p.req.blkno, p.ctx.attached_child())),
effective_lsn,
requests.iter().map(|p| {
(
&p.req.rel,
&p.req.blkno,
p.effective_request_lsn,
p.ctx.attached_child(),
)
}),
io_concurrency,
ctx,
&ctx,
)
.await;
assert_eq!(results.len(), requests.len());
@@ -2673,7 +2838,7 @@ where
) -> Result<(), QueryError> {
// this unwrap is never triggered, because check_auth_jwt only called when auth_type is NeonJWT
// which requires auth to be present
let data = self
let data: TokenData<Claims> = self
.auth
.as_ref()
.unwrap()

View File

@@ -6,7 +6,7 @@
//! walingest.rs handles a few things like implicit relation creation and extension.
//! Clarify that)
//!
use std::collections::{BTreeMap, HashMap, HashSet, hash_map};
use std::collections::{HashMap, HashSet, hash_map};
use std::ops::{ControlFlow, Range};
use crate::walingest::{WalIngestError, WalIngestErrorKind};
@@ -14,7 +14,6 @@ use crate::{PERF_TRACE_TARGET, ensure_walingest};
use anyhow::Context;
use bytes::{Buf, Bytes, BytesMut};
use enum_map::Enum;
use itertools::Itertools;
use pageserver_api::key::{
AUX_FILES_KEY, CHECKPOINT_KEY, CONTROLFILE_KEY, CompactKey, DBDIR_KEY, Key, RelDirExists,
TWOPHASEDIR_KEY, dbdir_key_range, rel_block_to_key, rel_dir_to_key, rel_key_range,
@@ -22,7 +21,7 @@ use pageserver_api::key::{
repl_origin_key, repl_origin_key_range, slru_block_to_key, slru_dir_to_key,
slru_segment_key_range, slru_segment_size_to_key, twophase_file_key, twophase_key_range,
};
use pageserver_api::keyspace::SparseKeySpace;
use pageserver_api::keyspace::{KeySpaceRandomAccum, SparseKeySpace};
use pageserver_api::models::RelSizeMigration;
use pageserver_api::record::NeonWalRecord;
use pageserver_api::reltag::{BlockNumber, RelTag, SlruKind};
@@ -41,7 +40,7 @@ use wal_decoder::serialized_batch::{SerializedValueBatch, ValueMeta};
use super::tenant::{PageReconstructError, Timeline};
use crate::aux_file;
use crate::context::{PerfInstrumentFutureExt, RequestContext, RequestContextBuilder};
use crate::context::{PerfInstrumentFutureExt, RequestContext};
use crate::keyspace::{KeySpace, KeySpaceAccum};
use crate::metrics::{
RELSIZE_CACHE_ENTRIES, RELSIZE_CACHE_HITS, RELSIZE_CACHE_MISSES, RELSIZE_CACHE_MISSES_OLD,
@@ -51,7 +50,7 @@ use crate::span::{
debug_assert_current_span_has_tenant_and_timeline_id_no_shard_id,
};
use crate::tenant::storage_layer::IoConcurrency;
use crate::tenant::timeline::GetVectoredError;
use crate::tenant::timeline::{GetVectoredError, VersionedKeySpaceQuery};
/// Max delta records appended to the AUX_FILES_KEY (for aux v1). The write path will write a full image once this threshold is reached.
pub const MAX_AUX_FILE_DELTAS: usize = 1024;
@@ -207,10 +206,9 @@ impl Timeline {
let pages: smallvec::SmallVec<[_; 1]> = smallvec::smallvec![(tag, blknum)];
let res = self
.get_rel_page_at_lsn_batched(
pages
.iter()
.map(|(tag, blknum)| (tag, blknum, ctx.attached_child())),
effective_lsn,
pages.iter().map(|(tag, blknum)| {
(tag, blknum, effective_lsn, ctx.attached_child())
}),
io_concurrency.clone(),
ctx,
)
@@ -248,8 +246,7 @@ impl Timeline {
/// The ordering of the returned vec corresponds to the ordering of `pages`.
pub(crate) async fn get_rel_page_at_lsn_batched(
&self,
pages: impl ExactSizeIterator<Item = (&RelTag, &BlockNumber, RequestContext)>,
effective_lsn: Lsn,
pages: impl ExactSizeIterator<Item = (&RelTag, &BlockNumber, Lsn, RequestContext)>,
io_concurrency: IoConcurrency,
ctx: &RequestContext,
) -> Vec<Result<Bytes, PageReconstructError>> {
@@ -262,11 +259,13 @@ impl Timeline {
let mut result = Vec::with_capacity(pages.len());
let result_slots = result.spare_capacity_mut();
let mut keys_slots: BTreeMap<Key, smallvec::SmallVec<[(usize, RequestContext); 1]>> =
BTreeMap::default();
let mut keys_slots: HashMap<Key, smallvec::SmallVec<[(usize, RequestContext); 1]>> =
HashMap::with_capacity(pages.len());
let mut perf_instrument = false;
for (response_slot_idx, (tag, blknum, ctx)) in pages.enumerate() {
let mut req_keyspaces: HashMap<Lsn, KeySpaceRandomAccum> =
HashMap::with_capacity(pages.len());
for (response_slot_idx, (tag, blknum, lsn, ctx)) in pages.enumerate() {
if tag.relnode == 0 {
result_slots[response_slot_idx].write(Err(PageReconstructError::Other(
RelationError::InvalidRelnode.into(),
@@ -277,14 +276,14 @@ impl Timeline {
}
let nblocks = match self
.get_rel_size(*tag, Version::Lsn(effective_lsn), &ctx)
.get_rel_size(*tag, Version::Lsn(lsn), &ctx)
.maybe_perf_instrument(&ctx, |crnt_perf_span| {
info_span!(
target: PERF_TRACE_TARGET,
parent: crnt_perf_span,
"GET_REL_SIZE",
reltag=%tag,
lsn=%effective_lsn,
lsn=%lsn,
)
})
.await
@@ -300,7 +299,7 @@ impl Timeline {
if *blknum >= nblocks {
debug!(
"read beyond EOF at {} blk {} at {}, size is {}: returning all-zeros page",
tag, blknum, effective_lsn, nblocks
tag, blknum, lsn, nblocks
);
result_slots[response_slot_idx].write(Ok(ZERO_PAGE.clone()));
slots_filled += 1;
@@ -309,46 +308,29 @@ impl Timeline {
let key = rel_block_to_key(*tag, *blknum);
if ctx.has_perf_span() {
perf_instrument = true;
}
let key_slots = keys_slots.entry(key).or_default();
key_slots.push((response_slot_idx, ctx));
let acc = req_keyspaces.entry(lsn).or_default();
acc.add_key(key);
}
let keyspace = {
// add_key requires monotonicity
let mut acc = KeySpaceAccum::new();
for key in keys_slots
.keys()
// in fact it requires strong monotonicity
.dedup()
{
acc.add_key(*key);
}
acc.to_keyspace()
};
let ctx = match perf_instrument {
true => RequestContextBuilder::from(ctx)
.root_perf_span(|| {
info_span!(
target: PERF_TRACE_TARGET,
"GET_VECTORED",
tenant_id = %self.tenant_shard_id.tenant_id,
timeline_id = %self.timeline_id,
lsn = %effective_lsn,
shard = %self.tenant_shard_id.shard_slug(),
)
})
.attached_child(),
false => ctx.attached_child(),
};
let query: Vec<(Lsn, KeySpace)> = req_keyspaces
.into_iter()
.map(|(lsn, acc)| (lsn, acc.to_keyspace()))
.collect();
let query = VersionedKeySpaceQuery::scattered(query);
let res = self
.get_vectored(keyspace, effective_lsn, io_concurrency, &ctx)
.maybe_perf_instrument(&ctx, |current_perf_span| current_perf_span.clone())
.get_vectored(query, io_concurrency, ctx)
.maybe_perf_instrument(ctx, |current_perf_span| {
info_span!(
target: PERF_TRACE_TARGET,
parent: current_perf_span,
"GET_BATCH",
batch_size = %page_count,
)
})
.await;
match res {
@@ -378,12 +360,12 @@ impl Timeline {
// There is no standardized way to express that the batched span followed from N request spans.
// So, abuse the system and mark the request contexts as follows_from the batch span, so we get
// some linkage in our trace viewer. It allows us to answer: which GET_VECTORED did this GET_PAGE wait for.
req_ctx.perf_follows_from(&ctx);
req_ctx.perf_follows_from(ctx);
slots_filled += 1;
}
result_slots[first_slot].write(res);
first_req_ctx.perf_follows_from(&ctx);
first_req_ctx.perf_follows_from(ctx);
slots_filled += 1;
}
}
@@ -422,7 +404,7 @@ impl Timeline {
}
};
req_ctx.perf_follows_from(&ctx);
req_ctx.perf_follows_from(ctx);
result_slots[*slot].write(err);
}
@@ -661,8 +643,9 @@ impl Timeline {
let mut segment = BytesMut::with_capacity(n_blocks as usize * BLCKSZ as usize);
for batch in batches.parts {
let query = VersionedKeySpaceQuery::uniform(batch, lsn);
let blocks = self
.get_vectored(batch, lsn, io_concurrency.clone(), ctx)
.get_vectored(query, io_concurrency.clone(), ctx)
.await?;
for (_key, block) in blocks {
@@ -899,8 +882,9 @@ impl Timeline {
);
for batch in batches.parts.into_iter().rev() {
let query = VersionedKeySpaceQuery::uniform(batch, probe_lsn);
let blocks = self
.get_vectored(batch, probe_lsn, io_concurrency.clone(), ctx)
.get_vectored(query, io_concurrency.clone(), ctx)
.await?;
for (_key, clog_page) in blocks.into_iter().rev() {

File diff suppressed because it is too large Load Diff

View File

@@ -15,23 +15,21 @@
//! len >= 128: 1CCCXXXX XXXXXXXX XXXXXXXX XXXXXXXX
//!
use std::cmp::min;
use std::sync::Arc;
use std::io::Error;
use async_compression::Level;
use bytes::{BufMut, BytesMut};
use pageserver_api::models::ImageCompressionAlgorithm;
use tokio::io::AsyncWriteExt;
use tokio_epoll_uring::IoBuf;
use tokio_epoll_uring::{BoundedBuf, IoBuf, Slice};
use tokio_util::sync::CancellationToken;
use tracing::warn;
use crate::context::RequestContext;
use crate::page_cache::PAGE_SZ;
use crate::tenant::block_io::BlockCursor;
use crate::virtual_file::IoBufferMut;
use crate::virtual_file::VirtualFile;
use crate::virtual_file::owned_buffers_io::io_buf_ext::{FullSlice, IoBufExt};
use crate::virtual_file::owned_buffers_io::write::{BufferedWriter, FlushTaskError};
#[derive(Copy, Clone, Debug)]
pub struct CompressionInfo {
@@ -39,12 +37,61 @@ pub struct CompressionInfo {
pub compressed_size: Option<usize>,
}
#[derive(Debug, thiserror::Error)]
pub enum WriteBlobError {
#[error(transparent)]
Flush(FlushTaskError),
#[error("blob too large ({len} bytes)")]
BlobTooLarge { len: usize },
/// A blob header, with header+data length and compression info.
///
/// TODO: use this more widely, and add an encode() method too.
/// TODO: document the header format.
#[derive(Clone, Copy, Default)]
pub struct Header {
pub header_len: usize,
pub data_len: usize,
pub compression_bits: u8,
}
impl Header {
/// Decodes a header from a byte slice.
pub fn decode(bytes: &[u8]) -> Result<Self, std::io::Error> {
let Some(&first_header_byte) = bytes.first() else {
return Err(std::io::Error::new(
std::io::ErrorKind::InvalidData,
"zero-length blob header",
));
};
// If the first bit is 0, this is just a 1-byte length prefix up to 128 bytes.
if first_header_byte < 0x80 {
return Ok(Self {
header_len: 1, // by definition
data_len: first_header_byte as usize,
compression_bits: BYTE_UNCOMPRESSED,
});
}
// Otherwise, this is a 4-byte header containing compression information and length.
const HEADER_LEN: usize = 4;
let mut header_buf: [u8; HEADER_LEN] = bytes[0..HEADER_LEN].try_into().map_err(|_| {
std::io::Error::new(
std::io::ErrorKind::InvalidData,
format!("blob header too short: {bytes:?}"),
)
})?;
// TODO: verify the compression bits and convert to an enum.
let compression_bits = header_buf[0] & LEN_COMPRESSION_BIT_MASK;
header_buf[0] &= !LEN_COMPRESSION_BIT_MASK;
let data_len = u32::from_be_bytes(header_buf) as usize;
Ok(Self {
header_len: HEADER_LEN,
data_len,
compression_bits,
})
}
/// Returns the total header+data length.
pub fn total_len(&self) -> usize {
self.header_len + self.data_len
}
}
impl BlockCursor<'_> {
@@ -168,62 +215,141 @@ pub(super) const BYTE_ZSTD: u8 = BYTE_UNCOMPRESSED | 0x10;
/// A wrapper of `VirtualFile` that allows users to write blobs.
///
/// If a `BlobWriter` is dropped, the internal buffer will be
/// discarded. You need to call [`Self::into_inner`]
/// discarded. You need to call [`flush_buffer`](Self::flush_buffer)
/// manually before dropping.
pub struct BlobWriter {
pub struct BlobWriter<const BUFFERED: bool> {
inner: VirtualFile,
offset: u64,
/// A buffer to save on write calls, only used if BUFFERED=true
buf: Vec<u8>,
/// We do tiny writes for the length headers; they need to be in an owned buffer;
io_buf: Option<BytesMut>,
writer: BufferedWriter<IoBufferMut, VirtualFile>,
offset: u64,
}
impl BlobWriter {
impl<const BUFFERED: bool> BlobWriter<BUFFERED> {
pub fn new(
file: Arc<VirtualFile>,
inner: VirtualFile,
start_offset: u64,
gate: &utils::sync::gate::Gate,
cancel: CancellationToken,
ctx: &RequestContext,
flush_task_span: tracing::Span,
) -> anyhow::Result<Self> {
Ok(Self {
io_buf: Some(BytesMut::new()),
writer: BufferedWriter::new(
file,
start_offset,
|| IoBufferMut::with_capacity(Self::CAPACITY),
gate.enter()?,
cancel,
ctx,
flush_task_span,
),
_gate: &utils::sync::gate::Gate,
_cancel: CancellationToken,
_ctx: &RequestContext,
) -> Self {
Self {
inner,
offset: start_offset,
})
buf: Vec::with_capacity(Self::CAPACITY),
io_buf: Some(BytesMut::new()),
}
}
pub fn size(&self) -> u64 {
self.offset
}
const CAPACITY: usize = 64 * 1024;
const CAPACITY: usize = if BUFFERED { 64 * 1024 } else { 0 };
/// Writes `src_buf` to the file at the current offset.
/// Writes the given buffer directly to the underlying `VirtualFile`.
/// You need to make sure that the internal buffer is empty, otherwise
/// data will be written in wrong order.
#[inline(always)]
async fn write_all_unbuffered<Buf: IoBuf + Send>(
&mut self,
src_buf: FullSlice<Buf>,
ctx: &RequestContext,
) -> (FullSlice<Buf>, Result<(), Error>) {
let (src_buf, res) = self.inner.write_all(src_buf, ctx).await;
let nbytes = match res {
Ok(nbytes) => nbytes,
Err(e) => return (src_buf, Err(e)),
};
self.offset += nbytes as u64;
(src_buf, Ok(()))
}
#[inline(always)]
/// Flushes the internal buffer to the underlying `VirtualFile`.
pub async fn flush_buffer(&mut self, ctx: &RequestContext) -> Result<(), Error> {
let buf = std::mem::take(&mut self.buf);
let (slice, res) = self.inner.write_all(buf.slice_len(), ctx).await;
res?;
let mut buf = slice.into_raw_slice().into_inner();
buf.clear();
self.buf = buf;
Ok(())
}
#[inline(always)]
/// Writes as much of `src_buf` into the internal buffer as it fits
fn write_into_buffer(&mut self, src_buf: &[u8]) -> usize {
let remaining = Self::CAPACITY - self.buf.len();
let to_copy = src_buf.len().min(remaining);
self.buf.extend_from_slice(&src_buf[..to_copy]);
self.offset += to_copy as u64;
to_copy
}
/// Internal, possibly buffered, write function
async fn write_all<Buf: IoBuf + Send>(
&mut self,
src_buf: FullSlice<Buf>,
ctx: &RequestContext,
) -> (FullSlice<Buf>, Result<(), FlushTaskError>) {
let res = self
.writer
// TODO: why are we taking a FullSlice if we're going to pass a borrow downstack?
// Can remove all the complexity around owned buffers upstack
.write_buffered_borrowed(&src_buf, ctx)
.await
.map(|len| {
self.offset += len as u64;
});
) -> (FullSlice<Buf>, Result<(), Error>) {
let src_buf = src_buf.into_raw_slice();
let src_buf_bounds = src_buf.bounds();
let restore = move |src_buf_slice: Slice<_>| {
FullSlice::must_new(Slice::from_buf_bounds(
src_buf_slice.into_inner(),
src_buf_bounds,
))
};
(src_buf, res)
if !BUFFERED {
assert!(self.buf.is_empty());
return self
.write_all_unbuffered(FullSlice::must_new(src_buf), ctx)
.await;
}
let remaining = Self::CAPACITY - self.buf.len();
let src_buf_len = src_buf.bytes_init();
if src_buf_len == 0 {
return (restore(src_buf), Ok(()));
}
let mut src_buf = src_buf.slice(0..src_buf_len);
// First try to copy as much as we can into the buffer
if remaining > 0 {
let copied = self.write_into_buffer(&src_buf);
src_buf = src_buf.slice(copied..);
}
// Then, if the buffer is full, flush it out
if self.buf.len() == Self::CAPACITY {
if let Err(e) = self.flush_buffer(ctx).await {
return (restore(src_buf), Err(e));
}
}
// Finally, write the tail of src_buf:
// If it wholly fits into the buffer without
// completely filling it, then put it there.
// If not, write it out directly.
let src_buf = if !src_buf.is_empty() {
assert_eq!(self.buf.len(), 0);
if src_buf.len() < Self::CAPACITY {
let copied = self.write_into_buffer(&src_buf);
// We just verified above that src_buf fits into our internal buffer.
assert_eq!(copied, src_buf.len());
restore(src_buf)
} else {
let (src_buf, res) = self
.write_all_unbuffered(FullSlice::must_new(src_buf), ctx)
.await;
if let Err(e) = res {
return (src_buf, Err(e));
}
src_buf
}
} else {
restore(src_buf)
};
(src_buf, Ok(()))
}
/// Write a blob of data. Returns the offset that it was written to,
@@ -232,7 +358,7 @@ impl BlobWriter {
&mut self,
srcbuf: FullSlice<Buf>,
ctx: &RequestContext,
) -> (FullSlice<Buf>, Result<u64, WriteBlobError>) {
) -> (FullSlice<Buf>, Result<u64, Error>) {
let (buf, res) = self
.write_blob_maybe_compressed(srcbuf, ctx, ImageCompressionAlgorithm::Disabled)
.await;
@@ -246,10 +372,7 @@ impl BlobWriter {
srcbuf: FullSlice<Buf>,
ctx: &RequestContext,
algorithm: ImageCompressionAlgorithm,
) -> (
FullSlice<Buf>,
Result<(u64, CompressionInfo), WriteBlobError>,
) {
) -> (FullSlice<Buf>, Result<(u64, CompressionInfo), Error>) {
let offset = self.offset;
let mut compression_info = CompressionInfo {
written_compressed: false,
@@ -265,16 +388,14 @@ impl BlobWriter {
if len < 128 {
// Short blob. Write a 1-byte length header
io_buf.put_u8(len as u8);
let (slice, res) = self.write_all(io_buf.slice_len(), ctx).await;
let res = res.map_err(WriteBlobError::Flush);
((slice, res), srcbuf)
(self.write_all(io_buf.slice_len(), ctx).await, srcbuf)
} else {
// Write a 4-byte length header
if len > MAX_SUPPORTED_BLOB_LEN {
return (
(
io_buf.slice_len(),
Err(WriteBlobError::BlobTooLarge { len }),
Err(Error::other(format!("blob too large ({len} bytes)"))),
),
srcbuf,
);
@@ -308,9 +429,7 @@ impl BlobWriter {
assert_eq!(len_buf[0] & 0xf0, 0);
len_buf[0] |= high_bit_mask;
io_buf.extend_from_slice(&len_buf[..]);
let (slice, res) = self.write_all(io_buf.slice_len(), ctx).await;
let res = res.map_err(WriteBlobError::Flush);
((slice, res), srcbuf)
(self.write_all(io_buf.slice_len(), ctx).await, srcbuf)
}
}
.await;
@@ -325,23 +444,61 @@ impl BlobWriter {
} else {
self.write_all(srcbuf, ctx).await
};
let res = res.map_err(WriteBlobError::Flush);
(srcbuf, res.map(|_| (offset, compression_info)))
}
/// Writes a raw blob containing both header and data, returning its offset.
pub(crate) async fn write_blob_raw<Buf: IoBuf + Send>(
&mut self,
raw_with_header: FullSlice<Buf>,
ctx: &RequestContext,
) -> (FullSlice<Buf>, Result<u64, Error>) {
// Verify the header, to ensure we don't write invalid/corrupt data.
let header = match Header::decode(&raw_with_header) {
Ok(header) => header,
Err(err) => return (raw_with_header, Err(err)),
};
if raw_with_header.len() != header.total_len() {
let header_total_len = header.total_len();
let raw_len = raw_with_header.len();
return (
raw_with_header,
Err(std::io::Error::new(
std::io::ErrorKind::InvalidData,
format!("header length mismatch: {header_total_len} != {raw_len}"),
)),
);
}
let offset = self.offset;
let (raw_with_header, result) = self.write_all(raw_with_header, ctx).await;
(raw_with_header, result.map(|_| offset))
}
}
impl BlobWriter<true> {
/// Access the underlying `VirtualFile`.
///
/// This function flushes the internal buffer before giving access
/// to the underlying `VirtualFile`.
pub async fn into_inner(mut self, ctx: &RequestContext) -> Result<VirtualFile, Error> {
self.flush_buffer(ctx).await?;
Ok(self.inner)
}
/// Access the underlying `VirtualFile`.
///
/// The caller can use the `handle_tail` function to change the tail of the buffer before flushing it to disk.
/// The buffer will not be flushed to disk if handle_tail returns `None`.
pub async fn into_inner(
self,
handle_tail: impl FnMut(IoBufferMut) -> Option<IoBufferMut>,
) -> Result<VirtualFile, FlushTaskError> {
let (_, file) = self.writer.shutdown(handle_tail).await?;
Ok(file)
/// Unlike [`into_inner`](Self::into_inner), this doesn't flush
/// the internal buffer before giving access.
pub fn into_inner_no_flush(self) -> VirtualFile {
self.inner
}
}
impl BlobWriter<false> {
/// Access the underlying `VirtualFile`.
pub fn into_inner(self) -> VirtualFile {
self.inner
}
}
@@ -350,22 +507,21 @@ pub(crate) mod tests {
use camino::Utf8PathBuf;
use camino_tempfile::Utf8TempDir;
use rand::{Rng, SeedableRng};
use tracing::info_span;
use super::*;
use crate::context::DownloadBehavior;
use crate::task_mgr::TaskKind;
use crate::tenant::block_io::BlockReaderRef;
async fn round_trip_test(blobs: &[Vec<u8>]) -> anyhow::Result<()> {
round_trip_test_compressed(blobs, false).await
async fn round_trip_test<const BUFFERED: bool>(blobs: &[Vec<u8>]) -> Result<(), Error> {
round_trip_test_compressed::<BUFFERED>(blobs, false).await
}
pub(crate) async fn write_maybe_compressed(
pub(crate) async fn write_maybe_compressed<const BUFFERED: bool>(
blobs: &[Vec<u8>],
compression: bool,
ctx: &RequestContext,
) -> anyhow::Result<(Utf8TempDir, Utf8PathBuf, Vec<u64>)> {
) -> Result<(Utf8TempDir, Utf8PathBuf, Vec<u64>), Error> {
let temp_dir = camino_tempfile::tempdir()?;
let pathbuf = temp_dir.path().join("file");
let gate = utils::sync::gate::Gate::default();
@@ -374,9 +530,8 @@ pub(crate) mod tests {
// Write part (in block to drop the file)
let mut offsets = Vec::new();
{
let file = Arc::new(VirtualFile::create_v2(pathbuf.as_path(), ctx).await?);
let mut wtr =
BlobWriter::new(file, 0, &gate, cancel.clone(), ctx, info_span!("test")).unwrap();
let file = VirtualFile::create(pathbuf.as_path(), ctx).await?;
let mut wtr = BlobWriter::<BUFFERED>::new(file, 0, &gate, cancel.clone(), ctx);
for blob in blobs.iter() {
let (_, res) = if compression {
let res = wtr
@@ -393,37 +548,26 @@ pub(crate) mod tests {
let offs = res?;
offsets.push(offs);
}
wtr.into_inner(|mut buf| {
use crate::virtual_file::owned_buffers_io::write::Buffer;
let len = buf.pending();
let cap = buf.cap();
// pad zeros to the next io alignment requirement.
// TODO: this is actually padding to next PAGE_SZ multiple, but only if the buffer capacity is larger than that.
// We can't let the fact that we do direct IO, or the buffer capacity, dictate the on-disk format we write here.
// Need to find a better API that allows writing the format we intend to.
let count = len.next_multiple_of(PAGE_SZ).min(cap) - len;
buf.extend_with(0, count);
Some(buf)
})
.await?; // TODO: this here is the problem with the tests: we're dropping the tail end
// Write out one page worth of zeros so that we can
// read again with read_blk
let (_, res) = wtr.write_blob(vec![0; PAGE_SZ].slice_len(), ctx).await;
let offs = res?;
println!("Writing final blob at offs={offs}");
wtr.flush_buffer(ctx).await?;
}
Ok((temp_dir, pathbuf, offsets))
}
async fn round_trip_test_compressed(
async fn round_trip_test_compressed<const BUFFERED: bool>(
blobs: &[Vec<u8>],
compression: bool,
) -> anyhow::Result<()> {
) -> Result<(), Error> {
let ctx =
RequestContext::new(TaskKind::UnitTest, DownloadBehavior::Error).with_scope_unit_test();
let (_temp_dir, pathbuf, offsets) =
write_maybe_compressed(blobs, compression, &ctx).await?;
write_maybe_compressed::<BUFFERED>(blobs, compression, &ctx).await?;
println!("Done writing!");
let file = VirtualFile::open_v2(pathbuf, &ctx).await?;
let file = VirtualFile::open(pathbuf, &ctx).await?;
let rdr = BlockReaderRef::VirtualFile(&file);
let rdr = BlockCursor::new_with_compression(rdr, compression);
for (idx, (blob, offset)) in blobs.iter().zip(offsets.iter()).enumerate() {
@@ -442,27 +586,30 @@ pub(crate) mod tests {
}
#[tokio::test]
async fn test_one() -> anyhow::Result<()> {
async fn test_one() -> Result<(), Error> {
let blobs = &[vec![12, 21, 22]];
round_trip_test(blobs).await?;
round_trip_test::<false>(blobs).await?;
round_trip_test::<true>(blobs).await?;
Ok(())
}
#[tokio::test]
async fn test_hello_simple() -> anyhow::Result<()> {
async fn test_hello_simple() -> Result<(), Error> {
let blobs = &[
vec![0, 1, 2, 3],
b"Hello, World!".to_vec(),
Vec::new(),
b"foobar".to_vec(),
];
round_trip_test(blobs).await?;
round_trip_test_compressed(blobs, true).await?;
round_trip_test::<false>(blobs).await?;
round_trip_test::<true>(blobs).await?;
round_trip_test_compressed::<false>(blobs, true).await?;
round_trip_test_compressed::<true>(blobs, true).await?;
Ok(())
}
#[tokio::test]
async fn test_really_big_array() -> anyhow::Result<()> {
async fn test_really_big_array() -> Result<(), Error> {
let blobs = &[
b"test".to_vec(),
random_array(10 * PAGE_SZ),
@@ -471,22 +618,25 @@ pub(crate) mod tests {
vec![0xf3; 24 * PAGE_SZ],
b"foobar".to_vec(),
];
round_trip_test(blobs).await?;
round_trip_test_compressed(blobs, true).await?;
round_trip_test::<false>(blobs).await?;
round_trip_test::<true>(blobs).await?;
round_trip_test_compressed::<false>(blobs, true).await?;
round_trip_test_compressed::<true>(blobs, true).await?;
Ok(())
}
#[tokio::test]
async fn test_arrays_inc() -> anyhow::Result<()> {
async fn test_arrays_inc() -> Result<(), Error> {
let blobs = (0..PAGE_SZ / 8)
.map(|v| random_array(v * 16))
.collect::<Vec<_>>();
round_trip_test(&blobs).await?;
round_trip_test::<false>(&blobs).await?;
round_trip_test::<true>(&blobs).await?;
Ok(())
}
#[tokio::test]
async fn test_arrays_random_size() -> anyhow::Result<()> {
async fn test_arrays_random_size() -> Result<(), Error> {
let mut rng = rand::rngs::StdRng::seed_from_u64(42);
let blobs = (0..1024)
.map(|_| {
@@ -498,18 +648,20 @@ pub(crate) mod tests {
random_array(sz.into())
})
.collect::<Vec<_>>();
round_trip_test(&blobs).await?;
round_trip_test::<false>(&blobs).await?;
round_trip_test::<true>(&blobs).await?;
Ok(())
}
#[tokio::test]
async fn test_arrays_page_boundary() -> anyhow::Result<()> {
async fn test_arrays_page_boundary() -> Result<(), Error> {
let blobs = &[
random_array(PAGE_SZ - 4),
random_array(PAGE_SZ - 4),
random_array(PAGE_SZ - 4),
];
round_trip_test(blobs).await?;
round_trip_test::<false>(blobs).await?;
round_trip_test::<true>(blobs).await?;
Ok(())
}
}

View File

@@ -4,12 +4,14 @@
use std::ops::Deref;
use bytes::Bytes;
use super::storage_layer::delta_layer::{Adapter, DeltaLayerInner};
use crate::context::RequestContext;
use crate::page_cache::{self, FileId, PAGE_SZ, PageReadGuard, PageWriteGuard, ReadBufResult};
#[cfg(test)]
use crate::virtual_file::IoBufferMut;
use crate::virtual_file::{IoBuffer, VirtualFile};
use crate::virtual_file::VirtualFile;
/// This is implemented by anything that can read 8 kB (PAGE_SZ)
/// blocks, using the page cache
@@ -245,17 +247,17 @@ pub trait BlockWriter {
/// 'buf' must be of size PAGE_SZ. Returns the block number the page was
/// written to.
///
fn write_blk(&mut self, buf: IoBuffer) -> Result<u32, std::io::Error>;
fn write_blk(&mut self, buf: Bytes) -> Result<u32, std::io::Error>;
}
///
/// A simple in-memory buffer of blocks.
///
pub struct BlockBuf {
pub blocks: Vec<IoBuffer>,
pub blocks: Vec<Bytes>,
}
impl BlockWriter for BlockBuf {
fn write_blk(&mut self, buf: IoBuffer) -> Result<u32, std::io::Error> {
fn write_blk(&mut self, buf: Bytes) -> Result<u32, std::io::Error> {
assert!(buf.len() == PAGE_SZ);
let blknum = self.blocks.len();
self.blocks.push(buf);

View File

@@ -25,7 +25,7 @@ use std::{io, result};
use async_stream::try_stream;
use byteorder::{BE, ReadBytesExt};
use bytes::BufMut;
use bytes::{BufMut, Bytes, BytesMut};
use either::Either;
use futures::{Stream, StreamExt};
use hex;
@@ -34,7 +34,6 @@ use tracing::error;
use crate::context::RequestContext;
use crate::tenant::block_io::{BlockReader, BlockWriter};
use crate::virtual_file::{IoBuffer, IoBufferMut, owned_buffers_io::write::Buffer};
// The maximum size of a value stored in the B-tree. 5 bytes is enough currently.
pub const VALUE_SZ: usize = 5;
@@ -788,12 +787,12 @@ impl<const L: usize> BuildNode<L> {
///
/// Serialize the node to on-disk format.
///
fn pack(&self) -> IoBuffer {
fn pack(&self) -> Bytes {
assert!(self.keys.len() == self.num_children as usize * self.suffix_len);
assert!(self.values.len() == self.num_children as usize * VALUE_SZ);
assert!(self.num_children > 0);
let mut buf = IoBufferMut::with_capacity(PAGE_SZ);
let mut buf = BytesMut::new();
buf.put_u16(self.num_children);
buf.put_u8(self.level);
@@ -806,7 +805,7 @@ impl<const L: usize> BuildNode<L> {
assert!(buf.len() == self.size);
assert!(buf.len() <= PAGE_SZ);
buf.extend_with(0, PAGE_SZ - buf.len());
buf.resize(PAGE_SZ, 0);
buf.freeze()
}
@@ -840,7 +839,7 @@ pub(crate) mod tests {
#[derive(Clone, Default)]
pub(crate) struct TestDisk {
blocks: Vec<IoBuffer>,
blocks: Vec<Bytes>,
}
impl TestDisk {
fn new() -> Self {
@@ -858,7 +857,7 @@ pub(crate) mod tests {
}
}
impl BlockWriter for &mut TestDisk {
fn write_blk(&mut self, buf: IoBuffer) -> io::Result<u32> {
fn write_blk(&mut self, buf: Bytes) -> io::Result<u32> {
let blknum = self.blocks.len();
self.blocks.push(buf);
Ok(blknum as u32)

View File

@@ -75,7 +75,6 @@ impl EphemeralFile {
bytes_written: 0,
buffered_writer: owned_buffers_io::write::BufferedWriter::new(
file,
0,
|| IoBufferMut::with_capacity(TAIL_SZ),
gate.enter()?,
cancel.child_token(),

View File

@@ -714,7 +714,7 @@ impl LayerMap {
true
}
pub fn iter_historic_layers(&self) -> impl '_ + Iterator<Item = Arc<PersistentLayerDesc>> {
pub fn iter_historic_layers(&self) -> impl ExactSizeIterator<Item = Arc<PersistentLayerDesc>> {
self.historic.iter()
}

View File

@@ -504,7 +504,7 @@ impl<Value: Clone> BufferedHistoricLayerCoverage<Value> {
}
/// Iterate all the layers
pub fn iter(&self) -> impl '_ + Iterator<Item = Value> {
pub fn iter(&self) -> impl ExactSizeIterator<Item = Value> {
// NOTE we can actually perform this without rebuilding,
// but it's not necessary for now.
if !self.buffer.is_empty() {

View File

@@ -564,8 +564,9 @@ mod tests {
Lsn(0),
Lsn(0),
Lsn(0),
// Any version will do here, so use the default
crate::DEFAULT_PG_VERSION,
// Updating this version to 17 will cause the test to fail at the
// next assert_eq!().
16,
);
let expected_bytes = vec![
/* TimelineMetadataHeader */

View File

@@ -52,7 +52,9 @@ use crate::tenant::config::{
use crate::tenant::span::debug_assert_current_span_has_tenant_id;
use crate::tenant::storage_layer::inmemory_layer;
use crate::tenant::timeline::ShutdownMode;
use crate::tenant::{AttachedTenantConf, GcError, LoadConfigError, SpawnMode, Tenant, TenantState};
use crate::tenant::{
AttachedTenantConf, GcError, LoadConfigError, SpawnMode, TenantShard, TenantState,
};
use crate::virtual_file::MaybeFatalIo;
use crate::{InitializationOrder, TEMP_FILE_SUFFIX};
@@ -67,7 +69,7 @@ use crate::{InitializationOrder, TEMP_FILE_SUFFIX};
/// having a properly acquired generation (Secondary doesn't need a generation)
#[derive(Clone)]
pub(crate) enum TenantSlot {
Attached(Arc<Tenant>),
Attached(Arc<TenantShard>),
Secondary(Arc<SecondaryTenant>),
/// In this state, other administrative operations acting on the TenantId should
/// block, or return a retry indicator equivalent to HTTP 503.
@@ -86,7 +88,7 @@ impl std::fmt::Debug for TenantSlot {
impl TenantSlot {
/// Return the `Tenant` in this slot if attached, else None
fn get_attached(&self) -> Option<&Arc<Tenant>> {
fn get_attached(&self) -> Option<&Arc<TenantShard>> {
match self {
Self::Attached(t) => Some(t),
Self::Secondary(_) => None,
@@ -164,7 +166,7 @@ impl TenantStartupMode {
/// Result type for looking up a TenantId to a specific shard
pub(crate) enum ShardResolveResult {
NotFound,
Found(Arc<Tenant>),
Found(Arc<TenantShard>),
// Wait for this barrrier, then query again
InProgress(utils::completion::Barrier),
}
@@ -173,7 +175,7 @@ impl TenantsMap {
/// Convenience function for typical usage, where we want to get a `Tenant` object, for
/// working with attached tenants. If the TenantId is in the map but in Secondary state,
/// None is returned.
pub(crate) fn get(&self, tenant_shard_id: &TenantShardId) -> Option<&Arc<Tenant>> {
pub(crate) fn get(&self, tenant_shard_id: &TenantShardId) -> Option<&Arc<TenantShard>> {
match self {
TenantsMap::Initializing => None,
TenantsMap::Open(m) | TenantsMap::ShuttingDown(m) => {
@@ -410,7 +412,7 @@ fn load_tenant_config(
return None;
}
Some(Tenant::load_tenant_config(conf, &tenant_shard_id))
Some(TenantShard::load_tenant_config(conf, &tenant_shard_id))
}
/// Initial stage of load: walk the local tenants directory, clean up any temp files,
@@ -606,7 +608,8 @@ pub async fn init_tenant_mgr(
// Presence of a generation number implies attachment: attach the tenant
// if it wasn't already, and apply the generation number.
config_write_futs.push(async move {
let r = Tenant::persist_tenant_config(conf, &tenant_shard_id, &location_conf).await;
let r =
TenantShard::persist_tenant_config(conf, &tenant_shard_id, &location_conf).await;
(tenant_shard_id, location_conf, r)
});
}
@@ -694,7 +697,7 @@ fn tenant_spawn(
init_order: Option<InitializationOrder>,
mode: SpawnMode,
ctx: &RequestContext,
) -> Result<Arc<Tenant>, GlobalShutDown> {
) -> Result<Arc<TenantShard>, GlobalShutDown> {
// All these conditions should have been satisfied by our caller: the tenant dir exists, is a well formed
// path, and contains a configuration file. Assertions that do synchronous I/O are limited to debug mode
// to avoid impacting prod runtime performance.
@@ -706,7 +709,7 @@ fn tenant_spawn(
.unwrap()
);
Tenant::spawn(
TenantShard::spawn(
conf,
tenant_shard_id,
resources,
@@ -883,12 +886,12 @@ impl TenantManager {
/// Gets the attached tenant from the in-memory data, erroring if it's absent, in secondary mode, or currently
/// undergoing a state change (i.e. slot is InProgress).
///
/// The return Tenant is not guaranteed to be active: check its status after obtaing it, or
/// use [`Tenant::wait_to_become_active`] before using it if you will do I/O on it.
/// The return TenantShard is not guaranteed to be active: check its status after obtaing it, or
/// use [`TenantShard::wait_to_become_active`] before using it if you will do I/O on it.
pub(crate) fn get_attached_tenant_shard(
&self,
tenant_shard_id: TenantShardId,
) -> Result<Arc<Tenant>, GetTenantError> {
) -> Result<Arc<TenantShard>, GetTenantError> {
let locked = self.tenants.read().unwrap();
let peek_slot = tenant_map_peek_slot(&locked, &tenant_shard_id, TenantSlotPeekMode::Read)?;
@@ -937,12 +940,12 @@ impl TenantManager {
flush: Option<Duration>,
mut spawn_mode: SpawnMode,
ctx: &RequestContext,
) -> Result<Option<Arc<Tenant>>, UpsertLocationError> {
) -> Result<Option<Arc<TenantShard>>, UpsertLocationError> {
debug_assert_current_span_has_tenant_id();
info!("configuring tenant location to state {new_location_config:?}");
enum FastPathModified {
Attached(Arc<Tenant>),
Attached(Arc<TenantShard>),
Secondary(Arc<SecondaryTenant>),
}
@@ -999,9 +1002,13 @@ impl TenantManager {
// phase of writing config and/or waiting for flush, before returning.
match fast_path_taken {
Some(FastPathModified::Attached(tenant)) => {
Tenant::persist_tenant_config(self.conf, &tenant_shard_id, &new_location_config)
.await
.fatal_err("write tenant shard config");
TenantShard::persist_tenant_config(
self.conf,
&tenant_shard_id,
&new_location_config,
)
.await
.fatal_err("write tenant shard config");
// Transition to AttachedStale means we may well hold a valid generation
// still, and have been requested to go stale as part of a migration. If
@@ -1030,9 +1037,13 @@ impl TenantManager {
return Ok(Some(tenant));
}
Some(FastPathModified::Secondary(_secondary_tenant)) => {
Tenant::persist_tenant_config(self.conf, &tenant_shard_id, &new_location_config)
.await
.fatal_err("write tenant shard config");
TenantShard::persist_tenant_config(
self.conf,
&tenant_shard_id,
&new_location_config,
)
.await
.fatal_err("write tenant shard config");
return Ok(None);
}
@@ -1122,7 +1133,7 @@ impl TenantManager {
// Before activating either secondary or attached mode, persist the
// configuration, so that on restart we will re-attach (or re-start
// secondary) on the tenant.
Tenant::persist_tenant_config(self.conf, &tenant_shard_id, &new_location_config)
TenantShard::persist_tenant_config(self.conf, &tenant_shard_id, &new_location_config)
.await
.fatal_err("write tenant shard config");
@@ -1262,7 +1273,7 @@ impl TenantManager {
let tenant_path = self.conf.tenant_path(&tenant_shard_id);
let timelines_path = self.conf.timelines_path(&tenant_shard_id);
let config = Tenant::load_tenant_config(self.conf, &tenant_shard_id)?;
let config = TenantShard::load_tenant_config(self.conf, &tenant_shard_id)?;
if drop_cache {
tracing::info!("Dropping local file cache");
@@ -1297,7 +1308,7 @@ impl TenantManager {
Ok(())
}
pub(crate) fn get_attached_active_tenant_shards(&self) -> Vec<Arc<Tenant>> {
pub(crate) fn get_attached_active_tenant_shards(&self) -> Vec<Arc<TenantShard>> {
let locked = self.tenants.read().unwrap();
match &*locked {
TenantsMap::Initializing => Vec::new(),
@@ -1446,7 +1457,7 @@ impl TenantManager {
#[instrument(skip_all, fields(tenant_id=%tenant.get_tenant_shard_id().tenant_id, shard_id=%tenant.get_tenant_shard_id().shard_slug(), new_shard_count=%new_shard_count.literal()))]
pub(crate) async fn shard_split(
&self,
tenant: Arc<Tenant>,
tenant: Arc<TenantShard>,
new_shard_count: ShardCount,
new_stripe_size: Option<ShardStripeSize>,
ctx: &RequestContext,
@@ -1476,7 +1487,7 @@ impl TenantManager {
pub(crate) async fn do_shard_split(
&self,
tenant: Arc<Tenant>,
tenant: Arc<TenantShard>,
new_shard_count: ShardCount,
new_stripe_size: Option<ShardStripeSize>,
ctx: &RequestContext,
@@ -1703,7 +1714,7 @@ impl TenantManager {
/// For each resident layer in the parent shard, we will hard link it into all of the child shards.
async fn shard_split_hardlink(
&self,
parent_shard: &Tenant,
parent_shard: &TenantShard,
child_shards: Vec<TenantShardId>,
) -> anyhow::Result<()> {
debug_assert_current_span_has_tenant_id();
@@ -1988,7 +1999,7 @@ impl TenantManager {
}
let tenant_path = self.conf.tenant_path(&tenant_shard_id);
let config = Tenant::load_tenant_config(self.conf, &tenant_shard_id)
let config = TenantShard::load_tenant_config(self.conf, &tenant_shard_id)
.map_err(|e| Error::DetachReparent(e.into()))?;
let shard_identity = config.shard;

View File

@@ -133,7 +133,7 @@
//! - Initiate upload queue with that [`IndexPart`].
//! - Reschedule all lost operations by comparing the local filesystem state
//! and remote state as per [`IndexPart`]. This is done in
//! [`Tenant::timeline_init_and_sync`].
//! [`TenantShard::timeline_init_and_sync`].
//!
//! Note that if we crash during file deletion between the index update
//! that removes the file from the list of files, and deleting the remote file,
@@ -171,7 +171,7 @@
//! If no remote storage configuration is provided, the [`RemoteTimelineClient`] is
//! not created and the uploads are skipped.
//!
//! [`Tenant::timeline_init_and_sync`]: super::Tenant::timeline_init_and_sync
//! [`TenantShard::timeline_init_and_sync`]: super::TenantShard::timeline_init_and_sync
//! [`Timeline::load_layer_map`]: super::Timeline::load_layer_map
pub(crate) mod download;
@@ -2743,7 +2743,7 @@ mod tests {
use crate::tenant::config::AttachmentMode;
use crate::tenant::harness::{TIMELINE_ID, TenantHarness};
use crate::tenant::storage_layer::layer::local_layer_path;
use crate::tenant::{Tenant, Timeline};
use crate::tenant::{TenantShard, Timeline};
pub(super) fn dummy_contents(name: &str) -> Vec<u8> {
format!("contents for {name}").into()
@@ -2796,7 +2796,7 @@ mod tests {
struct TestSetup {
harness: TenantHarness,
tenant: Arc<Tenant>,
tenant: Arc<TenantShard>,
timeline: Arc<Timeline>,
tenant_ctx: RequestContext,
}

View File

@@ -32,14 +32,12 @@ use super::{
remote_tenant_manifest_prefix, remote_tenant_path,
};
use crate::TEMP_FILE_SUFFIX;
use crate::assert_u64_eq_usize::UsizeIsU64;
use crate::config::PageServerConf;
use crate::context::RequestContext;
use crate::span::{
debug_assert_current_span_has_tenant_and_timeline_id, debug_assert_current_span_has_tenant_id,
};
use crate::tenant::Generation;
use crate::tenant::disk_btree::PAGE_SZ;
use crate::tenant::remote_timeline_client::{remote_layer_path, remote_timelines_path};
use crate::tenant::storage_layer::LayerName;
use crate::virtual_file::{MaybeFatalIo, VirtualFile, on_fatal_io_error};
@@ -229,7 +227,6 @@ async fn download_object(
let mut buffered = owned_buffers_io::write::BufferedWriter::<IoBufferMut, _>::new(
destination_file,
0,
|| IoBufferMut::with_capacity(super::BUFFER_SIZE),
gate.enter().map_err(|_| DownloadError::Cancelled)?,
cancel.child_token(),
@@ -254,41 +251,13 @@ async fn download_object(
FlushTaskError::Cancelled => DownloadError::Cancelled,
})?;
}
let mut pad_amount = None;
let (bytes_amount, destination_file) = buffered
.shutdown(|mut buf| {
use crate::virtual_file::owned_buffers_io::write::Buffer;
let len = buf.pending();
let cap = buf.cap();
// pad zeros to the next io alignment requirement.
// TODO: this is actually padding to next PAGE_SZ multiple, but only if the buffer capacity is larger than that.
// We can't let the fact that we do direct IO, or the buffer capacity, dictate the on-disk format we write here.
// Need to find a better API that allows writing the format we intend to.
let count = len.next_multiple_of(PAGE_SZ).min(cap) - len;
pad_amount = Some(count);
buf.extend_with(0, count);
Some(buf)
})
let inner = buffered
.flush_and_into_inner(ctx)
.await
.map_err(|e| match e {
FlushTaskError::Cancelled => DownloadError::Cancelled,
})?;
let pad_amount = pad_amount.expect("shutdown always invokes the closure").into_u64();
let set_len_arg = bytes_amount - pad_amount;
destination_file
.set_len(set_len_arg)
.await
.maybe_fatal_err("download_object set_len")
.with_context(|| {
format!("set len for file at {dst_path}: 0x{set_len_arg:x} = 0x{bytes_amount:x} - 0x{pad_amount:x}")
})
.map_err(DownloadError::Other)?;
Ok((set_len_arg, destination_file))
Ok(inner)
}
.await?;
@@ -483,7 +452,7 @@ async fn do_download_index_part(
/// generation (normal case when migrating/restarting). Only if both of these return 404 do we fall back
/// to listing objects.
///
/// * `my_generation`: the value of `[crate::tenant::Tenant::generation]`
/// * `my_generation`: the value of `[crate::tenant::TenantShard::generation]`
/// * `what`: for logging, what object are we downloading
/// * `prefix`: when listing objects, use this prefix (i.e. the part of the object path before the generation)
/// * `do_download`: a GET of the object in a particular generation, which should **retry indefinitely** unless

View File

@@ -1521,11 +1521,12 @@ async fn load_heatmap(
path: &Utf8PathBuf,
ctx: &RequestContext,
) -> Result<Option<HeatMapTenant>, anyhow::Error> {
let st = match VirtualFile::read_to_string(path, ctx).await {
let mut file = match VirtualFile::open(path, ctx).await {
Ok(file) => file,
Err(e) if e.kind() == std::io::ErrorKind::NotFound => return Ok(None),
Err(e) => Err(e)?,
};
let st = file.read_to_string(ctx).await?;
let htm = serde_json::from_str(&st)?;
Ok(Some(htm))
}

View File

@@ -21,7 +21,7 @@ use super::scheduler::{
use super::{CommandRequest, SecondaryTenantError, UploadCommand};
use crate::TEMP_FILE_SUFFIX;
use crate::metrics::SECONDARY_MODE;
use crate::tenant::Tenant;
use crate::tenant::TenantShard;
use crate::tenant::config::AttachmentMode;
use crate::tenant::mgr::{GetTenantError, TenantManager};
use crate::tenant::remote_timeline_client::remote_heatmap_path;
@@ -74,7 +74,7 @@ impl RunningJob for WriteInProgress {
}
struct UploadPending {
tenant: Arc<Tenant>,
tenant: Arc<TenantShard>,
last_upload: Option<LastUploadState>,
target_time: Option<Instant>,
period: Option<Duration>,
@@ -106,7 +106,7 @@ impl scheduler::Completion for WriteComplete {
struct UploaderTenantState {
// This Weak only exists to enable culling idle instances of this type
// when the Tenant has been deallocated.
tenant: Weak<Tenant>,
tenant: Weak<TenantShard>,
/// Digest of the serialized heatmap that we last successfully uploaded
last_upload_state: Option<LastUploadState>,
@@ -357,7 +357,7 @@ struct LastUploadState {
/// of the object we would have uploaded.
async fn upload_tenant_heatmap(
remote_storage: GenericRemoteStorage,
tenant: &Arc<Tenant>,
tenant: &Arc<TenantShard>,
last_upload: Option<LastUploadState>,
) -> Result<UploadHeatmapOutcome, UploadHeatmapError> {
debug_assert_current_span_has_tenant_id();

View File

@@ -360,7 +360,7 @@ where
/// Periodic execution phase: inspect all attached tenants and schedule any work they require.
///
/// The type in `tenants` should be a tenant-like structure, e.g. [`crate::tenant::Tenant`] or [`crate::tenant::secondary::SecondaryTenant`]
/// The type in `tenants` should be a tenant-like structure, e.g. [`crate::tenant::TenantShard`] or [`crate::tenant::secondary::SecondaryTenant`]
///
/// This function resets the pending list: it is assumed that the caller may change their mind about
/// which tenants need work between calls to schedule_iteration.

View File

@@ -12,7 +12,7 @@ use tracing::*;
use utils::id::TimelineId;
use utils::lsn::Lsn;
use super::{GcError, LogicalSizeCalculationCause, Tenant};
use super::{GcError, LogicalSizeCalculationCause, TenantShard};
use crate::context::RequestContext;
use crate::pgdatadir_mapping::CalculateLogicalSizeError;
use crate::tenant::{MaybeOffloaded, Timeline};
@@ -156,7 +156,7 @@ pub struct TimelineInputs {
/// initdb_lsn branchpoints* next_pitr_cutoff latest
/// ```
pub(super) async fn gather_inputs(
tenant: &Tenant,
tenant: &TenantShard,
limit: &Arc<Semaphore>,
max_retention_period: Option<u64>,
logical_size_cache: &mut HashMap<(TimelineId, Lsn), u64>,

View File

@@ -715,13 +715,34 @@ pub(crate) enum LayerId {
}
/// Uniquely identify a layer visit by the layer
/// and LSN floor (or start LSN) of the reads.
/// The layer itself is not enough since we may
/// have different LSN lower bounds for delta layer reads.
/// and LSN range of the reads. Note that the end of the range is exclusive.
///
/// The layer itself is not enough since we may have different LSN lower
/// bounds for delta layer reads. Scenarios where this can happen are:
///
/// 1. Layer overlaps: imagine an image layer inside and in-memory layer
/// and a query that only partially hits the image layer. Part of the query
/// needs to read the whole in-memory layer and the other part needs to read
/// only up to the image layer. Hence, they'll have different LSN floor values
/// for the read.
///
/// 2. Scattered reads: the read path supports starting at different LSNs. Imagine
/// The start LSN for one range is inside a layer and the start LSN for another range
/// Is above the layer (includes all of it). Both ranges need to read the layer all the
/// Way to the end but starting at different points. Hence, they'll have different LSN
/// Ceil values.
///
/// The implication is that we might visit the same layer multiple times
/// in order to read different LSN ranges from it. In practice, this isn't very concerning
/// because:
/// 1. Layer overlaps are rare and generally not intended
/// 2. Scattered reads will stabilise after the first few layers provided their starting LSNs
/// are grouped tightly enough (likely the case).
#[derive(Debug, PartialEq, Eq, Clone, Hash)]
struct LayerToVisitId {
layer_id: LayerId,
lsn_floor: Lsn,
lsn_ceil: Lsn,
}
#[derive(Debug, PartialEq, Eq, Hash)]
@@ -805,6 +826,7 @@ impl LayerFringe {
let layer_to_visit_id = LayerToVisitId {
layer_id: layer.id(),
lsn_floor: lsn_range.start,
lsn_ceil: lsn_range.end,
};
let entry = self.visit_reads.entry(layer_to_visit_id.clone());

View File

@@ -29,11 +29,11 @@
//!
use std::collections::{HashMap, VecDeque};
use std::fs::File;
use std::io::SeekFrom;
use std::ops::Range;
use std::os::unix::fs::FileExt;
use std::str::FromStr;
use std::sync::Arc;
use std::sync::atomic::AtomicU64;
use anyhow::{Context, Result, bail, ensure};
use camino::{Utf8Path, Utf8PathBuf};
@@ -45,13 +45,14 @@ use pageserver_api::keyspace::KeySpace;
use pageserver_api::models::ImageCompressionAlgorithm;
use pageserver_api::shard::TenantShardId;
use pageserver_api::value::Value;
use rand::Rng;
use rand::distributions::Alphanumeric;
use serde::{Deserialize, Serialize};
use tokio::sync::OnceCell;
use tokio_epoll_uring::IoBuf;
use tokio_util::sync::CancellationToken;
use tracing::*;
use utils::bin_ser::BeSer;
use utils::bin_ser::SerializeError;
use utils::id::{TenantId, TimelineId};
use utils::lsn::Lsn;
@@ -74,8 +75,7 @@ use crate::tenant::vectored_blob_io::{
VectoredReadPlanner,
};
use crate::virtual_file::owned_buffers_io::io_buf_ext::{FullSlice, IoBufExt};
use crate::virtual_file::owned_buffers_io::write::Buffer;
use crate::virtual_file::{self, IoBuffer, IoBufferMut, MaybeFatalIo, VirtualFile};
use crate::virtual_file::{self, IoBufferMut, MaybeFatalIo, VirtualFile};
use crate::{DELTA_FILE_MAGIC, STORAGE_FORMAT_VERSION, TEMP_FILE_SUFFIX};
///
@@ -113,15 +113,6 @@ impl From<&DeltaLayer> for Summary {
}
impl Summary {
/// Serializes the summary header into an aligned buffer of lenth `PAGE_SZ`.
pub fn ser_into_page(&self) -> Result<IoBuffer, SerializeError> {
let mut buf = IoBufferMut::with_capacity(PAGE_SZ);
Self::ser_into(self, &mut buf)?;
// Pad zeroes to the buffer so the length is a multiple of the alignment.
buf.extend_with(0, buf.capacity() - buf.len());
Ok(buf.freeze())
}
pub(super) fn expected(
tenant_id: TenantId,
timeline_id: TimelineId,
@@ -297,19 +288,19 @@ impl DeltaLayer {
key_start: Key,
lsn_range: &Range<Lsn>,
) -> Utf8PathBuf {
// Never reuse a filename in the lifetime of a pageserver process so that we need
// not worry about laggard Drop impl's async unlink hitting an already reused filename.
static NEXT_TEMP_DISAMBIGUATOR: AtomicU64 = AtomicU64::new(1);
let filename_disambiguator =
NEXT_TEMP_DISAMBIGUATOR.fetch_add(1, std::sync::atomic::Ordering::Relaxed);
let rand_string: String = rand::thread_rng()
.sample_iter(&Alphanumeric)
.take(8)
.map(char::from)
.collect();
conf.timeline_path(tenant_shard_id, timeline_id)
.join(format!(
"{}-XXX__{:016X}-{:016X}.{:x}.{}",
"{}-XXX__{:016X}-{:016X}.{}.{}",
key_start,
u64::from(lsn_range.start),
u64::from(lsn_range.end),
filename_disambiguator,
rand_string,
TEMP_FILE_SUFFIX,
))
}
@@ -400,12 +391,10 @@ struct DeltaLayerWriterInner {
tree: DiskBtreeBuilder<BlockBuf, DELTA_KEY_SIZE>,
blob_writer: BlobWriter,
blob_writer: BlobWriter<true>,
// Number of key-lsns in the layer.
num_keys: usize,
_gate_guard: utils::sync::gate::GateGuard,
}
impl DeltaLayerWriterInner {
@@ -432,17 +421,10 @@ impl DeltaLayerWriterInner {
let path =
DeltaLayer::temp_path_for(conf, &tenant_shard_id, &timeline_id, key_start, &lsn_range);
let file = Arc::new(VirtualFile::create_v2(&path, ctx).await?);
// Start at PAGE_SZ, make room for the header block
let blob_writer = BlobWriter::new(
file,
PAGE_SZ as u64,
gate,
cancel,
ctx,
info_span!(parent: None, "delta_layer_writer_flush_task", tenant_id=%tenant_shard_id.tenant_id, shard_id=%tenant_shard_id.shard_slug(), timeline_id=%timeline_id, path = %path),
)?;
let mut file = VirtualFile::create(&path, ctx).await?;
// make room for the header block
file.seek(SeekFrom::Start(PAGE_SZ as u64)).await?;
let blob_writer = BlobWriter::new(file, PAGE_SZ as u64, gate, cancel, ctx);
// Initialize the b-tree index builder
let block_buf = BlockBuf::new();
@@ -457,7 +439,6 @@ impl DeltaLayerWriterInner {
tree: tree_builder,
blob_writer,
num_keys: 0,
_gate_guard: gate.enter()?,
})
}
@@ -553,33 +534,15 @@ impl DeltaLayerWriterInner {
) -> anyhow::Result<(PersistentLayerDesc, Utf8PathBuf)> {
let index_start_blk = self.blob_writer.size().div_ceil(PAGE_SZ as u64) as u32;
let file = self
.blob_writer
.into_inner(|mut buf| {
let len = buf.pending();
let cap = buf.cap();
// pad zeros to the next io alignment requirement.
// TODO: this is actually padding to next PAGE_SZ multiple, but only if the buffer capacity is larger than that.
// We can't let the fact that we do direct IO, or the buffer capacity, dictate the on-disk format we write here.
// Need to find a better API that allows writing the format we intend to.
let count = len.next_multiple_of(PAGE_SZ).min(cap) - len;
buf.extend_with(0, count);
Some(buf)
})
.await?;
let mut file = self.blob_writer.into_inner(ctx).await?;
// Write out the index
let (index_root_blk, block_buf) = self.tree.finish()?;
let mut offset = index_start_blk as u64 * PAGE_SZ as u64;
// TODO(yuchen): https://github.com/neondatabase/neon/issues/10092
// Should we just replace BlockBuf::blocks with one big buffer
file.seek(SeekFrom::Start(index_start_blk as u64 * PAGE_SZ as u64))
.await?;
for buf in block_buf.blocks {
let (_buf, res) = file.write_all_at(buf.slice_len(), offset, ctx).await;
let (_buf, res) = file.write_all(buf.slice_len(), ctx).await;
res?;
offset += PAGE_SZ as u64;
}
assert!(self.lsn_range.start < self.lsn_range.end);
// Fill in the summary on blk 0
@@ -594,9 +557,11 @@ impl DeltaLayerWriterInner {
index_root_blk,
};
// Writes summary at the first block (offset 0).
let buf = summary.ser_into_page()?;
let (_buf, res) = file.write_all_at(buf.slice_len(), 0, ctx).await;
let mut buf = Vec::with_capacity(PAGE_SZ);
// TODO: could use smallvec here but it's a pain with Slice<T>
Summary::ser_into(&summary, &mut buf)?;
file.seek(SeekFrom::Start(0)).await?;
let (_buf, res) = file.write_all(buf.slice_len(), ctx).await;
res?;
let metadata = file
@@ -763,33 +728,12 @@ impl DeltaLayerWriter {
impl Drop for DeltaLayerWriter {
fn drop(&mut self) {
let Some(inner) = self.inner.take() else {
return;
};
tokio::spawn(async move {
let DeltaLayerWriterInner {
blob_writer,
_gate_guard,
..
} = inner;
let vfile = match blob_writer.into_inner(|_| None).await {
Ok(vfile) => vfile,
Err(e) => {
error!(err=%e, "failed to remove delta layer writer file");
drop(_gate_guard);
return;
}
};
if let Err(e) = std::fs::remove_file(vfile.path())
.maybe_fatal_err("failed to remove the virtual file")
{
error!(err=%e, path=%vfile.path(), "failed to remove delta layer writer file");
}
drop(_gate_guard);
});
if let Some(inner) = self.inner.take() {
// We want to remove the virtual file here, so it's fine to not
// having completely flushed unwritten data.
let vfile = inner.blob_writer.into_inner_no_flush();
vfile.remove();
}
}
}
@@ -816,7 +760,7 @@ impl DeltaLayer {
where
F: Fn(Summary) -> Summary,
{
let file = VirtualFile::open_with_options_v2(
let mut file = VirtualFile::open_with_options(
path,
virtual_file::OpenOptions::new().read(true).write(true),
ctx,
@@ -833,8 +777,11 @@ impl DeltaLayer {
let new_summary = rewrite(actual_summary);
let buf = new_summary.ser_into_page().context("serialize")?;
let (_buf, res) = file.write_all_at(buf.slice_len(), 0, ctx).await;
let mut buf = Vec::with_capacity(PAGE_SZ);
// TODO: could use smallvec here, but it's a pain with Slice<T>
Summary::ser_into(&new_summary, &mut buf).context("serialize")?;
file.seek(SeekFrom::Start(0)).await?;
let (_buf, res) = file.write_all(buf.slice_len(), ctx).await;
res?;
Ok(())
}
@@ -1662,8 +1609,8 @@ pub(crate) mod test {
use bytes::Bytes;
use itertools::MinMaxResult;
use pageserver_api::value::Value;
use rand::RngCore;
use rand::prelude::{SeedableRng, SliceRandom, StdRng};
use rand::{Rng, RngCore};
use super::*;
use crate::DEFAULT_PG_VERSION;
@@ -1673,7 +1620,7 @@ pub(crate) mod test {
use crate::tenant::harness::{TIMELINE_ID, TenantHarness};
use crate::tenant::storage_layer::{Layer, ResidentLayer};
use crate::tenant::vectored_blob_io::StreamingVectoredReadPlanner;
use crate::tenant::{Tenant, Timeline};
use crate::tenant::{TenantShard, Timeline};
/// Construct an index for a fictional delta layer and and then
/// traverse in order to plan vectored reads for a query. Finally,
@@ -2262,7 +2209,7 @@ pub(crate) mod test {
}
pub(crate) async fn produce_delta_layer(
tenant: &Tenant,
tenant: &TenantShard,
tline: &Arc<Timeline>,
mut deltas: Vec<(Key, Lsn, Value)>,
ctx: &RequestContext,

View File

@@ -27,11 +27,11 @@
//! actual page images are stored in the "values" part.
use std::collections::{HashMap, VecDeque};
use std::fs::File;
use std::io::SeekFrom;
use std::ops::Range;
use std::os::unix::prelude::FileExt;
use std::str::FromStr;
use std::sync::Arc;
use std::sync::atomic::AtomicU64;
use anyhow::{Context, Result, bail, ensure};
use bytes::Bytes;
@@ -43,13 +43,14 @@ use pageserver_api::key::{DBDIR_KEY, KEY_SIZE, Key};
use pageserver_api::keyspace::KeySpace;
use pageserver_api::shard::{ShardIdentity, TenantShardId};
use pageserver_api::value::Value;
use rand::Rng;
use rand::distributions::Alphanumeric;
use serde::{Deserialize, Serialize};
use tokio::sync::OnceCell;
use tokio_stream::StreamExt;
use tokio_util::sync::CancellationToken;
use tracing::*;
use utils::bin_ser::BeSer;
use utils::bin_ser::SerializeError;
use utils::id::{TenantId, TimelineId};
use utils::lsn::Lsn;
@@ -72,8 +73,7 @@ use crate::tenant::vectored_blob_io::{
VectoredReadPlanner,
};
use crate::virtual_file::owned_buffers_io::io_buf_ext::IoBufExt;
use crate::virtual_file::owned_buffers_io::write::Buffer;
use crate::virtual_file::{self, IoBuffer, IoBufferMut, MaybeFatalIo, VirtualFile};
use crate::virtual_file::{self, IoBufferMut, MaybeFatalIo, VirtualFile};
use crate::{IMAGE_FILE_MAGIC, STORAGE_FORMAT_VERSION, TEMP_FILE_SUFFIX};
///
@@ -112,15 +112,6 @@ impl From<&ImageLayer> for Summary {
}
impl Summary {
/// Serializes the summary header into an aligned buffer of lenth `PAGE_SZ`.
pub fn ser_into_page(&self) -> Result<IoBuffer, SerializeError> {
let mut buf = IoBufferMut::with_capacity(PAGE_SZ);
Self::ser_into(self, &mut buf)?;
// Pad zeroes to the buffer so the length is a multiple of the alignment.
buf.extend_with(0, buf.capacity() - buf.len());
Ok(buf.freeze())
}
pub(super) fn expected(
tenant_id: TenantId,
timeline_id: TimelineId,
@@ -261,17 +252,14 @@ impl ImageLayer {
tenant_shard_id: TenantShardId,
fname: &ImageLayerName,
) -> Utf8PathBuf {
// Never reuse a filename in the lifetime of a pageserver process so that we need
// not worry about laggard Drop impl's async unlink hitting an already reused filename.
static NEXT_TEMP_DISAMBIGUATOR: AtomicU64 = AtomicU64::new(1);
let filename_disambiguator =
NEXT_TEMP_DISAMBIGUATOR.fetch_add(1, std::sync::atomic::Ordering::Relaxed);
let rand_string: String = rand::thread_rng()
.sample_iter(&Alphanumeric)
.take(8)
.map(char::from)
.collect();
conf.timeline_path(&tenant_shard_id, &timeline_id)
.join(format!(
"{fname}.{:x}.{TEMP_FILE_SUFFIX}",
filename_disambiguator
))
.join(format!("{fname}.{rand_string}.{TEMP_FILE_SUFFIX}"))
}
///
@@ -361,7 +349,7 @@ impl ImageLayer {
where
F: Fn(Summary) -> Summary,
{
let file = VirtualFile::open_with_options_v2(
let mut file = VirtualFile::open_with_options(
path,
virtual_file::OpenOptions::new().read(true).write(true),
ctx,
@@ -378,8 +366,11 @@ impl ImageLayer {
let new_summary = rewrite(actual_summary);
let buf = new_summary.ser_into_page().context("serialize")?;
let (_buf, res) = file.write_all_at(buf.slice_len(), 0, ctx).await;
let mut buf = Vec::with_capacity(PAGE_SZ);
// TODO: could use smallvec here but it's a pain with Slice<T>
Summary::ser_into(&new_summary, &mut buf).context("serialize")?;
file.seek(SeekFrom::Start(0)).await?;
let (_buf, res) = file.write_all(buf.slice_len(), ctx).await;
res?;
Ok(())
}
@@ -568,11 +559,12 @@ impl ImageLayerInner {
let view = BufView::new_slice(&blobs_buf.buf);
for meta in blobs_buf.blobs.iter() {
let img_buf = meta.read(&view).await?;
// Just read the raw header+data and pass it through to the target layer, without
// decoding and recompressing it.
let raw = meta.raw_with_header(&view);
key_count += 1;
writer
.put_image(meta.meta.key, img_buf.into_bytes(), ctx)
.put_image_raw(meta.meta.key, raw.into_bytes(), ctx)
.await
.context(format!("Storing key {}", meta.meta.key))?;
}
@@ -747,13 +739,11 @@ struct ImageLayerWriterInner {
// Number of keys in the layer.
num_keys: usize,
blob_writer: BlobWriter,
blob_writer: BlobWriter<false>,
tree: DiskBtreeBuilder<BlockBuf, KEY_SIZE>,
#[cfg(feature = "testing")]
last_written_key: Key,
_gate_guard: utils::sync::gate::GateGuard,
}
impl ImageLayerWriterInner {
@@ -783,28 +773,19 @@ impl ImageLayerWriterInner {
},
);
trace!("creating image layer {}", path);
let file = {
Arc::new(
VirtualFile::open_with_options_v2(
&path,
virtual_file::OpenOptions::new()
.write(true)
.create_new(true),
ctx,
)
.await?,
let mut file = {
VirtualFile::open_with_options(
&path,
virtual_file::OpenOptions::new()
.write(true)
.create_new(true),
ctx,
)
.await?
};
// Start at `PAGE_SZ` to make room for the header block.
let blob_writer = BlobWriter::new(
file,
PAGE_SZ as u64,
gate,
cancel,
ctx,
info_span!(parent: None, "image_layer_writer_flush_task", tenant_id=%tenant_shard_id.tenant_id, shard_id=%tenant_shard_id.shard_slug(), timeline_id=%timeline_id, path = %path),
)?;
// make room for the header block
file.seek(SeekFrom::Start(PAGE_SZ as u64)).await?;
let blob_writer = BlobWriter::new(file, PAGE_SZ as u64, gate, cancel, ctx);
// Initialize the b-tree index builder
let block_buf = BlockBuf::new();
@@ -825,7 +806,6 @@ impl ImageLayerWriterInner {
num_keys: 0,
#[cfg(feature = "testing")]
last_written_key: Key::MIN,
_gate_guard: gate.enter()?,
};
Ok(writer)
@@ -874,6 +854,41 @@ impl ImageLayerWriterInner {
Ok(())
}
///
/// Write the next image to the file, as a raw blob header and data.
///
/// The page versions must be appended in blknum order.
///
async fn put_image_raw(
&mut self,
key: Key,
raw_with_header: Bytes,
ctx: &RequestContext,
) -> anyhow::Result<()> {
ensure!(self.key_range.contains(&key));
// NB: we don't update the (un)compressed metrics, since we can't determine them without
// decompressing the image. This seems okay.
self.num_keys += 1;
let (_, res) = self
.blob_writer
.write_blob_raw(raw_with_header.slice_len(), ctx)
.await;
let offset = res?;
let mut keybuf: [u8; KEY_SIZE] = [0u8; KEY_SIZE];
key.write_to_byte_slice(&mut keybuf);
self.tree.append(&keybuf, offset)?;
#[cfg(feature = "testing")]
{
self.last_written_key = key;
}
Ok(())
}
///
/// Finish writing the image layer.
///
@@ -909,32 +924,23 @@ impl ImageLayerWriterInner {
crate::metrics::COMPRESSION_IMAGE_INPUT_BYTES_CONSIDERED
.inc_by(self.uncompressed_bytes_eligible);
crate::metrics::COMPRESSION_IMAGE_INPUT_BYTES_CHOSEN.inc_by(self.uncompressed_bytes_chosen);
crate::metrics::COMPRESSION_IMAGE_OUTPUT_BYTES.inc_by(compressed_size);
let file = self
.blob_writer
.into_inner(|mut buf| {
let len = buf.pending();
let cap = buf.cap();
// NB: filter() may pass through raw pages from a different layer, without looking at
// whether these are compressed or not. We don't track metrics for these, so avoid
// increasing `COMPRESSION_IMAGE_OUTPUT_BYTES` in this case too.
if self.uncompressed_bytes > 0 {
crate::metrics::COMPRESSION_IMAGE_OUTPUT_BYTES.inc_by(compressed_size);
};
// pad zeros to the next io alignment requirement.
let count = len.next_multiple_of(PAGE_SZ).min(cap) - len;
buf.extend_with(0, count);
Some(buf)
})
.await?;
let mut file = self.blob_writer.into_inner();
// Write out the index
let mut offset = index_start_blk as u64 * PAGE_SZ as u64;
file.seek(SeekFrom::Start(index_start_blk as u64 * PAGE_SZ as u64))
.await?;
let (index_root_blk, block_buf) = self.tree.finish()?;
// TODO(yuchen): https://github.com/neondatabase/neon/issues/10092
// Should we just replace BlockBuf::blocks with one big buffer?
for buf in block_buf.blocks {
let (_buf, res) = file.write_all_at(buf.slice_len(), offset, ctx).await;
let (_buf, res) = file.write_all(buf.slice_len(), ctx).await;
res?;
offset += PAGE_SZ as u64;
}
let final_key_range = if let Some(end_key) = end_key {
@@ -955,9 +961,11 @@ impl ImageLayerWriterInner {
index_root_blk,
};
// Writes summary at the first block (offset 0).
let buf = summary.ser_into_page()?;
let (_buf, res) = file.write_all_at(buf.slice_len(), 0, ctx).await;
let mut buf = Vec::with_capacity(PAGE_SZ);
// TODO: could use smallvec here but it's a pain with Slice<T>
Summary::ser_into(&summary, &mut buf)?;
file.seek(SeekFrom::Start(0)).await?;
let (_buf, res) = file.write_all(buf.slice_len(), ctx).await;
res?;
let metadata = file
@@ -1068,6 +1076,25 @@ impl ImageLayerWriter {
self.inner.as_mut().unwrap().put_image(key, img, ctx).await
}
///
/// Write the next value to the file, as a raw header and data. This allows passing through a
/// raw, potentially compressed image from a different layer file without recompressing it.
///
/// The page versions must be appended in blknum order.
///
pub async fn put_image_raw(
&mut self,
key: Key,
raw_with_header: Bytes,
ctx: &RequestContext,
) -> anyhow::Result<()> {
self.inner
.as_mut()
.unwrap()
.put_image_raw(key, raw_with_header, ctx)
.await
}
/// Estimated size of the image layer.
pub(crate) fn estimated_size(&self) -> u64 {
let inner = self.inner.as_ref().unwrap();
@@ -1100,33 +1127,9 @@ impl ImageLayerWriter {
impl Drop for ImageLayerWriter {
fn drop(&mut self) {
let Some(inner) = self.inner.take() else {
return;
};
tokio::spawn(async move {
let ImageLayerWriterInner {
blob_writer,
_gate_guard,
..
} = inner;
let vfile = match blob_writer.into_inner(|_| None).await {
Ok(vfile) => vfile,
Err(e) => {
error!(err=%e, "failed to remove image layer writer file");
drop(_gate_guard);
return;
}
};
if let Err(e) = std::fs::remove_file(vfile.path())
.maybe_fatal_err("failed to remove the virtual file")
{
error!(err=%e, path=%vfile.path(), "failed to remove image layer writer file");
}
drop(_gate_guard);
});
if let Some(inner) = self.inner.take() {
inner.blob_writer.into_inner().remove();
}
}
}
@@ -1225,7 +1228,7 @@ mod test {
use crate::tenant::harness::{TIMELINE_ID, TenantHarness};
use crate::tenant::storage_layer::{Layer, ResidentLayer};
use crate::tenant::vectored_blob_io::StreamingVectoredReadPlanner;
use crate::tenant::{Tenant, Timeline};
use crate::tenant::{TenantShard, Timeline};
#[tokio::test]
async fn image_layer_rewrite() {
@@ -1407,7 +1410,7 @@ mod test {
}
async fn produce_image_layer(
tenant: &Tenant,
tenant: &TenantShard,
tline: &Arc<Timeline>,
mut images: Vec<(Key, Bytes)>,
lsn: Lsn,

View File

@@ -24,7 +24,7 @@ use crate::task_mgr::{self, BACKGROUND_RUNTIME, TOKIO_WORKER_THREADS, TaskKind};
use crate::tenant::throttle::Stats;
use crate::tenant::timeline::CompactionError;
use crate::tenant::timeline::compaction::CompactionOutcome;
use crate::tenant::{Tenant, TenantState};
use crate::tenant::{TenantShard, TenantState};
/// Semaphore limiting concurrent background tasks (across all tenants).
///
@@ -117,7 +117,7 @@ pub(crate) async fn acquire_concurrency_permit(
}
/// Start per tenant background loops: compaction, GC, and ingest housekeeping.
pub fn start_background_loops(tenant: &Arc<Tenant>, can_start: Option<&Barrier>) {
pub fn start_background_loops(tenant: &Arc<TenantShard>, can_start: Option<&Barrier>) {
let tenant_shard_id = tenant.tenant_shard_id;
task_mgr::spawn(
@@ -198,7 +198,7 @@ pub fn start_background_loops(tenant: &Arc<Tenant>, can_start: Option<&Barrier>)
}
/// Compaction task's main loop.
async fn compaction_loop(tenant: Arc<Tenant>, cancel: CancellationToken) {
async fn compaction_loop(tenant: Arc<TenantShard>, cancel: CancellationToken) {
const BASE_BACKOFF_SECS: f64 = 1.0;
const MAX_BACKOFF_SECS: f64 = 300.0;
const RECHECK_CONFIG_INTERVAL: Duration = Duration::from_secs(10);
@@ -348,7 +348,7 @@ pub(crate) fn log_compaction_error(
}
/// GC task's main loop.
async fn gc_loop(tenant: Arc<Tenant>, cancel: CancellationToken) {
async fn gc_loop(tenant: Arc<TenantShard>, cancel: CancellationToken) {
const MAX_BACKOFF_SECS: f64 = 300.0;
let mut error_run = 0; // consecutive errors
@@ -432,7 +432,7 @@ async fn gc_loop(tenant: Arc<Tenant>, cancel: CancellationToken) {
}
/// Tenant housekeeping's main loop.
async fn tenant_housekeeping_loop(tenant: Arc<Tenant>, cancel: CancellationToken) {
async fn tenant_housekeeping_loop(tenant: Arc<TenantShard>, cancel: CancellationToken) {
let mut last_throttle_flag_reset_at = Instant::now();
loop {
if wait_for_active_tenant(&tenant, &cancel).await.is_break() {
@@ -483,7 +483,7 @@ async fn tenant_housekeeping_loop(tenant: Arc<Tenant>, cancel: CancellationToken
/// Waits until the tenant becomes active, or returns `ControlFlow::Break()` to shut down.
async fn wait_for_active_tenant(
tenant: &Arc<Tenant>,
tenant: &Arc<TenantShard>,
cancel: &CancellationToken,
) -> ControlFlow<()> {
if tenant.current_state() == TenantState::Active {

Some files were not shown because too many files have changed in this diff Show More