Compare commits

...

43 Commits

Author SHA1 Message Date
Arseny Sher
22e9702525 migrate script 2023-12-25 23:04:34 +03:00
John Spray
ac38d3a88c remote_storage: don't count 404s as errors (#6201)
## Problem

Currently a chart of S3 error rate is misleading: it can show errors any
time we are attaching a tenant (probing for index_part generation,
checking for remote delete marker).

Considering 404 successful isn't perfectly elegant, but it enables the
error rate to be used a a more meaningful alert signal: it would
indicate if we were having auth issues, sending bad requests, getting
throttled ,etc.

## Summary of changes

Track 404 requests in the AttemptOutcome::Ok bucket instead of the
AttemptOutcome::Err bucket.
2023-12-20 17:00:29 +00:00
Arthur Petukhovsky
0f56104a61 Make sk_collect_dumps also possible with teleport (#4739)
Co-authored-by: Arseny Sher <sher-ars@yandex.ru>
2023-12-20 15:06:55 +00:00
John Spray
f260f1565e pageserver: fixes + test updates for sharding (#6186)
This is a precursor to:
- https://github.com/neondatabase/neon/pull/6185

While that PR contains big changes to neon_local and attachment_service,
this PR contains a few unrelated standalone changes generated while
working on that branch:
- Fix restarting a pageserver when it contains multiple shards for the
same tenant
- When using location_config api to attach a tenant, create its
timelines dir
- Update test paths where generations were previously optional to make
them always-on: this avoids tests having to spuriously assert that
attachment_service is not None in order to make the linter happy.
- Add a TenantShardId python implementation for subsequent use in test
helpers that will be made shard-aware
- Teach scrubber to read across shards when checking for layer
existence: this is a refactor to track the list of existent layers at
tenant-level rather than locally to each timeline. This is a precursor
to testing shard splitting.
2023-12-20 12:26:20 +00:00
Joonas Koivunen
c29df80634 fix(layer): move backoff to spawned task (#5746)
Move the backoff to spawned task as it can still be useful; make the
sleep cancellable.
2023-12-20 10:26:06 +02:00
Em Sharnoff
58dbca6ce3 Bump vm-builder v0.19.0 -> v0.21.0 (#6197)
Only applicable change was neondatabase/autoscaling#650, reducing the
vector scrape interval (inside the VM) from 15 seconds to 1 second.
2023-12-19 23:48:41 +00:00
Arthur Petukhovsky
613906acea Support custom types in broker (#5761)
Old methods are unchanged for backwards compatibility. Added
`SafekeeperDiscoveryRequest` and `SafekeeperDiscoveryResponse` types to
serve as example, and also as a prerequisite for
https://github.com/neondatabase/neon/issues/5471
2023-12-19 17:06:43 +00:00
Christian Schwarz
82809d2ec2 fix metric pageserver_initial_logical_size_start_calculation (#6191)
It wasn't being incremented.

Fixup of

    commit 1c88824ed0
    Author: Christian Schwarz <christian@neon.tech>
    Date:   Fri Dec 1 12:52:59 2023 +0100

        initial logical size calculation: add a bunch of metrics (#5995)
2023-12-19 17:44:49 +01:00
Anastasia Lubennikova
0bd79eb063 Handle role deletion when project has no databases. (#6170)
There is still default 'postgres' database, that may contain objects
owned by the role or some ACLs. We need to reassign objects in this
database too.

## Problem
If customer deleted all databases and then tries to delete role, that
has some non-standard ACLs,
`apply_config` operation will stuck because of failing role deletion.
2023-12-19 16:27:47 +00:00
Konstantin Knizhnik
8ff5387da1 eliminate GCC warning for unchecked result of fread (#6167)
## Problem


GCCproduce warning that bread result is not checked. It doesn't affect
program logic, but better live without warnings.

## Summary of changes

Check read result.

## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist
2023-12-19 18:17:11 +02:00
Arpad Müller
8b91bbc38e Update jsonwebtoken to 9 and sct to 0.7.1 (#6189)
This increases the list of crates that base on `ring` 0.17.
2023-12-19 15:45:17 +00:00
Christian Schwarz
e6bf6952b8 higher resolution histograms for getpage@lsn (#6177)
part of https://github.com/neondatabase/cloud/issues/7811
2023-12-19 14:46:17 +01:00
Arpad Müller
a2fab34371 Update zstd to 0.13 (#6187)
This updates the `zstd` crate to 0.13, and `zstd-sys` with it (it
contains C so we should always run the newest version of that).
2023-12-19 13:16:53 +00:00
Vadim Kharitonov
c52384752e Compile pg_semver extension (#6184)
Closes #6183
2023-12-19 15:10:07 +02:00
Bodobolero
73d247c464 Analyze clickbench performance with explain plans and pg_stat_statements (#6161)
## Problem

To understand differences in performance between neon, aurora and rds we
want to collect explain analyze plans and pg_stat_statements for
selected benchmarking runs

## Summary of changes

Add workflow input options to collect explain and pg_stat_statements for
benchmarking workflow

Co-authored-by: BodoBolero <bodobolero@gmail.com>
2023-12-19 11:44:25 +00:00
Arseny Sher
b701394d7a Fix WAL waiting in walproposer for v16.
Just preparing cv right before waiting is not enough as we might have already
missed the flushptr change & wakeup, so re-checked before sleep.

https://neondb.slack.com/archives/C03QLRH7PPD/p1702830965396619?thread_ts=1702756761.836649&cid=C03QLRH7PPD
2023-12-19 15:34:14 +04:00
John Spray
d89af4cf8e pageserver: downgrade 'connection reset' WAL errors (#6181)
This squashes a particularly noisy warn-level log that occurs when
safekeepers are restarted.

Unfortunately the error type from `tonic` doesn't provide a neat way of
matching this, so we use a string comparison
2023-12-19 10:38:00 +00:00
Christian Schwarz
6ffbbb2e02 include timeline ids in tenant details response (#6166)
Part of getpage@lsn benchmark epic:
https://github.com/neondatabase/neon/issues/5771

This allows getting the list of tenants and timelines without triggering
initial logical size calculation by requesting the timeline details API
response, which would skew our results.
2023-12-19 10:32:51 +00:00
Arpad Müller
fbb979d5e3 remote_storage: move shared utilities for S3 and Azure into common module (#6176)
The PR does two things:

* move the util functions present in the remote_storage Azure and S3
test files into a shared one, deduplicating them.
* add a `s3_upload_download_works` test as a copy of the Azure test

The goal is mainly to fight duplication and make the code a little bit
more generic (like removing mentions of s3 and azure from function
names).

This is a first step towards #6146.
2023-12-19 11:29:50 +01:00
Arpad Müller
a89d6dc76e Always send a json response for timeline_get_lsn_by_timestamp (#6178)
As part of the transition laid out in
[this](https://github.com/neondatabase/cloud/pull/7553#discussion_r1370473911)
comment, don't read the `version` query parameter in
`timeline_get_lsn_by_timestamp`, but always return the structured json
response.

Follow-up of https://github.com/neondatabase/neon/pull/5608
2023-12-19 11:29:16 +01:00
Christian Schwarz
c272c68e5c RFC: Per-Tenant GetPage@LSN Throttling (#5648)
Implementation epic: https://github.com/neondatabase/neon/issues/5899
2023-12-19 11:20:56 +01:00
Anna Khanova
6e6e40dd7f Invalidate credentials on auth failure (#6171)
## Problem

If the user reset password, cache could receive this information only
after `ttl` minutes.

## Summary of changes

Invalidate password on auth failure.
2023-12-18 23:24:22 +01:00
Heikki Linnakangas
6939fc3db6 Remove declarations of non-existent global variables and functions
FileCacheMonitorMain was removed in commit b497d0094e.
2023-12-18 21:05:31 +02:00
Heikki Linnakangas
c4c48cfd63 Clean up #includes
- No need to include c.h, port.h or pg_config.h, they are included in
  postgres.h
- No need to include postgres.h in header files. Instead, the
  assumption in PostgreSQL is that all .c files include postgres.h.
- Reorder includes to alphabetical order, and system headers before
  pgsql headers
- Remove bunch of other unnecessary includes that got copy-pasted from
  one source file to another
2023-12-18 21:05:29 +02:00
Heikki Linnakangas
82215d20b0 Mark some variables 'static'
Move initialization of neon_redo_read_buffer_filter. This allows
marking it 'static', too.
2023-12-18 21:05:24 +02:00
Sasha Krassovsky
62737f3776 Grant BYPASSRLS and REPLICATION explicitly to neon_superuser roles 2023-12-18 10:54:14 -08:00
Christian Schwarz
1f9a7d1cd0 add a Rust client for Pageserver page_service (#6128)
Part of getpage@lsn benchmark epic:
https://github.com/neondatabase/neon/issues/5771

Stacked atop https://github.com/neondatabase/neon/pull/6145
2023-12-18 18:17:19 +00:00
John Spray
4ea4812ab2 tests: update python dependencies (#6164)
## Problem

Existing dependencies didn't work on Fedora 39 (python 3.12)

## Summary of changes

- Update pyyaml 6.0 -> 6.0.1
- Update yarl 1.8.2->1.9.4
- Update the `dnf install` line in README to include dependencies of
python packages (unrelated to upgrades, just noticed absences while
doing fresh pysync run)
2023-12-18 15:47:09 +00:00
Anna Khanova
00d90ce76a Added cache for get role secret (#6165)
## Problem

Currently if we are getting many consecutive connections to the same
user/ep we will send a lot of traffic to the console.

## Summary of changes

Cache with ttl=4min proxy_get_role_secret response.

Note: this is the temporary hack, notifier listener is WIP.
2023-12-18 16:04:47 +01:00
John Khvatov
33cb9a68f7 pageserver: Reduce tracing overhead in timeline::get (#6115)
## Problem

Compaction process (specifically the image layer reconstructions part)
is lagging behind wal ingest (at speed ~10-15MB/s) for medium-sized
tenants (30-50GB). CPU profile shows that significant amount of time
(see flamegraph) is being spent in `tracing::span::Span::new`.

mainline (commit: 0ba4cae491):

![reconstruct-mainline-0ba4cae491c2](https://github.com/neondatabase/neon/assets/289788/ebfd262e-5c97-4858-80c7-664a1dbcc59d)

## Summary of changes

By lowering the tracing level in get_value_reconstruct_data and
get_or_maybe_download from info to debug, we can reduce the overhead of
span creation in prod environments. On my system, this sped up the image
reconstruction process by 60% (from 14500 to 23160 page reconstruction
per sec)

pr:

![reconstruct-opt-2](https://github.com/neondatabase/neon/assets/289788/563a159b-8f2f-4300-b0a1-6cd66e7df769)


`create_image_layers()` (it's 1 CPU bound here) mainline vs pr:

![image](https://github.com/neondatabase/neon/assets/289788/a981e3cb-6df9-4882-8a94-95e99c35aa83)
2023-12-18 13:33:23 +00:00
Conrad Ludgate
17bde7eda5 proxy refactor large files (#6153)
## Problem

The `src/proxy.rs` file is far too large

## Summary of changes

Creates 3 new files:
```
src/metrics.rs
src/proxy/retry.rs
src/proxy/connect_compute.rs
```
2023-12-18 10:59:49 +00:00
John Spray
dbdb1d21f2 pageserver: on-demand activation cleanups (#6157)
## Problem

#6112 added some logs and metrics: clean these up a bit:
- Avoid counting startup completions for tenants launched after startup
- exclude no-op cases from timing histograms 
- remove a rogue log messages
2023-12-18 10:29:19 +00:00
Arseny Sher
e1935f42a1 Don't generate core dump when walproposer intentionally panics.
Walproposer sometimes intentionally PANICs when its term is defeated as the
basebackup is likely spoiled by that time. We don't want core dumped in this
case.
2023-12-18 11:03:34 +04:00
Alexander Bayandin
9bdc25f0af Revert "CI: build build-tools image" (#6156)
It turns out the issue with skipped jobs is not so trivial (because
Github checks jobs transitively), a possible workaround with `if:
always() && contains(fromJSON('["success", "skipped"]'),
needs.build-buildtools-image.result)` will tangle the workflow really
bad. We'll need to come up with a better solution.

To unblock the main I'm going to revert
https://github.com/neondatabase/neon/pull/6082.
2023-12-16 12:32:00 +00:00
Christian Schwarz
47873470db pageserver: add method to dump keyspace in mgmt api client (#6145)
Part of getpage@lsn benchmark epic:
https://github.com/neondatabase/neon/issues/5771
2023-12-16 10:52:48 +00:00
Abhijeet Patil
8619e6295a CI: build build-tools image (#6082)
## Currently our build docker file is located in the build repo it makes
sense to have it as a part of our neon repo

## Summary of changes
We had the docker file that we use to build our binary and other tools
resided in the build repo
It made sense to bring the docker file to its repo where it has been
used
So that the contributors can also view it and amend if required
It will reduce the maintenance. Docker file changes and code changes can
be accommodated in same PR
Also, building the image and pushing it to ECR is abstracted in a
reusable workflow. Ideal is to use that for any other jobs too

## Checklist before requesting a review

- [x] Moved the docker file used to build the binary from the build repo
to the neon repo
- [x] adding gh workflow to build and push the image
- [x] adding gh workflow to tag the pushed image
- [x] update readMe file

---------

Co-authored-by: Abhijeet Patil <abhijeet@neon.tech>
Co-authored-by: Alexander Bayandin <alexander@neon.tech>
2023-12-16 10:33:52 +00:00
Conrad Ludgate
83811491da update zerocopy (#6148)
## Problem

https://github.com/neondatabase/neon/security/dependabot/48

```
$ cargo tree -i zerocopy
zerocopy v0.7.3
└── ahash v0.8.5
    └── hashbrown v0.13.2
```

ahash doesn't use the affected APIs we we are not vulnerable but best to
update to silence the alert anyway

## Summary of changes

```
$ cargo update -p zerocopy --precise 0.7.31
    Updating crates.io index
    Updating syn v2.0.28 -> v2.0.32
    Updating zerocopy v0.7.3 -> v0.7.31
    Updating zerocopy-derive v0.7.3 -> v0.7.31
```
2023-12-16 09:06:00 +00:00
John Spray
d066dad84b pageserver: prioritize activation of tenants with client requests (#6112)
## Problem

During startup, a client request might have to wait a long time while
the system is busy initializing all the attached tenants, even though
most of the attached tenants probably don't have any client requests to
service, and could wait a bit.

## Summary of changes

- Add a semaphore to limit how many Tenant::spawn()s may concurrently do
I/O to attach their tenant (i.e. read indices from remote storage, scan
local layer files, etc).
- Add Tenant::activate_now, a hook for kicking a tenant in its spawn()
method to skip waiting for the warmup semaphore
- For tenants that attached via warmup semaphore units, wait for logical
size calculation to complete before dropping the warmup units
- Set Tenant::activate_now in `get_active_tenant_with_timeout` (the page
service's path for getting a reference to a tenant).
- Wait for tenant activation in HTTP handlers for timeline creation and
deletion: like page service requests, these require an active tenant and
should prioritize activation if called.
2023-12-15 20:37:47 +00:00
John Spray
56f7d55ba7 pageserver: basic cancel/timeout for remote storage operations (#6097)
## Problem

Various places in remote storage were not subject to a timeout (thereby
stuck TCP connections could hold things up), and did not respect a
cancellation token (so things like timeline deletion or tenant detach
would have to wait arbitrarily long).



## Summary of changes

- Add download_cancellable and upload_cancellable helpers, and use them
in all the places we wait for remote storage operations (with the
exception of initdb downloads, where it would not have been safe).
- Add a cancellation token arg to `download_retry`.
- Use cancellation token args in various places that were missing one
per #5066

Closes: #5066 

Why is this only "basic" handling?
- Doesn't express difference between shutdown and errors in return
types, to avoid refactoring all the places that use an anyhow::Error
(these should all eventually return a more structured error type)
- Implements timeouts on top of remote storage, rather than within it:
this means that operations hitting their timeout will lose their
semaphore permit and thereby go to the back of the queue for their
retry.
- Doing a nicer job is tracked in
https://github.com/neondatabase/neon/issues/6096
2023-12-15 17:43:02 +00:00
Christian Schwarz
1a9854bfb7 add a Rust client for Pageserver management API (#6127)
Part of getpage@lsn benchmark epic:
https://github.com/neondatabase/neon/issues/5771

This PR moves the control plane's spread-all-over-the-place client for
the pageserver management API into a separate module within the
pageserver crate.

I need that client to be async in my benchmarking work, so, this PR
switches to the async version of `reqwest`.
That is also the right direction generally IMO.

The switch to async in turn mandated converting most of the
`control_plane/` code to async.

Note that some of the client methods should be taking `TenantShardId`
instead of `TenantId`, but, none of the callers seem to be
sharding-aware.
Leaving that for another time:
https://github.com/neondatabase/neon/issues/6154
2023-12-15 18:33:45 +01:00
John Spray
de1a9c6e3b s3_scrubber: basic support for sharding (#6119)
This doesn't make the scrubber smart enough to understand that many
shards are part of the same tenants, but it makes it understand paths
well enough to scrub the individual shards without thinking they're
malformed.

This is a prerequisite to being able to run tests with sharding enabled.

Related: #5929
2023-12-15 15:48:55 +00:00
Arseny Sher
e62569a878 A few comments on rust walproposer build. 2023-12-15 19:31:51 +04:00
John Spray
bd1cb1b217 tests: update allow list for negative_env (#6144)
Tests attaching the tenant immediately after the fixture detaches it
could result in LSN updates failing validation

e.g.
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-6142/7211196140/index.html#suites/7745dadbd815ab87f5798aa881796f47/32b12ccc0b01b122
2023-12-15 15:08:28 +00:00
132 changed files with 4504 additions and 2313 deletions

View File

@@ -11,7 +11,7 @@ on:
# │ │ ┌───────────── day of the month (1 - 31) # │ │ ┌───────────── day of the month (1 - 31)
# │ │ │ ┌───────────── month (1 - 12 or JAN-DEC) # │ │ │ ┌───────────── month (1 - 12 or JAN-DEC)
# │ │ │ │ ┌───────────── day of the week (0 - 6 or SUN-SAT) # │ │ │ │ ┌───────────── day of the week (0 - 6 or SUN-SAT)
- cron: '0 3 * * *' # run once a day, timezone is utc - cron: '0 3 * * *' # run once a day, timezone is utc
workflow_dispatch: # adds ability to run this manually workflow_dispatch: # adds ability to run this manually
inputs: inputs:
@@ -23,6 +23,21 @@ on:
type: boolean type: boolean
description: 'Publish perf report. If not set, the report will be published only for the main branch' description: 'Publish perf report. If not set, the report will be published only for the main branch'
required: false required: false
collect_olap_explain:
type: boolean
description: 'Collect EXPLAIN ANALYZE for OLAP queries. If not set, EXPLAIN ANALYZE will not be collected'
required: false
default: false
collect_pg_stat_statements:
type: boolean
description: 'Collect pg_stat_statements for OLAP queries. If not set, pg_stat_statements will not be collected'
required: false
default: false
run_AWS_RDS_AND_AURORA:
type: boolean
description: 'AWS-RDS and AWS-AURORA normally only run on Saturday. Set this to true to run them on every workflow_dispatch'
required: false
default: false
defaults: defaults:
run: run:
@@ -113,6 +128,8 @@ jobs:
# - neon-captest-reuse: Reusing existing project # - neon-captest-reuse: Reusing existing project
# - rds-aurora: Aurora Postgres Serverless v2 with autoscaling from 0.5 to 2 ACUs # - rds-aurora: Aurora Postgres Serverless v2 with autoscaling from 0.5 to 2 ACUs
# - rds-postgres: RDS Postgres db.m5.large instance (2 vCPU, 8 GiB) with gp3 EBS storage # - rds-postgres: RDS Postgres db.m5.large instance (2 vCPU, 8 GiB) with gp3 EBS storage
env:
RUN_AWS_RDS_AND_AURORA: ${{ github.event.inputs.run_AWS_RDS_AND_AURORA || 'false' }}
runs-on: ubuntu-latest runs-on: ubuntu-latest
outputs: outputs:
pgbench-compare-matrix: ${{ steps.pgbench-compare-matrix.outputs.matrix }} pgbench-compare-matrix: ${{ steps.pgbench-compare-matrix.outputs.matrix }}
@@ -152,7 +169,7 @@ jobs:
] ]
}' }'
if [ "$(date +%A)" = "Saturday" ]; then if [ "$(date +%A)" = "Saturday" ] || [ ${RUN_AWS_RDS_AND_AURORA} = "true" ]; then
matrix=$(echo "$matrix" | jq '.include += [{ "platform": "rds-postgres" }, matrix=$(echo "$matrix" | jq '.include += [{ "platform": "rds-postgres" },
{ "platform": "rds-aurora" }]') { "platform": "rds-aurora" }]')
fi fi
@@ -171,9 +188,9 @@ jobs:
] ]
}' }'
if [ "$(date +%A)" = "Saturday" ]; then if [ "$(date +%A)" = "Saturday" ] || [ ${RUN_AWS_RDS_AND_AURORA} = "true" ]; then
matrix=$(echo "$matrix" | jq '.include += [{ "platform": "rds-postgres", "scale": "10" }, matrix=$(echo "$matrix" | jq '.include += [{ "platform": "rds-postgres", "scale": "10" },
{ "platform": "rds-aurora", "scale": "10" }]') { "platform": "rds-aurora", "scale": "10" }]')
fi fi
echo "matrix=$(echo "$matrix" | jq --compact-output '.')" >> $GITHUB_OUTPUT echo "matrix=$(echo "$matrix" | jq --compact-output '.')" >> $GITHUB_OUTPUT
@@ -337,6 +354,8 @@ jobs:
POSTGRES_DISTRIB_DIR: /tmp/neon/pg_install POSTGRES_DISTRIB_DIR: /tmp/neon/pg_install
DEFAULT_PG_VERSION: 14 DEFAULT_PG_VERSION: 14
TEST_OUTPUT: /tmp/test_output TEST_OUTPUT: /tmp/test_output
TEST_OLAP_COLLECT_EXPLAIN: ${{ github.event.inputs.collect_olap_explain }}
TEST_OLAP_COLLECT_PG_STAT_STATEMENTS: ${{ github.event.inputs.collect_pg_stat_statements }}
BUILD_TYPE: remote BUILD_TYPE: remote
SAVE_PERF_REPORT: ${{ github.event.inputs.save_perf_report || ( github.ref_name == 'main' ) }} SAVE_PERF_REPORT: ${{ github.event.inputs.save_perf_report || ( github.ref_name == 'main' ) }}
PLATFORM: ${{ matrix.platform }} PLATFORM: ${{ matrix.platform }}
@@ -399,6 +418,8 @@ jobs:
env: env:
VIP_VAP_ACCESS_TOKEN: "${{ secrets.VIP_VAP_ACCESS_TOKEN }}" VIP_VAP_ACCESS_TOKEN: "${{ secrets.VIP_VAP_ACCESS_TOKEN }}"
PERF_TEST_RESULT_CONNSTR: "${{ secrets.PERF_TEST_RESULT_CONNSTR }}" PERF_TEST_RESULT_CONNSTR: "${{ secrets.PERF_TEST_RESULT_CONNSTR }}"
TEST_OLAP_COLLECT_EXPLAIN: ${{ github.event.inputs.collect_olap_explain || 'false' }}
TEST_OLAP_COLLECT_PG_STAT_STATEMENTS: ${{ github.event.inputs.collect_pg_stat_statements || 'false' }}
BENCHMARK_CONNSTR: ${{ steps.set-up-connstr.outputs.connstr }} BENCHMARK_CONNSTR: ${{ steps.set-up-connstr.outputs.connstr }}
TEST_OLAP_SCALE: 10 TEST_OLAP_SCALE: 10

View File

@@ -857,7 +857,7 @@ jobs:
run: run:
shell: sh -eu {0} shell: sh -eu {0}
env: env:
VM_BUILDER_VERSION: v0.19.0 VM_BUILDER_VERSION: v0.21.0
steps: steps:
- name: Checkout - name: Checkout

131
Cargo.lock generated
View File

@@ -190,9 +190,9 @@ dependencies = [
[[package]] [[package]]
name = "async-compression" name = "async-compression"
version = "0.4.0" version = "0.4.5"
source = "registry+https://github.com/rust-lang/crates.io-index" source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "5b0122885821398cc923ece939e24d1056a2384ee719432397fa9db87230ff11" checksum = "bc2d0cfb2a7388d34f590e76686704c494ed7aaceed62ee1ba35cbf363abc2a5"
dependencies = [ dependencies = [
"flate2", "flate2",
"futures-core", "futures-core",
@@ -233,7 +233,7 @@ checksum = "16e62a023e7c117e27523144c5d2459f4397fcc3cab0085af8e2224f643a0193"
dependencies = [ dependencies = [
"proc-macro2", "proc-macro2",
"quote", "quote",
"syn 2.0.28", "syn 2.0.32",
] ]
[[package]] [[package]]
@@ -244,7 +244,7 @@ checksum = "b9ccdd8f2a161be9bd5c023df56f1b2a0bd1d83872ae53b71a84a12c9bf6e842"
dependencies = [ dependencies = [
"proc-macro2", "proc-macro2",
"quote", "quote",
"syn 2.0.28", "syn 2.0.32",
] ]
[[package]] [[package]]
@@ -881,7 +881,7 @@ dependencies = [
"regex", "regex",
"rustc-hash", "rustc-hash",
"shlex", "shlex",
"syn 2.0.28", "syn 2.0.32",
"which", "which",
] ]
@@ -1095,7 +1095,7 @@ dependencies = [
"heck", "heck",
"proc-macro2", "proc-macro2",
"quote", "quote",
"syn 2.0.28", "syn 2.0.32",
] ]
[[package]] [[package]]
@@ -1245,16 +1245,19 @@ name = "control_plane"
version = "0.1.0" version = "0.1.0"
dependencies = [ dependencies = [
"anyhow", "anyhow",
"async-trait",
"camino", "camino",
"clap", "clap",
"comfy-table", "comfy-table",
"compute_api", "compute_api",
"futures",
"git-version", "git-version",
"hex", "hex",
"hyper", "hyper",
"nix 0.26.2", "nix 0.26.2",
"once_cell", "once_cell",
"pageserver_api", "pageserver_api",
"pageserver_client",
"postgres", "postgres",
"postgres_backend", "postgres_backend",
"postgres_connection", "postgres_connection",
@@ -1268,6 +1271,8 @@ dependencies = [
"tar", "tar",
"thiserror", "thiserror",
"tokio", "tokio",
"tokio-postgres",
"tokio-util",
"toml", "toml",
"tracing", "tracing",
"url", "url",
@@ -1481,7 +1486,7 @@ dependencies = [
"proc-macro2", "proc-macro2",
"quote", "quote",
"strsim", "strsim",
"syn 2.0.28", "syn 2.0.32",
] ]
[[package]] [[package]]
@@ -1492,7 +1497,7 @@ checksum = "29a358ff9f12ec09c3e61fef9b5a9902623a695a46a917b07f269bff1445611a"
dependencies = [ dependencies = [
"darling_core", "darling_core",
"quote", "quote",
"syn 2.0.28", "syn 2.0.32",
] ]
[[package]] [[package]]
@@ -1567,7 +1572,7 @@ checksum = "487585f4d0c6655fe74905e2504d8ad6908e4db67f744eb140876906c2f3175d"
dependencies = [ dependencies = [
"proc-macro2", "proc-macro2",
"quote", "quote",
"syn 2.0.28", "syn 2.0.32",
] ]
[[package]] [[package]]
@@ -1661,7 +1666,7 @@ dependencies = [
"darling", "darling",
"proc-macro2", "proc-macro2",
"quote", "quote",
"syn 2.0.28", "syn 2.0.32",
] ]
[[package]] [[package]]
@@ -1915,7 +1920,7 @@ checksum = "89ca545a94061b6365f2c7355b4b32bd20df3ff95f02da9329b34ccc3bd6ee72"
dependencies = [ dependencies = [
"proc-macro2", "proc-macro2",
"quote", "quote",
"syn 2.0.28", "syn 2.0.32",
] ]
[[package]] [[package]]
@@ -2482,13 +2487,14 @@ dependencies = [
[[package]] [[package]]
name = "jsonwebtoken" name = "jsonwebtoken"
version = "8.3.0" version = "9.2.0"
source = "registry+https://github.com/rust-lang/crates.io-index" source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "6971da4d9c3aa03c3d8f3ff0f4155b534aad021292003895a469716b2a230378" checksum = "5c7ea04a7c5c055c175f189b6dc6ba036fd62306b58c66c9f6389036c503a3f4"
dependencies = [ dependencies = [
"base64 0.21.1", "base64 0.21.1",
"pem 1.1.1", "js-sys",
"ring 0.16.20", "pem 3.0.3",
"ring 0.17.6",
"serde", "serde",
"serde_json", "serde_json",
"simple_asn1", "simple_asn1",
@@ -2901,7 +2907,7 @@ checksum = "a948666b637a0f465e8564c73e89d4dde00d72d4d473cc972f390fc3dcee7d9c"
dependencies = [ dependencies = [
"proc-macro2", "proc-macro2",
"quote", "quote",
"syn 2.0.28", "syn 2.0.32",
] ]
[[package]] [[package]]
@@ -3140,6 +3146,7 @@ dependencies = [
"tokio", "tokio",
"tokio-io-timeout", "tokio-io-timeout",
"tokio-postgres", "tokio-postgres",
"tokio-stream",
"tokio-tar", "tokio-tar",
"tokio-util", "tokio-util",
"toml_edit", "toml_edit",
@@ -3162,6 +3169,7 @@ dependencies = [
"enum-map", "enum-map",
"hex", "hex",
"postgres_ffi", "postgres_ffi",
"rand 0.8.5",
"serde", "serde",
"serde_json", "serde_json",
"serde_with", "serde_with",
@@ -3172,6 +3180,27 @@ dependencies = [
"workspace_hack", "workspace_hack",
] ]
[[package]]
name = "pageserver_client"
version = "0.1.0"
dependencies = [
"anyhow",
"async-trait",
"bytes",
"futures",
"pageserver_api",
"postgres",
"reqwest",
"serde",
"thiserror",
"tokio",
"tokio-postgres",
"tokio-stream",
"tokio-util",
"utils",
"workspace_hack",
]
[[package]] [[package]]
name = "parking" name = "parking"
version = "2.1.1" version = "2.1.1"
@@ -3263,18 +3292,19 @@ checksum = "19b17cddbe7ec3f8bc800887bab5e717348c95ea2ca0b1bf0837fb964dc67099"
[[package]] [[package]]
name = "pem" name = "pem"
version = "1.1.1" version = "2.0.1"
source = "registry+https://github.com/rust-lang/crates.io-index" source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "a8835c273a76a90455d7344889b0964598e3316e2a79ede8e36f16bdcf2228b8" checksum = "6b13fe415cdf3c8e44518e18a7c95a13431d9bdf6d15367d82b23c377fdd441a"
dependencies = [ dependencies = [
"base64 0.13.1", "base64 0.21.1",
"serde",
] ]
[[package]] [[package]]
name = "pem" name = "pem"
version = "2.0.1" version = "3.0.3"
source = "registry+https://github.com/rust-lang/crates.io-index" source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "6b13fe415cdf3c8e44518e18a7c95a13431d9bdf6d15367d82b23c377fdd441a" checksum = "1b8fcc794035347fb64beda2d3b462595dd2753e3f268d89c5aae77e8cf2c310"
dependencies = [ dependencies = [
"base64 0.21.1", "base64 0.21.1",
"serde", "serde",
@@ -3331,7 +3361,7 @@ checksum = "39407670928234ebc5e6e580247dd567ad73a3578460c5990f9503df207e8f07"
dependencies = [ dependencies = [
"proc-macro2", "proc-macro2",
"quote", "quote",
"syn 2.0.28", "syn 2.0.32",
] ]
[[package]] [[package]]
@@ -3538,7 +3568,7 @@ source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "3b69d39aab54d069e7f2fe8cb970493e7834601ca2d8c65fd7bbd183578080d1" checksum = "3b69d39aab54d069e7f2fe8cb970493e7834601ca2d8c65fd7bbd183578080d1"
dependencies = [ dependencies = [
"proc-macro2", "proc-macro2",
"syn 2.0.28", "syn 2.0.32",
] ]
[[package]] [[package]]
@@ -4146,7 +4176,7 @@ dependencies = [
"regex", "regex",
"relative-path", "relative-path",
"rustc_version", "rustc_version",
"syn 2.0.28", "syn 2.0.32",
"unicode-ident", "unicode-ident",
] ]
@@ -4292,6 +4322,7 @@ dependencies = [
"histogram", "histogram",
"itertools", "itertools",
"pageserver", "pageserver",
"pageserver_api",
"rand 0.8.5", "rand 0.8.5",
"remote_storage", "remote_storage",
"reqwest", "reqwest",
@@ -4399,12 +4430,12 @@ checksum = "d29ab0c6d3fc0ee92fe66e2d99f700eab17a8d57d1c1d3b748380fb20baa78cd"
[[package]] [[package]]
name = "sct" name = "sct"
version = "0.7.0" version = "0.7.1"
source = "registry+https://github.com/rust-lang/crates.io-index" source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "d53dcdb7c9f8158937a7981b48accfd39a43af418591a5d008c7b22b5e1b7ca4" checksum = "da046153aa2352493d6cb7da4b6e5c0c057d8a1d0a9aa8560baffdd945acd414"
dependencies = [ dependencies = [
"ring 0.16.20", "ring 0.17.6",
"untrusted 0.7.1", "untrusted 0.9.0",
] ]
[[package]] [[package]]
@@ -4580,7 +4611,7 @@ checksum = "aafe972d60b0b9bee71a91b92fee2d4fb3c9d7e8f6b179aa99f27203d99a4816"
dependencies = [ dependencies = [
"proc-macro2", "proc-macro2",
"quote", "quote",
"syn 2.0.28", "syn 2.0.32",
] ]
[[package]] [[package]]
@@ -4661,7 +4692,7 @@ dependencies = [
"darling", "darling",
"proc-macro2", "proc-macro2",
"quote", "quote",
"syn 2.0.28", "syn 2.0.32",
] ]
[[package]] [[package]]
@@ -4928,9 +4959,9 @@ dependencies = [
[[package]] [[package]]
name = "syn" name = "syn"
version = "2.0.28" version = "2.0.32"
source = "registry+https://github.com/rust-lang/crates.io-index" source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "04361975b3f5e348b2189d8dc55bc942f278b2d482a6a0365de5bdd62d351567" checksum = "239814284fd6f1a4ffe4ca893952cdd93c224b6a1571c9a9eadd670295c0c9e2"
dependencies = [ dependencies = [
"proc-macro2", "proc-macro2",
"quote", "quote",
@@ -5060,7 +5091,7 @@ checksum = "f9456a42c5b0d803c8cd86e73dd7cc9edd429499f37a3550d286d5e86720569f"
dependencies = [ dependencies = [
"proc-macro2", "proc-macro2",
"quote", "quote",
"syn 2.0.28", "syn 2.0.32",
] ]
[[package]] [[package]]
@@ -5178,7 +5209,7 @@ checksum = "5b8a1e28f2deaa14e508979454cb3a223b10b938b45af148bc0986de36f1923b"
dependencies = [ dependencies = [
"proc-macro2", "proc-macro2",
"quote", "quote",
"syn 2.0.28", "syn 2.0.32",
] ]
[[package]] [[package]]
@@ -5479,7 +5510,7 @@ checksum = "0f57e3ca2a01450b1a921183a9c9cbfda207fd822cef4ccb00a65402cbba7a74"
dependencies = [ dependencies = [
"proc-macro2", "proc-macro2",
"quote", "quote",
"syn 2.0.28", "syn 2.0.32",
] ]
[[package]] [[package]]
@@ -5924,7 +5955,7 @@ dependencies = [
"once_cell", "once_cell",
"proc-macro2", "proc-macro2",
"quote", "quote",
"syn 2.0.28", "syn 2.0.32",
"wasm-bindgen-shared", "wasm-bindgen-shared",
] ]
@@ -5958,7 +5989,7 @@ checksum = "e128beba882dd1eb6200e1dc92ae6c5dbaa4311aa7bb211ca035779e5efc39f8"
dependencies = [ dependencies = [
"proc-macro2", "proc-macro2",
"quote", "quote",
"syn 2.0.28", "syn 2.0.32",
"wasm-bindgen-backend", "wasm-bindgen-backend",
"wasm-bindgen-shared", "wasm-bindgen-shared",
] ]
@@ -6295,7 +6326,7 @@ dependencies = [
"smallvec", "smallvec",
"subtle", "subtle",
"syn 1.0.109", "syn 1.0.109",
"syn 2.0.28", "syn 2.0.32",
"time", "time",
"time-macros", "time-macros",
"tokio", "tokio",
@@ -6357,22 +6388,22 @@ dependencies = [
[[package]] [[package]]
name = "zerocopy" name = "zerocopy"
version = "0.7.3" version = "0.7.31"
source = "registry+https://github.com/rust-lang/crates.io-index" source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "7a7af71d8643341260a65f89fa60c0eeaa907f34544d8f6d9b0df72f069b5e74" checksum = "1c4061bedbb353041c12f413700357bec76df2c7e2ca8e4df8bac24c6bf68e3d"
dependencies = [ dependencies = [
"zerocopy-derive", "zerocopy-derive",
] ]
[[package]] [[package]]
name = "zerocopy-derive" name = "zerocopy-derive"
version = "0.7.3" version = "0.7.31"
source = "registry+https://github.com/rust-lang/crates.io-index" source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "9731702e2f0617ad526794ae28fbc6f6ca8849b5ba729666c2a5bc4b6ddee2cd" checksum = "b3c129550b3e6de3fd0ba67ba5c81818f9805e58b8d7fee80a3a59d2c9fc601a"
dependencies = [ dependencies = [
"proc-macro2", "proc-macro2",
"quote", "quote",
"syn 2.0.28", "syn 2.0.32",
] ]
[[package]] [[package]]
@@ -6383,30 +6414,28 @@ checksum = "2a0956f1ba7c7909bfb66c2e9e4124ab6f6482560f6628b5aaeba39207c9aad9"
[[package]] [[package]]
name = "zstd" name = "zstd"
version = "0.12.4" version = "0.13.0"
source = "registry+https://github.com/rust-lang/crates.io-index" source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "1a27595e173641171fc74a1232b7b1c7a7cb6e18222c11e9dfb9888fa424c53c" checksum = "bffb3309596d527cfcba7dfc6ed6052f1d39dfbd7c867aa2e865e4a449c10110"
dependencies = [ dependencies = [
"zstd-safe", "zstd-safe",
] ]
[[package]] [[package]]
name = "zstd-safe" name = "zstd-safe"
version = "6.0.6" version = "7.0.0"
source = "registry+https://github.com/rust-lang/crates.io-index" source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "ee98ffd0b48ee95e6c5168188e44a54550b1564d9d530ee21d5f0eaed1069581" checksum = "43747c7422e2924c11144d5229878b98180ef8b06cca4ab5af37afc8a8d8ea3e"
dependencies = [ dependencies = [
"libc",
"zstd-sys", "zstd-sys",
] ]
[[package]] [[package]]
name = "zstd-sys" name = "zstd-sys"
version = "2.0.8+zstd.1.5.5" version = "2.0.9+zstd.1.5.5"
source = "registry+https://github.com/rust-lang/crates.io-index" source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "5556e6ee25d32df2586c098bbfa278803692a20d0ab9565e049480d52707ec8c" checksum = "9e16efa8a874a0481a574084d34cc26fdb3b99627480f785888deb6386506656"
dependencies = [ dependencies = [
"cc", "cc",
"libc",
"pkg-config", "pkg-config",
] ]

View File

@@ -5,6 +5,7 @@ members = [
"control_plane", "control_plane",
"pageserver", "pageserver",
"pageserver/ctl", "pageserver/ctl",
"pageserver/client",
"proxy", "proxy",
"safekeeper", "safekeeper",
"storage_broker", "storage_broker",
@@ -90,7 +91,7 @@ hyper-tungstenite = "0.11"
inotify = "0.10.2" inotify = "0.10.2"
ipnet = "2.9.0" ipnet = "2.9.0"
itertools = "0.10" itertools = "0.10"
jsonwebtoken = "8" jsonwebtoken = "9"
libc = "0.2" libc = "0.2"
md5 = "0.7.0" md5 = "0.7.0"
memoffset = "0.8" memoffset = "0.8"
@@ -182,6 +183,7 @@ compute_api = { version = "0.1", path = "./libs/compute_api/" }
consumption_metrics = { version = "0.1", path = "./libs/consumption_metrics/" } consumption_metrics = { version = "0.1", path = "./libs/consumption_metrics/" }
metrics = { version = "0.1", path = "./libs/metrics/" } metrics = { version = "0.1", path = "./libs/metrics/" }
pageserver_api = { version = "0.1", path = "./libs/pageserver_api/" } pageserver_api = { version = "0.1", path = "./libs/pageserver_api/" }
pageserver_client = { path = "./pageserver/client" }
postgres_backend = { version = "0.1", path = "./libs/postgres_backend/" } postgres_backend = { version = "0.1", path = "./libs/postgres_backend/" }
postgres_connection = { version = "0.1", path = "./libs/postgres_connection/" } postgres_connection = { version = "0.1", path = "./libs/postgres_connection/" }
postgres_ffi = { version = "0.1", path = "./libs/postgres_ffi/" } postgres_ffi = { version = "0.1", path = "./libs/postgres_ffi/" }

View File

@@ -569,6 +569,23 @@ RUN wget https://github.com/ChenHuajun/pg_roaringbitmap/archive/refs/tags/v0.5.4
make -j $(getconf _NPROCESSORS_ONLN) install && \ make -j $(getconf _NPROCESSORS_ONLN) install && \
echo 'trusted = true' >> /usr/local/pgsql/share/extension/roaringbitmap.control echo 'trusted = true' >> /usr/local/pgsql/share/extension/roaringbitmap.control
#########################################################################################
#
# Layer "pg-semver-pg-build"
# compile pg_semver extension
#
#########################################################################################
FROM build-deps AS pg-semver-pg-build
COPY --from=pg-build /usr/local/pgsql/ /usr/local/pgsql/
ENV PATH "/usr/local/pgsql/bin/:$PATH"
RUN wget https://github.com/theory/pg-semver/archive/refs/tags/v0.32.1.tar.gz -O pg_semver.tar.gz && \
echo "fbdaf7512026d62eec03fad8687c15ed509b6ba395bff140acd63d2e4fbe25d7 pg_semver.tar.gz" | sha256sum --check && \
mkdir pg_semver-src && cd pg_semver-src && tar xvzf ../pg_semver.tar.gz --strip-components=1 -C . && \
make -j $(getconf _NPROCESSORS_ONLN) && \
make -j $(getconf _NPROCESSORS_ONLN) install && \
echo 'trusted = true' >> /usr/local/pgsql/share/extension/semver.control
######################################################################################### #########################################################################################
# #
# Layer "pg-embedding-pg-build" # Layer "pg-embedding-pg-build"
@@ -768,6 +785,7 @@ COPY --from=pg-pgx-ulid-build /usr/local/pgsql/ /usr/local/pgsql/
COPY --from=rdkit-pg-build /usr/local/pgsql/ /usr/local/pgsql/ COPY --from=rdkit-pg-build /usr/local/pgsql/ /usr/local/pgsql/
COPY --from=pg-uuidv7-pg-build /usr/local/pgsql/ /usr/local/pgsql/ COPY --from=pg-uuidv7-pg-build /usr/local/pgsql/ /usr/local/pgsql/
COPY --from=pg-roaringbitmap-pg-build /usr/local/pgsql/ /usr/local/pgsql/ COPY --from=pg-roaringbitmap-pg-build /usr/local/pgsql/ /usr/local/pgsql/
COPY --from=pg-semver-pg-build /usr/local/pgsql/ /usr/local/pgsql/
COPY --from=pg-embedding-pg-build /usr/local/pgsql/ /usr/local/pgsql/ COPY --from=pg-embedding-pg-build /usr/local/pgsql/ /usr/local/pgsql/
COPY --from=wal2json-pg-build /usr/local/pgsql /usr/local/pgsql COPY --from=wal2json-pg-build /usr/local/pgsql /usr/local/pgsql
COPY pgxn/ pgxn/ COPY pgxn/ pgxn/

View File

@@ -29,13 +29,14 @@ See developer documentation in [SUMMARY.md](/docs/SUMMARY.md) for more informati
```bash ```bash
apt install build-essential libtool libreadline-dev zlib1g-dev flex bison libseccomp-dev \ apt install build-essential libtool libreadline-dev zlib1g-dev flex bison libseccomp-dev \
libssl-dev clang pkg-config libpq-dev cmake postgresql-client protobuf-compiler \ libssl-dev clang pkg-config libpq-dev cmake postgresql-client protobuf-compiler \
libcurl4-openssl-dev openssl python-poetry lsof libicu-dev libcurl4-openssl-dev openssl python3-poetry lsof libicu-dev
``` ```
* On Fedora, these packages are needed: * On Fedora, these packages are needed:
```bash ```bash
dnf install flex bison readline-devel zlib-devel openssl-devel \ dnf install flex bison readline-devel zlib-devel openssl-devel \
libseccomp-devel perl clang cmake postgresql postgresql-contrib protobuf-compiler \ libseccomp-devel perl clang cmake postgresql postgresql-contrib protobuf-compiler \
protobuf-devel libcurl-devel openssl poetry lsof libicu-devel protobuf-devel libcurl-devel openssl poetry lsof libicu-devel libpq-devel python3-devel \
libffi-devel
``` ```
* On Arch based systems, these packages are needed: * On Arch based systems, these packages are needed:
```bash ```bash

View File

@@ -37,5 +37,5 @@ workspace_hack.workspace = true
toml_edit.workspace = true toml_edit.workspace = true
remote_storage = { version = "0.1", path = "../libs/remote_storage/" } remote_storage = { version = "0.1", path = "../libs/remote_storage/" }
vm_monitor = { version = "0.1", path = "../libs/vm_monitor/" } vm_monitor = { version = "0.1", path = "../libs/vm_monitor/" }
zstd = "0.12.4" zstd = "0.13"
bytes = "1.0" bytes = "1.0"

View File

@@ -298,7 +298,7 @@ pub fn handle_roles(spec: &ComputeSpec, client: &mut Client) -> Result<()> {
// safe to add more permissions here. BYPASSRLS and REPLICATION are inherited // safe to add more permissions here. BYPASSRLS and REPLICATION are inherited
// from neon_superuser. // from neon_superuser.
let mut query: String = format!( let mut query: String = format!(
"CREATE ROLE {} INHERIT CREATEROLE CREATEDB IN ROLE neon_superuser", "CREATE ROLE {} INHERIT CREATEROLE CREATEDB BYPASSRLS REPLICATION IN ROLE neon_superuser",
name.pg_quote() name.pg_quote()
); );
info!("role create query: '{}'", &query); info!("role create query: '{}'", &query);
@@ -370,33 +370,49 @@ pub fn handle_role_deletions(spec: &ComputeSpec, connstr: &str, client: &mut Cli
Ok(()) Ok(())
} }
fn reassign_owned_objects_in_one_db(
conf: Config,
role_name: &PgIdent,
db_owner: &PgIdent,
) -> Result<()> {
let mut client = conf.connect(NoTls)?;
// This will reassign all dependent objects to the db owner
let reassign_query = format!(
"REASSIGN OWNED BY {} TO {}",
role_name.pg_quote(),
db_owner.pg_quote()
);
info!(
"reassigning objects owned by '{}' in db '{}' to '{}'",
role_name,
conf.get_dbname().unwrap_or(""),
db_owner
);
client.simple_query(&reassign_query)?;
// This now will only drop privileges of the role
let drop_query = format!("DROP OWNED BY {}", role_name.pg_quote());
client.simple_query(&drop_query)?;
Ok(())
}
// Reassign all owned objects in all databases to the owner of the database. // Reassign all owned objects in all databases to the owner of the database.
fn reassign_owned_objects(spec: &ComputeSpec, connstr: &str, role_name: &PgIdent) -> Result<()> { fn reassign_owned_objects(spec: &ComputeSpec, connstr: &str, role_name: &PgIdent) -> Result<()> {
for db in &spec.cluster.databases { for db in &spec.cluster.databases {
if db.owner != *role_name { if db.owner != *role_name {
let mut conf = Config::from_str(connstr)?; let mut conf = Config::from_str(connstr)?;
conf.dbname(&db.name); conf.dbname(&db.name);
reassign_owned_objects_in_one_db(conf, role_name, &db.owner)?;
let mut client = conf.connect(NoTls)?;
// This will reassign all dependent objects to the db owner
let reassign_query = format!(
"REASSIGN OWNED BY {} TO {}",
role_name.pg_quote(),
db.owner.pg_quote()
);
info!(
"reassigning objects owned by '{}' in db '{}' to '{}'",
role_name, &db.name, &db.owner
);
client.simple_query(&reassign_query)?;
// This now will only drop privileges of the role
let drop_query = format!("DROP OWNED BY {}", role_name.pg_quote());
client.simple_query(&drop_query)?;
} }
} }
// Also handle case when there are no databases in the spec.
// In this case we need to reassign objects in the default database.
let conf = Config::from_str(connstr)?;
let db_owner = PgIdent::from_str("cloud_admin")?;
reassign_owned_objects_in_one_db(conf, role_name, &db_owner)?;
Ok(()) Ok(())
} }

View File

@@ -6,9 +6,11 @@ license.workspace = true
[dependencies] [dependencies]
anyhow.workspace = true anyhow.workspace = true
async-trait.workspace = true
camino.workspace = true camino.workspace = true
clap.workspace = true clap.workspace = true
comfy-table.workspace = true comfy-table.workspace = true
futures.workspace = true
git-version.workspace = true git-version.workspace = true
nix.workspace = true nix.workspace = true
once_cell.workspace = true once_cell.workspace = true
@@ -24,10 +26,11 @@ tar.workspace = true
thiserror.workspace = true thiserror.workspace = true
toml.workspace = true toml.workspace = true
tokio.workspace = true tokio.workspace = true
tokio-postgres.workspace = true
tokio-util.workspace = true
url.workspace = true url.workspace = true
# Note: Do not directly depend on pageserver or safekeeper; use pageserver_api or safekeeper_api
# instead, so that recompile times are better.
pageserver_api.workspace = true pageserver_api.workspace = true
pageserver_client.workspace = true
postgres_backend.workspace = true postgres_backend.workspace = true
safekeeper_api.workspace = true safekeeper_api.workspace = true
postgres_connection.workspace = true postgres_connection.workspace = true

View File

@@ -9,7 +9,7 @@ pub struct AttachmentService {
env: LocalEnv, env: LocalEnv,
listen: String, listen: String,
path: PathBuf, path: PathBuf,
client: reqwest::blocking::Client, client: reqwest::Client,
} }
const COMMAND: &str = "attachment_service"; const COMMAND: &str = "attachment_service";
@@ -53,7 +53,7 @@ impl AttachmentService {
env: env.clone(), env: env.clone(),
path, path,
listen, listen,
client: reqwest::blocking::ClientBuilder::new() client: reqwest::ClientBuilder::new()
.build() .build()
.expect("Failed to construct http client"), .expect("Failed to construct http client"),
} }
@@ -64,7 +64,7 @@ impl AttachmentService {
.expect("non-Unicode path") .expect("non-Unicode path")
} }
pub fn start(&self) -> anyhow::Result<Child> { pub async fn start(&self) -> anyhow::Result<Child> {
let path_str = self.path.to_string_lossy(); let path_str = self.path.to_string_lossy();
background_process::start_process( background_process::start_process(
@@ -73,10 +73,11 @@ impl AttachmentService {
&self.env.attachment_service_bin(), &self.env.attachment_service_bin(),
["-l", &self.listen, "-p", &path_str], ["-l", &self.listen, "-p", &path_str],
[], [],
background_process::InitialPidFile::Create(&self.pid_file()), background_process::InitialPidFile::Create(self.pid_file()),
// TODO: a real status check // TODO: a real status check
|| Ok(true), || async move { anyhow::Ok(true) },
) )
.await
} }
pub fn stop(&self, immediate: bool) -> anyhow::Result<()> { pub fn stop(&self, immediate: bool) -> anyhow::Result<()> {
@@ -84,7 +85,7 @@ impl AttachmentService {
} }
/// Call into the attach_hook API, for use before handing out attachments to pageservers /// Call into the attach_hook API, for use before handing out attachments to pageservers
pub fn attach_hook( pub async fn attach_hook(
&self, &self,
tenant_id: TenantId, tenant_id: TenantId,
pageserver_id: NodeId, pageserver_id: NodeId,
@@ -104,16 +105,16 @@ impl AttachmentService {
node_id: Some(pageserver_id), node_id: Some(pageserver_id),
}; };
let response = self.client.post(url).json(&request).send()?; let response = self.client.post(url).json(&request).send().await?;
if response.status() != StatusCode::OK { if response.status() != StatusCode::OK {
return Err(anyhow!("Unexpected status {}", response.status())); return Err(anyhow!("Unexpected status {}", response.status()));
} }
let response = response.json::<AttachHookResponse>()?; let response = response.json::<AttachHookResponse>().await?;
Ok(response.gen) Ok(response.gen)
} }
pub fn inspect(&self, tenant_id: TenantId) -> anyhow::Result<Option<(u32, NodeId)>> { pub async fn inspect(&self, tenant_id: TenantId) -> anyhow::Result<Option<(u32, NodeId)>> {
use hyper::StatusCode; use hyper::StatusCode;
let url = self let url = self
@@ -126,12 +127,12 @@ impl AttachmentService {
let request = InspectRequest { tenant_id }; let request = InspectRequest { tenant_id };
let response = self.client.post(url).json(&request).send()?; let response = self.client.post(url).json(&request).send().await?;
if response.status() != StatusCode::OK { if response.status() != StatusCode::OK {
return Err(anyhow!("Unexpected status {}", response.status())); return Err(anyhow!("Unexpected status {}", response.status()));
} }
let response = response.json::<InspectResponse>()?; let response = response.json::<InspectResponse>().await?;
Ok(response.attachment) Ok(response.attachment)
} }
} }

View File

@@ -44,15 +44,15 @@ const NOTICE_AFTER_RETRIES: u64 = 50;
/// Argument to `start_process`, to indicate whether it should create pidfile or if the process creates /// Argument to `start_process`, to indicate whether it should create pidfile or if the process creates
/// it itself. /// it itself.
pub enum InitialPidFile<'t> { pub enum InitialPidFile {
/// Create a pidfile, to allow future CLI invocations to manipulate the process. /// Create a pidfile, to allow future CLI invocations to manipulate the process.
Create(&'t Utf8Path), Create(Utf8PathBuf),
/// The process will create the pidfile itself, need to wait for that event. /// The process will create the pidfile itself, need to wait for that event.
Expect(&'t Utf8Path), Expect(Utf8PathBuf),
} }
/// Start a background child process using the parameters given. /// Start a background child process using the parameters given.
pub fn start_process<F, AI, A, EI>( pub async fn start_process<F, Fut, AI, A, EI>(
process_name: &str, process_name: &str,
datadir: &Path, datadir: &Path,
command: &Path, command: &Path,
@@ -62,7 +62,8 @@ pub fn start_process<F, AI, A, EI>(
process_status_check: F, process_status_check: F,
) -> anyhow::Result<Child> ) -> anyhow::Result<Child>
where where
F: Fn() -> anyhow::Result<bool>, F: Fn() -> Fut,
Fut: std::future::Future<Output = anyhow::Result<bool>>,
AI: IntoIterator<Item = A>, AI: IntoIterator<Item = A>,
A: AsRef<OsStr>, A: AsRef<OsStr>,
// Not generic AsRef<OsStr>, otherwise empty `envs` prevents type inference // Not generic AsRef<OsStr>, otherwise empty `envs` prevents type inference
@@ -89,7 +90,7 @@ where
let filled_cmd = fill_remote_storage_secrets_vars(fill_rust_env_vars(background_command)); let filled_cmd = fill_remote_storage_secrets_vars(fill_rust_env_vars(background_command));
filled_cmd.envs(envs); filled_cmd.envs(envs);
let pid_file_to_check = match initial_pid_file { let pid_file_to_check = match &initial_pid_file {
InitialPidFile::Create(path) => { InitialPidFile::Create(path) => {
pre_exec_create_pidfile(filled_cmd, path); pre_exec_create_pidfile(filled_cmd, path);
path path
@@ -107,7 +108,7 @@ where
); );
for retries in 0..RETRIES { for retries in 0..RETRIES {
match process_started(pid, Some(pid_file_to_check), &process_status_check) { match process_started(pid, pid_file_to_check, &process_status_check).await {
Ok(true) => { Ok(true) => {
println!("\n{process_name} started, pid: {pid}"); println!("\n{process_name} started, pid: {pid}");
return Ok(spawned_process); return Ok(spawned_process);
@@ -316,22 +317,20 @@ where
cmd cmd
} }
fn process_started<F>( async fn process_started<F, Fut>(
pid: Pid, pid: Pid,
pid_file_to_check: Option<&Utf8Path>, pid_file_to_check: &Utf8Path,
status_check: &F, status_check: &F,
) -> anyhow::Result<bool> ) -> anyhow::Result<bool>
where where
F: Fn() -> anyhow::Result<bool>, F: Fn() -> Fut,
Fut: std::future::Future<Output = anyhow::Result<bool>>,
{ {
match status_check() { match status_check().await {
Ok(true) => match pid_file_to_check { Ok(true) => match pid_file::read(pid_file_to_check)? {
Some(pid_file_path) => match pid_file::read(pid_file_path)? { PidFileRead::NotExist => Ok(false),
PidFileRead::NotExist => Ok(false), PidFileRead::LockedByOtherProcess(pid_in_file) => Ok(pid_in_file == pid),
PidFileRead::LockedByOtherProcess(pid_in_file) => Ok(pid_in_file == pid), PidFileRead::NotHeldByAnyProcess(_) => Ok(false),
PidFileRead::NotHeldByAnyProcess(_) => Ok(false),
},
None => Ok(true),
}, },
Ok(false) => Ok(false), Ok(false) => Ok(false),
Err(e) => anyhow::bail!("process failed to start: {e}"), Err(e) => anyhow::bail!("process failed to start: {e}"),

View File

@@ -120,15 +120,20 @@ fn main() -> Result<()> {
let mut env = LocalEnv::load_config().context("Error loading config")?; let mut env = LocalEnv::load_config().context("Error loading config")?;
let original_env = env.clone(); let original_env = env.clone();
let rt = tokio::runtime::Builder::new_current_thread()
.enable_all()
.build()
.unwrap();
let subcommand_result = match sub_name { let subcommand_result = match sub_name {
"tenant" => handle_tenant(sub_args, &mut env), "tenant" => rt.block_on(handle_tenant(sub_args, &mut env)),
"timeline" => handle_timeline(sub_args, &mut env), "timeline" => rt.block_on(handle_timeline(sub_args, &mut env)),
"start" => handle_start_all(sub_args, &env), "start" => rt.block_on(handle_start_all(sub_args, &env)),
"stop" => handle_stop_all(sub_args, &env), "stop" => handle_stop_all(sub_args, &env),
"pageserver" => handle_pageserver(sub_args, &env), "pageserver" => rt.block_on(handle_pageserver(sub_args, &env)),
"attachment_service" => handle_attachment_service(sub_args, &env), "attachment_service" => rt.block_on(handle_attachment_service(sub_args, &env)),
"safekeeper" => handle_safekeeper(sub_args, &env), "safekeeper" => rt.block_on(handle_safekeeper(sub_args, &env)),
"endpoint" => handle_endpoint(sub_args, &env), "endpoint" => rt.block_on(handle_endpoint(sub_args, &env)),
"mappings" => handle_mappings(sub_args, &mut env), "mappings" => handle_mappings(sub_args, &mut env),
"pg" => bail!("'pg' subcommand has been renamed to 'endpoint'"), "pg" => bail!("'pg' subcommand has been renamed to 'endpoint'"),
_ => bail!("unexpected subcommand {sub_name}"), _ => bail!("unexpected subcommand {sub_name}"),
@@ -269,12 +274,13 @@ fn print_timeline(
/// Returns a map of timeline IDs to timeline_id@lsn strings. /// Returns a map of timeline IDs to timeline_id@lsn strings.
/// Connects to the pageserver to query this information. /// Connects to the pageserver to query this information.
fn get_timeline_infos( async fn get_timeline_infos(
env: &local_env::LocalEnv, env: &local_env::LocalEnv,
tenant_id: &TenantId, tenant_id: &TenantId,
) -> Result<HashMap<TimelineId, TimelineInfo>> { ) -> Result<HashMap<TimelineId, TimelineInfo>> {
Ok(get_default_pageserver(env) Ok(get_default_pageserver(env)
.timeline_list(tenant_id)? .timeline_list(tenant_id)
.await?
.into_iter() .into_iter()
.map(|timeline_info| (timeline_info.timeline_id, timeline_info)) .map(|timeline_info| (timeline_info.timeline_id, timeline_info))
.collect()) .collect())
@@ -373,11 +379,14 @@ fn pageserver_config_overrides(init_match: &ArgMatches) -> Vec<&str> {
.collect() .collect()
} }
fn handle_tenant(tenant_match: &ArgMatches, env: &mut local_env::LocalEnv) -> anyhow::Result<()> { async fn handle_tenant(
tenant_match: &ArgMatches,
env: &mut local_env::LocalEnv,
) -> anyhow::Result<()> {
let pageserver = get_default_pageserver(env); let pageserver = get_default_pageserver(env);
match tenant_match.subcommand() { match tenant_match.subcommand() {
Some(("list", _)) => { Some(("list", _)) => {
for t in pageserver.tenant_list()? { for t in pageserver.tenant_list().await? {
println!("{} {:?}", t.id, t.state); println!("{} {:?}", t.id, t.state);
} }
} }
@@ -394,12 +403,16 @@ fn handle_tenant(tenant_match: &ArgMatches, env: &mut local_env::LocalEnv) -> an
// We must register the tenant with the attachment service, so // We must register the tenant with the attachment service, so
// that when the pageserver restarts, it will be re-attached. // that when the pageserver restarts, it will be re-attached.
let attachment_service = AttachmentService::from_env(env); let attachment_service = AttachmentService::from_env(env);
attachment_service.attach_hook(tenant_id, pageserver.conf.id)? attachment_service
.attach_hook(tenant_id, pageserver.conf.id)
.await?
} else { } else {
None None
}; };
pageserver.tenant_create(tenant_id, generation, tenant_conf)?; pageserver
.tenant_create(tenant_id, generation, tenant_conf)
.await?;
println!("tenant {tenant_id} successfully created on the pageserver"); println!("tenant {tenant_id} successfully created on the pageserver");
// Create an initial timeline for the new tenant // Create an initial timeline for the new tenant
@@ -409,14 +422,16 @@ fn handle_tenant(tenant_match: &ArgMatches, env: &mut local_env::LocalEnv) -> an
.copied() .copied()
.context("Failed to parse postgres version from the argument string")?; .context("Failed to parse postgres version from the argument string")?;
let timeline_info = pageserver.timeline_create( let timeline_info = pageserver
tenant_id, .timeline_create(
new_timeline_id, tenant_id,
None, new_timeline_id,
None, None,
Some(pg_version), None,
None, Some(pg_version),
)?; None,
)
.await?;
let new_timeline_id = timeline_info.timeline_id; let new_timeline_id = timeline_info.timeline_id;
let last_record_lsn = timeline_info.last_record_lsn; let last_record_lsn = timeline_info.last_record_lsn;
@@ -450,6 +465,7 @@ fn handle_tenant(tenant_match: &ArgMatches, env: &mut local_env::LocalEnv) -> an
pageserver pageserver
.tenant_config(tenant_id, tenant_conf) .tenant_config(tenant_id, tenant_conf)
.await
.with_context(|| format!("Tenant config failed for tenant with id {tenant_id}"))?; .with_context(|| format!("Tenant config failed for tenant with id {tenant_id}"))?;
println!("tenant {tenant_id} successfully configured on the pageserver"); println!("tenant {tenant_id} successfully configured on the pageserver");
} }
@@ -458,7 +474,7 @@ fn handle_tenant(tenant_match: &ArgMatches, env: &mut local_env::LocalEnv) -> an
let new_pageserver = get_pageserver(env, matches)?; let new_pageserver = get_pageserver(env, matches)?;
let new_pageserver_id = new_pageserver.conf.id; let new_pageserver_id = new_pageserver.conf.id;
migrate_tenant(env, tenant_id, new_pageserver)?; migrate_tenant(env, tenant_id, new_pageserver).await?;
println!("tenant {tenant_id} migrated to {}", new_pageserver_id); println!("tenant {tenant_id} migrated to {}", new_pageserver_id);
} }
@@ -468,13 +484,13 @@ fn handle_tenant(tenant_match: &ArgMatches, env: &mut local_env::LocalEnv) -> an
Ok(()) Ok(())
} }
fn handle_timeline(timeline_match: &ArgMatches, env: &mut local_env::LocalEnv) -> Result<()> { async fn handle_timeline(timeline_match: &ArgMatches, env: &mut local_env::LocalEnv) -> Result<()> {
let pageserver = get_default_pageserver(env); let pageserver = get_default_pageserver(env);
match timeline_match.subcommand() { match timeline_match.subcommand() {
Some(("list", list_match)) => { Some(("list", list_match)) => {
let tenant_id = get_tenant_id(list_match, env)?; let tenant_id = get_tenant_id(list_match, env)?;
let timelines = pageserver.timeline_list(&tenant_id)?; let timelines = pageserver.timeline_list(&tenant_id).await?;
print_timelines_tree(timelines, env.timeline_name_mappings())?; print_timelines_tree(timelines, env.timeline_name_mappings())?;
} }
Some(("create", create_match)) => { Some(("create", create_match)) => {
@@ -490,14 +506,16 @@ fn handle_timeline(timeline_match: &ArgMatches, env: &mut local_env::LocalEnv) -
let new_timeline_id_opt = parse_timeline_id(create_match)?; let new_timeline_id_opt = parse_timeline_id(create_match)?;
let timeline_info = pageserver.timeline_create( let timeline_info = pageserver
tenant_id, .timeline_create(
new_timeline_id_opt, tenant_id,
None, new_timeline_id_opt,
None, None,
Some(pg_version), None,
None, Some(pg_version),
)?; None,
)
.await?;
let new_timeline_id = timeline_info.timeline_id; let new_timeline_id = timeline_info.timeline_id;
let last_record_lsn = timeline_info.last_record_lsn; let last_record_lsn = timeline_info.last_record_lsn;
@@ -542,7 +560,9 @@ fn handle_timeline(timeline_match: &ArgMatches, env: &mut local_env::LocalEnv) -
let mut cplane = ComputeControlPlane::load(env.clone())?; let mut cplane = ComputeControlPlane::load(env.clone())?;
println!("Importing timeline into pageserver ..."); println!("Importing timeline into pageserver ...");
pageserver.timeline_import(tenant_id, timeline_id, base, pg_wal, pg_version)?; pageserver
.timeline_import(tenant_id, timeline_id, base, pg_wal, pg_version)
.await?;
env.register_branch_mapping(name.to_string(), tenant_id, timeline_id)?; env.register_branch_mapping(name.to_string(), tenant_id, timeline_id)?;
println!("Creating endpoint for imported timeline ..."); println!("Creating endpoint for imported timeline ...");
@@ -578,14 +598,16 @@ fn handle_timeline(timeline_match: &ArgMatches, env: &mut local_env::LocalEnv) -
.map(|lsn_str| Lsn::from_str(lsn_str)) .map(|lsn_str| Lsn::from_str(lsn_str))
.transpose() .transpose()
.context("Failed to parse ancestor start Lsn from the request")?; .context("Failed to parse ancestor start Lsn from the request")?;
let timeline_info = pageserver.timeline_create( let timeline_info = pageserver
tenant_id, .timeline_create(
None, tenant_id,
start_lsn, None,
Some(ancestor_timeline_id), start_lsn,
None, Some(ancestor_timeline_id),
None, None,
)?; None,
)
.await?;
let new_timeline_id = timeline_info.timeline_id; let new_timeline_id = timeline_info.timeline_id;
let last_record_lsn = timeline_info.last_record_lsn; let last_record_lsn = timeline_info.last_record_lsn;
@@ -604,7 +626,7 @@ fn handle_timeline(timeline_match: &ArgMatches, env: &mut local_env::LocalEnv) -
Ok(()) Ok(())
} }
fn handle_endpoint(ep_match: &ArgMatches, env: &local_env::LocalEnv) -> Result<()> { async fn handle_endpoint(ep_match: &ArgMatches, env: &local_env::LocalEnv) -> Result<()> {
let (sub_name, sub_args) = match ep_match.subcommand() { let (sub_name, sub_args) = match ep_match.subcommand() {
Some(ep_subcommand_data) => ep_subcommand_data, Some(ep_subcommand_data) => ep_subcommand_data,
None => bail!("no endpoint subcommand provided"), None => bail!("no endpoint subcommand provided"),
@@ -614,10 +636,12 @@ fn handle_endpoint(ep_match: &ArgMatches, env: &local_env::LocalEnv) -> Result<(
match sub_name { match sub_name {
"list" => { "list" => {
let tenant_id = get_tenant_id(sub_args, env)?; let tenant_id = get_tenant_id(sub_args, env)?;
let timeline_infos = get_timeline_infos(env, &tenant_id).unwrap_or_else(|e| { let timeline_infos = get_timeline_infos(env, &tenant_id)
eprintln!("Failed to load timeline info: {}", e); .await
HashMap::new() .unwrap_or_else(|e| {
}); eprintln!("Failed to load timeline info: {}", e);
HashMap::new()
});
let timeline_name_mappings = env.timeline_name_mappings(); let timeline_name_mappings = env.timeline_name_mappings();
@@ -791,7 +815,9 @@ fn handle_endpoint(ep_match: &ArgMatches, env: &local_env::LocalEnv) -> Result<(
}; };
println!("Starting existing endpoint {endpoint_id}..."); println!("Starting existing endpoint {endpoint_id}...");
endpoint.start(&auth_token, safekeepers, remote_ext_config)?; endpoint
.start(&auth_token, safekeepers, remote_ext_config)
.await?;
} }
"reconfigure" => { "reconfigure" => {
let endpoint_id = sub_args let endpoint_id = sub_args
@@ -809,7 +835,7 @@ fn handle_endpoint(ep_match: &ArgMatches, env: &local_env::LocalEnv) -> Result<(
} else { } else {
None None
}; };
endpoint.reconfigure(pageserver_id)?; endpoint.reconfigure(pageserver_id).await?;
} }
"stop" => { "stop" => {
let endpoint_id = sub_args let endpoint_id = sub_args
@@ -875,11 +901,12 @@ fn get_pageserver(env: &local_env::LocalEnv, args: &ArgMatches) -> Result<PageSe
)) ))
} }
fn handle_pageserver(sub_match: &ArgMatches, env: &local_env::LocalEnv) -> Result<()> { async fn handle_pageserver(sub_match: &ArgMatches, env: &local_env::LocalEnv) -> Result<()> {
match sub_match.subcommand() { match sub_match.subcommand() {
Some(("start", subcommand_args)) => { Some(("start", subcommand_args)) => {
if let Err(e) = get_pageserver(env, subcommand_args)? if let Err(e) = get_pageserver(env, subcommand_args)?
.start(&pageserver_config_overrides(subcommand_args)) .start(&pageserver_config_overrides(subcommand_args))
.await
{ {
eprintln!("pageserver start failed: {e}"); eprintln!("pageserver start failed: {e}");
exit(1); exit(1);
@@ -906,7 +933,10 @@ fn handle_pageserver(sub_match: &ArgMatches, env: &local_env::LocalEnv) -> Resul
exit(1); exit(1);
} }
if let Err(e) = pageserver.start(&pageserver_config_overrides(subcommand_args)) { if let Err(e) = pageserver
.start(&pageserver_config_overrides(subcommand_args))
.await
{
eprintln!("pageserver start failed: {e}"); eprintln!("pageserver start failed: {e}");
exit(1); exit(1);
} }
@@ -920,14 +950,17 @@ fn handle_pageserver(sub_match: &ArgMatches, env: &local_env::LocalEnv) -> Resul
exit(1); exit(1);
} }
if let Err(e) = pageserver.start(&pageserver_config_overrides(subcommand_args)) { if let Err(e) = pageserver
.start(&pageserver_config_overrides(subcommand_args))
.await
{
eprintln!("pageserver start failed: {e}"); eprintln!("pageserver start failed: {e}");
exit(1); exit(1);
} }
} }
Some(("status", subcommand_args)) => { Some(("status", subcommand_args)) => {
match get_pageserver(env, subcommand_args)?.check_status() { match get_pageserver(env, subcommand_args)?.check_status().await {
Ok(_) => println!("Page server is up and running"), Ok(_) => println!("Page server is up and running"),
Err(err) => { Err(err) => {
eprintln!("Page server is not available: {}", err); eprintln!("Page server is not available: {}", err);
@@ -942,11 +975,14 @@ fn handle_pageserver(sub_match: &ArgMatches, env: &local_env::LocalEnv) -> Resul
Ok(()) Ok(())
} }
fn handle_attachment_service(sub_match: &ArgMatches, env: &local_env::LocalEnv) -> Result<()> { async fn handle_attachment_service(
sub_match: &ArgMatches,
env: &local_env::LocalEnv,
) -> Result<()> {
let svc = AttachmentService::from_env(env); let svc = AttachmentService::from_env(env);
match sub_match.subcommand() { match sub_match.subcommand() {
Some(("start", _start_match)) => { Some(("start", _start_match)) => {
if let Err(e) = svc.start() { if let Err(e) = svc.start().await {
eprintln!("start failed: {e}"); eprintln!("start failed: {e}");
exit(1); exit(1);
} }
@@ -987,7 +1023,7 @@ fn safekeeper_extra_opts(init_match: &ArgMatches) -> Vec<String> {
.collect() .collect()
} }
fn handle_safekeeper(sub_match: &ArgMatches, env: &local_env::LocalEnv) -> Result<()> { async fn handle_safekeeper(sub_match: &ArgMatches, env: &local_env::LocalEnv) -> Result<()> {
let (sub_name, sub_args) = match sub_match.subcommand() { let (sub_name, sub_args) = match sub_match.subcommand() {
Some(safekeeper_command_data) => safekeeper_command_data, Some(safekeeper_command_data) => safekeeper_command_data,
None => bail!("no safekeeper subcommand provided"), None => bail!("no safekeeper subcommand provided"),
@@ -1005,7 +1041,7 @@ fn handle_safekeeper(sub_match: &ArgMatches, env: &local_env::LocalEnv) -> Resul
"start" => { "start" => {
let extra_opts = safekeeper_extra_opts(sub_args); let extra_opts = safekeeper_extra_opts(sub_args);
if let Err(e) = safekeeper.start(extra_opts) { if let Err(e) = safekeeper.start(extra_opts).await {
eprintln!("safekeeper start failed: {}", e); eprintln!("safekeeper start failed: {}", e);
exit(1); exit(1);
} }
@@ -1031,7 +1067,7 @@ fn handle_safekeeper(sub_match: &ArgMatches, env: &local_env::LocalEnv) -> Resul
} }
let extra_opts = safekeeper_extra_opts(sub_args); let extra_opts = safekeeper_extra_opts(sub_args);
if let Err(e) = safekeeper.start(extra_opts) { if let Err(e) = safekeeper.start(extra_opts).await {
eprintln!("safekeeper start failed: {}", e); eprintln!("safekeeper start failed: {}", e);
exit(1); exit(1);
} }
@@ -1044,15 +1080,15 @@ fn handle_safekeeper(sub_match: &ArgMatches, env: &local_env::LocalEnv) -> Resul
Ok(()) Ok(())
} }
fn handle_start_all(sub_match: &ArgMatches, env: &local_env::LocalEnv) -> anyhow::Result<()> { async fn handle_start_all(sub_match: &ArgMatches, env: &local_env::LocalEnv) -> anyhow::Result<()> {
// Endpoints are not started automatically // Endpoints are not started automatically
broker::start_broker_process(env)?; broker::start_broker_process(env).await?;
// Only start the attachment service if the pageserver is configured to need it // Only start the attachment service if the pageserver is configured to need it
if env.control_plane_api.is_some() { if env.control_plane_api.is_some() {
let attachment_service = AttachmentService::from_env(env); let attachment_service = AttachmentService::from_env(env);
if let Err(e) = attachment_service.start() { if let Err(e) = attachment_service.start().await {
eprintln!("attachment_service start failed: {:#}", e); eprintln!("attachment_service start failed: {:#}", e);
try_stop_all(env, true); try_stop_all(env, true);
exit(1); exit(1);
@@ -1061,7 +1097,10 @@ fn handle_start_all(sub_match: &ArgMatches, env: &local_env::LocalEnv) -> anyhow
for ps_conf in &env.pageservers { for ps_conf in &env.pageservers {
let pageserver = PageServerNode::from_env(env, ps_conf); let pageserver = PageServerNode::from_env(env, ps_conf);
if let Err(e) = pageserver.start(&pageserver_config_overrides(sub_match)) { if let Err(e) = pageserver
.start(&pageserver_config_overrides(sub_match))
.await
{
eprintln!("pageserver {} start failed: {:#}", ps_conf.id, e); eprintln!("pageserver {} start failed: {:#}", ps_conf.id, e);
try_stop_all(env, true); try_stop_all(env, true);
exit(1); exit(1);
@@ -1070,7 +1109,7 @@ fn handle_start_all(sub_match: &ArgMatches, env: &local_env::LocalEnv) -> anyhow
for node in env.safekeepers.iter() { for node in env.safekeepers.iter() {
let safekeeper = SafekeeperNode::from_env(env, node); let safekeeper = SafekeeperNode::from_env(env, node);
if let Err(e) = safekeeper.start(vec![]) { if let Err(e) = safekeeper.start(vec![]).await {
eprintln!("safekeeper {} start failed: {:#}", safekeeper.id, e); eprintln!("safekeeper {} start failed: {:#}", safekeeper.id, e);
try_stop_all(env, false); try_stop_all(env, false);
exit(1); exit(1);

View File

@@ -11,7 +11,7 @@ use camino::Utf8PathBuf;
use crate::{background_process, local_env}; use crate::{background_process, local_env};
pub fn start_broker_process(env: &local_env::LocalEnv) -> anyhow::Result<()> { pub async fn start_broker_process(env: &local_env::LocalEnv) -> anyhow::Result<()> {
let broker = &env.broker; let broker = &env.broker;
let listen_addr = &broker.listen_addr; let listen_addr = &broker.listen_addr;
@@ -19,15 +19,15 @@ pub fn start_broker_process(env: &local_env::LocalEnv) -> anyhow::Result<()> {
let args = [format!("--listen-addr={listen_addr}")]; let args = [format!("--listen-addr={listen_addr}")];
let client = reqwest::blocking::Client::new(); let client = reqwest::Client::new();
background_process::start_process( background_process::start_process(
"storage_broker", "storage_broker",
&env.base_data_dir, &env.base_data_dir,
&env.storage_broker_bin(), &env.storage_broker_bin(),
args, args,
[], [],
background_process::InitialPidFile::Create(&storage_broker_pid_file_path(env)), background_process::InitialPidFile::Create(storage_broker_pid_file_path(env)),
|| { || async {
let url = broker.client_url(); let url = broker.client_url();
let status_url = url.join("status").with_context(|| { let status_url = url.join("status").with_context(|| {
format!("Failed to append /status path to broker endpoint {url}") format!("Failed to append /status path to broker endpoint {url}")
@@ -36,12 +36,13 @@ pub fn start_broker_process(env: &local_env::LocalEnv) -> anyhow::Result<()> {
.get(status_url) .get(status_url)
.build() .build()
.with_context(|| format!("Failed to construct request to broker endpoint {url}"))?; .with_context(|| format!("Failed to construct request to broker endpoint {url}"))?;
match client.execute(request) { match client.execute(request).await {
Ok(resp) => Ok(resp.status().is_success()), Ok(resp) => Ok(resp.status().is_success()),
Err(_) => Ok(false), Err(_) => Ok(false),
} }
}, },
) )
.await
.context("Failed to spawn storage_broker subprocess")?; .context("Failed to spawn storage_broker subprocess")?;
Ok(()) Ok(())
} }

View File

@@ -464,7 +464,7 @@ impl Endpoint {
} }
} }
pub fn start( pub async fn start(
&self, &self,
auth_token: &Option<String>, auth_token: &Option<String>,
safekeepers: Vec<NodeId>, safekeepers: Vec<NodeId>,
@@ -587,7 +587,7 @@ impl Endpoint {
const MAX_ATTEMPTS: u32 = 10 * 30; // Wait up to 30 s const MAX_ATTEMPTS: u32 = 10 * 30; // Wait up to 30 s
loop { loop {
attempt += 1; attempt += 1;
match self.get_status() { match self.get_status().await {
Ok(state) => { Ok(state) => {
match state.status { match state.status {
ComputeStatus::Init => { ComputeStatus::Init => {
@@ -629,8 +629,8 @@ impl Endpoint {
} }
// Call the /status HTTP API // Call the /status HTTP API
pub fn get_status(&self) -> Result<ComputeState> { pub async fn get_status(&self) -> Result<ComputeState> {
let client = reqwest::blocking::Client::new(); let client = reqwest::Client::new();
let response = client let response = client
.request( .request(
@@ -641,16 +641,17 @@ impl Endpoint {
self.http_address.port() self.http_address.port()
), ),
) )
.send()?; .send()
.await?;
// Interpret the response // Interpret the response
let status = response.status(); let status = response.status();
if !(status.is_client_error() || status.is_server_error()) { if !(status.is_client_error() || status.is_server_error()) {
Ok(response.json()?) Ok(response.json().await?)
} else { } else {
// reqwest does not export its error construction utility functions, so let's craft the message ourselves // reqwest does not export its error construction utility functions, so let's craft the message ourselves
let url = response.url().to_owned(); let url = response.url().to_owned();
let msg = match response.text() { let msg = match response.text().await {
Ok(err_body) => format!("Error: {}", err_body), Ok(err_body) => format!("Error: {}", err_body),
Err(_) => format!("Http error ({}) at {}.", status.as_u16(), url), Err(_) => format!("Http error ({}) at {}.", status.as_u16(), url),
}; };
@@ -658,7 +659,7 @@ impl Endpoint {
} }
} }
pub fn reconfigure(&self, pageserver_id: Option<NodeId>) -> Result<()> { pub async fn reconfigure(&self, pageserver_id: Option<NodeId>) -> Result<()> {
let mut spec: ComputeSpec = { let mut spec: ComputeSpec = {
let spec_path = self.endpoint_path().join("spec.json"); let spec_path = self.endpoint_path().join("spec.json");
let file = std::fs::File::open(spec_path)?; let file = std::fs::File::open(spec_path)?;
@@ -687,7 +688,7 @@ impl Endpoint {
spec.pageserver_connstring = Some(format!("postgresql://no_user@{host}:{port}")); spec.pageserver_connstring = Some(format!("postgresql://no_user@{host}:{port}"));
} }
let client = reqwest::blocking::Client::new(); let client = reqwest::Client::new();
let response = client let response = client
.post(format!( .post(format!(
"http://{}:{}/configure", "http://{}:{}/configure",
@@ -698,14 +699,15 @@ impl Endpoint {
"{{\"spec\":{}}}", "{{\"spec\":{}}}",
serde_json::to_string_pretty(&spec)? serde_json::to_string_pretty(&spec)?
)) ))
.send()?; .send()
.await?;
let status = response.status(); let status = response.status();
if !(status.is_client_error() || status.is_server_error()) { if !(status.is_client_error() || status.is_server_error()) {
Ok(()) Ok(())
} else { } else {
let url = response.url().to_owned(); let url = response.url().to_owned();
let msg = match response.text() { let msg = match response.text().await {
Ok(err_body) => format!("Error: {}", err_body), Ok(err_body) => format!("Error: {}", err_body),
Err(_) => format!("Http error ({}) at {}.", status.as_u16(), url), Err(_) => format!("Http error ({}) at {}.", status.as_u16(), url),
}; };

View File

@@ -6,28 +6,24 @@
//! //!
use std::borrow::Cow; use std::borrow::Cow;
use std::collections::HashMap; use std::collections::HashMap;
use std::fs::File;
use std::io::{BufReader, Write}; use std::io;
use std::io::Write;
use std::num::NonZeroU64; use std::num::NonZeroU64;
use std::path::PathBuf; use std::path::PathBuf;
use std::process::{Child, Command}; use std::process::{Child, Command};
use std::time::Duration; use std::time::Duration;
use std::{io, result};
use anyhow::{bail, Context}; use anyhow::{bail, Context};
use camino::Utf8PathBuf; use camino::Utf8PathBuf;
use pageserver_api::models::{ use futures::SinkExt;
self, LocationConfig, TenantInfo, TenantLocationConfigRequest, TimelineInfo, use pageserver_api::models::{self, LocationConfig, TenantInfo, TimelineInfo};
};
use pageserver_api::shard::TenantShardId; use pageserver_api::shard::TenantShardId;
use pageserver_client::mgmt_api;
use postgres_backend::AuthType; use postgres_backend::AuthType;
use postgres_connection::{parse_host_port, PgConnectionConfig}; use postgres_connection::{parse_host_port, PgConnectionConfig};
use reqwest::blocking::{Client, RequestBuilder, Response};
use reqwest::{IntoUrl, Method};
use thiserror::Error;
use utils::auth::{Claims, Scope}; use utils::auth::{Claims, Scope};
use utils::{ use utils::{
http::error::HttpErrorBody,
id::{TenantId, TimelineId}, id::{TenantId, TimelineId},
lsn::Lsn, lsn::Lsn,
}; };
@@ -38,45 +34,6 @@ use crate::{background_process, local_env::LocalEnv};
/// Directory within .neon which will be used by default for LocalFs remote storage. /// Directory within .neon which will be used by default for LocalFs remote storage.
pub const PAGESERVER_REMOTE_STORAGE_DIR: &str = "local_fs_remote_storage/pageserver"; pub const PAGESERVER_REMOTE_STORAGE_DIR: &str = "local_fs_remote_storage/pageserver";
#[derive(Error, Debug)]
pub enum PageserverHttpError {
#[error("Reqwest error: {0}")]
Transport(#[from] reqwest::Error),
#[error("Error: {0}")]
Response(String),
}
impl From<anyhow::Error> for PageserverHttpError {
fn from(e: anyhow::Error) -> Self {
Self::Response(e.to_string())
}
}
type Result<T> = result::Result<T, PageserverHttpError>;
pub trait ResponseErrorMessageExt: Sized {
fn error_from_body(self) -> Result<Self>;
}
impl ResponseErrorMessageExt for Response {
fn error_from_body(self) -> Result<Self> {
let status = self.status();
if !(status.is_client_error() || status.is_server_error()) {
return Ok(self);
}
// reqwest does not export its error construction utility functions, so let's craft the message ourselves
let url = self.url().to_owned();
Err(PageserverHttpError::Response(
match self.json::<HttpErrorBody>() {
Ok(err_body) => format!("Error: {}", err_body.msg),
Err(_) => format!("Http error ({}) at {}.", status.as_u16(), url),
},
))
}
}
// //
// Control routines for pageserver. // Control routines for pageserver.
// //
@@ -87,8 +44,7 @@ pub struct PageServerNode {
pub pg_connection_config: PgConnectionConfig, pub pg_connection_config: PgConnectionConfig,
pub conf: PageServerConf, pub conf: PageServerConf,
pub env: LocalEnv, pub env: LocalEnv,
pub http_client: Client, pub http_client: mgmt_api::Client,
pub http_base_url: String,
} }
impl PageServerNode { impl PageServerNode {
@@ -100,8 +56,19 @@ impl PageServerNode {
pg_connection_config: PgConnectionConfig::new_host_port(host, port), pg_connection_config: PgConnectionConfig::new_host_port(host, port),
conf: conf.clone(), conf: conf.clone(),
env: env.clone(), env: env.clone(),
http_client: Client::new(), http_client: mgmt_api::Client::new(
http_base_url: format!("http://{}/v1", conf.listen_http_addr), format!("http://{}", conf.listen_http_addr),
{
match conf.http_auth_type {
AuthType::Trust => None,
AuthType::NeonJWT => Some(
env.generate_auth_token(&Claims::new(None, Scope::PageServerApi))
.unwrap(),
),
}
}
.as_deref(),
),
} }
} }
@@ -182,8 +149,8 @@ impl PageServerNode {
.expect("non-Unicode path") .expect("non-Unicode path")
} }
pub fn start(&self, config_overrides: &[&str]) -> anyhow::Result<Child> { pub async fn start(&self, config_overrides: &[&str]) -> anyhow::Result<Child> {
self.start_node(config_overrides, false) self.start_node(config_overrides, false).await
} }
fn pageserver_init(&self, config_overrides: &[&str]) -> anyhow::Result<()> { fn pageserver_init(&self, config_overrides: &[&str]) -> anyhow::Result<()> {
@@ -224,7 +191,12 @@ impl PageServerNode {
Ok(()) Ok(())
} }
fn start_node(&self, config_overrides: &[&str], update_config: bool) -> anyhow::Result<Child> { async fn start_node(
&self,
config_overrides: &[&str],
update_config: bool,
) -> anyhow::Result<Child> {
// TODO: using a thread here because start_process() is not async but we need to call check_status()
let datadir = self.repo_path(); let datadir = self.repo_path();
print!( print!(
"Starting pageserver node {} at '{}' in {:?}", "Starting pageserver node {} at '{}' in {:?}",
@@ -232,7 +204,7 @@ impl PageServerNode {
self.pg_connection_config.raw_address(), self.pg_connection_config.raw_address(),
datadir datadir
); );
io::stdout().flush()?; io::stdout().flush().context("flush stdout")?;
let datadir_path_str = datadir.to_str().with_context(|| { let datadir_path_str = datadir.to_str().with_context(|| {
format!( format!(
@@ -244,20 +216,23 @@ impl PageServerNode {
if update_config { if update_config {
args.push(Cow::Borrowed("--update-config")); args.push(Cow::Borrowed("--update-config"));
} }
background_process::start_process( background_process::start_process(
"pageserver", "pageserver",
&datadir, &datadir,
&self.env.pageserver_bin(), &self.env.pageserver_bin(),
args.iter().map(Cow::as_ref), args.iter().map(Cow::as_ref),
self.pageserver_env_variables()?, self.pageserver_env_variables()?,
background_process::InitialPidFile::Expect(&self.pid_file()), background_process::InitialPidFile::Expect(self.pid_file()),
|| match self.check_status() { || async {
Ok(()) => Ok(true), let st = self.check_status().await;
Err(PageserverHttpError::Transport(_)) => Ok(false), match st {
Err(e) => Err(anyhow::anyhow!("Failed to check node status: {e}")), Ok(()) => Ok(true),
Err(mgmt_api::Error::ReceiveBody(_)) => Ok(false),
Err(e) => Err(anyhow::anyhow!("Failed to check node status: {e}")),
}
}, },
) )
.await
} }
fn pageserver_basic_args<'a>( fn pageserver_basic_args<'a>(
@@ -303,7 +278,12 @@ impl PageServerNode {
background_process::stop_process(immediate, "pageserver", &self.pid_file()) background_process::stop_process(immediate, "pageserver", &self.pid_file())
} }
pub fn page_server_psql_client(&self) -> anyhow::Result<postgres::Client> { pub async fn page_server_psql_client(
&self,
) -> anyhow::Result<(
tokio_postgres::Client,
tokio_postgres::Connection<tokio_postgres::Socket, tokio_postgres::tls::NoTlsStream>,
)> {
let mut config = self.pg_connection_config.clone(); let mut config = self.pg_connection_config.clone();
if self.conf.pg_auth_type == AuthType::NeonJWT { if self.conf.pg_auth_type == AuthType::NeonJWT {
let token = self let token = self
@@ -311,36 +291,18 @@ impl PageServerNode {
.generate_auth_token(&Claims::new(None, Scope::PageServerApi))?; .generate_auth_token(&Claims::new(None, Scope::PageServerApi))?;
config = config.set_password(Some(token)); config = config.set_password(Some(token));
} }
Ok(config.connect_no_tls()?) Ok(config.connect_no_tls().await?)
} }
fn http_request<U: IntoUrl>(&self, method: Method, url: U) -> anyhow::Result<RequestBuilder> { pub async fn check_status(&self) -> mgmt_api::Result<()> {
let mut builder = self.http_client.request(method, url); self.http_client.status().await
if self.conf.http_auth_type == AuthType::NeonJWT {
let token = self
.env
.generate_auth_token(&Claims::new(None, Scope::PageServerApi))?;
builder = builder.bearer_auth(token)
}
Ok(builder)
} }
pub fn check_status(&self) -> Result<()> { pub async fn tenant_list(&self) -> mgmt_api::Result<Vec<TenantInfo>> {
self.http_request(Method::GET, format!("{}/status", self.http_base_url))? self.http_client.list_tenants().await
.send()?
.error_from_body()?;
Ok(())
} }
pub fn tenant_list(&self) -> Result<Vec<TenantInfo>> { pub async fn tenant_create(
Ok(self
.http_request(Method::GET, format!("{}/tenant", self.http_base_url))?
.send()?
.error_from_body()?
.json()?)
}
pub fn tenant_create(
&self, &self,
new_tenant_id: TenantId, new_tenant_id: TenantId,
generation: Option<u32>, generation: Option<u32>,
@@ -418,23 +380,10 @@ impl PageServerNode {
if !settings.is_empty() { if !settings.is_empty() {
bail!("Unrecognized tenant settings: {settings:?}") bail!("Unrecognized tenant settings: {settings:?}")
} }
self.http_request(Method::POST, format!("{}/tenant", self.http_base_url))? Ok(self.http_client.tenant_create(&request).await?)
.json(&request)
.send()?
.error_from_body()?
.json::<Option<String>>()
.with_context(|| {
format!("Failed to parse tenant creation response for tenant id: {new_tenant_id:?}")
})?
.context("No tenant id was found in the tenant creation response")
.and_then(|tenant_id_string| {
tenant_id_string.parse().with_context(|| {
format!("Failed to parse response string as tenant id: '{tenant_id_string}'")
})
})
} }
pub fn tenant_config( pub async fn tenant_config(
&self, &self,
tenant_id: TenantId, tenant_id: TenantId,
mut settings: HashMap<&str, &str>, mut settings: HashMap<&str, &str>,
@@ -513,54 +462,30 @@ impl PageServerNode {
bail!("Unrecognized tenant settings: {settings:?}") bail!("Unrecognized tenant settings: {settings:?}")
} }
self.http_request(Method::PUT, format!("{}/tenant/config", self.http_base_url))? self.http_client
.json(&models::TenantConfigRequest { tenant_id, config }) .tenant_config(&models::TenantConfigRequest { tenant_id, config })
.send()? .await?;
.error_from_body()?;
Ok(()) Ok(())
} }
pub fn location_config( pub async fn location_config(
&self, &self,
tenant_id: TenantId, tenant_id: TenantId,
config: LocationConfig, config: LocationConfig,
flush_ms: Option<Duration>, flush_ms: Option<Duration>,
) -> anyhow::Result<()> { ) -> anyhow::Result<()> {
let req_body = TenantLocationConfigRequest { tenant_id, config }; Ok(self
.http_client
let path = format!( .location_config(tenant_id, config, flush_ms)
"{}/tenant/{}/location_config", .await?)
self.http_base_url, tenant_id
);
let path = if let Some(flush_ms) = flush_ms {
format!("{}?flush_ms={}", path, flush_ms.as_millis())
} else {
path
};
self.http_request(Method::PUT, path)?
.json(&req_body)
.send()?
.error_from_body()?;
Ok(())
} }
pub fn timeline_list(&self, tenant_id: &TenantId) -> anyhow::Result<Vec<TimelineInfo>> { pub async fn timeline_list(&self, tenant_id: &TenantId) -> anyhow::Result<Vec<TimelineInfo>> {
let timeline_infos: Vec<TimelineInfo> = self Ok(self.http_client.list_timelines(*tenant_id).await?)
.http_request(
Method::GET,
format!("{}/tenant/{}/timeline", self.http_base_url, tenant_id),
)?
.send()?
.error_from_body()?
.json()?;
Ok(timeline_infos)
} }
pub fn timeline_create( pub async fn timeline_create(
&self, &self,
tenant_id: TenantId, tenant_id: TenantId,
new_timeline_id: Option<TimelineId>, new_timeline_id: Option<TimelineId>,
@@ -571,29 +496,14 @@ impl PageServerNode {
) -> anyhow::Result<TimelineInfo> { ) -> anyhow::Result<TimelineInfo> {
// If timeline ID was not specified, generate one // If timeline ID was not specified, generate one
let new_timeline_id = new_timeline_id.unwrap_or(TimelineId::generate()); let new_timeline_id = new_timeline_id.unwrap_or(TimelineId::generate());
let req = models::TimelineCreateRequest {
self.http_request(
Method::POST,
format!("{}/tenant/{}/timeline", self.http_base_url, tenant_id),
)?
.json(&models::TimelineCreateRequest {
new_timeline_id, new_timeline_id,
ancestor_start_lsn, ancestor_start_lsn,
ancestor_timeline_id, ancestor_timeline_id,
pg_version, pg_version,
existing_initdb_timeline_id, existing_initdb_timeline_id,
}) };
.send()? Ok(self.http_client.timeline_create(tenant_id, &req).await?)
.error_from_body()?
.json::<Option<TimelineInfo>>()
.with_context(|| {
format!("Failed to parse timeline creation response for tenant id: {tenant_id}")
})?
.with_context(|| {
format!(
"No timeline id was found in the timeline creation response for tenant {tenant_id}"
)
})
} }
/// Import a basebackup prepared using either: /// Import a basebackup prepared using either:
@@ -605,7 +515,7 @@ impl PageServerNode {
/// * `timeline_id` - id to assign to imported timeline /// * `timeline_id` - id to assign to imported timeline
/// * `base` - (start lsn of basebackup, path to `base.tar` file) /// * `base` - (start lsn of basebackup, path to `base.tar` file)
/// * `pg_wal` - if there's any wal to import: (end lsn, path to `pg_wal.tar`) /// * `pg_wal` - if there's any wal to import: (end lsn, path to `pg_wal.tar`)
pub fn timeline_import( pub async fn timeline_import(
&self, &self,
tenant_id: TenantId, tenant_id: TenantId,
timeline_id: TimelineId, timeline_id: TimelineId,
@@ -613,36 +523,60 @@ impl PageServerNode {
pg_wal: Option<(Lsn, PathBuf)>, pg_wal: Option<(Lsn, PathBuf)>,
pg_version: u32, pg_version: u32,
) -> anyhow::Result<()> { ) -> anyhow::Result<()> {
let mut client = self.page_server_psql_client()?; let (client, conn) = self.page_server_psql_client().await?;
// The connection object performs the actual communication with the database,
// so spawn it off to run on its own.
tokio::spawn(async move {
if let Err(e) = conn.await {
eprintln!("connection error: {}", e);
}
});
tokio::pin!(client);
// Init base reader // Init base reader
let (start_lsn, base_tarfile_path) = base; let (start_lsn, base_tarfile_path) = base;
let base_tarfile = File::open(base_tarfile_path)?; let base_tarfile = tokio::fs::File::open(base_tarfile_path).await?;
let mut base_reader = BufReader::new(base_tarfile); let base_tarfile = tokio_util::io::ReaderStream::new(base_tarfile);
// Init wal reader if necessary // Init wal reader if necessary
let (end_lsn, wal_reader) = if let Some((end_lsn, wal_tarfile_path)) = pg_wal { let (end_lsn, wal_reader) = if let Some((end_lsn, wal_tarfile_path)) = pg_wal {
let wal_tarfile = File::open(wal_tarfile_path)?; let wal_tarfile = tokio::fs::File::open(wal_tarfile_path).await?;
let wal_reader = BufReader::new(wal_tarfile); let wal_reader = tokio_util::io::ReaderStream::new(wal_tarfile);
(end_lsn, Some(wal_reader)) (end_lsn, Some(wal_reader))
} else { } else {
(start_lsn, None) (start_lsn, None)
}; };
// Import base let copy_in = |reader, cmd| {
let import_cmd = format!( let client = &client;
"import basebackup {tenant_id} {timeline_id} {start_lsn} {end_lsn} {pg_version}" async move {
); let writer = client.copy_in(&cmd).await?;
let mut writer = client.copy_in(&import_cmd)?; let writer = std::pin::pin!(writer);
io::copy(&mut base_reader, &mut writer)?; let mut writer = writer.sink_map_err(|e| {
writer.finish()?; std::io::Error::new(std::io::ErrorKind::Other, format!("{e}"))
});
let mut reader = std::pin::pin!(reader);
writer.send_all(&mut reader).await?;
writer.into_inner().finish().await?;
anyhow::Ok(())
}
};
// Import base
copy_in(
base_tarfile,
format!(
"import basebackup {tenant_id} {timeline_id} {start_lsn} {end_lsn} {pg_version}"
),
)
.await?;
// Import wal if necessary // Import wal if necessary
if let Some(mut wal_reader) = wal_reader { if let Some(wal_reader) = wal_reader {
let import_cmd = format!("import wal {tenant_id} {timeline_id} {start_lsn} {end_lsn}"); copy_in(
let mut writer = client.copy_in(&import_cmd)?; wal_reader,
io::copy(&mut wal_reader, &mut writer)?; format!("import wal {tenant_id} {timeline_id} {start_lsn} {end_lsn}"),
writer.finish()?; )
.await?;
} }
Ok(()) Ok(())

View File

@@ -13,7 +13,6 @@ use std::{io, result};
use anyhow::Context; use anyhow::Context;
use camino::Utf8PathBuf; use camino::Utf8PathBuf;
use postgres_connection::PgConnectionConfig; use postgres_connection::PgConnectionConfig;
use reqwest::blocking::{Client, RequestBuilder, Response};
use reqwest::{IntoUrl, Method}; use reqwest::{IntoUrl, Method};
use thiserror::Error; use thiserror::Error;
use utils::{http::error::HttpErrorBody, id::NodeId}; use utils::{http::error::HttpErrorBody, id::NodeId};
@@ -34,12 +33,14 @@ pub enum SafekeeperHttpError {
type Result<T> = result::Result<T, SafekeeperHttpError>; type Result<T> = result::Result<T, SafekeeperHttpError>;
#[async_trait::async_trait]
pub trait ResponseErrorMessageExt: Sized { pub trait ResponseErrorMessageExt: Sized {
fn error_from_body(self) -> Result<Self>; async fn error_from_body(self) -> Result<Self>;
} }
impl ResponseErrorMessageExt for Response { #[async_trait::async_trait]
fn error_from_body(self) -> Result<Self> { impl ResponseErrorMessageExt for reqwest::Response {
async fn error_from_body(self) -> Result<Self> {
let status = self.status(); let status = self.status();
if !(status.is_client_error() || status.is_server_error()) { if !(status.is_client_error() || status.is_server_error()) {
return Ok(self); return Ok(self);
@@ -48,7 +49,7 @@ impl ResponseErrorMessageExt for Response {
// reqwest does not export its error construction utility functions, so let's craft the message ourselves // reqwest does not export its error construction utility functions, so let's craft the message ourselves
let url = self.url().to_owned(); let url = self.url().to_owned();
Err(SafekeeperHttpError::Response( Err(SafekeeperHttpError::Response(
match self.json::<HttpErrorBody>() { match self.json::<HttpErrorBody>().await {
Ok(err_body) => format!("Error: {}", err_body.msg), Ok(err_body) => format!("Error: {}", err_body.msg),
Err(_) => format!("Http error ({}) at {}.", status.as_u16(), url), Err(_) => format!("Http error ({}) at {}.", status.as_u16(), url),
}, },
@@ -69,7 +70,7 @@ pub struct SafekeeperNode {
pub pg_connection_config: PgConnectionConfig, pub pg_connection_config: PgConnectionConfig,
pub env: LocalEnv, pub env: LocalEnv,
pub http_client: Client, pub http_client: reqwest::Client,
pub http_base_url: String, pub http_base_url: String,
} }
@@ -80,7 +81,7 @@ impl SafekeeperNode {
conf: conf.clone(), conf: conf.clone(),
pg_connection_config: Self::safekeeper_connection_config(conf.pg_port), pg_connection_config: Self::safekeeper_connection_config(conf.pg_port),
env: env.clone(), env: env.clone(),
http_client: Client::new(), http_client: reqwest::Client::new(),
http_base_url: format!("http://127.0.0.1:{}/v1", conf.http_port), http_base_url: format!("http://127.0.0.1:{}/v1", conf.http_port),
} }
} }
@@ -103,7 +104,7 @@ impl SafekeeperNode {
.expect("non-Unicode path") .expect("non-Unicode path")
} }
pub fn start(&self, extra_opts: Vec<String>) -> anyhow::Result<Child> { pub async fn start(&self, extra_opts: Vec<String>) -> anyhow::Result<Child> {
print!( print!(
"Starting safekeeper at '{}' in '{}'", "Starting safekeeper at '{}' in '{}'",
self.pg_connection_config.raw_address(), self.pg_connection_config.raw_address(),
@@ -191,13 +192,16 @@ impl SafekeeperNode {
&self.env.safekeeper_bin(), &self.env.safekeeper_bin(),
&args, &args,
[], [],
background_process::InitialPidFile::Expect(&self.pid_file()), background_process::InitialPidFile::Expect(self.pid_file()),
|| match self.check_status() { || async {
Ok(()) => Ok(true), match self.check_status().await {
Err(SafekeeperHttpError::Transport(_)) => Ok(false), Ok(()) => Ok(true),
Err(e) => Err(anyhow::anyhow!("Failed to check node status: {e}")), Err(SafekeeperHttpError::Transport(_)) => Ok(false),
Err(e) => Err(anyhow::anyhow!("Failed to check node status: {e}")),
}
}, },
) )
.await
} }
/// ///
@@ -216,7 +220,7 @@ impl SafekeeperNode {
) )
} }
fn http_request<U: IntoUrl>(&self, method: Method, url: U) -> RequestBuilder { fn http_request<U: IntoUrl>(&self, method: Method, url: U) -> reqwest::RequestBuilder {
// TODO: authentication // TODO: authentication
//if self.env.auth_type == AuthType::NeonJWT { //if self.env.auth_type == AuthType::NeonJWT {
// builder = builder.bearer_auth(&self.env.safekeeper_auth_token) // builder = builder.bearer_auth(&self.env.safekeeper_auth_token)
@@ -224,10 +228,12 @@ impl SafekeeperNode {
self.http_client.request(method, url) self.http_client.request(method, url)
} }
pub fn check_status(&self) -> Result<()> { pub async fn check_status(&self) -> Result<()> {
self.http_request(Method::GET, format!("{}/{}", self.http_base_url, "status")) self.http_request(Method::GET, format!("{}/{}", self.http_base_url, "status"))
.send()? .send()
.error_from_body()?; .await?
.error_from_body()
.await?;
Ok(()) Ok(())
} }
} }

View File

@@ -19,11 +19,11 @@ use utils::{
}; };
/// Given an attached pageserver, retrieve the LSN for all timelines /// Given an attached pageserver, retrieve the LSN for all timelines
fn get_lsns( async fn get_lsns(
tenant_id: TenantId, tenant_id: TenantId,
pageserver: &PageServerNode, pageserver: &PageServerNode,
) -> anyhow::Result<HashMap<TimelineId, Lsn>> { ) -> anyhow::Result<HashMap<TimelineId, Lsn>> {
let timelines = pageserver.timeline_list(&tenant_id)?; let timelines = pageserver.timeline_list(&tenant_id).await?;
Ok(timelines Ok(timelines
.into_iter() .into_iter()
.map(|t| (t.timeline_id, t.last_record_lsn)) .map(|t| (t.timeline_id, t.last_record_lsn))
@@ -32,13 +32,13 @@ fn get_lsns(
/// Wait for the timeline LSNs on `pageserver` to catch up with or overtake /// Wait for the timeline LSNs on `pageserver` to catch up with or overtake
/// `baseline`. /// `baseline`.
fn await_lsn( async fn await_lsn(
tenant_id: TenantId, tenant_id: TenantId,
pageserver: &PageServerNode, pageserver: &PageServerNode,
baseline: HashMap<TimelineId, Lsn>, baseline: HashMap<TimelineId, Lsn>,
) -> anyhow::Result<()> { ) -> anyhow::Result<()> {
loop { loop {
let latest = match get_lsns(tenant_id, pageserver) { let latest = match get_lsns(tenant_id, pageserver).await {
Ok(l) => l, Ok(l) => l,
Err(e) => { Err(e) => {
println!( println!(
@@ -84,7 +84,7 @@ fn await_lsn(
/// - Coordinate attach/secondary/detach on pageservers /// - Coordinate attach/secondary/detach on pageservers
/// - call into attachment_service for generations /// - call into attachment_service for generations
/// - reconfigure compute endpoints to point to new attached pageserver /// - reconfigure compute endpoints to point to new attached pageserver
pub fn migrate_tenant( pub async fn migrate_tenant(
env: &LocalEnv, env: &LocalEnv,
tenant_id: TenantId, tenant_id: TenantId,
dest_ps: PageServerNode, dest_ps: PageServerNode,
@@ -108,16 +108,18 @@ pub fn migrate_tenant(
} }
} }
let previous = attachment_service.inspect(tenant_id)?; let previous = attachment_service.inspect(tenant_id).await?;
let mut baseline_lsns = None; let mut baseline_lsns = None;
if let Some((generation, origin_ps_id)) = &previous { if let Some((generation, origin_ps_id)) = &previous {
let origin_ps = PageServerNode::from_env(env, env.get_pageserver_conf(*origin_ps_id)?); let origin_ps = PageServerNode::from_env(env, env.get_pageserver_conf(*origin_ps_id)?);
if origin_ps_id == &dest_ps.conf.id { if origin_ps_id == &dest_ps.conf.id {
println!("🔁 Already attached to {origin_ps_id}, freshening..."); println!("🔁 Already attached to {origin_ps_id}, freshening...");
let gen = attachment_service.attach_hook(tenant_id, dest_ps.conf.id)?; let gen = attachment_service
.attach_hook(tenant_id, dest_ps.conf.id)
.await?;
let dest_conf = build_location_config(LocationConfigMode::AttachedSingle, gen, None); let dest_conf = build_location_config(LocationConfigMode::AttachedSingle, gen, None);
dest_ps.location_config(tenant_id, dest_conf, None)?; dest_ps.location_config(tenant_id, dest_conf, None).await?;
println!("✅ Migration complete"); println!("✅ Migration complete");
return Ok(()); return Ok(());
} }
@@ -126,20 +128,24 @@ pub fn migrate_tenant(
let stale_conf = let stale_conf =
build_location_config(LocationConfigMode::AttachedStale, Some(*generation), None); build_location_config(LocationConfigMode::AttachedStale, Some(*generation), None);
origin_ps.location_config(tenant_id, stale_conf, Some(Duration::from_secs(10)))?; origin_ps
.location_config(tenant_id, stale_conf, Some(Duration::from_secs(10)))
.await?;
baseline_lsns = Some(get_lsns(tenant_id, &origin_ps)?); baseline_lsns = Some(get_lsns(tenant_id, &origin_ps).await?);
} }
let gen = attachment_service.attach_hook(tenant_id, dest_ps.conf.id)?; let gen = attachment_service
.attach_hook(tenant_id, dest_ps.conf.id)
.await?;
let dest_conf = build_location_config(LocationConfigMode::AttachedMulti, gen, None); let dest_conf = build_location_config(LocationConfigMode::AttachedMulti, gen, None);
println!("🔁 Attaching to pageserver {}", dest_ps.conf.id); println!("🔁 Attaching to pageserver {}", dest_ps.conf.id);
dest_ps.location_config(tenant_id, dest_conf, None)?; dest_ps.location_config(tenant_id, dest_conf, None).await?;
if let Some(baseline) = baseline_lsns { if let Some(baseline) = baseline_lsns {
println!("🕑 Waiting for LSN to catch up..."); println!("🕑 Waiting for LSN to catch up...");
await_lsn(tenant_id, &dest_ps, baseline)?; await_lsn(tenant_id, &dest_ps, baseline).await?;
} }
let cplane = ComputeControlPlane::load(env.clone())?; let cplane = ComputeControlPlane::load(env.clone())?;
@@ -149,7 +155,7 @@ pub fn migrate_tenant(
"🔁 Reconfiguring endpoint {} to use pageserver {}", "🔁 Reconfiguring endpoint {} to use pageserver {}",
endpoint_name, dest_ps.conf.id endpoint_name, dest_ps.conf.id
); );
endpoint.reconfigure(Some(dest_ps.conf.id))?; endpoint.reconfigure(Some(dest_ps.conf.id)).await?;
} }
} }
@@ -159,7 +165,7 @@ pub fn migrate_tenant(
} }
let other_ps = PageServerNode::from_env(env, other_ps_conf); let other_ps = PageServerNode::from_env(env, other_ps_conf);
let other_ps_tenants = other_ps.tenant_list()?; let other_ps_tenants = other_ps.tenant_list().await?;
// Check if this tenant is attached // Check if this tenant is attached
let found = other_ps_tenants let found = other_ps_tenants
@@ -181,7 +187,9 @@ pub fn migrate_tenant(
"💤 Switching to secondary mode on pageserver {}", "💤 Switching to secondary mode on pageserver {}",
other_ps.conf.id other_ps.conf.id
); );
other_ps.location_config(tenant_id, secondary_conf, None)?; other_ps
.location_config(tenant_id, secondary_conf, None)
.await?;
} }
println!( println!(
@@ -189,7 +197,7 @@ pub fn migrate_tenant(
dest_ps.conf.id dest_ps.conf.id
); );
let dest_conf = build_location_config(LocationConfigMode::AttachedSingle, gen, None); let dest_conf = build_location_config(LocationConfigMode::AttachedSingle, gen, None);
dest_ps.location_config(tenant_id, dest_conf, None)?; dest_ps.location_config(tenant_id, dest_conf, None).await?;
println!("✅ Migration complete"); println!("✅ Migration complete");

View File

@@ -0,0 +1,197 @@
# Per-Tenant GetPage@LSN Throttling
Author: Christian Schwarz
Date: Oct 24, 2023
## Summary
This RFC proposes per-tenant throttling of GetPage@LSN requests inside Pageserver
and the interactions with its client, i.e., the neon_smgr component in Compute.
The result of implementing & executing this RFC will be a fleet-wide upper limit for
**"the highest GetPage/second that Pageserver can support for a single tenant/shard"**.
## Background
### GetPage@LSN Request Flow
Pageserver exposes its `page_service.rs` as a libpq listener.
The Computes' `neon_smgr` module connects to that libpq listener.
Once a connection is established, the protocol allows Compute to request page images at a given LSN.
We call these requests GetPage@LSN requests, or GetPage requests for short.
Other request types can be sent, but these are low traffic compared to GetPage requests
and are not the concern of this RFC.
Pageserver associates one libpq connection with one tokio task.
Per connection/task, the pq protocol is handled by the common `postgres_backend` crate.
Its `run_message_loop` function invokes the `page_service` specific `impl<IO> postgres_backend::Handler<IO> for PageServerHandler`.
Requests are processed in the order in which they arrive via the TCP-based pq protocol.
So, there is no concurrent request processing within one connection/task.
There is a degree of natural pipelining:
Compute can "fill the pipe" by sending more than one GetPage request into the libpq TCP stream.
And Pageserver can fill the pipe with responses in the other direction.
Both directions are subject to the limit of tx/rx buffers, nodelay, TCP flow control, etc.
### GetPage@LSN Access Pattern
The Compute has its own hierarchy of caches, specifically `shared_buffers` and the `local file cache` (LFC).
Compute only issues GetPage requests to Pageserver if it encounters a miss in these caches.
If the working set stops fitting into Compute's caches, requests to Pageserver increase sharply -- the Compute starts *thrashing*.
## Motivation
In INC-69, a tenant issued 155k GetPage/second for a period of 10 minutes and 60k GetPage/second for a period of 3h,
then dropping to ca 18k GetPage/second for a period of 9h.
We noticed this because of an internal GetPage latency SLO burn rate alert, i.e.,
the request latency profile during this period significantly exceeded what was acceptable according to the internal SLO.
Sadly, we do not have the observability data to determine the impact of this tenant on other tenants on the same tenants.
However, here are some illustrative data points for the 155k period:
The tenant was responsible for >= 99% of the GetPage traffic and, frankly, the overall activity on this Pageserver instance.
We were serving pages at 10 Gb/s (`155k x 8 kbyte (PAGE_SZ) per second is 1.12GiB/s = 9.4Gb/s.`)
The CPU utilization of the instance was 75% user+system.
Pageserver page cache served 1.75M accesses/second at a hit rate of ca 90%.
The hit rate for materialized pages was ca. 40%.
Curiously, IOPS to the Instance Store NVMe were very low, rarely exceeding 100.
The fact that the IOPS were so low / the materialized page cache hit rate was so high suggests that **this tenant's compute's caches were thrashing**.
The compute was of type `k8s-pod`; hence, auto-scaling could/would not have helped remediate the thrashing by provisioning more RAM.
The consequence was that the **thrashing translated into excessive GetPage requests against Pageserver**.
My claim is that it was **unhealthy to serve this workload at the pace we did**:
* it is likely that other tenants were/would have experienced high latencies (again, we sadly don't have per-tenant latency data to confirm this)
* more importantly, it was **unsustainable** to serve traffic at this pace for multiple reasons:
* **predictability of performance**: when the working set grows, the pageserver materialized page cache hit rate drops.
At some point, we're bound by the EC2 Instance Store NVMe drive's IOPS limit.
The result is an **uneven** performance profile from the Compute perspective.
* **economics**: Neon currently does not charge for IOPS, only capacity.
**We cannot afford to undercut the market in IOPS/$ this drastically; it leads to adverse selection and perverse incentives.**
For example, the 155k IOPS, which we served for 10min, would cost ca. 6.5k$/month when provisioned as an io2 EBS volume.
Even the 18k IOPS, which we served for 9h, would cost ca. 1.1k$/month when provisioned as an io2 EBS volume.
We charge 0$.
It could be economically advantageous to keep using a low-DRAM compute because Pageserver IOPS are fast enough and free.
Note: It is helpful to think of Pageserver as a disk, because it's precisely where `neon_smgr` sits:
vanilla Postgres gets its pages from disk, Neon Postgres gets them from Pageserver.
So, regarding the above performance & economic arguments, it is fair to say that we currently provide an "as-fast-as-possible-IOPS" disk that we charge for only by capacity.
## Solution: Throttling GetPage Requests
**The consequence of the above analysis must be that Pageserver throttles GetPage@LSN requests**.
That is, unless we want to start charging for provisioned GetPage@LSN/second.
Throttling sets the correct incentive for a thrashing Compute to scale up its DRAM to the working set size.
Neon Autoscaling will make this easy, [eventually](https://github.com/neondatabase/neon/pull/3913).
## The Design Space
What that remains is the question about *policy* and *mechanism*:
**Policy** concerns itself with the question of what limit applies to a given connection|timeline|tenant.
Candidates are:
* hard limit, same limit value per connection|timeline|tenant
* Per-tenant will provide an upper bound for the impact of a tenant on a given Pageserver instance.
This is a major operational pain point / risk right now.
* hard limit, configurable per connection|timeline|tenant
* This outsources policy to console/control plane, with obvious advantages for flexible structuring of what service we offer to customers.
* Note that this is not a mechanism to guarantee a minium provisioned rate, i.e., this is not a mechanism to guarantee a certain QoS for a tenant.
* fair share among active connections|timelines|tenants per instance
* example: each connection|timeline|tenant gets a fair fraction of the machine's GetPage/second capacity
* NB: needs definition of "active", and knowledge of available GetPage/second capacity in advance
* ...
Regarding **mechanism**, it's clear that **backpressure** is the way to go.
However, we must choose between
* **implicit** backpressure through pq/TCP and
* **explicit** rejection of requests + retries with exponential backoff
Further, there is the question of how throttling GetPage@LSN will affect the **internal GetPage latency SLO**:
where do we measure the SLI for Pageserver's internal getpage latency SLO? Before or after the throttling?
And when we eventually move the measurement point into the Computes (to avoid coordinated omission),
how do we avoid counting throttling-induced latency toward the internal getpage latency SLI/SLO?
## Scope Of This RFC
**This RFC proposes introducing a hard GetPage@LSN/second limit per tenant, with the same value applying to each tenant on a Pageserver**.
This proposal is easy to implement and significantly de-risks operating large Pageservers,
based on the assumption that extremely-high-GetPage-rate-episodes like the one from the "Motivation" section are uncorrelated between tenants.
For example, suppose we pick a limit that allows up to 10 tenants to go at limit rate.
Suppose our Pageserver can serve 100k GetPage/second total at a 100% page cache miss rate.
If each tenant gets a hard limit of 10k GetPage/second, we can serve up to 10 tenants at limit speed without latency degradation.
The mechanism for backpressure will be TCP-based implicit backpressure.
The compute team isn't concerned about prefetch queue depth.
Pageserver will implement it by delaying the reading of requests from the libpq connection(s).
The rate limit will be implemented using a per-tenant token bucket.
The bucket will be be shared among all connections to the tenant.
The bucket implementation supports starvation-preventing `await`ing.
The current candidate for the implementation is [`leaky_bucket`](https://docs.rs/leaky-bucket/).
The getpage@lsn benchmark that's being added in https://github.com/neondatabase/neon/issues/5771
can be used to evaluate the overhead of sharing the bucket among connections of a tenant.
A possible technique to mitigate the impact of sharing the bucket would be to maintain a buffer of a few tokens per connection handler.
Regarding metrics / the internal GetPage latency SLO:
we will measure the GetPage latency SLO _after_ the throttler and introduce a new metric to measure the amount of throttling, quantified by:
- histogram that records the tenants' observations of queue depth before they start waiting (one such histogram per pageserver)
- histogram that records the tenants' observations of time spent waiting (one such histogram per pageserver)
Further observability measures:
- an INFO log message at frequency 1/min if the tenant/timeline/connection was throttled in that last minute.
The message will identify the tenant/timeline/connection to allow correlation with compute logs/stats.
Rollout will happen as follows:
- deploy 1: implementation + config: disabled by default, ability to enable it per tenant through tenant_conf
- experimentation in staging and later production to study impact & interaction with auto-scaling
- determination of a sensible global default value
- the value will be chosen as high as possible ...
- ... but low enough to work towards this RFC's goal that one tenant should not be able to dominate a pageserver instance.
- deploy 2: implementation fixes if any + config: enabled by default with the aforementioned global default
- reset of the experimental per-tenant overrides
- gain experience & lower the limit over time
- we stop lowering the limit as soon as this RFC's goal is achieved, i.e.,
once we decide that in practice the chosen value sufficiently de-risks operating large pageservers
The per-tenant override will remain for emergencies and testing.
But since Console doesn't preserve it during tenant migrations, it isn't durably configurable for the tenant.
Toward the upper layers of the Neon stack, the resulting limit will be
**"the highest GetPage/second that Pageserver can support for a single tenant"**.
### Rationale
We decided against error + retry because of worries about starvation.
## Future Work
Enable per-tenant emergency override of the limit via Console.
Should be part of a more general framework to specify tenant config overrides.
**NB:** this is **not** the right mechanism to _sell_ different max GetPage/second levels to users,
or _auto-scale_ the GetPage/second levels. Such functionality will require a separate RFC that
concerns itself with GetPage/second capacity planning.
Compute-side metrics for GetPage latency.
Back-channel to inform Compute/Autoscaling/ControlPlane that the project is being throttled.
Compute-side neon_smgr improvements to avoid sending the same GetPage request multiple times if multiple backends experience a cache miss.
Dealing with read-only endpoints: users use read-only endpoints to scale reads for a single tenant.
Possibly there are also assumptions around read-only endpoints not affecting the primary read-write endpoint's performance.
With per-tenant rate limiting, we will not meet that expectation.
However, we can currently only scale per tenant.
Soon, we will have sharding (#5505), which will apply the throttling on a per-shard basis.
But, that's orthogonal to scaling reads: if many endpoints hit one shard, they share the same throttling limit.
To solve this properly, I think we'll need replicas for tenants / shard.
To performance-isolate a tenant's endpoints from each other, we'd then route them to different replicas.

View File

@@ -24,3 +24,4 @@ workspace_hack.workspace = true
[dev-dependencies] [dev-dependencies]
bincode.workspace = true bincode.workspace = true
rand.workspace = true

View File

@@ -144,3 +144,37 @@ impl Key {
pub fn is_rel_block_key(key: &Key) -> bool { pub fn is_rel_block_key(key: &Key) -> bool {
key.field1 == 0x00 && key.field4 != 0 key.field1 == 0x00 && key.field4 != 0
} }
impl std::str::FromStr for Key {
type Err = anyhow::Error;
fn from_str(s: &str) -> std::result::Result<Self, Self::Err> {
Self::from_hex(s)
}
}
#[cfg(test)]
mod tests {
use std::str::FromStr;
use crate::key::Key;
use rand::Rng;
use rand::SeedableRng;
#[test]
fn display_fromstr_bijection() {
let mut rng = rand::rngs::StdRng::seed_from_u64(42);
let key = Key {
field1: rng.gen(),
field2: rng.gen(),
field3: rng.gen(),
field4: rng.gen(),
field5: rng.gen(),
field6: rng.gen(),
};
assert_eq!(key, Key::from_str(&format!("{key}")).unwrap());
}
}

View File

@@ -1,11 +1,12 @@
use crate::repository::{key_range_size, singleton_range, Key};
use postgres_ffi::BLCKSZ; use postgres_ffi::BLCKSZ;
use std::ops::Range; use std::ops::Range;
use crate::key::Key;
/// ///
/// Represents a set of Keys, in a compact form. /// Represents a set of Keys, in a compact form.
/// ///
#[derive(Clone, Debug, Default)] #[derive(Clone, Debug, Default, PartialEq, Eq)]
pub struct KeySpace { pub struct KeySpace {
/// Contiguous ranges of keys that belong to the key space. In key order, /// Contiguous ranges of keys that belong to the key space. In key order,
/// and with no overlap. /// and with no overlap.
@@ -186,6 +187,33 @@ impl KeySpaceRandomAccum {
} }
} }
pub fn key_range_size(key_range: &Range<Key>) -> u32 {
let start = key_range.start;
let end = key_range.end;
if end.field1 != start.field1
|| end.field2 != start.field2
|| end.field3 != start.field3
|| end.field4 != start.field4
{
return u32::MAX;
}
let start = (start.field5 as u64) << 32 | start.field6 as u64;
let end = (end.field5 as u64) << 32 | end.field6 as u64;
let diff = end - start;
if diff > u32::MAX as u64 {
u32::MAX
} else {
diff as u32
}
}
pub fn singleton_range(key: Key) -> Range<Key> {
key..key.next()
}
#[cfg(test)] #[cfg(test)]
mod tests { mod tests {
use super::*; use super::*;

View File

@@ -5,6 +5,7 @@ use const_format::formatcp;
/// Public API types /// Public API types
pub mod control_api; pub mod control_api;
pub mod key; pub mod key;
pub mod keyspace;
pub mod models; pub mod models;
pub mod reltag; pub mod reltag;
pub mod shard; pub mod shard;

View File

@@ -1,5 +1,8 @@
pub mod partitioning;
use std::{ use std::{
collections::HashMap, collections::HashMap,
io::Read,
num::{NonZeroU64, NonZeroUsize}, num::{NonZeroU64, NonZeroUsize},
time::SystemTime, time::SystemTime,
}; };
@@ -17,7 +20,7 @@ use utils::{
use crate::{reltag::RelTag, shard::TenantShardId}; use crate::{reltag::RelTag, shard::TenantShardId};
use anyhow::bail; use anyhow::bail;
use bytes::{BufMut, Bytes, BytesMut}; use bytes::{Buf, BufMut, Bytes, BytesMut};
/// The state of a tenant in this pageserver. /// The state of a tenant in this pageserver.
/// ///
@@ -367,6 +370,14 @@ pub struct TenantInfo {
pub attachment_status: TenantAttachmentStatus, pub attachment_status: TenantAttachmentStatus,
} }
#[derive(Serialize, Deserialize, Clone)]
pub struct TenantDetails {
#[serde(flatten)]
pub tenant_info: TenantInfo,
pub timelines: Vec<TimelineId>,
}
/// This represents the output of the "timeline_detail" and "timeline_list" API calls. /// This represents the output of the "timeline_detail" and "timeline_list" API calls.
#[derive(Debug, Serialize, Deserialize, Clone)] #[derive(Debug, Serialize, Deserialize, Clone)]
pub struct TimelineInfo { pub struct TimelineInfo {
@@ -574,6 +585,7 @@ pub enum PagestreamFeMessage {
} }
// Wrapped in libpq CopyData // Wrapped in libpq CopyData
#[derive(strum_macros::EnumProperty)]
pub enum PagestreamBeMessage { pub enum PagestreamBeMessage {
Exists(PagestreamExistsResponse), Exists(PagestreamExistsResponse),
Nblocks(PagestreamNblocksResponse), Nblocks(PagestreamNblocksResponse),
@@ -582,6 +594,29 @@ pub enum PagestreamBeMessage {
DbSize(PagestreamDbSizeResponse), DbSize(PagestreamDbSizeResponse),
} }
// Keep in sync with `pagestore_client.h`
#[repr(u8)]
enum PagestreamBeMessageTag {
Exists = 100,
Nblocks = 101,
GetPage = 102,
Error = 103,
DbSize = 104,
}
impl TryFrom<u8> for PagestreamBeMessageTag {
type Error = u8;
fn try_from(value: u8) -> Result<Self, u8> {
match value {
100 => Ok(PagestreamBeMessageTag::Exists),
101 => Ok(PagestreamBeMessageTag::Nblocks),
102 => Ok(PagestreamBeMessageTag::GetPage),
103 => Ok(PagestreamBeMessageTag::Error),
104 => Ok(PagestreamBeMessageTag::DbSize),
_ => Err(value),
}
}
}
#[derive(Debug, PartialEq, Eq)] #[derive(Debug, PartialEq, Eq)]
pub struct PagestreamExistsRequest { pub struct PagestreamExistsRequest {
pub latest: bool, pub latest: bool,
@@ -737,35 +772,91 @@ impl PagestreamBeMessage {
pub fn serialize(&self) -> Bytes { pub fn serialize(&self) -> Bytes {
let mut bytes = BytesMut::new(); let mut bytes = BytesMut::new();
use PagestreamBeMessageTag as Tag;
match self { match self {
Self::Exists(resp) => { Self::Exists(resp) => {
bytes.put_u8(100); /* tag from pagestore_client.h */ bytes.put_u8(Tag::Exists as u8);
bytes.put_u8(resp.exists as u8); bytes.put_u8(resp.exists as u8);
} }
Self::Nblocks(resp) => { Self::Nblocks(resp) => {
bytes.put_u8(101); /* tag from pagestore_client.h */ bytes.put_u8(Tag::Nblocks as u8);
bytes.put_u32(resp.n_blocks); bytes.put_u32(resp.n_blocks);
} }
Self::GetPage(resp) => { Self::GetPage(resp) => {
bytes.put_u8(102); /* tag from pagestore_client.h */ bytes.put_u8(Tag::GetPage as u8);
bytes.put(&resp.page[..]); bytes.put(&resp.page[..]);
} }
Self::Error(resp) => { Self::Error(resp) => {
bytes.put_u8(103); /* tag from pagestore_client.h */ bytes.put_u8(Tag::Error as u8);
bytes.put(resp.message.as_bytes()); bytes.put(resp.message.as_bytes());
bytes.put_u8(0); // null terminator bytes.put_u8(0); // null terminator
} }
Self::DbSize(resp) => { Self::DbSize(resp) => {
bytes.put_u8(104); /* tag from pagestore_client.h */ bytes.put_u8(Tag::DbSize as u8);
bytes.put_i64(resp.db_size); bytes.put_i64(resp.db_size);
} }
} }
bytes.into() bytes.into()
} }
pub fn deserialize(buf: Bytes) -> anyhow::Result<Self> {
let mut buf = buf.reader();
let msg_tag = buf.read_u8()?;
use PagestreamBeMessageTag as Tag;
let ok =
match Tag::try_from(msg_tag).map_err(|tag: u8| anyhow::anyhow!("invalid tag {tag}"))? {
Tag::Exists => {
let exists = buf.read_u8()?;
Self::Exists(PagestreamExistsResponse {
exists: exists != 0,
})
}
Tag::Nblocks => {
let n_blocks = buf.read_u32::<BigEndian>()?;
Self::Nblocks(PagestreamNblocksResponse { n_blocks })
}
Tag::GetPage => {
let mut page = vec![0; 8192]; // TODO: use MaybeUninit
buf.read_exact(&mut page)?;
PagestreamBeMessage::GetPage(PagestreamGetPageResponse { page: page.into() })
}
Tag::Error => {
let buf = buf.get_ref();
let cstr = std::ffi::CStr::from_bytes_until_nul(buf)?;
let rust_str = cstr.to_str()?;
PagestreamBeMessage::Error(PagestreamErrorResponse {
message: rust_str.to_owned(),
})
}
Tag::DbSize => {
let db_size = buf.read_i64::<BigEndian>()?;
Self::DbSize(PagestreamDbSizeResponse { db_size })
}
};
let remaining = buf.into_inner();
if !remaining.is_empty() {
anyhow::bail!(
"remaining bytes in msg with tag={msg_tag}: {}",
remaining.len()
);
}
Ok(ok)
}
pub fn kind(&self) -> &'static str {
match self {
Self::Exists(_) => "Exists",
Self::Nblocks(_) => "Nblocks",
Self::GetPage(_) => "GetPage",
Self::Error(_) => "Error",
Self::DbSize(_) => "DbSize",
}
}
} }
#[cfg(test)] #[cfg(test)]

View File

@@ -0,0 +1,151 @@
use utils::lsn::Lsn;
#[derive(Debug, PartialEq, Eq)]
pub struct Partitioning {
pub keys: crate::keyspace::KeySpace,
pub at_lsn: Lsn,
}
impl serde::Serialize for Partitioning {
fn serialize<S>(&self, serializer: S) -> std::result::Result<S::Ok, S::Error>
where
S: serde::Serializer,
{
pub struct KeySpace<'a>(&'a crate::keyspace::KeySpace);
impl<'a> serde::Serialize for KeySpace<'a> {
fn serialize<S>(&self, serializer: S) -> std::result::Result<S::Ok, S::Error>
where
S: serde::Serializer,
{
use serde::ser::SerializeSeq;
let mut seq = serializer.serialize_seq(Some(self.0.ranges.len()))?;
for kr in &self.0.ranges {
seq.serialize_element(&KeyRange(kr))?;
}
seq.end()
}
}
use serde::ser::SerializeMap;
let mut map = serializer.serialize_map(Some(2))?;
map.serialize_key("keys")?;
map.serialize_value(&KeySpace(&self.keys))?;
map.serialize_key("at_lsn")?;
map.serialize_value(&WithDisplay(&self.at_lsn))?;
map.end()
}
}
pub struct WithDisplay<'a, T>(&'a T);
impl<'a, T: std::fmt::Display> serde::Serialize for WithDisplay<'a, T> {
fn serialize<S>(&self, serializer: S) -> std::result::Result<S::Ok, S::Error>
where
S: serde::Serializer,
{
serializer.collect_str(&self.0)
}
}
pub struct KeyRange<'a>(&'a std::ops::Range<crate::key::Key>);
impl<'a> serde::Serialize for KeyRange<'a> {
fn serialize<S>(&self, serializer: S) -> Result<S::Ok, S::Error>
where
S: serde::Serializer,
{
use serde::ser::SerializeTuple;
let mut t = serializer.serialize_tuple(2)?;
t.serialize_element(&WithDisplay(&self.0.start))?;
t.serialize_element(&WithDisplay(&self.0.end))?;
t.end()
}
}
impl<'a> serde::Deserialize<'a> for Partitioning {
fn deserialize<D>(deserializer: D) -> Result<Self, D::Error>
where
D: serde::Deserializer<'a>,
{
pub struct KeySpace(crate::keyspace::KeySpace);
impl<'de> serde::Deserialize<'de> for KeySpace {
fn deserialize<D>(deserializer: D) -> Result<Self, D::Error>
where
D: serde::Deserializer<'de>,
{
#[serde_with::serde_as]
#[derive(serde::Deserialize)]
#[serde(transparent)]
struct Key(#[serde_as(as = "serde_with::DisplayFromStr")] crate::key::Key);
#[serde_with::serde_as]
#[derive(serde::Deserialize)]
struct Range(Key, Key);
let ranges: Vec<Range> = serde::Deserialize::deserialize(deserializer)?;
Ok(Self(crate::keyspace::KeySpace {
ranges: ranges
.into_iter()
.map(|Range(start, end)| (start.0..end.0))
.collect(),
}))
}
}
#[serde_with::serde_as]
#[derive(serde::Deserialize)]
struct De {
keys: KeySpace,
#[serde_as(as = "serde_with::DisplayFromStr")]
at_lsn: Lsn,
}
let de: De = serde::Deserialize::deserialize(deserializer)?;
Ok(Self {
at_lsn: de.at_lsn,
keys: de.keys.0,
})
}
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_serialization_roundtrip() {
let reference = r#"
{
"keys": [
[
"000000000000000000000000000000000000",
"000000000000000000000000000000000001"
],
[
"000000067F00000001000000000000000000",
"000000067F00000001000000000000000002"
],
[
"030000000000000000000000000000000000",
"030000000000000000000000000000000003"
]
],
"at_lsn": "0/2240160"
}
"#;
let de: Partitioning = serde_json::from_str(reference).unwrap();
let ser = serde_json::to_string(&de).unwrap();
let ser_de: serde_json::Value = serde_json::from_str(&ser).unwrap();
assert_eq!(
ser_de,
serde_json::from_str::<'_, serde_json::Value>(reference).unwrap()
);
}
}

View File

@@ -159,7 +159,7 @@ impl From<[u8; 18]> for TenantShardId {
/// shard we're dealing with, but do not need to know the full ShardIdentity (because /// shard we're dealing with, but do not need to know the full ShardIdentity (because
/// we won't be doing any page->shard mapping), and do not need to know the fully qualified /// we won't be doing any page->shard mapping), and do not need to know the fully qualified
/// TenantShardId. /// TenantShardId.
#[derive(Eq, PartialEq, PartialOrd, Ord, Clone, Copy)] #[derive(Eq, PartialEq, PartialOrd, Ord, Clone, Copy, Hash)]
pub struct ShardIndex { pub struct ShardIndex {
pub shard_number: ShardNumber, pub shard_number: ShardNumber,
pub shard_count: ShardCount, pub shard_count: ShardCount,

View File

@@ -163,8 +163,18 @@ impl PgConnectionConfig {
} }
/// Connect using postgres protocol with TLS disabled. /// Connect using postgres protocol with TLS disabled.
pub fn connect_no_tls(&self) -> Result<postgres::Client, postgres::Error> { pub async fn connect_no_tls(
postgres::Config::from(self.to_tokio_postgres_config()).connect(postgres::NoTls) &self,
) -> Result<
(
tokio_postgres::Client,
tokio_postgres::Connection<tokio_postgres::Socket, tokio_postgres::tls::NoTlsStream>,
),
postgres::Error,
> {
self.to_tokio_postgres_config()
.connect(postgres::NoTls)
.await
} }
} }

View File

@@ -218,14 +218,6 @@ impl S3Bucket {
let started_at = ScopeGuard::into_inner(started_at); let started_at = ScopeGuard::into_inner(started_at);
if get_object.is_err() {
metrics::BUCKET_METRICS.req_seconds.observe_elapsed(
kind,
AttemptOutcome::Err,
started_at,
);
}
match get_object { match get_object {
Ok(object_output) => { Ok(object_output) => {
let metadata = object_output.metadata().cloned().map(StorageMetadata); let metadata = object_output.metadata().cloned().map(StorageMetadata);
@@ -241,11 +233,27 @@ impl S3Bucket {
}) })
} }
Err(SdkError::ServiceError(e)) if matches!(e.err(), GetObjectError::NoSuchKey(_)) => { Err(SdkError::ServiceError(e)) if matches!(e.err(), GetObjectError::NoSuchKey(_)) => {
// Count this in the AttemptOutcome::Ok bucket, because 404 is not
// an error: we expect to sometimes fetch an object and find it missing,
// e.g. when probing for timeline indices.
metrics::BUCKET_METRICS.req_seconds.observe_elapsed(
kind,
AttemptOutcome::Ok,
started_at,
);
Err(DownloadError::NotFound) Err(DownloadError::NotFound)
} }
Err(e) => Err(DownloadError::Other( Err(e) => {
anyhow::Error::new(e).context("download s3 object"), metrics::BUCKET_METRICS.req_seconds.observe_elapsed(
)), kind,
AttemptOutcome::Err,
started_at,
);
Err(DownloadError::Other(
anyhow::Error::new(e).context("download s3 object"),
))
}
} }
} }
} }

View File

@@ -0,0 +1,200 @@
use std::collections::HashSet;
use std::ops::ControlFlow;
use std::path::PathBuf;
use std::sync::Arc;
use anyhow::Context;
use bytes::Bytes;
use camino::Utf8Path;
use futures::stream::Stream;
use once_cell::sync::OnceCell;
use remote_storage::{Download, GenericRemoteStorage, RemotePath};
use tokio::task::JoinSet;
use tracing::{debug, error, info};
static LOGGING_DONE: OnceCell<()> = OnceCell::new();
pub(crate) fn upload_stream(
content: std::borrow::Cow<'static, [u8]>,
) -> (
impl Stream<Item = std::io::Result<Bytes>> + Send + Sync + 'static,
usize,
) {
use std::borrow::Cow;
let content = match content {
Cow::Borrowed(x) => Bytes::from_static(x),
Cow::Owned(vec) => Bytes::from(vec),
};
wrap_stream(content)
}
pub(crate) fn wrap_stream(
content: bytes::Bytes,
) -> (
impl Stream<Item = std::io::Result<Bytes>> + Send + Sync + 'static,
usize,
) {
let len = content.len();
let content = futures::future::ready(Ok(content));
(futures::stream::once(content), len)
}
pub(crate) async fn download_to_vec(dl: Download) -> anyhow::Result<Vec<u8>> {
let mut buf = Vec::new();
tokio::io::copy_buf(
&mut tokio_util::io::StreamReader::new(dl.download_stream),
&mut buf,
)
.await?;
Ok(buf)
}
// Uploads files `folder{j}/blob{i}.txt`. See test description for more details.
pub(crate) async fn upload_simple_remote_data(
client: &Arc<GenericRemoteStorage>,
upload_tasks_count: usize,
) -> ControlFlow<HashSet<RemotePath>, HashSet<RemotePath>> {
info!("Creating {upload_tasks_count} remote files");
let mut upload_tasks = JoinSet::new();
for i in 1..upload_tasks_count + 1 {
let task_client = Arc::clone(client);
upload_tasks.spawn(async move {
let blob_path = PathBuf::from(format!("folder{}/blob_{}.txt", i / 7, i));
let blob_path = RemotePath::new(
Utf8Path::from_path(blob_path.as_path()).expect("must be valid blob path"),
)
.with_context(|| format!("{blob_path:?} to RemotePath conversion"))?;
debug!("Creating remote item {i} at path {blob_path:?}");
let (data, len) = upload_stream(format!("remote blob data {i}").into_bytes().into());
task_client.upload(data, len, &blob_path, None).await?;
Ok::<_, anyhow::Error>(blob_path)
});
}
let mut upload_tasks_failed = false;
let mut uploaded_blobs = HashSet::with_capacity(upload_tasks_count);
while let Some(task_run_result) = upload_tasks.join_next().await {
match task_run_result
.context("task join failed")
.and_then(|task_result| task_result.context("upload task failed"))
{
Ok(upload_path) => {
uploaded_blobs.insert(upload_path);
}
Err(e) => {
error!("Upload task failed: {e:?}");
upload_tasks_failed = true;
}
}
}
if upload_tasks_failed {
ControlFlow::Break(uploaded_blobs)
} else {
ControlFlow::Continue(uploaded_blobs)
}
}
pub(crate) async fn cleanup(
client: &Arc<GenericRemoteStorage>,
objects_to_delete: HashSet<RemotePath>,
) {
info!(
"Removing {} objects from the remote storage during cleanup",
objects_to_delete.len()
);
let mut delete_tasks = JoinSet::new();
for object_to_delete in objects_to_delete {
let task_client = Arc::clone(client);
delete_tasks.spawn(async move {
debug!("Deleting remote item at path {object_to_delete:?}");
task_client
.delete(&object_to_delete)
.await
.with_context(|| format!("{object_to_delete:?} removal"))
});
}
while let Some(task_run_result) = delete_tasks.join_next().await {
match task_run_result {
Ok(task_result) => match task_result {
Ok(()) => {}
Err(e) => error!("Delete task failed: {e:?}"),
},
Err(join_err) => error!("Delete task did not finish correctly: {join_err}"),
}
}
}
pub(crate) struct Uploads {
pub(crate) prefixes: HashSet<RemotePath>,
pub(crate) blobs: HashSet<RemotePath>,
}
pub(crate) async fn upload_remote_data(
client: &Arc<GenericRemoteStorage>,
base_prefix_str: &'static str,
upload_tasks_count: usize,
) -> ControlFlow<Uploads, Uploads> {
info!("Creating {upload_tasks_count} remote files");
let mut upload_tasks = JoinSet::new();
for i in 1..upload_tasks_count + 1 {
let task_client = Arc::clone(client);
upload_tasks.spawn(async move {
let prefix = format!("{base_prefix_str}/sub_prefix_{i}/");
let blob_prefix = RemotePath::new(Utf8Path::new(&prefix))
.with_context(|| format!("{prefix:?} to RemotePath conversion"))?;
let blob_path = blob_prefix.join(Utf8Path::new(&format!("blob_{i}")));
debug!("Creating remote item {i} at path {blob_path:?}");
let (data, data_len) =
upload_stream(format!("remote blob data {i}").into_bytes().into());
task_client.upload(data, data_len, &blob_path, None).await?;
Ok::<_, anyhow::Error>((blob_prefix, blob_path))
});
}
let mut upload_tasks_failed = false;
let mut uploaded_prefixes = HashSet::with_capacity(upload_tasks_count);
let mut uploaded_blobs = HashSet::with_capacity(upload_tasks_count);
while let Some(task_run_result) = upload_tasks.join_next().await {
match task_run_result
.context("task join failed")
.and_then(|task_result| task_result.context("upload task failed"))
{
Ok((upload_prefix, upload_path)) => {
uploaded_prefixes.insert(upload_prefix);
uploaded_blobs.insert(upload_path);
}
Err(e) => {
error!("Upload task failed: {e:?}");
upload_tasks_failed = true;
}
}
}
let uploads = Uploads {
prefixes: uploaded_prefixes,
blobs: uploaded_blobs,
};
if upload_tasks_failed {
ControlFlow::Break(uploads)
} else {
ControlFlow::Continue(uploads)
}
}
pub(crate) fn ensure_logging_ready() {
LOGGING_DONE.get_or_init(|| {
utils::logging::init(
utils::logging::LogFormat::Test,
utils::logging::TracingErrorLayerEnablement::Disabled,
utils::logging::Output::Stdout,
)
.expect("logging init failed");
});
}

View File

@@ -2,23 +2,23 @@ use std::collections::HashSet;
use std::env; use std::env;
use std::num::NonZeroUsize; use std::num::NonZeroUsize;
use std::ops::ControlFlow; use std::ops::ControlFlow;
use std::path::PathBuf;
use std::sync::Arc; use std::sync::Arc;
use std::time::UNIX_EPOCH; use std::time::UNIX_EPOCH;
use anyhow::Context; use anyhow::Context;
use bytes::Bytes;
use camino::Utf8Path; use camino::Utf8Path;
use futures::stream::Stream;
use once_cell::sync::OnceCell;
use remote_storage::{ use remote_storage::{
AzureConfig, Download, GenericRemoteStorage, RemotePath, RemoteStorageConfig, RemoteStorageKind, AzureConfig, GenericRemoteStorage, RemotePath, RemoteStorageConfig, RemoteStorageKind,
}; };
use test_context::{test_context, AsyncTestContext}; use test_context::{test_context, AsyncTestContext};
use tokio::task::JoinSet; use tracing::{debug, info};
use tracing::{debug, error, info};
static LOGGING_DONE: OnceCell<()> = OnceCell::new(); mod common;
use common::{
cleanup, download_to_vec, ensure_logging_ready, upload_remote_data, upload_simple_remote_data,
upload_stream, wrap_stream,
};
const ENABLE_REAL_AZURE_REMOTE_STORAGE_ENV_VAR_NAME: &str = "ENABLE_REAL_AZURE_REMOTE_STORAGE"; const ENABLE_REAL_AZURE_REMOTE_STORAGE_ENV_VAR_NAME: &str = "ENABLE_REAL_AZURE_REMOTE_STORAGE";
@@ -30,7 +30,7 @@ const BASE_PREFIX: &str = "test";
/// If real Azure tests are disabled, the test passes, skipping any real test run: currently, there's no way to mark the test ignored in runtime with the /// If real Azure tests are disabled, the test passes, skipping any real test run: currently, there's no way to mark the test ignored in runtime with the
/// deafult test framework, see https://github.com/rust-lang/rust/issues/68007 for details. /// deafult test framework, see https://github.com/rust-lang/rust/issues/68007 for details.
/// ///
/// First, the test creates a set of Azure blobs with keys `/${random_prefix_part}/${base_prefix_str}/sub_prefix_${i}/blob_${i}` in [`upload_azure_data`] /// First, the test creates a set of Azure blobs with keys `/${random_prefix_part}/${base_prefix_str}/sub_prefix_${i}/blob_${i}` in [`upload_remote_data`]
/// where /// where
/// * `random_prefix_part` is set for the entire Azure client during the Azure client creation in [`create_azure_client`], to avoid multiple test runs interference /// * `random_prefix_part` is set for the entire Azure client during the Azure client creation in [`create_azure_client`], to avoid multiple test runs interference
/// * `base_prefix_str` is a common prefix to use in the client requests: we would want to ensure that the client is able to list nested prefixes inside the bucket /// * `base_prefix_str` is a common prefix to use in the client requests: we would want to ensure that the client is able to list nested prefixes inside the bucket
@@ -97,7 +97,7 @@ async fn azure_pagination_should_work(
/// Uses real Azure and requires [`ENABLE_REAL_AZURE_REMOTE_STORAGE_ENV_VAR_NAME`] and related Azure cred env vars specified. Test will skip real code and pass if env vars not set. /// Uses real Azure and requires [`ENABLE_REAL_AZURE_REMOTE_STORAGE_ENV_VAR_NAME`] and related Azure cred env vars specified. Test will skip real code and pass if env vars not set.
/// See `Azure_pagination_should_work` for more information. /// See `Azure_pagination_should_work` for more information.
/// ///
/// First, create a set of Azure objects with keys `random_prefix/folder{j}/blob_{i}.txt` in [`upload_azure_data`] /// First, create a set of Azure objects with keys `random_prefix/folder{j}/blob_{i}.txt` in [`upload_remote_data`]
/// Then performs the following queries: /// Then performs the following queries:
/// 1. `list_files(None)`. This should return all files `random_prefix/folder{j}/blob_{i}.txt` /// 1. `list_files(None)`. This should return all files `random_prefix/folder{j}/blob_{i}.txt`
/// 2. `list_files("folder1")`. This should return all files `random_prefix/folder1/blob_{i}.txt` /// 2. `list_files("folder1")`. This should return all files `random_prefix/folder1/blob_{i}.txt`
@@ -218,18 +218,9 @@ async fn azure_upload_download_works(ctx: &mut MaybeEnabledAzure) -> anyhow::Res
ctx.client.upload(data, len, &path, None).await?; ctx.client.upload(data, len, &path, None).await?;
async fn download_and_compare(dl: Download) -> anyhow::Result<Vec<u8>> {
let mut buf = Vec::new();
tokio::io::copy_buf(
&mut tokio_util::io::StreamReader::new(dl.download_stream),
&mut buf,
)
.await?;
Ok(buf)
}
// Normal download request // Normal download request
let dl = ctx.client.download(&path).await?; let dl = ctx.client.download(&path).await?;
let buf = download_and_compare(dl).await?; let buf = download_to_vec(dl).await?;
assert_eq!(&buf, &orig); assert_eq!(&buf, &orig);
// Full range (end specified) // Full range (end specified)
@@ -237,12 +228,12 @@ async fn azure_upload_download_works(ctx: &mut MaybeEnabledAzure) -> anyhow::Res
.client .client
.download_byte_range(&path, 0, Some(len as u64)) .download_byte_range(&path, 0, Some(len as u64))
.await?; .await?;
let buf = download_and_compare(dl).await?; let buf = download_to_vec(dl).await?;
assert_eq!(&buf, &orig); assert_eq!(&buf, &orig);
// partial range (end specified) // partial range (end specified)
let dl = ctx.client.download_byte_range(&path, 4, Some(10)).await?; let dl = ctx.client.download_byte_range(&path, 4, Some(10)).await?;
let buf = download_and_compare(dl).await?; let buf = download_to_vec(dl).await?;
assert_eq!(&buf, &orig[4..10]); assert_eq!(&buf, &orig[4..10]);
// partial range (end beyond real end) // partial range (end beyond real end)
@@ -250,17 +241,17 @@ async fn azure_upload_download_works(ctx: &mut MaybeEnabledAzure) -> anyhow::Res
.client .client
.download_byte_range(&path, 8, Some(len as u64 * 100)) .download_byte_range(&path, 8, Some(len as u64 * 100))
.await?; .await?;
let buf = download_and_compare(dl).await?; let buf = download_to_vec(dl).await?;
assert_eq!(&buf, &orig[8..]); assert_eq!(&buf, &orig[8..]);
// Partial range (end unspecified) // Partial range (end unspecified)
let dl = ctx.client.download_byte_range(&path, 4, None).await?; let dl = ctx.client.download_byte_range(&path, 4, None).await?;
let buf = download_and_compare(dl).await?; let buf = download_to_vec(dl).await?;
assert_eq!(&buf, &orig[4..]); assert_eq!(&buf, &orig[4..]);
// Full range (end unspecified) // Full range (end unspecified)
let dl = ctx.client.download_byte_range(&path, 0, None).await?; let dl = ctx.client.download_byte_range(&path, 0, None).await?;
let buf = download_and_compare(dl).await?; let buf = download_to_vec(dl).await?;
assert_eq!(&buf, &orig); assert_eq!(&buf, &orig);
debug!("Cleanup: deleting file at path {path:?}"); debug!("Cleanup: deleting file at path {path:?}");
@@ -272,17 +263,6 @@ async fn azure_upload_download_works(ctx: &mut MaybeEnabledAzure) -> anyhow::Res
Ok(()) Ok(())
} }
fn ensure_logging_ready() {
LOGGING_DONE.get_or_init(|| {
utils::logging::init(
utils::logging::LogFormat::Test,
utils::logging::TracingErrorLayerEnablement::Disabled,
utils::logging::Output::Stdout,
)
.expect("logging init failed");
});
}
struct EnabledAzure { struct EnabledAzure {
client: Arc<GenericRemoteStorage>, client: Arc<GenericRemoteStorage>,
base_prefix: &'static str, base_prefix: &'static str,
@@ -352,7 +332,7 @@ impl AsyncTestContext for MaybeEnabledAzureWithTestBlobs {
let enabled = EnabledAzure::setup(Some(max_keys_in_list_response)).await; let enabled = EnabledAzure::setup(Some(max_keys_in_list_response)).await;
match upload_azure_data(&enabled.client, enabled.base_prefix, upload_tasks_count).await { match upload_remote_data(&enabled.client, enabled.base_prefix, upload_tasks_count).await {
ControlFlow::Continue(uploads) => { ControlFlow::Continue(uploads) => {
info!("Remote objects created successfully"); info!("Remote objects created successfully");
@@ -414,7 +394,7 @@ impl AsyncTestContext for MaybeEnabledAzureWithSimpleTestBlobs {
let enabled = EnabledAzure::setup(Some(max_keys_in_list_response)).await; let enabled = EnabledAzure::setup(Some(max_keys_in_list_response)).await;
match upload_simple_azure_data(&enabled.client, upload_tasks_count).await { match upload_simple_remote_data(&enabled.client, upload_tasks_count).await {
ControlFlow::Continue(uploads) => { ControlFlow::Continue(uploads) => {
info!("Remote objects created successfully"); info!("Remote objects created successfully");
@@ -478,166 +458,3 @@ fn create_azure_client(
GenericRemoteStorage::from_config(&remote_storage_config).context("remote storage init")?, GenericRemoteStorage::from_config(&remote_storage_config).context("remote storage init")?,
)) ))
} }
struct Uploads {
prefixes: HashSet<RemotePath>,
blobs: HashSet<RemotePath>,
}
async fn upload_azure_data(
client: &Arc<GenericRemoteStorage>,
base_prefix_str: &'static str,
upload_tasks_count: usize,
) -> ControlFlow<Uploads, Uploads> {
info!("Creating {upload_tasks_count} Azure files");
let mut upload_tasks = JoinSet::new();
for i in 1..upload_tasks_count + 1 {
let task_client = Arc::clone(client);
upload_tasks.spawn(async move {
let prefix = format!("{base_prefix_str}/sub_prefix_{i}/");
let blob_prefix = RemotePath::new(Utf8Path::new(&prefix))
.with_context(|| format!("{prefix:?} to RemotePath conversion"))?;
let blob_path = blob_prefix.join(Utf8Path::new(&format!("blob_{i}")));
debug!("Creating remote item {i} at path {blob_path:?}");
let (data, len) = upload_stream(format!("remote blob data {i}").into_bytes().into());
task_client.upload(data, len, &blob_path, None).await?;
Ok::<_, anyhow::Error>((blob_prefix, blob_path))
});
}
let mut upload_tasks_failed = false;
let mut uploaded_prefixes = HashSet::with_capacity(upload_tasks_count);
let mut uploaded_blobs = HashSet::with_capacity(upload_tasks_count);
while let Some(task_run_result) = upload_tasks.join_next().await {
match task_run_result
.context("task join failed")
.and_then(|task_result| task_result.context("upload task failed"))
{
Ok((upload_prefix, upload_path)) => {
uploaded_prefixes.insert(upload_prefix);
uploaded_blobs.insert(upload_path);
}
Err(e) => {
error!("Upload task failed: {e:?}");
upload_tasks_failed = true;
}
}
}
let uploads = Uploads {
prefixes: uploaded_prefixes,
blobs: uploaded_blobs,
};
if upload_tasks_failed {
ControlFlow::Break(uploads)
} else {
ControlFlow::Continue(uploads)
}
}
async fn cleanup(client: &Arc<GenericRemoteStorage>, objects_to_delete: HashSet<RemotePath>) {
info!(
"Removing {} objects from the remote storage during cleanup",
objects_to_delete.len()
);
let mut delete_tasks = JoinSet::new();
for object_to_delete in objects_to_delete {
let task_client = Arc::clone(client);
delete_tasks.spawn(async move {
debug!("Deleting remote item at path {object_to_delete:?}");
task_client
.delete(&object_to_delete)
.await
.with_context(|| format!("{object_to_delete:?} removal"))
});
}
while let Some(task_run_result) = delete_tasks.join_next().await {
match task_run_result {
Ok(task_result) => match task_result {
Ok(()) => {}
Err(e) => error!("Delete task failed: {e:?}"),
},
Err(join_err) => error!("Delete task did not finish correctly: {join_err}"),
}
}
}
// Uploads files `folder{j}/blob{i}.txt`. See test description for more details.
async fn upload_simple_azure_data(
client: &Arc<GenericRemoteStorage>,
upload_tasks_count: usize,
) -> ControlFlow<HashSet<RemotePath>, HashSet<RemotePath>> {
info!("Creating {upload_tasks_count} Azure files");
let mut upload_tasks = JoinSet::new();
for i in 1..upload_tasks_count + 1 {
let task_client = Arc::clone(client);
upload_tasks.spawn(async move {
let blob_path = PathBuf::from(format!("folder{}/blob_{}.txt", i / 7, i));
let blob_path = RemotePath::new(
Utf8Path::from_path(blob_path.as_path()).expect("must be valid blob path"),
)
.with_context(|| format!("{blob_path:?} to RemotePath conversion"))?;
debug!("Creating remote item {i} at path {blob_path:?}");
let (data, len) = upload_stream(format!("remote blob data {i}").into_bytes().into());
task_client.upload(data, len, &blob_path, None).await?;
Ok::<_, anyhow::Error>(blob_path)
});
}
let mut upload_tasks_failed = false;
let mut uploaded_blobs = HashSet::with_capacity(upload_tasks_count);
while let Some(task_run_result) = upload_tasks.join_next().await {
match task_run_result
.context("task join failed")
.and_then(|task_result| task_result.context("upload task failed"))
{
Ok(upload_path) => {
uploaded_blobs.insert(upload_path);
}
Err(e) => {
error!("Upload task failed: {e:?}");
upload_tasks_failed = true;
}
}
}
if upload_tasks_failed {
ControlFlow::Break(uploaded_blobs)
} else {
ControlFlow::Continue(uploaded_blobs)
}
}
// FIXME: copypasted from test_real_s3, can't remember how to share a module which is not compiled
// to binary
fn upload_stream(
content: std::borrow::Cow<'static, [u8]>,
) -> (
impl Stream<Item = std::io::Result<Bytes>> + Send + Sync + 'static,
usize,
) {
use std::borrow::Cow;
let content = match content {
Cow::Borrowed(x) => Bytes::from_static(x),
Cow::Owned(vec) => Bytes::from(vec),
};
wrap_stream(content)
}
fn wrap_stream(
content: bytes::Bytes,
) -> (
impl Stream<Item = std::io::Result<Bytes>> + Send + Sync + 'static,
usize,
) {
let len = content.len();
let content = futures::future::ready(Ok(content));
(futures::stream::once(content), len)
}

View File

@@ -2,23 +2,23 @@ use std::collections::HashSet;
use std::env; use std::env;
use std::num::NonZeroUsize; use std::num::NonZeroUsize;
use std::ops::ControlFlow; use std::ops::ControlFlow;
use std::path::PathBuf;
use std::sync::Arc; use std::sync::Arc;
use std::time::UNIX_EPOCH; use std::time::UNIX_EPOCH;
use anyhow::Context; use anyhow::Context;
use bytes::Bytes;
use camino::Utf8Path; use camino::Utf8Path;
use futures::stream::Stream;
use once_cell::sync::OnceCell;
use remote_storage::{ use remote_storage::{
GenericRemoteStorage, RemotePath, RemoteStorageConfig, RemoteStorageKind, S3Config, GenericRemoteStorage, RemotePath, RemoteStorageConfig, RemoteStorageKind, S3Config,
}; };
use test_context::{test_context, AsyncTestContext}; use test_context::{test_context, AsyncTestContext};
use tokio::task::JoinSet; use tracing::{debug, info};
use tracing::{debug, error, info};
static LOGGING_DONE: OnceCell<()> = OnceCell::new(); mod common;
use common::{
cleanup, download_to_vec, ensure_logging_ready, upload_remote_data, upload_simple_remote_data,
upload_stream, wrap_stream,
};
const ENABLE_REAL_S3_REMOTE_STORAGE_ENV_VAR_NAME: &str = "ENABLE_REAL_S3_REMOTE_STORAGE"; const ENABLE_REAL_S3_REMOTE_STORAGE_ENV_VAR_NAME: &str = "ENABLE_REAL_S3_REMOTE_STORAGE";
@@ -30,7 +30,7 @@ const BASE_PREFIX: &str = "test";
/// If real S3 tests are disabled, the test passes, skipping any real test run: currently, there's no way to mark the test ignored in runtime with the /// If real S3 tests are disabled, the test passes, skipping any real test run: currently, there's no way to mark the test ignored in runtime with the
/// deafult test framework, see https://github.com/rust-lang/rust/issues/68007 for details. /// deafult test framework, see https://github.com/rust-lang/rust/issues/68007 for details.
/// ///
/// First, the test creates a set of S3 objects with keys `/${random_prefix_part}/${base_prefix_str}/sub_prefix_${i}/blob_${i}` in [`upload_s3_data`] /// First, the test creates a set of S3 objects with keys `/${random_prefix_part}/${base_prefix_str}/sub_prefix_${i}/blob_${i}` in [`upload_remote_data`]
/// where /// where
/// * `random_prefix_part` is set for the entire S3 client during the S3 client creation in [`create_s3_client`], to avoid multiple test runs interference /// * `random_prefix_part` is set for the entire S3 client during the S3 client creation in [`create_s3_client`], to avoid multiple test runs interference
/// * `base_prefix_str` is a common prefix to use in the client requests: we would want to ensure that the client is able to list nested prefixes inside the bucket /// * `base_prefix_str` is a common prefix to use in the client requests: we would want to ensure that the client is able to list nested prefixes inside the bucket
@@ -95,7 +95,7 @@ async fn s3_pagination_should_work(ctx: &mut MaybeEnabledS3WithTestBlobs) -> any
/// Uses real S3 and requires [`ENABLE_REAL_S3_REMOTE_STORAGE_ENV_VAR_NAME`] and related S3 cred env vars specified. Test will skip real code and pass if env vars not set. /// Uses real S3 and requires [`ENABLE_REAL_S3_REMOTE_STORAGE_ENV_VAR_NAME`] and related S3 cred env vars specified. Test will skip real code and pass if env vars not set.
/// See `s3_pagination_should_work` for more information. /// See `s3_pagination_should_work` for more information.
/// ///
/// First, create a set of S3 objects with keys `random_prefix/folder{j}/blob_{i}.txt` in [`upload_s3_data`] /// First, create a set of S3 objects with keys `random_prefix/folder{j}/blob_{i}.txt` in [`upload_remote_data`]
/// Then performs the following queries: /// Then performs the following queries:
/// 1. `list_files(None)`. This should return all files `random_prefix/folder{j}/blob_{i}.txt` /// 1. `list_files(None)`. This should return all files `random_prefix/folder{j}/blob_{i}.txt`
/// 2. `list_files("folder1")`. This should return all files `random_prefix/folder1/blob_{i}.txt` /// 2. `list_files("folder1")`. This should return all files `random_prefix/folder1/blob_{i}.txt`
@@ -198,15 +198,65 @@ async fn s3_delete_objects_works(ctx: &mut MaybeEnabledS3) -> anyhow::Result<()>
Ok(()) Ok(())
} }
fn ensure_logging_ready() { #[test_context(MaybeEnabledS3)]
LOGGING_DONE.get_or_init(|| { #[tokio::test]
utils::logging::init( async fn s3_upload_download_works(ctx: &mut MaybeEnabledS3) -> anyhow::Result<()> {
utils::logging::LogFormat::Test, let MaybeEnabledS3::Enabled(ctx) = ctx else {
utils::logging::TracingErrorLayerEnablement::Disabled, return Ok(());
utils::logging::Output::Stdout, };
)
.expect("logging init failed"); let path = RemotePath::new(Utf8Path::new(format!("{}/file", ctx.base_prefix).as_str()))
}); .with_context(|| "RemotePath conversion")?;
let orig = bytes::Bytes::from_static("remote blob data here".as_bytes());
let (data, len) = wrap_stream(orig.clone());
ctx.client.upload(data, len, &path, None).await?;
// Normal download request
let dl = ctx.client.download(&path).await?;
let buf = download_to_vec(dl).await?;
assert_eq!(&buf, &orig);
// Full range (end specified)
let dl = ctx
.client
.download_byte_range(&path, 0, Some(len as u64))
.await?;
let buf = download_to_vec(dl).await?;
assert_eq!(&buf, &orig);
// partial range (end specified)
let dl = ctx.client.download_byte_range(&path, 4, Some(10)).await?;
let buf = download_to_vec(dl).await?;
assert_eq!(&buf, &orig[4..10]);
// partial range (end beyond real end)
let dl = ctx
.client
.download_byte_range(&path, 8, Some(len as u64 * 100))
.await?;
let buf = download_to_vec(dl).await?;
assert_eq!(&buf, &orig[8..]);
// Partial range (end unspecified)
let dl = ctx.client.download_byte_range(&path, 4, None).await?;
let buf = download_to_vec(dl).await?;
assert_eq!(&buf, &orig[4..]);
// Full range (end unspecified)
let dl = ctx.client.download_byte_range(&path, 0, None).await?;
let buf = download_to_vec(dl).await?;
assert_eq!(&buf, &orig);
debug!("Cleanup: deleting file at path {path:?}");
ctx.client
.delete(&path)
.await
.with_context(|| format!("{path:?} removal"))?;
Ok(())
} }
struct EnabledS3 { struct EnabledS3 {
@@ -278,7 +328,7 @@ impl AsyncTestContext for MaybeEnabledS3WithTestBlobs {
let enabled = EnabledS3::setup(Some(max_keys_in_list_response)).await; let enabled = EnabledS3::setup(Some(max_keys_in_list_response)).await;
match upload_s3_data(&enabled.client, enabled.base_prefix, upload_tasks_count).await { match upload_remote_data(&enabled.client, enabled.base_prefix, upload_tasks_count).await {
ControlFlow::Continue(uploads) => { ControlFlow::Continue(uploads) => {
info!("Remote objects created successfully"); info!("Remote objects created successfully");
@@ -340,7 +390,7 @@ impl AsyncTestContext for MaybeEnabledS3WithSimpleTestBlobs {
let enabled = EnabledS3::setup(Some(max_keys_in_list_response)).await; let enabled = EnabledS3::setup(Some(max_keys_in_list_response)).await;
match upload_simple_s3_data(&enabled.client, upload_tasks_count).await { match upload_simple_remote_data(&enabled.client, upload_tasks_count).await {
ControlFlow::Continue(uploads) => { ControlFlow::Continue(uploads) => {
info!("Remote objects created successfully"); info!("Remote objects created successfully");
@@ -403,166 +453,3 @@ fn create_s3_client(
GenericRemoteStorage::from_config(&remote_storage_config).context("remote storage init")?, GenericRemoteStorage::from_config(&remote_storage_config).context("remote storage init")?,
)) ))
} }
struct Uploads {
prefixes: HashSet<RemotePath>,
blobs: HashSet<RemotePath>,
}
async fn upload_s3_data(
client: &Arc<GenericRemoteStorage>,
base_prefix_str: &'static str,
upload_tasks_count: usize,
) -> ControlFlow<Uploads, Uploads> {
info!("Creating {upload_tasks_count} S3 files");
let mut upload_tasks = JoinSet::new();
for i in 1..upload_tasks_count + 1 {
let task_client = Arc::clone(client);
upload_tasks.spawn(async move {
let prefix = format!("{base_prefix_str}/sub_prefix_{i}/");
let blob_prefix = RemotePath::new(Utf8Path::new(&prefix))
.with_context(|| format!("{prefix:?} to RemotePath conversion"))?;
let blob_path = blob_prefix.join(Utf8Path::new(&format!("blob_{i}")));
debug!("Creating remote item {i} at path {blob_path:?}");
let (data, data_len) =
upload_stream(format!("remote blob data {i}").into_bytes().into());
task_client.upload(data, data_len, &blob_path, None).await?;
Ok::<_, anyhow::Error>((blob_prefix, blob_path))
});
}
let mut upload_tasks_failed = false;
let mut uploaded_prefixes = HashSet::with_capacity(upload_tasks_count);
let mut uploaded_blobs = HashSet::with_capacity(upload_tasks_count);
while let Some(task_run_result) = upload_tasks.join_next().await {
match task_run_result
.context("task join failed")
.and_then(|task_result| task_result.context("upload task failed"))
{
Ok((upload_prefix, upload_path)) => {
uploaded_prefixes.insert(upload_prefix);
uploaded_blobs.insert(upload_path);
}
Err(e) => {
error!("Upload task failed: {e:?}");
upload_tasks_failed = true;
}
}
}
let uploads = Uploads {
prefixes: uploaded_prefixes,
blobs: uploaded_blobs,
};
if upload_tasks_failed {
ControlFlow::Break(uploads)
} else {
ControlFlow::Continue(uploads)
}
}
async fn cleanup(client: &Arc<GenericRemoteStorage>, objects_to_delete: HashSet<RemotePath>) {
info!(
"Removing {} objects from the remote storage during cleanup",
objects_to_delete.len()
);
let mut delete_tasks = JoinSet::new();
for object_to_delete in objects_to_delete {
let task_client = Arc::clone(client);
delete_tasks.spawn(async move {
debug!("Deleting remote item at path {object_to_delete:?}");
task_client
.delete(&object_to_delete)
.await
.with_context(|| format!("{object_to_delete:?} removal"))
});
}
while let Some(task_run_result) = delete_tasks.join_next().await {
match task_run_result {
Ok(task_result) => match task_result {
Ok(()) => {}
Err(e) => error!("Delete task failed: {e:?}"),
},
Err(join_err) => error!("Delete task did not finish correctly: {join_err}"),
}
}
}
// Uploads files `folder{j}/blob{i}.txt`. See test description for more details.
async fn upload_simple_s3_data(
client: &Arc<GenericRemoteStorage>,
upload_tasks_count: usize,
) -> ControlFlow<HashSet<RemotePath>, HashSet<RemotePath>> {
info!("Creating {upload_tasks_count} S3 files");
let mut upload_tasks = JoinSet::new();
for i in 1..upload_tasks_count + 1 {
let task_client = Arc::clone(client);
upload_tasks.spawn(async move {
let blob_path = PathBuf::from(format!("folder{}/blob_{}.txt", i / 7, i));
let blob_path = RemotePath::new(
Utf8Path::from_path(blob_path.as_path()).expect("must be valid blob path"),
)
.with_context(|| format!("{blob_path:?} to RemotePath conversion"))?;
debug!("Creating remote item {i} at path {blob_path:?}");
let (data, data_len) =
upload_stream(format!("remote blob data {i}").into_bytes().into());
task_client.upload(data, data_len, &blob_path, None).await?;
Ok::<_, anyhow::Error>(blob_path)
});
}
let mut upload_tasks_failed = false;
let mut uploaded_blobs = HashSet::with_capacity(upload_tasks_count);
while let Some(task_run_result) = upload_tasks.join_next().await {
match task_run_result
.context("task join failed")
.and_then(|task_result| task_result.context("upload task failed"))
{
Ok(upload_path) => {
uploaded_blobs.insert(upload_path);
}
Err(e) => {
error!("Upload task failed: {e:?}");
upload_tasks_failed = true;
}
}
}
if upload_tasks_failed {
ControlFlow::Break(uploaded_blobs)
} else {
ControlFlow::Continue(uploaded_blobs)
}
}
fn upload_stream(
content: std::borrow::Cow<'static, [u8]>,
) -> (
impl Stream<Item = std::io::Result<Bytes>> + Send + Sync + 'static,
usize,
) {
use std::borrow::Cow;
let content = match content {
Cow::Borrowed(x) => Bytes::from_static(x),
Cow::Owned(vec) => Bytes::from(vec),
};
wrap_stream(content)
}
fn wrap_stream(
content: bytes::Bytes,
) -> (
impl Stream<Item = std::io::Result<Bytes>> + Send + Sync + 'static,
usize,
) {
let len = content.len();
let content = futures::future::ready(Ok(content));
(futures::stream::once(content), len)
}

View File

@@ -2,8 +2,11 @@ use std::time::Duration;
use tokio_util::sync::CancellationToken; use tokio_util::sync::CancellationToken;
#[derive(thiserror::Error, Debug)]
pub enum TimeoutCancellableError { pub enum TimeoutCancellableError {
#[error("Timed out")]
Timeout, Timeout,
#[error("Cancelled")]
Cancelled, Cancelled,
} }

View File

@@ -1 +1,2 @@
#include "postgres.h"
#include "walproposer.h" #include "walproposer.h"

View File

@@ -1,3 +1,6 @@
//! Links with walproposer, pgcommon, pgport and runs bindgen on walproposer.h
//! to generate Rust bindings for it.
use std::{env, path::PathBuf, process::Command}; use std::{env, path::PathBuf, process::Command};
use anyhow::{anyhow, Context}; use anyhow::{anyhow, Context};

View File

@@ -1,3 +1,6 @@
//! A C-Rust shim: defines implementation of C walproposer API, assuming wp
//! callback_data stores Box to some Rust implementation.
#![allow(dead_code)] #![allow(dead_code)]
use std::ffi::CStr; use std::ffi::CStr;

View File

@@ -63,6 +63,7 @@ thiserror.workspace = true
tokio = { workspace = true, features = ["process", "sync", "fs", "rt", "io-util", "time"] } tokio = { workspace = true, features = ["process", "sync", "fs", "rt", "io-util", "time"] }
tokio-io-timeout.workspace = true tokio-io-timeout.workspace = true
tokio-postgres.workspace = true tokio-postgres.workspace = true
tokio-stream.workspace = true
tokio-util.workspace = true tokio-util.workspace = true
toml_edit = { workspace = true, features = [ "serde" ] } toml_edit = { workspace = true, features = [ "serde" ] }
tracing.workspace = true tracing.workspace = true

View File

@@ -0,0 +1,22 @@
[package]
name = "pageserver_client"
version = "0.1.0"
edition.workspace = true
license.workspace = true
[dependencies]
pageserver_api.workspace = true
thiserror.workspace = true
async-trait.workspace = true
reqwest.workspace = true
utils.workspace = true
serde.workspace = true
workspace_hack = { version = "0.1", path = "../../workspace_hack" }
tokio-postgres.workspace = true
tokio-stream.workspace = true
tokio.workspace = true
futures.workspace = true
tokio-util.workspace = true
anyhow.workspace = true
postgres.workspace = true
bytes.workspace = true

View File

@@ -0,0 +1,2 @@
pub mod mgmt_api;
pub mod page_service;

View File

@@ -0,0 +1,200 @@
use pageserver_api::models::*;
use reqwest::{IntoUrl, Method};
use utils::{
http::error::HttpErrorBody,
id::{TenantId, TimelineId},
};
#[derive(Debug)]
pub struct Client {
mgmt_api_endpoint: String,
authorization_header: Option<String>,
client: reqwest::Client,
}
#[derive(thiserror::Error, Debug)]
pub enum Error {
#[error("receive body: {0}")]
ReceiveBody(reqwest::Error),
#[error("receive error body: {0}")]
ReceiveErrorBody(String),
#[error("pageserver API: {0}")]
ApiError(String),
}
pub type Result<T> = std::result::Result<T, Error>;
#[async_trait::async_trait]
pub trait ResponseErrorMessageExt: Sized {
async fn error_from_body(self) -> Result<Self>;
}
#[async_trait::async_trait]
impl ResponseErrorMessageExt for reqwest::Response {
async fn error_from_body(mut self) -> Result<Self> {
let status = self.status();
if !(status.is_client_error() || status.is_server_error()) {
return Ok(self);
}
let url = self.url().to_owned();
Err(match self.json::<HttpErrorBody>().await {
Ok(HttpErrorBody { msg }) => Error::ApiError(msg),
Err(_) => {
Error::ReceiveErrorBody(format!("Http error ({}) at {}.", status.as_u16(), url))
}
})
}
}
impl Client {
pub fn new(mgmt_api_endpoint: String, jwt: Option<&str>) -> Self {
Self {
mgmt_api_endpoint,
authorization_header: jwt.map(|jwt| format!("Bearer {jwt}")),
client: reqwest::Client::new(),
}
}
pub async fn list_tenants(&self) -> Result<Vec<pageserver_api::models::TenantInfo>> {
let uri = format!("{}/v1/tenant", self.mgmt_api_endpoint);
let resp = self.get(&uri).await?;
resp.json().await.map_err(Error::ReceiveBody)
}
pub async fn tenant_details(
&self,
tenant_id: TenantId,
) -> Result<pageserver_api::models::TenantDetails> {
let uri = format!("{}/v1/tenant/{tenant_id}", self.mgmt_api_endpoint);
self.get(uri)
.await?
.json()
.await
.map_err(Error::ReceiveBody)
}
pub async fn list_timelines(
&self,
tenant_id: TenantId,
) -> Result<Vec<pageserver_api::models::TimelineInfo>> {
let uri = format!("{}/v1/tenant/{tenant_id}/timeline", self.mgmt_api_endpoint);
self.get(&uri)
.await?
.json()
.await
.map_err(Error::ReceiveBody)
}
pub async fn timeline_info(
&self,
tenant_id: TenantId,
timeline_id: TimelineId,
) -> Result<pageserver_api::models::TimelineInfo> {
let uri = format!(
"{}/v1/tenant/{tenant_id}/timeline/{timeline_id}",
self.mgmt_api_endpoint
);
self.get(&uri)
.await?
.json()
.await
.map_err(Error::ReceiveBody)
}
pub async fn keyspace(
&self,
tenant_id: TenantId,
timeline_id: TimelineId,
) -> Result<pageserver_api::models::partitioning::Partitioning> {
let uri = format!(
"{}/v1/tenant/{tenant_id}/timeline/{timeline_id}/keyspace",
self.mgmt_api_endpoint
);
self.get(&uri)
.await?
.json()
.await
.map_err(Error::ReceiveBody)
}
async fn get<U: IntoUrl>(&self, uri: U) -> Result<reqwest::Response> {
self.request(Method::GET, uri, ()).await
}
async fn request<B: serde::Serialize, U: reqwest::IntoUrl>(
&self,
method: Method,
uri: U,
body: B,
) -> Result<reqwest::Response> {
let req = self.client.request(method, uri);
let req = if let Some(value) = &self.authorization_header {
req.header(reqwest::header::AUTHORIZATION, value)
} else {
req
};
let res = req.json(&body).send().await.map_err(Error::ReceiveBody)?;
let response = res.error_from_body().await?;
Ok(response)
}
pub async fn status(&self) -> Result<()> {
let uri = format!("{}/v1/status", self.mgmt_api_endpoint);
self.get(&uri).await?;
Ok(())
}
pub async fn tenant_create(&self, req: &TenantCreateRequest) -> Result<TenantId> {
let uri = format!("{}/v1/tenant", self.mgmt_api_endpoint);
self.request(Method::POST, &uri, req)
.await?
.json()
.await
.map_err(Error::ReceiveBody)
}
pub async fn tenant_config(&self, req: &TenantConfigRequest) -> Result<()> {
let uri = format!("{}/v1/tenant/config", self.mgmt_api_endpoint);
self.request(Method::PUT, &uri, req).await?;
Ok(())
}
pub async fn location_config(
&self,
tenant_id: TenantId,
config: LocationConfig,
flush_ms: Option<std::time::Duration>,
) -> Result<()> {
let req_body = TenantLocationConfigRequest { tenant_id, config };
let path = format!(
"{}/v1/tenant/{}/location_config",
self.mgmt_api_endpoint, tenant_id
);
let path = if let Some(flush_ms) = flush_ms {
format!("{}?flush_ms={}", path, flush_ms.as_millis())
} else {
path
};
self.request(Method::PUT, &path, &req_body).await?;
Ok(())
}
pub async fn timeline_create(
&self,
tenant_id: TenantId,
req: &TimelineCreateRequest,
) -> Result<TimelineInfo> {
let uri = format!(
"{}/v1/tenant/{}/timeline",
self.mgmt_api_endpoint, tenant_id
);
self.request(Method::POST, &uri, req)
.await?
.json()
.await
.map_err(Error::ReceiveBody)
}
}

View File

@@ -0,0 +1,151 @@
use std::pin::Pin;
use futures::SinkExt;
use pageserver_api::{
models::{
PagestreamBeMessage, PagestreamFeMessage, PagestreamGetPageRequest,
PagestreamGetPageResponse,
},
reltag::RelTag,
};
use tokio::task::JoinHandle;
use tokio_postgres::CopyOutStream;
use tokio_stream::StreamExt;
use tokio_util::sync::CancellationToken;
use utils::{
id::{TenantId, TimelineId},
lsn::Lsn,
};
pub struct Client {
client: tokio_postgres::Client,
cancel_on_client_drop: Option<tokio_util::sync::DropGuard>,
conn_task: JoinHandle<()>,
}
pub struct BasebackupRequest {
pub tenant_id: TenantId,
pub timeline_id: TimelineId,
pub lsn: Option<Lsn>,
pub gzip: bool,
}
impl Client {
pub async fn new(connstring: String) -> anyhow::Result<Self> {
let (client, connection) = tokio_postgres::connect(&connstring, postgres::NoTls).await?;
let conn_task_cancel = CancellationToken::new();
let conn_task = tokio::spawn({
let conn_task_cancel = conn_task_cancel.clone();
async move {
tokio::select! {
_ = conn_task_cancel.cancelled() => { }
res = connection => {
res.unwrap();
}
}
}
});
Ok(Self {
cancel_on_client_drop: Some(conn_task_cancel.drop_guard()),
conn_task,
client,
})
}
pub async fn pagestream(
self,
tenant_id: TenantId,
timeline_id: TimelineId,
) -> anyhow::Result<PagestreamClient> {
let copy_both: tokio_postgres::CopyBothDuplex<bytes::Bytes> = self
.client
.copy_both_simple(&format!("pagestream {tenant_id} {timeline_id}"))
.await?;
let Client {
cancel_on_client_drop,
conn_task,
client: _,
} = self;
Ok(PagestreamClient {
copy_both: Box::pin(copy_both),
conn_task,
cancel_on_client_drop,
})
}
pub async fn basebackup(&self, req: &BasebackupRequest) -> anyhow::Result<CopyOutStream> {
let BasebackupRequest {
tenant_id,
timeline_id,
lsn,
gzip,
} = req;
let mut args = Vec::with_capacity(5);
args.push("basebackup".to_string());
args.push(format!("{tenant_id}"));
args.push(format!("{timeline_id}"));
if let Some(lsn) = lsn {
args.push(format!("{lsn}"));
}
if *gzip {
args.push("--gzip".to_string())
}
Ok(self.client.copy_out(&args.join(" ")).await?)
}
}
/// Create using [`Client::pagestream`].
pub struct PagestreamClient {
copy_both: Pin<Box<tokio_postgres::CopyBothDuplex<bytes::Bytes>>>,
cancel_on_client_drop: Option<tokio_util::sync::DropGuard>,
conn_task: JoinHandle<()>,
}
pub struct RelTagBlockNo {
pub rel_tag: RelTag,
pub block_no: u32,
}
impl PagestreamClient {
pub async fn shutdown(mut self) {
let _ = self.cancel_on_client_drop.take();
self.conn_task.await.unwrap();
}
pub async fn getpage(
&mut self,
key: RelTagBlockNo,
lsn: Lsn,
) -> anyhow::Result<PagestreamGetPageResponse> {
let req = PagestreamGetPageRequest {
latest: false,
rel: key.rel_tag,
blkno: key.block_no,
lsn,
};
let req = PagestreamFeMessage::GetPage(req);
let req: bytes::Bytes = req.serialize();
// let mut req = tokio_util::io::ReaderStream::new(&req);
let mut req = tokio_stream::once(Ok(req));
self.copy_both.send_all(&mut req).await?;
let next: Option<Result<bytes::Bytes, _>> = self.copy_both.next().await;
let next: bytes::Bytes = next.unwrap()?;
let msg = PagestreamBeMessage::deserialize(next)?;
match msg {
PagestreamBeMessage::GetPage(p) => Ok(p),
PagestreamBeMessage::Error(e) => anyhow::bail!("Error: {:?}", e),
PagestreamBeMessage::Exists(_)
| PagestreamBeMessage::Nblocks(_)
| PagestreamBeMessage::DbSize(_) => {
anyhow::bail!(
"unexpected be message kind in response to getpage request: {}",
msg.kind()
)
}
}
}
}

View File

@@ -41,6 +41,8 @@ use crate::{
TIMELINE_DELETE_MARK_SUFFIX, TIMELINE_UNINIT_MARK_SUFFIX, TIMELINE_DELETE_MARK_SUFFIX, TIMELINE_UNINIT_MARK_SUFFIX,
}; };
use self::defaults::DEFAULT_CONCURRENT_TENANT_WARMUP;
pub mod defaults { pub mod defaults {
use crate::tenant::config::defaults::*; use crate::tenant::config::defaults::*;
use const_format::formatcp; use const_format::formatcp;
@@ -61,6 +63,8 @@ pub mod defaults {
pub const DEFAULT_LOG_FORMAT: &str = "plain"; pub const DEFAULT_LOG_FORMAT: &str = "plain";
pub const DEFAULT_CONCURRENT_TENANT_WARMUP: usize = 8;
pub const DEFAULT_CONCURRENT_TENANT_SIZE_LOGICAL_SIZE_QUERIES: usize = pub const DEFAULT_CONCURRENT_TENANT_SIZE_LOGICAL_SIZE_QUERIES: usize =
super::ConfigurableSemaphore::DEFAULT_INITIAL.get(); super::ConfigurableSemaphore::DEFAULT_INITIAL.get();
@@ -94,6 +98,7 @@ pub mod defaults {
#log_format = '{DEFAULT_LOG_FORMAT}' #log_format = '{DEFAULT_LOG_FORMAT}'
#concurrent_tenant_size_logical_size_queries = '{DEFAULT_CONCURRENT_TENANT_SIZE_LOGICAL_SIZE_QUERIES}' #concurrent_tenant_size_logical_size_queries = '{DEFAULT_CONCURRENT_TENANT_SIZE_LOGICAL_SIZE_QUERIES}'
#concurrent_tenant_warmup = '{DEFAULT_CONCURRENT_TENANT_WARMUP}'
#metric_collection_interval = '{DEFAULT_METRIC_COLLECTION_INTERVAL}' #metric_collection_interval = '{DEFAULT_METRIC_COLLECTION_INTERVAL}'
#cached_metric_collection_interval = '{DEFAULT_CACHED_METRIC_COLLECTION_INTERVAL}' #cached_metric_collection_interval = '{DEFAULT_CACHED_METRIC_COLLECTION_INTERVAL}'
@@ -180,6 +185,11 @@ pub struct PageServerConf {
pub log_format: LogFormat, pub log_format: LogFormat,
/// Number of tenants which will be concurrently loaded from remote storage proactively on startup,
/// does not limit tenants loaded in response to client I/O. A lower value implicitly deprioritizes
/// loading such tenants, vs. other work in the system.
pub concurrent_tenant_warmup: ConfigurableSemaphore,
/// Number of concurrent [`Tenant::gather_size_inputs`](crate::tenant::Tenant::gather_size_inputs) allowed. /// Number of concurrent [`Tenant::gather_size_inputs`](crate::tenant::Tenant::gather_size_inputs) allowed.
pub concurrent_tenant_size_logical_size_queries: ConfigurableSemaphore, pub concurrent_tenant_size_logical_size_queries: ConfigurableSemaphore,
/// Limit of concurrent [`Tenant::gather_size_inputs`] issued by module `eviction_task`. /// Limit of concurrent [`Tenant::gather_size_inputs`] issued by module `eviction_task`.
@@ -283,6 +293,7 @@ struct PageServerConfigBuilder {
log_format: BuilderValue<LogFormat>, log_format: BuilderValue<LogFormat>,
concurrent_tenant_warmup: BuilderValue<NonZeroUsize>,
concurrent_tenant_size_logical_size_queries: BuilderValue<NonZeroUsize>, concurrent_tenant_size_logical_size_queries: BuilderValue<NonZeroUsize>,
metric_collection_interval: BuilderValue<Duration>, metric_collection_interval: BuilderValue<Duration>,
@@ -340,6 +351,8 @@ impl Default for PageServerConfigBuilder {
.expect("cannot parse default keepalive interval")), .expect("cannot parse default keepalive interval")),
log_format: Set(LogFormat::from_str(DEFAULT_LOG_FORMAT).unwrap()), log_format: Set(LogFormat::from_str(DEFAULT_LOG_FORMAT).unwrap()),
concurrent_tenant_warmup: Set(NonZeroUsize::new(DEFAULT_CONCURRENT_TENANT_WARMUP)
.expect("Invalid default constant")),
concurrent_tenant_size_logical_size_queries: Set( concurrent_tenant_size_logical_size_queries: Set(
ConfigurableSemaphore::DEFAULT_INITIAL, ConfigurableSemaphore::DEFAULT_INITIAL,
), ),
@@ -453,6 +466,10 @@ impl PageServerConfigBuilder {
self.log_format = BuilderValue::Set(log_format) self.log_format = BuilderValue::Set(log_format)
} }
pub fn concurrent_tenant_warmup(&mut self, u: NonZeroUsize) {
self.concurrent_tenant_warmup = BuilderValue::Set(u);
}
pub fn concurrent_tenant_size_logical_size_queries(&mut self, u: NonZeroUsize) { pub fn concurrent_tenant_size_logical_size_queries(&mut self, u: NonZeroUsize) {
self.concurrent_tenant_size_logical_size_queries = BuilderValue::Set(u); self.concurrent_tenant_size_logical_size_queries = BuilderValue::Set(u);
} }
@@ -518,6 +535,9 @@ impl PageServerConfigBuilder {
} }
pub fn build(self) -> anyhow::Result<PageServerConf> { pub fn build(self) -> anyhow::Result<PageServerConf> {
let concurrent_tenant_warmup = self
.concurrent_tenant_warmup
.ok_or(anyhow!("missing concurrent_tenant_warmup"))?;
let concurrent_tenant_size_logical_size_queries = self let concurrent_tenant_size_logical_size_queries = self
.concurrent_tenant_size_logical_size_queries .concurrent_tenant_size_logical_size_queries
.ok_or(anyhow!( .ok_or(anyhow!(
@@ -570,6 +590,7 @@ impl PageServerConfigBuilder {
.broker_keepalive_interval .broker_keepalive_interval
.ok_or(anyhow!("No broker keepalive interval provided"))?, .ok_or(anyhow!("No broker keepalive interval provided"))?,
log_format: self.log_format.ok_or(anyhow!("missing log_format"))?, log_format: self.log_format.ok_or(anyhow!("missing log_format"))?,
concurrent_tenant_warmup: ConfigurableSemaphore::new(concurrent_tenant_warmup),
concurrent_tenant_size_logical_size_queries: ConfigurableSemaphore::new( concurrent_tenant_size_logical_size_queries: ConfigurableSemaphore::new(
concurrent_tenant_size_logical_size_queries, concurrent_tenant_size_logical_size_queries,
), ),
@@ -807,6 +828,11 @@ impl PageServerConf {
"log_format" => builder.log_format( "log_format" => builder.log_format(
LogFormat::from_config(&parse_toml_string(key, item)?)? LogFormat::from_config(&parse_toml_string(key, item)?)?
), ),
"concurrent_tenant_warmup" => builder.concurrent_tenant_warmup({
let input = parse_toml_string(key, item)?;
let permits = input.parse::<usize>().context("expected a number of initial permits, not {s:?}")?;
NonZeroUsize::new(permits).context("initial semaphore permits out of range: 0, use other configuration to disable a feature")?
}),
"concurrent_tenant_size_logical_size_queries" => builder.concurrent_tenant_size_logical_size_queries({ "concurrent_tenant_size_logical_size_queries" => builder.concurrent_tenant_size_logical_size_queries({
let input = parse_toml_string(key, item)?; let input = parse_toml_string(key, item)?;
let permits = input.parse::<usize>().context("expected a number of initial permits, not {s:?}")?; let permits = input.parse::<usize>().context("expected a number of initial permits, not {s:?}")?;
@@ -904,6 +930,10 @@ impl PageServerConf {
broker_endpoint: storage_broker::DEFAULT_ENDPOINT.parse().unwrap(), broker_endpoint: storage_broker::DEFAULT_ENDPOINT.parse().unwrap(),
broker_keepalive_interval: Duration::from_secs(5000), broker_keepalive_interval: Duration::from_secs(5000),
log_format: LogFormat::from_str(defaults::DEFAULT_LOG_FORMAT).unwrap(), log_format: LogFormat::from_str(defaults::DEFAULT_LOG_FORMAT).unwrap(),
concurrent_tenant_warmup: ConfigurableSemaphore::new(
NonZeroUsize::new(DEFAULT_CONCURRENT_TENANT_WARMUP)
.expect("Invalid default constant"),
),
concurrent_tenant_size_logical_size_queries: ConfigurableSemaphore::default(), concurrent_tenant_size_logical_size_queries: ConfigurableSemaphore::default(),
eviction_task_immitated_concurrent_logical_size_queries: ConfigurableSemaphore::default( eviction_task_immitated_concurrent_logical_size_queries: ConfigurableSemaphore::default(
), ),
@@ -1122,6 +1152,9 @@ background_task_maximum_delay = '334 s'
storage_broker::DEFAULT_KEEPALIVE_INTERVAL storage_broker::DEFAULT_KEEPALIVE_INTERVAL
)?, )?,
log_format: LogFormat::from_str(defaults::DEFAULT_LOG_FORMAT).unwrap(), log_format: LogFormat::from_str(defaults::DEFAULT_LOG_FORMAT).unwrap(),
concurrent_tenant_warmup: ConfigurableSemaphore::new(
NonZeroUsize::new(DEFAULT_CONCURRENT_TENANT_WARMUP).unwrap()
),
concurrent_tenant_size_logical_size_queries: ConfigurableSemaphore::default(), concurrent_tenant_size_logical_size_queries: ConfigurableSemaphore::default(),
eviction_task_immitated_concurrent_logical_size_queries: eviction_task_immitated_concurrent_logical_size_queries:
ConfigurableSemaphore::default(), ConfigurableSemaphore::default(),
@@ -1188,6 +1221,9 @@ background_task_maximum_delay = '334 s'
broker_endpoint: storage_broker::DEFAULT_ENDPOINT.parse().unwrap(), broker_endpoint: storage_broker::DEFAULT_ENDPOINT.parse().unwrap(),
broker_keepalive_interval: Duration::from_secs(5), broker_keepalive_interval: Duration::from_secs(5),
log_format: LogFormat::Json, log_format: LogFormat::Json,
concurrent_tenant_warmup: ConfigurableSemaphore::new(
NonZeroUsize::new(DEFAULT_CONCURRENT_TENANT_WARMUP).unwrap()
),
concurrent_tenant_size_logical_size_queries: ConfigurableSemaphore::default(), concurrent_tenant_size_logical_size_queries: ConfigurableSemaphore::default(),
eviction_task_immitated_concurrent_logical_size_queries: eviction_task_immitated_concurrent_logical_size_queries:
ConfigurableSemaphore::default(), ConfigurableSemaphore::default(),

View File

@@ -1,4 +1,2 @@
pub mod routes; pub mod routes;
pub use routes::make_router; pub use routes::make_router;
pub use pageserver_api::models;

View File

@@ -14,6 +14,7 @@ use hyper::header;
use hyper::StatusCode; use hyper::StatusCode;
use hyper::{Body, Request, Response, Uri}; use hyper::{Body, Request, Response, Uri};
use metrics::launch_timestamp::LaunchTimestamp; use metrics::launch_timestamp::LaunchTimestamp;
use pageserver_api::models::TenantDetails;
use pageserver_api::models::{ use pageserver_api::models::{
DownloadRemoteLayersTaskSpawnRequest, LocationConfigMode, TenantAttachRequest, DownloadRemoteLayersTaskSpawnRequest, LocationConfigMode, TenantAttachRequest,
TenantLoadRequest, TenantLocationConfigRequest, TenantLoadRequest, TenantLocationConfigRequest,
@@ -28,16 +29,13 @@ use utils::http::endpoint::request_span;
use utils::http::json::json_request_or_empty_body; use utils::http::json::json_request_or_empty_body;
use utils::http::request::{get_request_param, must_get_query_param, parse_query_param}; use utils::http::request::{get_request_param, must_get_query_param, parse_query_param};
use super::models::{
StatusResponse, TenantConfigRequest, TenantCreateRequest, TenantCreateResponse, TenantInfo,
TimelineCreateRequest, TimelineGcRequest, TimelineInfo,
};
use crate::context::{DownloadBehavior, RequestContext}; use crate::context::{DownloadBehavior, RequestContext};
use crate::deletion_queue::DeletionQueueClient; use crate::deletion_queue::DeletionQueueClient;
use crate::metrics::{StorageTimeOperation, STORAGE_TIME_GLOBAL}; use crate::metrics::{StorageTimeOperation, STORAGE_TIME_GLOBAL};
use crate::pgdatadir_mapping::LsnForTimestamp; use crate::pgdatadir_mapping::LsnForTimestamp;
use crate::task_mgr::TaskKind; use crate::task_mgr::TaskKind;
use crate::tenant::config::{LocationConf, TenantConfOpt}; use crate::tenant::config::{LocationConf, TenantConfOpt};
use crate::tenant::mgr::GetActiveTenantError;
use crate::tenant::mgr::{ use crate::tenant::mgr::{
GetTenantError, SetNewTenantConfigError, TenantManager, TenantMapError, TenantMapInsertError, GetTenantError, SetNewTenantConfigError, TenantManager, TenantMapError, TenantMapInsertError,
TenantSlotError, TenantSlotUpsertError, TenantStateError, TenantSlotError, TenantSlotUpsertError, TenantStateError,
@@ -50,6 +48,10 @@ use crate::tenant::timeline::Timeline;
use crate::tenant::{LogicalSizeCalculationCause, PageReconstructError, TenantSharedResources}; use crate::tenant::{LogicalSizeCalculationCause, PageReconstructError, TenantSharedResources};
use crate::{config::PageServerConf, tenant::mgr}; use crate::{config::PageServerConf, tenant::mgr};
use crate::{disk_usage_eviction_task, tenant}; use crate::{disk_usage_eviction_task, tenant};
use pageserver_api::models::{
StatusResponse, TenantConfigRequest, TenantCreateRequest, TenantCreateResponse, TenantInfo,
TimelineCreateRequest, TimelineGcRequest, TimelineInfo,
};
use utils::{ use utils::{
auth::SwappableJwtAuth, auth::SwappableJwtAuth,
generation::Generation, generation::Generation,
@@ -65,7 +67,12 @@ use utils::{
}; };
// Imports only used for testing APIs // Imports only used for testing APIs
use super::models::ConfigureFailpointsRequest; use pageserver_api::models::ConfigureFailpointsRequest;
// For APIs that require an Active tenant, how long should we block waiting for that state?
// This is not functionally necessary (clients will retry), but avoids generating a lot of
// failed API calls while tenants are activating.
const ACTIVE_TENANT_TIMEOUT: Duration = Duration::from_millis(5000);
pub struct State { pub struct State {
conf: &'static PageServerConf, conf: &'static PageServerConf,
@@ -233,6 +240,19 @@ impl From<GetTenantError> for ApiError {
} }
} }
impl From<GetActiveTenantError> for ApiError {
fn from(e: GetActiveTenantError) -> ApiError {
match e {
GetActiveTenantError::WillNotBecomeActive(_) => ApiError::Conflict(format!("{}", e)),
GetActiveTenantError::Cancelled => ApiError::ShuttingDown,
GetActiveTenantError::NotFound(gte) => gte.into(),
GetActiveTenantError::WaitForActiveTimeout { .. } => {
ApiError::ResourceUnavailable(format!("{}", e).into())
}
}
}
}
impl From<SetNewTenantConfigError> for ApiError { impl From<SetNewTenantConfigError> for ApiError {
fn from(e: SetNewTenantConfigError) -> ApiError { fn from(e: SetNewTenantConfigError) -> ApiError {
match e { match e {
@@ -435,7 +455,10 @@ async fn timeline_create_handler(
let state = get_state(&request); let state = get_state(&request);
async { async {
let tenant = state.tenant_manager.get_attached_tenant_shard(tenant_shard_id, true)?; let tenant = state.tenant_manager.get_attached_tenant_shard(tenant_shard_id, false)?;
tenant.wait_to_become_active(ACTIVE_TENANT_TIMEOUT).await?;
match tenant.create_timeline( match tenant.create_timeline(
new_timeline_id, new_timeline_id,
request_data.ancestor_timeline_id.map(TimelineId::from), request_data.ancestor_timeline_id.map(TimelineId::from),
@@ -570,8 +593,6 @@ async fn get_lsn_by_timestamp_handler(
))); )));
} }
let version: Option<u8> = parse_query_param(&request, "version")?;
let timeline_id: TimelineId = parse_request_param(&request, "timeline_id")?; let timeline_id: TimelineId = parse_request_param(&request, "timeline_id")?;
let timestamp_raw = must_get_query_param(&request, "timestamp")?; let timestamp_raw = must_get_query_param(&request, "timestamp")?;
let timestamp = humantime::parse_rfc3339(&timestamp_raw) let timestamp = humantime::parse_rfc3339(&timestamp_raw)
@@ -584,31 +605,18 @@ async fn get_lsn_by_timestamp_handler(
let result = timeline let result = timeline
.find_lsn_for_timestamp(timestamp_pg, &cancel, &ctx) .find_lsn_for_timestamp(timestamp_pg, &cancel, &ctx)
.await?; .await?;
#[derive(serde::Serialize)]
if version.unwrap_or(0) > 1 { struct Result {
#[derive(serde::Serialize)] lsn: Lsn,
struct Result { kind: &'static str,
lsn: Lsn,
kind: &'static str,
}
let (lsn, kind) = match result {
LsnForTimestamp::Present(lsn) => (lsn, "present"),
LsnForTimestamp::Future(lsn) => (lsn, "future"),
LsnForTimestamp::Past(lsn) => (lsn, "past"),
LsnForTimestamp::NoData(lsn) => (lsn, "nodata"),
};
json_response(StatusCode::OK, Result { lsn, kind })
} else {
// FIXME: this is a temporary crutch not to break backwards compatibility
// See https://github.com/neondatabase/neon/pull/5608
let result = match result {
LsnForTimestamp::Present(lsn) => format!("{lsn}"),
LsnForTimestamp::Future(_lsn) => "future".into(),
LsnForTimestamp::Past(_lsn) => "past".into(),
LsnForTimestamp::NoData(_lsn) => "nodata".into(),
};
json_response(StatusCode::OK, result)
} }
let (lsn, kind) = match result {
LsnForTimestamp::Present(lsn) => (lsn, "present"),
LsnForTimestamp::Future(lsn) => (lsn, "future"),
LsnForTimestamp::Past(lsn) => (lsn, "past"),
LsnForTimestamp::NoData(lsn) => (lsn, "nodata"),
};
json_response(StatusCode::OK, Result { lsn, kind })
} }
async fn get_timestamp_of_lsn_handler( async fn get_timestamp_of_lsn_handler(
@@ -694,11 +702,23 @@ async fn timeline_delete_handler(
let timeline_id: TimelineId = parse_request_param(&request, "timeline_id")?; let timeline_id: TimelineId = parse_request_param(&request, "timeline_id")?;
check_permission(&request, Some(tenant_shard_id.tenant_id))?; check_permission(&request, Some(tenant_shard_id.tenant_id))?;
let ctx = RequestContext::new(TaskKind::MgmtRequest, DownloadBehavior::Warn);
let state = get_state(&request); let state = get_state(&request);
state.tenant_manager.delete_timeline(tenant_shard_id, timeline_id, &ctx) let tenant = state
.instrument(info_span!("timeline_delete", tenant_id=%tenant_shard_id.tenant_id, shard=%tenant_shard_id.shard_slug(), %timeline_id)) .tenant_manager
.get_attached_tenant_shard(tenant_shard_id, false)
.map_err(|e| {
match e {
// GetTenantError has a built-in conversion to ApiError, but in this context we don't
// want to treat missing tenants as 404, to avoid ambiguity with successful deletions.
GetTenantError::NotFound(_) => ApiError::PreconditionFailed(
"Requested tenant is missing".to_string().into_boxed_str(),
),
e => e.into(),
}
})?;
tenant.wait_to_become_active(ACTIVE_TENANT_TIMEOUT).await?;
tenant.delete_timeline(timeline_id).instrument(info_span!("timeline_delete", tenant_id=%tenant_shard_id.tenant_id, shard=%tenant_shard_id.shard_slug(), %timeline_id))
.await?; .await?;
json_response(StatusCode::ACCEPTED, ()) json_response(StatusCode::ACCEPTED, ())
@@ -838,11 +858,14 @@ async fn tenant_status(
} }
let state = tenant.current_state(); let state = tenant.current_state();
Result::<_, ApiError>::Ok(TenantInfo { Result::<_, ApiError>::Ok(TenantDetails {
id: tenant_shard_id, tenant_info: TenantInfo {
state: state.clone(), id: tenant_shard_id,
current_physical_size: Some(current_physical_size), state: state.clone(),
attachment_status: state.attachment_status(), current_physical_size: Some(current_physical_size),
attachment_status: state.attachment_status(),
},
timelines: tenant.list_timeline_ids(),
}) })
} }
.instrument(info_span!("tenant_status_handler", .instrument(info_span!("tenant_status_handler",
@@ -1136,7 +1159,10 @@ async fn tenant_create_handler(
// We created the tenant. Existing API semantics are that the tenant // We created the tenant. Existing API semantics are that the tenant
// is Active when this function returns. // is Active when this function returns.
if let res @ Err(_) = new_tenant.wait_to_become_active().await { if let res @ Err(_) = new_tenant
.wait_to_become_active(ACTIVE_TENANT_TIMEOUT)
.await
{
// This shouldn't happen because we just created the tenant directory // This shouldn't happen because we just created the tenant directory
// in tenant::mgr::create_tenant, and there aren't any remote timelines // in tenant::mgr::create_tenant, and there aren't any remote timelines
// to load, so, nothing can really fail during load. // to load, so, nothing can really fail during load.
@@ -1487,69 +1513,6 @@ async fn timeline_collect_keyspace(
let timeline_id: TimelineId = parse_request_param(&request, "timeline_id")?; let timeline_id: TimelineId = parse_request_param(&request, "timeline_id")?;
check_permission(&request, Some(tenant_shard_id.tenant_id))?; check_permission(&request, Some(tenant_shard_id.tenant_id))?;
struct Partitioning {
keys: crate::keyspace::KeySpace,
at_lsn: Lsn,
}
impl serde::Serialize for Partitioning {
fn serialize<S>(&self, serializer: S) -> std::result::Result<S::Ok, S::Error>
where
S: serde::Serializer,
{
use serde::ser::SerializeMap;
let mut map = serializer.serialize_map(Some(2))?;
map.serialize_key("keys")?;
map.serialize_value(&KeySpace(&self.keys))?;
map.serialize_key("at_lsn")?;
map.serialize_value(&WithDisplay(&self.at_lsn))?;
map.end()
}
}
struct WithDisplay<'a, T>(&'a T);
impl<'a, T: std::fmt::Display> serde::Serialize for WithDisplay<'a, T> {
fn serialize<S>(&self, serializer: S) -> std::result::Result<S::Ok, S::Error>
where
S: serde::Serializer,
{
serializer.collect_str(&self.0)
}
}
struct KeySpace<'a>(&'a crate::keyspace::KeySpace);
impl<'a> serde::Serialize for KeySpace<'a> {
fn serialize<S>(&self, serializer: S) -> std::result::Result<S::Ok, S::Error>
where
S: serde::Serializer,
{
use serde::ser::SerializeSeq;
let mut seq = serializer.serialize_seq(Some(self.0.ranges.len()))?;
for kr in &self.0.ranges {
seq.serialize_element(&KeyRange(kr))?;
}
seq.end()
}
}
struct KeyRange<'a>(&'a std::ops::Range<crate::repository::Key>);
impl<'a> serde::Serialize for KeyRange<'a> {
fn serialize<S>(&self, serializer: S) -> Result<S::Ok, S::Error>
where
S: serde::Serializer,
{
use serde::ser::SerializeTuple;
let mut t = serializer.serialize_tuple(2)?;
t.serialize_element(&WithDisplay(&self.0.start))?;
t.serialize_element(&WithDisplay(&self.0.end))?;
t.end()
}
}
let at_lsn: Option<Lsn> = parse_query_param(&request, "at_lsn")?; let at_lsn: Option<Lsn> = parse_query_param(&request, "at_lsn")?;
async { async {
@@ -1561,7 +1524,9 @@ async fn timeline_collect_keyspace(
.await .await
.map_err(|e| ApiError::InternalServerError(e.into()))?; .map_err(|e| ApiError::InternalServerError(e.into()))?;
json_response(StatusCode::OK, Partitioning { keys, at_lsn }) let res = pageserver_api::models::partitioning::Partitioning { keys, at_lsn };
json_response(StatusCode::OK, res)
} }
.instrument(info_span!("timeline_collect_keyspace", tenant_id = %tenant_shard_id.tenant_id, shard_id = %tenant_shard_id.shard_slug(), %timeline_id)) .instrument(info_span!("timeline_collect_keyspace", tenant_id = %tenant_shard_id.tenant_id, shard_id = %tenant_shard_id.shard_slug(), %timeline_id))
.await .await

View File

@@ -10,7 +10,7 @@ pub mod deletion_queue;
pub mod disk_usage_eviction_task; pub mod disk_usage_eviction_task;
pub mod http; pub mod http;
pub mod import_datadir; pub mod import_datadir;
pub mod keyspace; pub use pageserver_api::keyspace;
pub mod metrics; pub mod metrics;
pub mod page_cache; pub mod page_cache;
pub mod page_service; pub mod page_service;

View File

@@ -522,14 +522,18 @@ pub(crate) mod initial_logical_size {
impl StartCalculation { impl StartCalculation {
pub(crate) fn first(&self, circumstances: StartCircumstances) -> OngoingCalculationGuard { pub(crate) fn first(&self, circumstances: StartCircumstances) -> OngoingCalculationGuard {
let circumstances_label: &'static str = circumstances.into(); let circumstances_label: &'static str = circumstances.into();
self.0.with_label_values(&["first", circumstances_label]); self.0
.with_label_values(&["first", circumstances_label])
.inc();
OngoingCalculationGuard { OngoingCalculationGuard {
inc_drop_calculation: Some(DROP_CALCULATION.first.clone()), inc_drop_calculation: Some(DROP_CALCULATION.first.clone()),
} }
} }
pub(crate) fn retry(&self, circumstances: StartCircumstances) -> OngoingCalculationGuard { pub(crate) fn retry(&self, circumstances: StartCircumstances) -> OngoingCalculationGuard {
let circumstances_label: &'static str = circumstances.into(); let circumstances_label: &'static str = circumstances.into();
self.0.with_label_values(&["retry", circumstances_label]); self.0
.with_label_values(&["retry", circumstances_label])
.inc();
OngoingCalculationGuard { OngoingCalculationGuard {
inc_drop_calculation: Some(DROP_CALCULATION.retry.clone()), inc_drop_calculation: Some(DROP_CALCULATION.retry.clone()),
} }
@@ -684,14 +688,54 @@ pub static STARTUP_IS_LOADING: Lazy<UIntGauge> = Lazy::new(|| {
.expect("Failed to register pageserver_startup_is_loading") .expect("Failed to register pageserver_startup_is_loading")
}); });
/// How long did tenants take to go from construction to active state? /// Metrics related to the lifecycle of a [`crate::tenant::Tenant`] object: things
pub(crate) static TENANT_ACTIVATION: Lazy<Histogram> = Lazy::new(|| { /// like how long it took to load.
register_histogram!( ///
/// Note that these are process-global metrics, _not_ per-tenant metrics. Per-tenant
/// metrics are rather expensive, and usually fine grained stuff makes more sense
/// at a timeline level than tenant level.
pub(crate) struct TenantMetrics {
/// How long did tenants take to go from construction to active state?
pub(crate) activation: Histogram,
pub(crate) preload: Histogram,
pub(crate) attach: Histogram,
/// How many tenants are included in the initial startup of the pagesrever?
pub(crate) startup_scheduled: IntCounter,
pub(crate) startup_complete: IntCounter,
}
pub(crate) static TENANT: Lazy<TenantMetrics> = Lazy::new(|| {
TenantMetrics {
activation: register_histogram!(
"pageserver_tenant_activation_seconds", "pageserver_tenant_activation_seconds",
"Time taken by tenants to activate, in seconds", "Time taken by tenants to activate, in seconds",
CRITICAL_OP_BUCKETS.into() CRITICAL_OP_BUCKETS.into()
) )
.expect("Failed to register pageserver_tenant_activation_seconds metric") .expect("Failed to register metric"),
preload: register_histogram!(
"pageserver_tenant_preload_seconds",
"Time taken by tenants to load remote metadata on startup/attach, in seconds",
CRITICAL_OP_BUCKETS.into()
)
.expect("Failed to register metric"),
attach: register_histogram!(
"pageserver_tenant_attach_seconds",
"Time taken by tenants to intialize, after remote metadata is already loaded",
CRITICAL_OP_BUCKETS.into()
)
.expect("Failed to register metric"),
startup_scheduled: register_int_counter!(
"pageserver_tenant_startup_scheduled",
"Number of tenants included in pageserver startup (doesn't count tenants attached later)"
).expect("Failed to register metric"),
startup_complete: register_int_counter!(
"pageserver_tenant_startup_complete",
"Number of tenants that have completed warm-up, or activated on-demand during initial startup: \
should eventually reach `pageserver_tenant_startup_scheduled_total`. Does not include broken \
tenants: such cases will lead to this metric never reaching the scheduled count."
).expect("Failed to register metric"),
}
}); });
/// Each `Timeline`'s [`EVICTIONS_WITH_LOW_RESIDENCE_DURATION`] metric. /// Each `Timeline`'s [`EVICTIONS_WITH_LOW_RESIDENCE_DURATION`] metric.
@@ -979,12 +1023,62 @@ static SMGR_QUERY_TIME_PER_TENANT_TIMELINE: Lazy<HistogramVec> = Lazy::new(|| {
.expect("failed to define a metric") .expect("failed to define a metric")
}); });
static SMGR_QUERY_TIME_GLOBAL_BUCKETS: Lazy<Vec<f64>> = Lazy::new(|| {
[
1,
10,
20,
40,
60,
80,
100,
200,
300,
400,
500,
600,
700,
800,
900,
1_000, // 1ms
2_000,
4_000,
6_000,
8_000,
10_000, // 10ms
20_000,
40_000,
60_000,
80_000,
100_000,
200_000,
400_000,
600_000,
800_000,
1_000_000, // 1s
2_000_000,
4_000_000,
6_000_000,
8_000_000,
10_000_000, // 10s
20_000_000,
50_000_000,
100_000_000,
200_000_000,
1_000_000_000, // 1000s
]
.into_iter()
.map(Duration::from_micros)
.map(|d| d.as_secs_f64())
.collect()
});
static SMGR_QUERY_TIME_GLOBAL: Lazy<HistogramVec> = Lazy::new(|| { static SMGR_QUERY_TIME_GLOBAL: Lazy<HistogramVec> = Lazy::new(|| {
register_histogram_vec!( register_histogram_vec!(
"pageserver_smgr_query_seconds_global", "pageserver_smgr_query_seconds_global",
"Time spent on smgr query handling, aggregated by query type.", "Time spent on smgr query handling, aggregated by query type.",
&["smgr_query_type"], &["smgr_query_type"],
CRITICAL_OP_BUCKETS.into(), SMGR_QUERY_TIME_GLOBAL_BUCKETS.clone(),
) )
.expect("failed to define a metric") .expect("failed to define a metric")
}); });
@@ -2213,6 +2307,9 @@ pub fn preinitialize_metrics() {
// Deletion queue stats // Deletion queue stats
Lazy::force(&DELETION_QUEUE); Lazy::force(&DELETION_QUEUE);
// Tenant stats
Lazy::force(&TENANT);
// Tenant manager stats // Tenant manager stats
Lazy::force(&TENANT_MANAGER); Lazy::force(&TENANT_MANAGER);

View File

@@ -2,38 +2,11 @@ use crate::walrecord::NeonWalRecord;
use anyhow::Result; use anyhow::Result;
use bytes::Bytes; use bytes::Bytes;
use serde::{Deserialize, Serialize}; use serde::{Deserialize, Serialize};
use std::ops::{AddAssign, Range}; use std::ops::AddAssign;
use std::time::Duration; use std::time::Duration;
pub use pageserver_api::key::{Key, KEY_SIZE}; pub use pageserver_api::key::{Key, KEY_SIZE};
pub fn key_range_size(key_range: &Range<Key>) -> u32 {
let start = key_range.start;
let end = key_range.end;
if end.field1 != start.field1
|| end.field2 != start.field2
|| end.field3 != start.field3
|| end.field4 != start.field4
{
return u32::MAX;
}
let start = (start.field5 as u64) << 32 | start.field6 as u64;
let end = (end.field5 as u64) << 32 | end.field6 as u64;
let diff = end - start;
if diff > u32::MAX as u64 {
u32::MAX
} else {
diff as u32
}
}
pub fn singleton_range(key: Key) -> Range<Key> {
key..key.next()
}
/// A 'value' stored for a one Key. /// A 'value' stored for a one Key.
#[derive(Debug, Clone, Serialize, Deserialize)] #[derive(Debug, Clone, Serialize, Deserialize)]
#[cfg_attr(test, derive(PartialEq))] #[cfg_attr(test, derive(PartialEq))]

View File

@@ -36,6 +36,8 @@ use utils::crashsafe::path_with_suffix_extension;
use utils::fs_ext; use utils::fs_ext;
use utils::sync::gate::Gate; use utils::sync::gate::Gate;
use utils::sync::gate::GateGuard; use utils::sync::gate::GateGuard;
use utils::timeout::timeout_cancellable;
use utils::timeout::TimeoutCancellableError;
use self::config::AttachedLocationConfig; use self::config::AttachedLocationConfig;
use self::config::AttachmentMode; use self::config::AttachmentMode;
@@ -59,7 +61,7 @@ use crate::deletion_queue::DeletionQueueClient;
use crate::deletion_queue::DeletionQueueError; use crate::deletion_queue::DeletionQueueError;
use crate::import_datadir; use crate::import_datadir;
use crate::is_uninit_mark; use crate::is_uninit_mark;
use crate::metrics::TENANT_ACTIVATION; use crate::metrics::TENANT;
use crate::metrics::{remove_tenant_metrics, TENANT_STATE_METRIC, TENANT_SYNTHETIC_SIZE_METRIC}; use crate::metrics::{remove_tenant_metrics, TENANT_STATE_METRIC, TENANT_SYNTHETIC_SIZE_METRIC};
use crate::repository::GcResult; use crate::repository::GcResult;
use crate::task_mgr; use crate::task_mgr;
@@ -226,7 +228,7 @@ pub struct Tenant {
/// The value creation timestamp, used to measure activation delay, see: /// The value creation timestamp, used to measure activation delay, see:
/// <https://github.com/neondatabase/neon/issues/4025> /// <https://github.com/neondatabase/neon/issues/4025>
loading_started_at: Instant, constructed_at: Instant,
state: watch::Sender<TenantState>, state: watch::Sender<TenantState>,
@@ -276,6 +278,11 @@ pub struct Tenant {
eviction_task_tenant_state: tokio::sync::Mutex<EvictionTaskTenantState>, eviction_task_tenant_state: tokio::sync::Mutex<EvictionTaskTenantState>,
/// If the tenant is in Activating state, notify this to encourage it
/// to proceed to Active as soon as possible, rather than waiting for lazy
/// background warmup.
pub(crate) activate_now_sem: tokio::sync::Semaphore,
pub(crate) delete_progress: Arc<tokio::sync::Mutex<DeleteTenantFlow>>, pub(crate) delete_progress: Arc<tokio::sync::Mutex<DeleteTenantFlow>>,
// Cancellation token fires when we have entered shutdown(). This is a parent of // Cancellation token fires when we have entered shutdown(). This is a parent of
@@ -622,6 +629,14 @@ impl Tenant {
"attach tenant", "attach tenant",
false, false,
async move { async move {
// Is this tenant being spawned as part of process startup?
let starting_up = init_order.is_some();
scopeguard::defer! {
if starting_up {
TENANT.startup_complete.inc();
}
}
// Ideally we should use Tenant::set_broken_no_wait, but it is not supposed to be used when tenant is in loading state. // Ideally we should use Tenant::set_broken_no_wait, but it is not supposed to be used when tenant is in loading state.
let make_broken = let make_broken =
|t: &Tenant, err: anyhow::Error| { |t: &Tenant, err: anyhow::Error| {
@@ -648,8 +663,62 @@ impl Tenant {
.as_mut() .as_mut()
.and_then(|x| x.initial_tenant_load_remote.take()); .and_then(|x| x.initial_tenant_load_remote.take());
enum AttachType<'a> {
// During pageserver startup, we are attaching this tenant lazily in the background
Warmup(tokio::sync::SemaphorePermit<'a>),
// During pageserver startup, we are attaching this tenant as soon as we can,
// because a client tried to access it.
OnDemand,
// During normal operations after startup, we are attaching a tenant.
Normal,
}
// Before doing any I/O, wait for either or:
// - A client to attempt to access to this tenant (on-demand loading)
// - A permit to become available in the warmup semaphore (background warmup)
//
// Some-ness of init_order is how we know if we're attaching during startup or later
// in process lifetime.
let attach_type = if init_order.is_some() {
tokio::select!(
_ = tenant_clone.activate_now_sem.acquire() => {
tracing::info!("Activating tenant (on-demand)");
AttachType::OnDemand
},
permit_result = conf.concurrent_tenant_warmup.inner().acquire() => {
match permit_result {
Ok(p) => {
tracing::info!("Activating tenant (warmup)");
AttachType::Warmup(p)
}
Err(_) => {
// This is unexpected: the warmup semaphore should stay alive
// for the lifetime of init_order. Log a warning and proceed.
tracing::warn!("warmup_limit semaphore unexpectedly closed");
AttachType::Normal
}
}
}
_ = tenant_clone.cancel.cancelled() => {
// This is safe, but should be pretty rare: it is interesting if a tenant
// stayed in Activating for such a long time that shutdown found it in
// that state.
tracing::info!(state=%tenant_clone.current_state(), "Tenant shut down before activation");
return Ok(());
},
)
} else {
AttachType::Normal
};
let preload_timer = TENANT.preload.start_timer();
let preload = match mode { let preload = match mode {
SpawnMode::Create => {None}, SpawnMode::Create => {
// Don't count the skipped preload into the histogram of preload durations
preload_timer.stop_and_discard();
None
},
SpawnMode::Normal => { SpawnMode::Normal => {
match &remote_storage { match &remote_storage {
Some(remote_storage) => Some( Some(remote_storage) => Some(
@@ -659,7 +728,11 @@ impl Tenant {
tracing::info_span!(parent: None, "attach_preload", tenant_id=%tenant_shard_id.tenant_id, shard_id=%tenant_shard_id.shard_slug()), tracing::info_span!(parent: None, "attach_preload", tenant_id=%tenant_shard_id.tenant_id, shard_id=%tenant_shard_id.shard_slug()),
) )
.await { .await {
Ok(p) => p, Ok(p) => {
preload_timer.observe_duration();
p
}
,
Err(e) => { Err(e) => {
make_broken(&tenant_clone, anyhow::anyhow!(e)); make_broken(&tenant_clone, anyhow::anyhow!(e));
return Ok(()); return Ok(());
@@ -721,15 +794,43 @@ impl Tenant {
} }
} }
// We will time the duration of the attach phase unless this is a creation (attach will do no work)
let attach_timer = match mode {
SpawnMode::Create => None,
SpawnMode::Normal => {Some(TENANT.attach.start_timer())}
};
match tenant_clone.attach(preload, &ctx).await { match tenant_clone.attach(preload, &ctx).await {
Ok(()) => { Ok(()) => {
info!("attach finished, activating"); info!("attach finished, activating");
if let Some(t)= attach_timer {t.observe_duration();}
tenant_clone.activate(broker_client, None, &ctx); tenant_clone.activate(broker_client, None, &ctx);
} }
Err(e) => { Err(e) => {
if let Some(t)= attach_timer {t.observe_duration();}
make_broken(&tenant_clone, anyhow::anyhow!(e)); make_broken(&tenant_clone, anyhow::anyhow!(e));
} }
} }
// If we are doing an opportunistic warmup attachment at startup, initialize
// logical size at the same time. This is better than starting a bunch of idle tenants
// with cold caches and then coming back later to initialize their logical sizes.
//
// It also prevents the warmup proccess competing with the concurrency limit on
// logical size calculations: if logical size calculation semaphore is saturated,
// then warmup will wait for that before proceeding to the next tenant.
if let AttachType::Warmup(_permit) = attach_type {
let mut futs = FuturesUnordered::new();
let timelines: Vec<_> = tenant_clone.timelines.lock().unwrap().values().cloned().collect();
for t in timelines {
futs.push(t.await_initial_logical_size())
}
tracing::info!("Waiting for initial logical sizes while warming up...");
while futs.next().await.is_some() {
}
tracing::info!("Warm-up complete");
}
Ok(()) Ok(())
} }
.instrument({ .instrument({
@@ -1451,6 +1552,10 @@ impl Tenant {
.collect() .collect()
} }
pub fn list_timeline_ids(&self) -> Vec<TimelineId> {
self.timelines.lock().unwrap().keys().cloned().collect()
}
/// This is used to create the initial 'main' timeline during bootstrapping, /// This is used to create the initial 'main' timeline during bootstrapping,
/// or when importing a new base backup. The caller is expected to load an /// or when importing a new base backup. The caller is expected to load an
/// initial image of the datadir to the new timeline after this. /// initial image of the datadir to the new timeline after this.
@@ -1696,6 +1801,15 @@ impl Tenant {
Ok(loaded_timeline) Ok(loaded_timeline)
} }
pub(crate) async fn delete_timeline(
self: Arc<Self>,
timeline_id: TimelineId,
) -> Result<(), DeleteTimelineError> {
DeleteTimelineFlow::run(&self, timeline_id, false).await?;
Ok(())
}
/// perform one garbage collection iteration, removing old data files from disk. /// perform one garbage collection iteration, removing old data files from disk.
/// this function is periodically called by gc task. /// this function is periodically called by gc task.
/// also it can be explicitly requested through page server api 'do_gc' command. /// also it can be explicitly requested through page server api 'do_gc' command.
@@ -1857,7 +1971,7 @@ impl Tenant {
); );
*current_state = TenantState::Active; *current_state = TenantState::Active;
let elapsed = self.loading_started_at.elapsed(); let elapsed = self.constructed_at.elapsed();
let total_timelines = timelines_accessor.len(); let total_timelines = timelines_accessor.len();
// log a lot of stuff, because some tenants sometimes suffer from user-visible // log a lot of stuff, because some tenants sometimes suffer from user-visible
@@ -1872,7 +1986,7 @@ impl Tenant {
"activation attempt finished" "activation attempt finished"
); );
TENANT_ACTIVATION.observe(elapsed.as_secs_f64()); TENANT.activation.observe(elapsed.as_secs_f64());
}); });
} }
} }
@@ -2127,18 +2241,41 @@ impl Tenant {
self.state.subscribe() self.state.subscribe()
} }
pub(crate) async fn wait_to_become_active(&self) -> Result<(), GetActiveTenantError> { /// The activate_now semaphore is initialized with zero units. As soon as
/// we add a unit, waiters will be able to acquire a unit and proceed.
pub(crate) fn activate_now(&self) {
self.activate_now_sem.add_permits(1);
}
pub(crate) async fn wait_to_become_active(
&self,
timeout: Duration,
) -> Result<(), GetActiveTenantError> {
let mut receiver = self.state.subscribe(); let mut receiver = self.state.subscribe();
loop { loop {
let current_state = receiver.borrow_and_update().clone(); let current_state = receiver.borrow_and_update().clone();
match current_state { match current_state {
TenantState::Loading | TenantState::Attaching | TenantState::Activating(_) => { TenantState::Loading | TenantState::Attaching | TenantState::Activating(_) => {
// in these states, there's a chance that we can reach ::Active // in these states, there's a chance that we can reach ::Active
receiver.changed().await.map_err( self.activate_now();
|_e: tokio::sync::watch::error::RecvError| match timeout_cancellable(timeout, &self.cancel, receiver.changed()).await {
// Tenant existed but was dropped: report it as non-existent Ok(r) => {
GetActiveTenantError::NotFound(GetTenantError::NotFound(self.tenant_shard_id.tenant_id)) r.map_err(
)?; |_e: tokio::sync::watch::error::RecvError|
// Tenant existed but was dropped: report it as non-existent
GetActiveTenantError::NotFound(GetTenantError::NotFound(self.tenant_shard_id.tenant_id))
)?
}
Err(TimeoutCancellableError::Cancelled) => {
return Err(GetActiveTenantError::Cancelled);
}
Err(TimeoutCancellableError::Timeout) => {
return Err(GetActiveTenantError::WaitForActiveTimeout {
latest_state: Some(self.current_state()),
wait_time: timeout,
});
}
}
} }
TenantState::Active { .. } => { TenantState::Active { .. } => {
return Ok(()); return Ok(());
@@ -2463,7 +2600,7 @@ impl Tenant {
conf, conf,
// using now here is good enough approximation to catch tenants with really long // using now here is good enough approximation to catch tenants with really long
// activation times. // activation times.
loading_started_at: Instant::now(), constructed_at: Instant::now(),
tenant_conf: Arc::new(RwLock::new(attached_conf)), tenant_conf: Arc::new(RwLock::new(attached_conf)),
timelines: Mutex::new(HashMap::new()), timelines: Mutex::new(HashMap::new()),
timelines_creating: Mutex::new(HashSet::new()), timelines_creating: Mutex::new(HashSet::new()),
@@ -2475,6 +2612,7 @@ impl Tenant {
cached_logical_sizes: tokio::sync::Mutex::new(HashMap::new()), cached_logical_sizes: tokio::sync::Mutex::new(HashMap::new()),
cached_synthetic_tenant_size: Arc::new(AtomicU64::new(0)), cached_synthetic_tenant_size: Arc::new(AtomicU64::new(0)),
eviction_task_tenant_state: tokio::sync::Mutex::new(EvictionTaskTenantState::default()), eviction_task_tenant_state: tokio::sync::Mutex::new(EvictionTaskTenantState::default()),
activate_now_sem: tokio::sync::Semaphore::new(0),
delete_progress: Arc::new(tokio::sync::Mutex::new(DeleteTenantFlow::default())), delete_progress: Arc::new(tokio::sync::Mutex::new(DeleteTenantFlow::default())),
cancel: CancellationToken::default(), cancel: CancellationToken::default(),
gate: Gate::new(format!("Tenant<{tenant_shard_id}>")), gate: Gate::new(format!("Tenant<{tenant_shard_id}>")),
@@ -3059,6 +3197,7 @@ impl Tenant {
storage, storage,
&self.tenant_shard_id, &self.tenant_shard_id,
&existing_initdb_timeline_id, &existing_initdb_timeline_id,
&self.cancel,
) )
.await .await
.context("download initdb tar")?; .context("download initdb tar")?;
@@ -3099,6 +3238,7 @@ impl Tenant {
&timeline_id, &timeline_id,
pgdata_zstd.try_clone().await?, pgdata_zstd.try_clone().await?,
tar_zst_size, tar_zst_size,
&self.cancel,
) )
.await .await
}, },
@@ -3106,9 +3246,7 @@ impl Tenant {
3, 3,
u32::MAX, u32::MAX,
"persist_initdb_tar_zst", "persist_initdb_tar_zst",
backoff::Cancel::new(self.cancel.clone(), || { backoff::Cancel::new(self.cancel.clone(), || anyhow::anyhow!("Cancelled")),
anyhow::anyhow!("initdb upload cancelled")
}),
) )
.await?; .await?;

View File

@@ -71,6 +71,7 @@ async fn create_remote_delete_mark(
conf: &PageServerConf, conf: &PageServerConf,
remote_storage: &GenericRemoteStorage, remote_storage: &GenericRemoteStorage,
tenant_shard_id: &TenantShardId, tenant_shard_id: &TenantShardId,
cancel: &CancellationToken,
) -> Result<(), DeleteTenantError> { ) -> Result<(), DeleteTenantError> {
let remote_mark_path = remote_tenant_delete_mark_path(conf, tenant_shard_id)?; let remote_mark_path = remote_tenant_delete_mark_path(conf, tenant_shard_id)?;
@@ -87,8 +88,7 @@ async fn create_remote_delete_mark(
FAILED_UPLOAD_WARN_THRESHOLD, FAILED_UPLOAD_WARN_THRESHOLD,
FAILED_REMOTE_OP_RETRIES, FAILED_REMOTE_OP_RETRIES,
"mark_upload", "mark_upload",
// TODO: use a cancellation token (https://github.com/neondatabase/neon/issues/5066) backoff::Cancel::new(cancel.clone(), || anyhow::anyhow!("Cancelled")),
backoff::Cancel::new(CancellationToken::new(), || unreachable!()),
) )
.await .await
.context("mark_upload")?; .context("mark_upload")?;
@@ -170,6 +170,7 @@ async fn remove_tenant_remote_delete_mark(
conf: &PageServerConf, conf: &PageServerConf,
remote_storage: Option<&GenericRemoteStorage>, remote_storage: Option<&GenericRemoteStorage>,
tenant_shard_id: &TenantShardId, tenant_shard_id: &TenantShardId,
cancel: &CancellationToken,
) -> Result<(), DeleteTenantError> { ) -> Result<(), DeleteTenantError> {
if let Some(remote_storage) = remote_storage { if let Some(remote_storage) = remote_storage {
let path = remote_tenant_delete_mark_path(conf, tenant_shard_id)?; let path = remote_tenant_delete_mark_path(conf, tenant_shard_id)?;
@@ -179,8 +180,7 @@ async fn remove_tenant_remote_delete_mark(
FAILED_UPLOAD_WARN_THRESHOLD, FAILED_UPLOAD_WARN_THRESHOLD,
FAILED_REMOTE_OP_RETRIES, FAILED_REMOTE_OP_RETRIES,
"remove_tenant_remote_delete_mark", "remove_tenant_remote_delete_mark",
// TODO: use a cancellation token (https://github.com/neondatabase/neon/issues/5066) backoff::Cancel::new(cancel.clone(), || anyhow::anyhow!("Cancelled")),
backoff::Cancel::new(CancellationToken::new(), || unreachable!()),
) )
.await .await
.context("remove_tenant_remote_delete_mark")?; .context("remove_tenant_remote_delete_mark")?;
@@ -322,9 +322,15 @@ impl DeleteTenantFlow {
// Though sounds scary, different mark name? // Though sounds scary, different mark name?
// Detach currently uses remove_dir_all so in case of a crash we can end up in a weird state. // Detach currently uses remove_dir_all so in case of a crash we can end up in a weird state.
if let Some(remote_storage) = &remote_storage { if let Some(remote_storage) = &remote_storage {
create_remote_delete_mark(conf, remote_storage, &tenant.tenant_shard_id) create_remote_delete_mark(
.await conf,
.context("remote_mark")? remote_storage,
&tenant.tenant_shard_id,
// Can't use tenant.cancel, it's already shut down. TODO: wire in an appropriate token
&CancellationToken::new(),
)
.await
.context("remote_mark")?
} }
fail::fail_point!("tenant-delete-before-create-local-mark", |_| { fail::fail_point!("tenant-delete-before-create-local-mark", |_| {
@@ -524,8 +530,14 @@ impl DeleteTenantFlow {
.context("timelines dir not empty")?; .context("timelines dir not empty")?;
} }
remove_tenant_remote_delete_mark(conf, remote_storage.as_ref(), &tenant.tenant_shard_id) remove_tenant_remote_delete_mark(
.await?; conf,
remote_storage.as_ref(),
&tenant.tenant_shard_id,
// Can't use tenant.cancel, it's already shut down. TODO: wire in an appropriate token
&CancellationToken::new(),
)
.await?;
fail::fail_point!("tenant-delete-before-cleanup-remaining-fs-traces", |_| { fail::fail_point!("tenant-delete-before-cleanup-remaining-fs-traces", |_| {
Err(anyhow::anyhow!( Err(anyhow::anyhow!(

View File

@@ -28,7 +28,7 @@ use crate::control_plane_client::{
ControlPlaneClient, ControlPlaneGenerationsApi, RetryForeverError, ControlPlaneClient, ControlPlaneGenerationsApi, RetryForeverError,
}; };
use crate::deletion_queue::DeletionQueueClient; use crate::deletion_queue::DeletionQueueClient;
use crate::metrics::TENANT_MANAGER as METRICS; use crate::metrics::{TENANT, TENANT_MANAGER as METRICS};
use crate::task_mgr::{self, TaskKind}; use crate::task_mgr::{self, TaskKind};
use crate::tenant::config::{ use crate::tenant::config::{
AttachedLocationConfig, AttachmentMode, LocationConf, LocationMode, TenantConfOpt, AttachedLocationConfig, AttachmentMode, LocationConf, LocationMode, TenantConfOpt,
@@ -44,7 +44,6 @@ use utils::generation::Generation;
use utils::id::{TenantId, TimelineId}; use utils::id::{TenantId, TimelineId};
use super::delete::DeleteTenantError; use super::delete::DeleteTenantError;
use super::timeline::delete::DeleteTimelineFlow;
use super::TenantSharedResources; use super::TenantSharedResources;
/// For a tenant that appears in TenantsMap, it may either be /// For a tenant that appears in TenantsMap, it may either be
@@ -430,6 +429,13 @@ pub async fn init_tenant_mgr(
let tenant_generations = let tenant_generations =
init_load_generations(conf, &tenant_configs, &resources, &cancel).await?; init_load_generations(conf, &tenant_configs, &resources, &cancel).await?;
tracing::info!(
"Attaching {} tenants at startup, warming up {} at a time",
tenant_configs.len(),
conf.concurrent_tenant_warmup.initial_permits()
);
TENANT.startup_scheduled.inc_by(tenant_configs.len() as u64);
// Construct `Tenant` objects and start them running // Construct `Tenant` objects and start them running
for (tenant_shard_id, location_conf) in tenant_configs { for (tenant_shard_id, location_conf) in tenant_configs {
let tenant_dir_path = conf.tenant_path(&tenant_shard_id); let tenant_dir_path = conf.tenant_path(&tenant_shard_id);
@@ -508,10 +514,7 @@ pub async fn init_tenant_mgr(
&ctx, &ctx,
) { ) {
Ok(tenant) => { Ok(tenant) => {
tenants.insert( tenants.insert(tenant_shard_id, TenantSlot::Attached(tenant));
TenantShardId::unsharded(tenant.tenant_id()),
TenantSlot::Attached(tenant),
);
} }
Err(e) => { Err(e) => {
error!(tenant_id=%tenant_shard_id.tenant_id, shard_id=%tenant_shard_id.shard_slug(), "Failed to start tenant: {e:#}"); error!(tenant_id=%tenant_shard_id.tenant_id, shard_id=%tenant_shard_id.shard_slug(), "Failed to start tenant: {e:#}");
@@ -848,17 +851,6 @@ impl TenantManager {
} }
} }
pub(crate) async fn delete_timeline(
&self,
tenant_shard_id: TenantShardId,
timeline_id: TimelineId,
_ctx: &RequestContext,
) -> Result<(), DeleteTimelineError> {
let tenant = self.get_attached_tenant_shard(tenant_shard_id, true)?;
DeleteTimelineFlow::run(&tenant, timeline_id, false).await?;
Ok(())
}
#[instrument(skip_all, fields(tenant_id=%tenant_shard_id.tenant_id, shard_id=%tenant_shard_id.shard_slug()))] #[instrument(skip_all, fields(tenant_id=%tenant_shard_id.tenant_id, shard_id=%tenant_shard_id.shard_slug()))]
pub(crate) async fn upsert_location( pub(crate) async fn upsert_location(
&self, &self,
@@ -967,35 +959,27 @@ impl TenantManager {
} }
let tenant_path = self.conf.tenant_path(&tenant_shard_id); let tenant_path = self.conf.tenant_path(&tenant_shard_id);
let timelines_path = self.conf.timelines_path(&tenant_shard_id);
// Directory structure is the same for attached and secondary modes:
// create it if it doesn't exist. Timeline load/creation expects the
// timelines/ subdir to already exist.
//
// Does not need to be fsync'd because local storage is just a cache.
tokio::fs::create_dir_all(&timelines_path)
.await
.with_context(|| format!("Creating {timelines_path}"))?;
// Before activating either secondary or attached mode, persist the
// configuration, so that on restart we will re-attach (or re-start
// secondary) on the tenant.
Tenant::persist_tenant_config(self.conf, &tenant_shard_id, &new_location_config)
.await
.map_err(SetNewTenantConfigError::Persist)?;
let new_slot = match &new_location_config.mode { let new_slot = match &new_location_config.mode {
LocationMode::Secondary(_) => { LocationMode::Secondary(_) => TenantSlot::Secondary,
// Directory doesn't need to be fsync'd because if we crash it can
// safely be recreated next time this tenant location is configured.
tokio::fs::create_dir_all(&tenant_path)
.await
.with_context(|| format!("Creating {tenant_path}"))?;
Tenant::persist_tenant_config(self.conf, &tenant_shard_id, &new_location_config)
.await
.map_err(SetNewTenantConfigError::Persist)?;
TenantSlot::Secondary
}
LocationMode::Attached(_attach_config) => { LocationMode::Attached(_attach_config) => {
let timelines_path = self.conf.timelines_path(&tenant_shard_id);
// Directory doesn't need to be fsync'd because we do not depend on
// it to exist after crashes: it may be recreated when tenant is
// re-attached, see https://github.com/neondatabase/neon/issues/5550
tokio::fs::create_dir_all(&tenant_path)
.await
.with_context(|| format!("Creating {timelines_path}"))?;
Tenant::persist_tenant_config(self.conf, &tenant_shard_id, &new_location_config)
.await
.map_err(SetNewTenantConfigError::Persist)?;
let shard_identity = new_location_config.shard; let shard_identity = new_location_config.shard;
let tenant = tenant_spawn( let tenant = tenant_spawn(
self.conf, self.conf,
@@ -1221,7 +1205,10 @@ pub(crate) async fn get_active_tenant_with_timeout(
// Fast path: we don't need to do any async waiting. // Fast path: we don't need to do any async waiting.
return Ok(tenant.clone()); return Ok(tenant.clone());
} }
_ => (WaitFor::Tenant(tenant.clone()), tenant_shard_id), _ => {
tenant.activate_now();
(WaitFor::Tenant(tenant.clone()), tenant_shard_id)
}
} }
} }
Some(TenantSlot::Secondary) => { Some(TenantSlot::Secondary) => {
@@ -1275,28 +1262,10 @@ pub(crate) async fn get_active_tenant_with_timeout(
}; };
tracing::debug!("Waiting for tenant to enter active state..."); tracing::debug!("Waiting for tenant to enter active state...");
match timeout_cancellable( tenant
deadline.duration_since(Instant::now()), .wait_to_become_active(deadline.duration_since(Instant::now()))
cancel, .await?;
tenant.wait_to_become_active(), Ok(tenant)
)
.await
{
Ok(Ok(())) => Ok(tenant),
Ok(Err(e)) => Err(e),
Err(TimeoutCancellableError::Timeout) => {
let latest_state = tenant.current_state();
if latest_state == TenantState::Active {
Ok(tenant)
} else {
Err(GetActiveTenantError::WaitForActiveTimeout {
latest_state: Some(latest_state),
wait_time: timeout,
})
}
}
Err(TimeoutCancellableError::Cancelled) => Err(GetActiveTenantError::Cancelled),
}
} }
pub(crate) async fn delete_tenant( pub(crate) async fn delete_tenant(

View File

@@ -196,10 +196,12 @@ pub(crate) use upload::upload_initdb_dir;
use utils::backoff::{ use utils::backoff::{
self, exponential_backoff, DEFAULT_BASE_BACKOFF_SECONDS, DEFAULT_MAX_BACKOFF_SECONDS, self, exponential_backoff, DEFAULT_BASE_BACKOFF_SECONDS, DEFAULT_MAX_BACKOFF_SECONDS,
}; };
use utils::timeout::{timeout_cancellable, TimeoutCancellableError};
use std::collections::{HashMap, VecDeque}; use std::collections::{HashMap, VecDeque};
use std::sync::atomic::{AtomicU32, Ordering}; use std::sync::atomic::{AtomicU32, Ordering};
use std::sync::{Arc, Mutex}; use std::sync::{Arc, Mutex};
use std::time::Duration;
use remote_storage::{DownloadError, GenericRemoteStorage, RemotePath}; use remote_storage::{DownloadError, GenericRemoteStorage, RemotePath};
use std::ops::DerefMut; use std::ops::DerefMut;
@@ -316,6 +318,47 @@ pub struct RemoteTimelineClient {
storage_impl: GenericRemoteStorage, storage_impl: GenericRemoteStorage,
deletion_queue_client: DeletionQueueClient, deletion_queue_client: DeletionQueueClient,
cancel: CancellationToken,
}
/// This timeout is intended to deal with hangs in lower layers, e.g. stuck TCP flows. It is not
/// intended to be snappy enough for prompt shutdown, as we have a CancellationToken for that.
const UPLOAD_TIMEOUT: Duration = Duration::from_secs(120);
const DOWNLOAD_TIMEOUT: Duration = Duration::from_secs(120);
/// Wrapper for timeout_cancellable that flattens result and converts TimeoutCancellableError to anyhow.
///
/// This is a convenience for the various upload functions. In future
/// the anyhow::Error result should be replaced with a more structured type that
/// enables callers to avoid handling shutdown as an error.
async fn upload_cancellable<F>(cancel: &CancellationToken, future: F) -> anyhow::Result<()>
where
F: std::future::Future<Output = anyhow::Result<()>>,
{
match timeout_cancellable(UPLOAD_TIMEOUT, cancel, future).await {
Ok(Ok(())) => Ok(()),
Ok(Err(e)) => Err(e),
Err(TimeoutCancellableError::Timeout) => Err(anyhow::anyhow!("Timeout")),
Err(TimeoutCancellableError::Cancelled) => Err(anyhow::anyhow!("Shutting down")),
}
}
/// Wrapper for timeout_cancellable that flattens result and converts TimeoutCancellableError to DownloaDError.
async fn download_cancellable<F, R>(
cancel: &CancellationToken,
future: F,
) -> Result<R, DownloadError>
where
F: std::future::Future<Output = Result<R, DownloadError>>,
{
match timeout_cancellable(DOWNLOAD_TIMEOUT, cancel, future).await {
Ok(Ok(r)) => Ok(r),
Ok(Err(e)) => Err(e),
Err(TimeoutCancellableError::Timeout) => {
Err(DownloadError::Other(anyhow::anyhow!("Timed out")))
}
Err(TimeoutCancellableError::Cancelled) => Err(DownloadError::Cancelled),
}
} }
impl RemoteTimelineClient { impl RemoteTimelineClient {
@@ -351,6 +394,7 @@ impl RemoteTimelineClient {
&tenant_shard_id, &tenant_shard_id,
&timeline_id, &timeline_id,
)), )),
cancel: CancellationToken::new(),
} }
} }
@@ -501,6 +545,7 @@ impl RemoteTimelineClient {
&self, &self,
layer_file_name: &LayerFileName, layer_file_name: &LayerFileName,
layer_metadata: &LayerFileMetadata, layer_metadata: &LayerFileMetadata,
cancel: &CancellationToken,
) -> anyhow::Result<u64> { ) -> anyhow::Result<u64> {
let downloaded_size = { let downloaded_size = {
let _unfinished_gauge_guard = self.metrics.call_begin( let _unfinished_gauge_guard = self.metrics.call_begin(
@@ -517,6 +562,7 @@ impl RemoteTimelineClient {
self.timeline_id, self.timeline_id,
layer_file_name, layer_file_name,
layer_metadata, layer_metadata,
cancel,
) )
.measure_remote_op( .measure_remote_op(
self.tenant_shard_id.tenant_id, self.tenant_shard_id.tenant_id,
@@ -971,6 +1017,7 @@ impl RemoteTimelineClient {
&self.timeline_id, &self.timeline_id,
self.generation, self.generation,
&index_part_with_deleted_at, &index_part_with_deleted_at,
&self.cancel,
) )
}, },
|_e| false, |_e| false,
@@ -980,8 +1027,7 @@ impl RemoteTimelineClient {
// when executed as part of tenant deletion this happens in the background // when executed as part of tenant deletion this happens in the background
2, 2,
"persist_index_part_with_deleted_flag", "persist_index_part_with_deleted_flag",
// TODO: use a cancellation token (https://github.com/neondatabase/neon/issues/5066) backoff::Cancel::new(self.cancel.clone(), || anyhow::anyhow!("Cancelled")),
backoff::Cancel::new(CancellationToken::new(), || unreachable!()),
) )
.await?; .await?;
@@ -1281,6 +1327,7 @@ impl RemoteTimelineClient {
path, path,
layer_metadata, layer_metadata,
self.generation, self.generation,
&self.cancel,
) )
.measure_remote_op( .measure_remote_op(
self.tenant_shard_id.tenant_id, self.tenant_shard_id.tenant_id,
@@ -1307,6 +1354,7 @@ impl RemoteTimelineClient {
&self.timeline_id, &self.timeline_id,
self.generation, self.generation,
index_part, index_part,
&self.cancel,
) )
.measure_remote_op( .measure_remote_op(
self.tenant_shard_id.tenant_id, self.tenant_shard_id.tenant_id,
@@ -1828,6 +1876,7 @@ mod tests {
&self.harness.tenant_shard_id, &self.harness.tenant_shard_id,
&TIMELINE_ID, &TIMELINE_ID,
)), )),
cancel: CancellationToken::new(),
}) })
} }

View File

@@ -5,7 +5,6 @@
use std::collections::HashSet; use std::collections::HashSet;
use std::future::Future; use std::future::Future;
use std::time::Duration;
use anyhow::{anyhow, Context}; use anyhow::{anyhow, Context};
use camino::{Utf8Path, Utf8PathBuf}; use camino::{Utf8Path, Utf8PathBuf};
@@ -14,13 +13,17 @@ use tokio::fs::{self, File, OpenOptions};
use tokio::io::{AsyncSeekExt, AsyncWriteExt}; use tokio::io::{AsyncSeekExt, AsyncWriteExt};
use tokio_util::sync::CancellationToken; use tokio_util::sync::CancellationToken;
use tracing::warn; use tracing::warn;
use utils::timeout::timeout_cancellable;
use utils::{backoff, crashsafe}; use utils::{backoff, crashsafe};
use crate::config::PageServerConf; use crate::config::PageServerConf;
use crate::tenant::remote_timeline_client::{remote_layer_path, remote_timelines_path}; use crate::tenant::remote_timeline_client::{
download_cancellable, remote_layer_path, remote_timelines_path, DOWNLOAD_TIMEOUT,
};
use crate::tenant::storage_layer::LayerFileName; use crate::tenant::storage_layer::LayerFileName;
use crate::tenant::timeline::span::debug_assert_current_span_has_tenant_and_timeline_id; use crate::tenant::timeline::span::debug_assert_current_span_has_tenant_and_timeline_id;
use crate::tenant::Generation; use crate::tenant::Generation;
use crate::virtual_file::on_fatal_io_error;
use crate::TEMP_FILE_SUFFIX; use crate::TEMP_FILE_SUFFIX;
use remote_storage::{DownloadError, GenericRemoteStorage, ListingMode}; use remote_storage::{DownloadError, GenericRemoteStorage, ListingMode};
use utils::crashsafe::path_with_suffix_extension; use utils::crashsafe::path_with_suffix_extension;
@@ -32,8 +35,6 @@ use super::{
FAILED_DOWNLOAD_WARN_THRESHOLD, FAILED_REMOTE_OP_RETRIES, INITDB_PATH, FAILED_DOWNLOAD_WARN_THRESHOLD, FAILED_REMOTE_OP_RETRIES, INITDB_PATH,
}; };
static MAX_DOWNLOAD_DURATION: Duration = Duration::from_secs(120);
/// ///
/// If 'metadata' is given, we will validate that the downloaded file's size matches that /// If 'metadata' is given, we will validate that the downloaded file's size matches that
/// in the metadata. (In the future, we might do more cross-checks, like CRC validation) /// in the metadata. (In the future, we might do more cross-checks, like CRC validation)
@@ -46,6 +47,7 @@ pub async fn download_layer_file<'a>(
timeline_id: TimelineId, timeline_id: TimelineId,
layer_file_name: &'a LayerFileName, layer_file_name: &'a LayerFileName,
layer_metadata: &'a LayerFileMetadata, layer_metadata: &'a LayerFileMetadata,
cancel: &CancellationToken,
) -> Result<u64, DownloadError> { ) -> Result<u64, DownloadError> {
debug_assert_current_span_has_tenant_and_timeline_id(); debug_assert_current_span_has_tenant_and_timeline_id();
@@ -73,14 +75,18 @@ pub async fn download_layer_file<'a>(
// If pageserver crashes the temp file will be deleted on startup and re-downloaded. // If pageserver crashes the temp file will be deleted on startup and re-downloaded.
let temp_file_path = path_with_suffix_extension(&local_path, TEMP_DOWNLOAD_EXTENSION); let temp_file_path = path_with_suffix_extension(&local_path, TEMP_DOWNLOAD_EXTENSION);
let cancel_inner = cancel.clone();
let (mut destination_file, bytes_amount) = download_retry( let (mut destination_file, bytes_amount) = download_retry(
|| async { || async {
let destination_file = tokio::fs::File::create(&temp_file_path) let destination_file = tokio::fs::File::create(&temp_file_path)
.await .await
.with_context(|| format!("create a destination file for layer '{temp_file_path}'")) .with_context(|| format!("create a destination file for layer '{temp_file_path}'"))
.map_err(DownloadError::Other)?; .map_err(DownloadError::Other)?;
let download = storage
.download(&remote_path) // Cancellation safety: it is safe to cancel this future, because it isn't writing to a local
// file: the write to local file doesn't start until after the request header is returned
// and we start draining the body stream below
let download = download_cancellable(&cancel_inner, storage.download(&remote_path))
.await .await
.with_context(|| { .with_context(|| {
format!( format!(
@@ -94,12 +100,33 @@ pub async fn download_layer_file<'a>(
let mut reader = tokio_util::io::StreamReader::new(download.download_stream); let mut reader = tokio_util::io::StreamReader::new(download.download_stream);
let bytes_amount = tokio::time::timeout( // Cancellation safety: it is safe to cancel this future because it is writing into a temporary file,
MAX_DOWNLOAD_DURATION, // and we will unlink the temporary file if there is an error. This unlink is important because we
// are in a retry loop, and we wouldn't want to leave behind a rogue write I/O to a file that
// we will imminiently try and write to again.
let bytes_amount: u64 = match timeout_cancellable(
DOWNLOAD_TIMEOUT,
&cancel_inner,
tokio::io::copy_buf(&mut reader, &mut destination_file), tokio::io::copy_buf(&mut reader, &mut destination_file),
) )
.await .await
.map_err(|e| DownloadError::Other(anyhow::anyhow!("Timed out {:?}", e)))? .with_context(|| {
format!(
"download layer at remote path '{remote_path:?}' into file {temp_file_path:?}"
)
})
.map_err(DownloadError::Other)?
{
Ok(b) => Ok(b),
Err(e) => {
// Remove incomplete files: on restart Timeline would do this anyway, but we must
// do it here for the retry case.
if let Err(e) = tokio::fs::remove_file(&temp_file_path).await {
on_fatal_io_error(&e, &format!("Removing temporary file {temp_file_path}"));
}
Err(e)
}
}
.with_context(|| { .with_context(|| {
format!( format!(
"download layer at remote path '{remote_path:?}' into file {temp_file_path:?}" "download layer at remote path '{remote_path:?}' into file {temp_file_path:?}"
@@ -112,6 +139,7 @@ pub async fn download_layer_file<'a>(
Ok((destination_file, bytes_amount)) Ok((destination_file, bytes_amount))
}, },
&format!("download {remote_path:?}"), &format!("download {remote_path:?}"),
cancel,
) )
.await?; .await?;
@@ -188,8 +216,14 @@ pub async fn list_remote_timelines(
anyhow::bail!("storage-sync-list-remote-timelines"); anyhow::bail!("storage-sync-list-remote-timelines");
}); });
let cancel_inner = cancel.clone();
let listing = download_retry_forever( let listing = download_retry_forever(
|| storage.list(Some(&remote_path), ListingMode::WithDelimiter), || {
download_cancellable(
&cancel_inner,
storage.list(Some(&remote_path), ListingMode::WithDelimiter),
)
},
&format!("list timelines for {tenant_shard_id}"), &format!("list timelines for {tenant_shard_id}"),
cancel, cancel,
) )
@@ -230,9 +264,13 @@ async fn do_download_index_part(
let remote_path = remote_index_path(tenant_shard_id, timeline_id, index_generation); let remote_path = remote_index_path(tenant_shard_id, timeline_id, index_generation);
let cancel_inner = cancel.clone();
let index_part_bytes = download_retry_forever( let index_part_bytes = download_retry_forever(
|| async { || async {
let index_part_download = storage.download(&remote_path).await?; // Cancellation: if is safe to cancel this future because we're just downloading into
// a memory buffer, not touching local disk.
let index_part_download =
download_cancellable(&cancel_inner, storage.download(&remote_path)).await?;
let mut index_part_bytes = Vec::new(); let mut index_part_bytes = Vec::new();
let mut stream = std::pin::pin!(index_part_download.download_stream); let mut stream = std::pin::pin!(index_part_download.download_stream);
@@ -347,10 +385,7 @@ pub(super) async fn download_index_part(
FAILED_DOWNLOAD_WARN_THRESHOLD, FAILED_DOWNLOAD_WARN_THRESHOLD,
FAILED_REMOTE_OP_RETRIES, FAILED_REMOTE_OP_RETRIES,
"listing index_part files", "listing index_part files",
// TODO: use a cancellation token (https://github.com/neondatabase/neon/issues/5066) backoff::Cancel::new(cancel.clone(), || anyhow::anyhow!("Cancelled")),
backoff::Cancel::new(CancellationToken::new(), || -> anyhow::Error {
unreachable!()
}),
) )
.await .await
.map_err(DownloadError::Other)?; .map_err(DownloadError::Other)?;
@@ -389,6 +424,7 @@ pub(crate) async fn download_initdb_tar_zst(
storage: &GenericRemoteStorage, storage: &GenericRemoteStorage,
tenant_shard_id: &TenantShardId, tenant_shard_id: &TenantShardId,
timeline_id: &TimelineId, timeline_id: &TimelineId,
cancel: &CancellationToken,
) -> Result<(Utf8PathBuf, File), DownloadError> { ) -> Result<(Utf8PathBuf, File), DownloadError> {
debug_assert_current_span_has_tenant_and_timeline_id(); debug_assert_current_span_has_tenant_and_timeline_id();
@@ -406,6 +442,8 @@ pub(crate) async fn download_initdb_tar_zst(
"{INITDB_PATH}.download-{timeline_id}.{TEMP_FILE_SUFFIX}" "{INITDB_PATH}.download-{timeline_id}.{TEMP_FILE_SUFFIX}"
)); ));
let cancel_inner = cancel.clone();
let file = download_retry( let file = download_retry(
|| async { || async {
let file = OpenOptions::new() let file = OpenOptions::new()
@@ -418,10 +456,14 @@ pub(crate) async fn download_initdb_tar_zst(
.with_context(|| format!("tempfile creation {temp_path}")) .with_context(|| format!("tempfile creation {temp_path}"))
.map_err(DownloadError::Other)?; .map_err(DownloadError::Other)?;
let download = storage.download(&remote_path).await?; let download =
download_cancellable(&cancel_inner, storage.download(&remote_path)).await?;
let mut download = tokio_util::io::StreamReader::new(download.download_stream); let mut download = tokio_util::io::StreamReader::new(download.download_stream);
let mut writer = tokio::io::BufWriter::with_capacity(8 * 1024, file); let mut writer = tokio::io::BufWriter::with_capacity(8 * 1024, file);
// TODO: this consumption of the response body should be subject to timeout + cancellation, but
// not without thinking carefully about how to recover safely from cancelling a write to
// local storage (e.g. by writing into a temp file as we do in download_layer)
tokio::io::copy_buf(&mut download, &mut writer) tokio::io::copy_buf(&mut download, &mut writer)
.await .await
.with_context(|| format!("download initdb.tar.zst at {remote_path:?}")) .with_context(|| format!("download initdb.tar.zst at {remote_path:?}"))
@@ -437,6 +479,7 @@ pub(crate) async fn download_initdb_tar_zst(
Ok(file) Ok(file)
}, },
&format!("download {remote_path}"), &format!("download {remote_path}"),
cancel,
) )
.await .await
.map_err(|e| { .map_err(|e| {
@@ -460,7 +503,11 @@ pub(crate) async fn download_initdb_tar_zst(
/// with backoff. /// with backoff.
/// ///
/// (See similar logic for uploads in `perform_upload_task`) /// (See similar logic for uploads in `perform_upload_task`)
async fn download_retry<T, O, F>(op: O, description: &str) -> Result<T, DownloadError> async fn download_retry<T, O, F>(
op: O,
description: &str,
cancel: &CancellationToken,
) -> Result<T, DownloadError>
where where
O: FnMut() -> F, O: FnMut() -> F,
F: Future<Output = Result<T, DownloadError>>, F: Future<Output = Result<T, DownloadError>>,
@@ -471,10 +518,7 @@ where
FAILED_DOWNLOAD_WARN_THRESHOLD, FAILED_DOWNLOAD_WARN_THRESHOLD,
FAILED_REMOTE_OP_RETRIES, FAILED_REMOTE_OP_RETRIES,
description, description,
// TODO: use a cancellation token (https://github.com/neondatabase/neon/issues/5066) backoff::Cancel::new(cancel.clone(), || DownloadError::Cancelled),
backoff::Cancel::new(CancellationToken::new(), || -> DownloadError {
unreachable!()
}),
) )
.await .await
} }

View File

@@ -7,12 +7,14 @@ use pageserver_api::shard::TenantShardId;
use std::io::{ErrorKind, SeekFrom}; use std::io::{ErrorKind, SeekFrom};
use tokio::fs::{self, File}; use tokio::fs::{self, File};
use tokio::io::AsyncSeekExt; use tokio::io::AsyncSeekExt;
use tokio_util::sync::CancellationToken;
use super::Generation; use super::Generation;
use crate::{ use crate::{
config::PageServerConf, config::PageServerConf,
tenant::remote_timeline_client::{ tenant::remote_timeline_client::{
index::IndexPart, remote_index_path, remote_initdb_archive_path, remote_path, index::IndexPart, remote_index_path, remote_initdb_archive_path, remote_path,
upload_cancellable,
}, },
}; };
use remote_storage::GenericRemoteStorage; use remote_storage::GenericRemoteStorage;
@@ -29,6 +31,7 @@ pub(super) async fn upload_index_part<'a>(
timeline_id: &TimelineId, timeline_id: &TimelineId,
generation: Generation, generation: Generation,
index_part: &'a IndexPart, index_part: &'a IndexPart,
cancel: &CancellationToken,
) -> anyhow::Result<()> { ) -> anyhow::Result<()> {
tracing::trace!("uploading new index part"); tracing::trace!("uploading new index part");
@@ -44,14 +47,16 @@ pub(super) async fn upload_index_part<'a>(
let index_part_bytes = bytes::Bytes::from(index_part_bytes); let index_part_bytes = bytes::Bytes::from(index_part_bytes);
let remote_path = remote_index_path(tenant_shard_id, timeline_id, generation); let remote_path = remote_index_path(tenant_shard_id, timeline_id, generation);
storage upload_cancellable(
.upload_storage_object( cancel,
storage.upload_storage_object(
futures::stream::once(futures::future::ready(Ok(index_part_bytes))), futures::stream::once(futures::future::ready(Ok(index_part_bytes))),
index_part_size, index_part_size,
&remote_path, &remote_path,
) ),
.await )
.with_context(|| format!("upload index part for '{tenant_shard_id} / {timeline_id}'")) .await
.with_context(|| format!("upload index part for '{tenant_shard_id} / {timeline_id}'"))
} }
/// Attempts to upload given layer files. /// Attempts to upload given layer files.
@@ -64,6 +69,7 @@ pub(super) async fn upload_timeline_layer<'a>(
source_path: &'a Utf8Path, source_path: &'a Utf8Path,
known_metadata: &'a LayerFileMetadata, known_metadata: &'a LayerFileMetadata,
generation: Generation, generation: Generation,
cancel: &CancellationToken,
) -> anyhow::Result<()> { ) -> anyhow::Result<()> {
fail_point!("before-upload-layer", |_| { fail_point!("before-upload-layer", |_| {
bail!("failpoint before-upload-layer") bail!("failpoint before-upload-layer")
@@ -107,8 +113,7 @@ pub(super) async fn upload_timeline_layer<'a>(
let reader = tokio_util::io::ReaderStream::with_capacity(source_file, super::BUFFER_SIZE); let reader = tokio_util::io::ReaderStream::with_capacity(source_file, super::BUFFER_SIZE);
storage upload_cancellable(cancel, storage.upload(reader, fs_size, &storage_path, None))
.upload(reader, fs_size, &storage_path, None)
.await .await
.with_context(|| format!("upload layer from local path '{source_path}'"))?; .with_context(|| format!("upload layer from local path '{source_path}'"))?;
@@ -122,6 +127,7 @@ pub(crate) async fn upload_initdb_dir(
timeline_id: &TimelineId, timeline_id: &TimelineId,
mut initdb_tar_zst: File, mut initdb_tar_zst: File,
size: u64, size: u64,
cancel: &CancellationToken,
) -> anyhow::Result<()> { ) -> anyhow::Result<()> {
tracing::trace!("uploading initdb dir"); tracing::trace!("uploading initdb dir");
@@ -131,8 +137,10 @@ pub(crate) async fn upload_initdb_dir(
let file = tokio_util::io::ReaderStream::with_capacity(initdb_tar_zst, super::BUFFER_SIZE); let file = tokio_util::io::ReaderStream::with_capacity(initdb_tar_zst, super::BUFFER_SIZE);
let remote_path = remote_initdb_archive_path(tenant_id, timeline_id); let remote_path = remote_initdb_archive_path(tenant_id, timeline_id);
storage upload_cancellable(
.upload_storage_object(file, size as usize, &remote_path) cancel,
.await storage.upload_storage_object(file, size as usize, &remote_path),
.with_context(|| format!("upload initdb dir for '{tenant_id} / {timeline_id}'")) )
.await
.with_context(|| format!("upload initdb dir for '{tenant_id} / {timeline_id}'"))
} }

View File

@@ -259,8 +259,9 @@ impl Layer {
layer layer
.get_value_reconstruct_data(key, lsn_range, reconstruct_data, &self.0, ctx) .get_value_reconstruct_data(key, lsn_range, reconstruct_data, &self.0, ctx)
.instrument(tracing::info_span!("get_value_reconstruct_data", layer=%self)) .instrument(tracing::debug_span!("get_value_reconstruct_data", layer=%self))
.await .await
.with_context(|| format!("get_value_reconstruct_data for layer {self}"))
} }
/// Download the layer if evicted. /// Download the layer if evicted.
@@ -654,7 +655,6 @@ impl LayerInner {
} }
/// Cancellation safe. /// Cancellation safe.
#[tracing::instrument(skip_all, fields(layer=%self))]
async fn get_or_maybe_download( async fn get_or_maybe_download(
self: &Arc<Self>, self: &Arc<Self>,
allow_download: bool, allow_download: bool,
@@ -663,95 +663,101 @@ impl LayerInner {
let mut init_permit = None; let mut init_permit = None;
loop { loop {
let download = move |permit| async move { let download = move |permit| {
// disable any scheduled but not yet running eviction deletions for this async move {
let next_version = 1 + self.version.fetch_add(1, Ordering::Relaxed); // disable any scheduled but not yet running eviction deletions for this
let next_version = 1 + self.version.fetch_add(1, Ordering::Relaxed);
// count cancellations, which currently remain largely unexpected // count cancellations, which currently remain largely unexpected
let init_cancelled = let init_cancelled =
scopeguard::guard((), |_| LAYER_IMPL_METRICS.inc_init_cancelled()); scopeguard::guard((), |_| LAYER_IMPL_METRICS.inc_init_cancelled());
// no need to make the evict_and_wait wait for the actual download to complete // no need to make the evict_and_wait wait for the actual download to complete
drop(self.status.send(Status::Downloaded)); drop(self.status.send(Status::Downloaded));
let timeline = self let timeline = self
.timeline .timeline
.upgrade() .upgrade()
.ok_or_else(|| DownloadError::TimelineShutdown)?; .ok_or_else(|| DownloadError::TimelineShutdown)?;
// FIXME: grab a gate // FIXME: grab a gate
let can_ever_evict = timeline.remote_client.as_ref().is_some(); let can_ever_evict = timeline.remote_client.as_ref().is_some();
// check if we really need to be downloaded; could have been already downloaded by a // check if we really need to be downloaded; could have been already downloaded by a
// cancelled previous attempt. // cancelled previous attempt.
let needs_download = self let needs_download = self
.needs_download() .needs_download()
.await .await
.map_err(DownloadError::PreStatFailed)?; .map_err(DownloadError::PreStatFailed)?;
let permit = if let Some(reason) = needs_download { let permit = if let Some(reason) = needs_download {
if let NeedsDownload::NotFile(ft) = reason { if let NeedsDownload::NotFile(ft) = reason {
return Err(DownloadError::NotFile(ft)); return Err(DownloadError::NotFile(ft));
}
// only reset this after we've decided we really need to download. otherwise it'd
// be impossible to mark cancelled downloads for eviction, like one could imagine
// we would like to do for prefetching which was not needed.
self.wanted_evicted.store(false, Ordering::Release);
if !can_ever_evict {
return Err(DownloadError::NoRemoteStorage);
}
if let Some(ctx) = ctx {
self.check_expected_download(ctx)?;
}
if !allow_download {
// this does look weird, but for LayerInner the "downloading" means also changing
// internal once related state ...
return Err(DownloadError::DownloadRequired);
}
tracing::info!(%reason, "downloading on-demand");
self.spawn_download_and_wait(timeline, permit).await?
} else {
// the file is present locally, probably by a previous but cancelled call to
// get_or_maybe_download. alternatively we might be running without remote storage.
LAYER_IMPL_METRICS.inc_init_needed_no_download();
permit
};
let since_last_eviction =
self.last_evicted_at.lock().unwrap().map(|ts| ts.elapsed());
if let Some(since_last_eviction) = since_last_eviction {
// FIXME: this will not always be recorded correctly until #6028 (the no
// download needed branch above)
LAYER_IMPL_METRICS.record_redownloaded_after(since_last_eviction);
} }
// only reset this after we've decided we really need to download. otherwise it'd let res = Arc::new(DownloadedLayer {
// be impossible to mark cancelled downloads for eviction, like one could imagine owner: Arc::downgrade(self),
// we would like to do for prefetching which was not needed. kind: tokio::sync::OnceCell::default(),
self.wanted_evicted.store(false, Ordering::Release); version: next_version,
});
if !can_ever_evict { self.access_stats.record_residence_event(
return Err(DownloadError::NoRemoteStorage); LayerResidenceStatus::Resident,
LayerResidenceEventReason::ResidenceChange,
);
let waiters = self.inner.initializer_count();
if waiters > 0 {
tracing::info!(
waiters,
"completing the on-demand download for other tasks"
);
} }
if let Some(ctx) = ctx { scopeguard::ScopeGuard::into_inner(init_cancelled);
self.check_expected_download(ctx)?;
}
if !allow_download { Ok((ResidentOrWantedEvicted::Resident(res), permit))
// this does look weird, but for LayerInner the "downloading" means also changing
// internal once related state ...
return Err(DownloadError::DownloadRequired);
}
tracing::info!(%reason, "downloading on-demand");
self.spawn_download_and_wait(timeline, permit).await?
} else {
// the file is present locally, probably by a previous but cancelled call to
// get_or_maybe_download. alternatively we might be running without remote storage.
LAYER_IMPL_METRICS.inc_init_needed_no_download();
permit
};
let since_last_eviction =
self.last_evicted_at.lock().unwrap().map(|ts| ts.elapsed());
if let Some(since_last_eviction) = since_last_eviction {
// FIXME: this will not always be recorded correctly until #6028 (the no
// download needed branch above)
LAYER_IMPL_METRICS.record_redownloaded_after(since_last_eviction);
} }
.instrument(tracing::info_span!("get_or_maybe_download", layer=%self))
let res = Arc::new(DownloadedLayer {
owner: Arc::downgrade(self),
kind: tokio::sync::OnceCell::default(),
version: next_version,
});
self.access_stats.record_residence_event(
LayerResidenceStatus::Resident,
LayerResidenceEventReason::ResidenceChange,
);
let waiters = self.inner.initializer_count();
if waiters > 0 {
tracing::info!(waiters, "completing the on-demand download for other tasks");
}
scopeguard::ScopeGuard::into_inner(init_cancelled);
Ok((ResidentOrWantedEvicted::Resident(res), permit))
}; };
if let Some(init_permit) = init_permit.take() { if let Some(init_permit) = init_permit.take() {
@@ -862,6 +868,7 @@ impl LayerInner {
let result = client.download_layer_file( let result = client.download_layer_file(
&this.desc.filename(), &this.desc.filename(),
&this.metadata(), &this.metadata(),
&crate::task_mgr::shutdown_token()
) )
.await; .await;
@@ -871,6 +878,23 @@ impl LayerInner {
Ok(()) Ok(())
} }
Err(e) => { Err(e) => {
let consecutive_failures =
this.consecutive_failures.fetch_add(1, Ordering::Relaxed);
let backoff = utils::backoff::exponential_backoff_duration_seconds(
consecutive_failures.min(u32::MAX as usize) as u32,
1.5,
60.0,
);
let backoff = std::time::Duration::from_secs_f64(backoff);
tokio::select! {
_ = tokio::time::sleep(backoff) => {},
_ = crate::task_mgr::shutdown_token().cancelled_owned() => {},
_ = timeline.cancel.cancelled() => {},
};
Err(e) Err(e)
} }
}; };
@@ -919,21 +943,9 @@ impl LayerInner {
Ok(permit) Ok(permit)
} }
Ok((Err(e), _permit)) => { Ok((Err(e), _permit)) => {
// FIXME: this should be with the spawned task and be cancellation sensitive // sleep already happened in the spawned task, if it was not cancelled
// let consecutive_failures = self.consecutive_failures.load(Ordering::Relaxed);
// while we should not need this, this backoff has turned out to be useful with
// a bug of unexpectedly deleted remote layer file (#5787).
let consecutive_failures =
self.consecutive_failures.fetch_add(1, Ordering::Relaxed);
tracing::error!(consecutive_failures, "layer file download failed: {e:#}"); tracing::error!(consecutive_failures, "layer file download failed: {e:#}");
let backoff = utils::backoff::exponential_backoff_duration_seconds(
consecutive_failures.min(u32::MAX as usize) as u32,
1.5,
60.0,
);
let backoff = std::time::Duration::from_secs_f64(backoff);
tokio::time::sleep(backoff).await;
Err(DownloadError::DownloadFailed) Err(DownloadError::DownloadFailed)
} }
Err(_gone) => Err(DownloadError::DownloadCancelled), Err(_gone) => Err(DownloadError::DownloadCancelled),

View File

@@ -1734,6 +1734,7 @@ impl Timeline {
self.current_logical_size.current_size().accuracy(), self.current_logical_size.current_size().accuracy(),
logical_size::Accuracy::Exact, logical_size::Accuracy::Exact,
); );
self.current_logical_size.initialized.add_permits(1);
return; return;
}; };
@@ -1779,6 +1780,11 @@ impl Timeline {
cancel: CancellationToken, cancel: CancellationToken,
background_ctx: RequestContext, background_ctx: RequestContext,
) { ) {
scopeguard::defer! {
// Irrespective of the outcome of this operation, we should unblock anyone waiting for it.
self.current_logical_size.initialized.add_permits(1);
}
enum BackgroundCalculationError { enum BackgroundCalculationError {
Cancelled, Cancelled,
Other(anyhow::Error), Other(anyhow::Error),
@@ -3104,6 +3110,32 @@ impl Timeline {
Ok(image_layers) Ok(image_layers)
} }
/// Wait until the background initial logical size calculation is complete, or
/// this Timeline is shut down. Calling this function will cause the initial
/// logical size calculation to skip waiting for the background jobs barrier.
pub(crate) async fn await_initial_logical_size(self: Arc<Self>) {
if let Some(await_bg_cancel) = self
.current_logical_size
.cancel_wait_for_background_loop_concurrency_limit_semaphore
.get()
{
await_bg_cancel.cancel();
} else {
// We should not wait if we were not able to explicitly instruct
// the logical size cancellation to skip the concurrency limit semaphore.
// TODO: this is an unexpected case. We should restructure so that it
// can't happen.
tracing::info!(
"await_initial_logical_size: can't get semaphore cancel token, skipping"
);
}
tokio::select!(
_ = self.current_logical_size.initialized.acquire() => {},
_ = self.cancel.cancelled() => {}
)
}
} }
#[derive(Default)] #[derive(Default)]

View File

@@ -34,6 +34,9 @@ pub(super) struct LogicalSize {
pub(crate) cancel_wait_for_background_loop_concurrency_limit_semaphore: pub(crate) cancel_wait_for_background_loop_concurrency_limit_semaphore:
OnceCell<CancellationToken>, OnceCell<CancellationToken>,
/// Once the initial logical size is initialized, this is notified.
pub(crate) initialized: tokio::sync::Semaphore,
/// Latest Lsn that has its size uncalculated, could be absent for freshly created timelines. /// Latest Lsn that has its size uncalculated, could be absent for freshly created timelines.
pub initial_part_end: Option<Lsn>, pub initial_part_end: Option<Lsn>,
@@ -125,6 +128,7 @@ impl LogicalSize {
initial_part_end: None, initial_part_end: None,
size_added_after_initial: AtomicI64::new(0), size_added_after_initial: AtomicI64::new(0),
did_return_approximate_to_walreceiver: AtomicBool::new(false), did_return_approximate_to_walreceiver: AtomicBool::new(false),
initialized: tokio::sync::Semaphore::new(0),
} }
} }
@@ -135,6 +139,7 @@ impl LogicalSize {
initial_part_end: Some(compute_to), initial_part_end: Some(compute_to),
size_added_after_initial: AtomicI64::new(0), size_added_after_initial: AtomicI64::new(0),
did_return_approximate_to_walreceiver: AtomicBool::new(false), did_return_approximate_to_walreceiver: AtomicBool::new(false),
initialized: tokio::sync::Semaphore::new(0),
} }
} }

View File

@@ -138,7 +138,7 @@ pub(super) async fn connection_manager_loop_step(
Ok(Some(broker_update)) => connection_manager_state.register_timeline_update(broker_update), Ok(Some(broker_update)) => connection_manager_state.register_timeline_update(broker_update),
Err(status) => { Err(status) => {
match status.code() { match status.code() {
Code::Unknown if status.message().contains("stream closed because of a broken pipe") => { Code::Unknown if status.message().contains("stream closed because of a broken pipe") || status.message().contains("connection reset") => {
// tonic's error handling doesn't provide a clear code for disconnections: we get // tonic's error handling doesn't provide a clear code for disconnections: we get
// "h2 protocol error: error reading a body from connection: stream closed because of a broken pipe" // "h2 protocol error: error reading a body from connection: stream closed because of a broken pipe"
info!("broker disconnected: {status}"); info!("broker disconnected: {status}");

View File

@@ -19,20 +19,21 @@
*------------------------------------------------------------------------- *-------------------------------------------------------------------------
*/ */
#include "postgres.h" #include "postgres.h"
#include <curl/curl.h>
#include "access/xact.h"
#include "commands/defrem.h"
#include "fmgr.h"
#include "libpq/crypt.h"
#include "miscadmin.h"
#include "tcop/pquery.h" #include "tcop/pquery.h"
#include "tcop/utility.h" #include "tcop/utility.h"
#include "access/xact.h" #include "utils/acl.h"
#include "utils/guc.h"
#include "utils/hsearch.h" #include "utils/hsearch.h"
#include "utils/memutils.h" #include "utils/memutils.h"
#include "commands/defrem.h"
#include "miscadmin.h"
#include "utils/acl.h"
#include "fmgr.h"
#include "utils/guc.h"
#include "port.h"
#include <curl/curl.h>
#include "utils/jsonb.h" #include "utils/jsonb.h"
#include "libpq/crypt.h"
static ProcessUtility_hook_type PreviousProcessUtilityHook = NULL; static ProcessUtility_hook_type PreviousProcessUtilityHook = NULL;

View File

@@ -1,4 +1,3 @@
/*------------------------------------------------------------------------- /*-------------------------------------------------------------------------
* *
* extension_server.c * extension_server.c
@@ -10,21 +9,11 @@
*------------------------------------------------------------------------- *-------------------------------------------------------------------------
*/ */
#include "postgres.h" #include "postgres.h"
#include "tcop/pquery.h"
#include "tcop/utility.h"
#include "access/xact.h"
#include "utils/hsearch.h"
#include "utils/memutils.h"
#include "commands/defrem.h"
#include "miscadmin.h"
#include "utils/acl.h"
#include "fmgr.h"
#include "utils/guc.h"
#include "port.h"
#include "fmgr.h"
#include <curl/curl.h> #include <curl/curl.h>
#include "utils/guc.h"
static int extension_server_port = 0; static int extension_server_port = 0;
static download_extension_file_hook_type prev_download_extension_file_hook = NULL; static download_extension_file_hook_type prev_download_extension_file_hook = NULL;

View File

@@ -13,32 +13,30 @@
*------------------------------------------------------------------------- *-------------------------------------------------------------------------
*/ */
#include "postgres.h"
#include <sys/file.h> #include <sys/file.h>
#include <unistd.h> #include <unistd.h>
#include <fcntl.h> #include <fcntl.h>
#include "postgres.h"
#include "neon_pgversioncompat.h" #include "neon_pgversioncompat.h"
#include "access/parallel.h"
#include "funcapi.h" #include "funcapi.h"
#include "miscadmin.h" #include "miscadmin.h"
#include "pgstat.h"
#include "pagestore_client.h" #include "pagestore_client.h"
#include "access/parallel.h" #include "pgstat.h"
#include "postmaster/bgworker.h" #include "postmaster/bgworker.h"
#include RELFILEINFO_HDR #include RELFILEINFO_HDR
#include "storage/buf_internals.h" #include "storage/buf_internals.h"
#include "storage/latch.h" #include "storage/fd.h"
#include "storage/ipc.h" #include "storage/ipc.h"
#include "storage/latch.h"
#include "storage/lwlock.h" #include "storage/lwlock.h"
#include "storage/pg_shmem.h"
#include "utils/builtins.h" #include "utils/builtins.h"
#include "utils/dynahash.h" #include "utils/dynahash.h"
#include "utils/guc.h" #include "utils/guc.h"
#include "storage/fd.h"
#include "storage/pg_shmem.h"
#include "storage/buf_internals.h"
#include "pgstat.h"
/* /*
* Local file cache is used to temporary store relations pages in local file system. * Local file cache is used to temporary store relations pages in local file system.
@@ -102,8 +100,6 @@ static shmem_request_hook_type prev_shmem_request_hook;
#define LFC_ENABLED() (lfc_ctl->limit != 0) #define LFC_ENABLED() (lfc_ctl->limit != 0)
void PGDLLEXPORT FileCacheMonitorMain(Datum main_arg);
/* /*
* Local file cache is optional and Neon can work without it. * Local file cache is optional and Neon can work without it.
* In case of any any errors with this cache, we should disable it but to not throw error. * In case of any any errors with this cache, we should disable it but to not throw error.

View File

@@ -14,28 +14,24 @@
*/ */
#include "postgres.h" #include "postgres.h"
#include "pagestore_client.h"
#include "fmgr.h"
#include "access/xlog.h" #include "access/xlog.h"
#include "access/xlogutils.h" #include "fmgr.h"
#include "storage/buf_internals.h"
#include "storage/lwlock.h"
#include "storage/ipc.h"
#include "storage/pg_shmem.h"
#include "c.h"
#include "postmaster/interrupt.h"
#include "libpq-fe.h" #include "libpq-fe.h"
#include "libpq/pqformat.h"
#include "libpq/libpq.h" #include "libpq/libpq.h"
#include "libpq/pqformat.h"
#include "miscadmin.h" #include "miscadmin.h"
#include "pgstat.h" #include "pgstat.h"
#include "postmaster/interrupt.h"
#include "storage/buf_internals.h"
#include "storage/ipc.h"
#include "storage/lwlock.h"
#include "storage/pg_shmem.h"
#include "utils/guc.h" #include "utils/guc.h"
#include "neon.h" #include "neon.h"
#include "walproposer.h"
#include "neon_utils.h" #include "neon_utils.h"
#include "pagestore_client.h"
#include "walproposer.h"
#define PageStoreTrace DEBUG5 #define PageStoreTrace DEBUG5
@@ -62,8 +58,8 @@ char *neon_auth_token;
int readahead_buffer_size = 128; int readahead_buffer_size = 128;
int flush_every_n_requests = 8; int flush_every_n_requests = 8;
int n_reconnect_attempts = 0; static int n_reconnect_attempts = 0;
int max_reconnect_attempts = 60; static int max_reconnect_attempts = 60;
#define MAX_PAGESERVER_CONNSTRING_SIZE 256 #define MAX_PAGESERVER_CONNSTRING_SIZE 256
@@ -83,8 +79,6 @@ static PagestoreShmemState *pagestore_shared;
static uint64 pagestore_local_counter = 0; static uint64 pagestore_local_counter = 0;
static char local_pageserver_connstring[MAX_PAGESERVER_CONNSTRING_SIZE]; static char local_pageserver_connstring[MAX_PAGESERVER_CONNSTRING_SIZE];
bool (*old_redo_read_buffer_filter) (XLogReaderState *record, uint8 block_id) = NULL;
static bool pageserver_flush(void); static bool pageserver_flush(void);
static void pageserver_disconnect(void); static void pageserver_disconnect(void);
@@ -627,8 +621,6 @@ pg_init_libpagestore(void)
smgr_hook = smgr_neon; smgr_hook = smgr_neon;
smgr_init_hook = smgr_init_neon; smgr_init_hook = smgr_init_neon;
dbsize_hook = neon_dbsize; dbsize_hook = neon_dbsize;
old_redo_read_buffer_filter = redo_read_buffer_filter;
redo_read_buffer_filter = neon_redo_read_buffer_filter;
} }
lfc_init(); lfc_init();

View File

@@ -27,13 +27,6 @@ extern void pg_init_walproposer(void);
extern void pg_init_extension_server(void); extern void pg_init_extension_server(void);
/*
* Returns true if we shouldn't do REDO on that block in record indicated by
* block_id; false otherwise.
*/
extern bool neon_redo_read_buffer_filter(XLogReaderState *record, uint8 block_id);
extern bool (*old_redo_read_buffer_filter) (XLogReaderState *record, uint8 block_id);
extern uint64 BackpressureThrottlingTime(void); extern uint64 BackpressureThrottlingTime(void);
extern void replication_feedback_get_lsns(XLogRecPtr *writeLsn, XLogRecPtr *flushLsn, XLogRecPtr *applyLsn); extern void replication_feedback_get_lsns(XLogRecPtr *writeLsn, XLogRecPtr *flushLsn, XLogRecPtr *applyLsn);

View File

@@ -1,32 +1,10 @@
#include <sys/resource.h>
#include "postgres.h" #include "postgres.h"
#include "access/timeline.h" #include "lib/stringinfo.h"
#include "access/xlogutils.h"
#include "common/logging.h"
#include "common/ip.h"
#include "funcapi.h"
#include "libpq/libpq.h"
#include "libpq/pqformat.h" #include "libpq/pqformat.h"
#include "miscadmin.h"
#include "postmaster/interrupt.h"
#include "replication/slot.h"
#include "replication/walsender_private.h"
#include "storage/ipc.h"
#include "utils/builtins.h"
#include "utils/ps_status.h"
#include "libpq-fe.h"
#include <netinet/tcp.h>
#include <unistd.h>
#if PG_VERSION_NUM >= 150000
#include "access/xlogutils.h"
#include "access/xlogrecovery.h"
#endif
#if PG_MAJORVERSION_NUM >= 16
#include "utils/guc.h"
#endif
/* /*
* Convert a character which represents a hexadecimal digit to an integer. * Convert a character which represents a hexadecimal digit to an integer.
@@ -114,3 +92,25 @@ pq_sendint64_le(StringInfo buf, uint64 i)
memcpy(buf->data + buf->len, &i, sizeof(uint64)); memcpy(buf->data + buf->len, &i, sizeof(uint64));
buf->len += sizeof(uint64); buf->len += sizeof(uint64);
} }
/*
* Disables core dump for the current process.
*/
void
disable_core_dump()
{
struct rlimit rlim;
#ifdef WALPROPOSER_LIB /* skip in simulation mode */
return;
#endif
rlim.rlim_cur = 0;
rlim.rlim_max = 0;
if (setrlimit(RLIMIT_CORE, &rlim))
{
int save_errno = errno;
fprintf(stderr, "WARNING: disable cores setrlimit failed: %s", strerror(save_errno));
}
}

View File

@@ -1,12 +1,11 @@
#ifndef __NEON_UTILS_H__ #ifndef __NEON_UTILS_H__
#define __NEON_UTILS_H__ #define __NEON_UTILS_H__
#include "postgres.h"
bool HexDecodeString(uint8 *result, char *input, int nbytes); bool HexDecodeString(uint8 *result, char *input, int nbytes);
uint32 pq_getmsgint32_le(StringInfo msg); uint32 pq_getmsgint32_le(StringInfo msg);
uint64 pq_getmsgint64_le(StringInfo msg); uint64 pq_getmsgint64_le(StringInfo msg);
void pq_sendint32_le(StringInfo buf, uint32 i); void pq_sendint32_le(StringInfo buf, uint32 i);
void pq_sendint64_le(StringInfo buf, uint64 i); void pq_sendint64_le(StringInfo buf, uint64 i);
extern void disable_core_dump();
#endif /* __NEON_UTILS_H__ */ #endif /* __NEON_UTILS_H__ */

View File

@@ -13,19 +13,16 @@
#ifndef pageserver_h #ifndef pageserver_h
#define pageserver_h #define pageserver_h
#include "postgres.h"
#include "neon_pgversioncompat.h" #include "neon_pgversioncompat.h"
#include "access/xlogdefs.h" #include "access/xlogdefs.h"
#include RELFILEINFO_HDR #include RELFILEINFO_HDR
#include "storage/block.h"
#include "storage/smgr.h"
#include "lib/stringinfo.h" #include "lib/stringinfo.h"
#include "libpq/pqformat.h" #include "libpq/pqformat.h"
#include "storage/block.h"
#include "storage/smgr.h"
#include "utils/memutils.h" #include "utils/memutils.h"
#include "pg_config.h"
typedef enum typedef enum
{ {
/* pagestore_client -> pagestore */ /* pagestore_client -> pagestore */
@@ -158,11 +155,8 @@ extern page_server_api *page_server;
extern char *page_server_connstring; extern char *page_server_connstring;
extern int flush_every_n_requests; extern int flush_every_n_requests;
extern int readahead_buffer_size; extern int readahead_buffer_size;
extern bool seqscan_prefetch_enabled;
extern int seqscan_prefetch_distance;
extern char *neon_timeline; extern char *neon_timeline;
extern char *neon_tenant; extern char *neon_tenant;
extern bool wal_redo;
extern int32 max_cluster_size; extern int32 max_cluster_size;
extern const f_smgr *smgr_neon(BackendId backend, NRelFileInfo rinfo); extern const f_smgr *smgr_neon(BackendId backend, NRelFileInfo rinfo);

View File

@@ -47,25 +47,26 @@
#include "access/xact.h" #include "access/xact.h"
#include "access/xlog.h" #include "access/xlog.h"
#include "access/xlogdefs.h"
#include "access/xloginsert.h" #include "access/xloginsert.h"
#include "access/xlog_internal.h" #include "access/xlog_internal.h"
#include "access/xlogdefs.h" #include "access/xlogutils.h"
#include "catalog/pg_class.h" #include "catalog/pg_class.h"
#include "common/hashfn.h" #include "common/hashfn.h"
#include "executor/instrument.h" #include "executor/instrument.h"
#include "pagestore_client.h" #include "pgstat.h"
#include "postmaster/interrupt.h"
#include "postmaster/autovacuum.h" #include "postmaster/autovacuum.h"
#include "postmaster/interrupt.h"
#include "replication/walsender.h" #include "replication/walsender.h"
#include "storage/bufmgr.h" #include "storage/bufmgr.h"
#include "storage/buf_internals.h" #include "storage/buf_internals.h"
#include "storage/fsm_internals.h" #include "storage/fsm_internals.h"
#include "storage/smgr.h"
#include "storage/md.h" #include "storage/md.h"
#include "pgstat.h" #include "storage/smgr.h"
#include "pagestore_client.h"
#if PG_VERSION_NUM >= 150000 #if PG_VERSION_NUM >= 150000
#include "access/xlogutils.h"
#include "access/xlogrecovery.h" #include "access/xlogrecovery.h"
#endif #endif
@@ -106,6 +107,9 @@ typedef enum
static SMgrRelation unlogged_build_rel = NULL; static SMgrRelation unlogged_build_rel = NULL;
static UnloggedBuildPhase unlogged_build_phase = UNLOGGED_BUILD_NOT_IN_PROGRESS; static UnloggedBuildPhase unlogged_build_phase = UNLOGGED_BUILD_NOT_IN_PROGRESS;
static bool neon_redo_read_buffer_filter(XLogReaderState *record, uint8 block_id);
static bool (*old_redo_read_buffer_filter) (XLogReaderState *record, uint8 block_id) = NULL;
/* /*
* Prefetch implementation: * Prefetch implementation:
* *
@@ -239,7 +243,7 @@ typedef struct PrefetchState
PrefetchRequest prf_buffer[]; /* prefetch buffers */ PrefetchRequest prf_buffer[]; /* prefetch buffers */
} PrefetchState; } PrefetchState;
PrefetchState *MyPState; static PrefetchState *MyPState;
#define GetPrfSlot(ring_index) ( \ #define GetPrfSlot(ring_index) ( \
( \ ( \
@@ -257,7 +261,7 @@ PrefetchState *MyPState;
) \ ) \
) )
XLogRecPtr prefetch_lsn = 0; static XLogRecPtr prefetch_lsn = 0;
static bool compact_prefetch_buffers(void); static bool compact_prefetch_buffers(void);
static void consume_prefetch_responses(void); static void consume_prefetch_responses(void);
@@ -1371,6 +1375,9 @@ neon_init(void)
MyPState->prf_hash = prfh_create(MyPState->hashctx, MyPState->prf_hash = prfh_create(MyPState->hashctx,
readahead_buffer_size, NULL); readahead_buffer_size, NULL);
old_redo_read_buffer_filter = redo_read_buffer_filter;
redo_read_buffer_filter = neon_redo_read_buffer_filter;
#ifdef DEBUG_COMPARE_LOCAL #ifdef DEBUG_COMPARE_LOCAL
mdinit(); mdinit();
#endif #endif
@@ -2869,7 +2876,7 @@ get_fsm_physical_block(BlockNumber heapblk)
* contents, where with REDO locking it would wait on block 1 and see * contents, where with REDO locking it would wait on block 1 and see
* block 3 with post-REDO contents only. * block 3 with post-REDO contents only.
*/ */
bool static bool
neon_redo_read_buffer_filter(XLogReaderState *record, uint8 block_id) neon_redo_read_buffer_filter(XLogReaderState *record, uint8 block_id)
{ {
XLogRecPtr end_recptr = record->EndRecPtr; XLogRecPtr end_recptr = record->EndRecPtr;

View File

@@ -35,6 +35,8 @@
* *
*------------------------------------------------------------------------- *-------------------------------------------------------------------------
*/ */
#include <sys/resource.h>
#include "postgres.h" #include "postgres.h"
#include "libpq/pqformat.h" #include "libpq/pqformat.h"
#include "neon.h" #include "neon.h"
@@ -1069,6 +1071,12 @@ DetermineEpochStartLsn(WalProposer *wp)
if (!((dth->n_entries >= 1) && (dth->entries[dth->n_entries - 1].term == if (!((dth->n_entries >= 1) && (dth->entries[dth->n_entries - 1].term ==
walprop_shared->mineLastElectedTerm))) walprop_shared->mineLastElectedTerm)))
{ {
/*
* Panic to restart PG as we need to retake basebackup.
* However, don't dump core as this is kinda expected
* scenario.
*/
disable_core_dump();
walprop_log(PANIC, walprop_log(PANIC,
"collected propEpochStartLsn %X/%X, but basebackup LSN %X/%X", "collected propEpochStartLsn %X/%X, but basebackup LSN %X/%X",
LSN_FORMAT_ARGS(wp->propEpochStartLsn), LSN_FORMAT_ARGS(wp->propEpochStartLsn),
@@ -1445,7 +1453,12 @@ RecvAppendResponses(Safekeeper *sk)
if (sk->appendResponse.term > wp->propTerm) if (sk->appendResponse.term > wp->propTerm)
{ {
/* Another compute with higher term is running. */ /*
* Another compute with higher term is running. Panic to restart
* PG as we likely need to retake basebackup. However, don't dump
* core as this is kinda expected scenario.
*/
disable_core_dump();
walprop_log(PANIC, "WAL acceptor %s:%s with term " INT64_FORMAT " rejected our request, our term " INT64_FORMAT "", walprop_log(PANIC, "WAL acceptor %s:%s with term " INT64_FORMAT " rejected our request, our term " INT64_FORMAT "",
sk->host, sk->port, sk->host, sk->port,
sk->appendResponse.term, wp->propTerm); sk->appendResponse.term, wp->propTerm);

View File

@@ -1,14 +1,12 @@
#ifndef __NEON_WALPROPOSER_H__ #ifndef __NEON_WALPROPOSER_H__
#define __NEON_WALPROPOSER_H__ #define __NEON_WALPROPOSER_H__
#include "postgres.h"
#include "access/xlogdefs.h"
#include "port.h"
#include "access/xlog_internal.h"
#include "access/transam.h" #include "access/transam.h"
#include "access/xlogdefs.h"
#include "access/xlog_internal.h"
#include "nodes/replnodes.h" #include "nodes/replnodes.h"
#include "utils/uuid.h"
#include "replication/walreceiver.h" #include "replication/walreceiver.h"
#include "utils/uuid.h"
#define SK_MAGIC 0xCafeCeefu #define SK_MAGIC 0xCafeCeefu
#define SK_PROTOCOL_VERSION 2 #define SK_PROTOCOL_VERSION 2

View File

@@ -3,11 +3,13 @@
* This is needed to avoid linking to full postgres server installation. This file * This is needed to avoid linking to full postgres server installation. This file
* is compiled as a part of libwalproposer static library. * is compiled as a part of libwalproposer static library.
*/ */
#include "postgres.h"
#include <stdio.h> #include <stdio.h>
#include "walproposer.h"
#include "utils/datetime.h"
#include "miscadmin.h" #include "miscadmin.h"
#include "utils/datetime.h"
#include "walproposer.h"
void void
ExceptionalCondition(const char *conditionName, ExceptionalCondition(const char *conditionName,

View File

@@ -1482,6 +1482,21 @@ walprop_pg_wait_event_set(WalProposer *wp, long timeout, Safekeeper **sk, uint32
#if PG_MAJORVERSION_NUM >= 16 #if PG_MAJORVERSION_NUM >= 16
if (WalSndCtl != NULL) if (WalSndCtl != NULL)
ConditionVariablePrepareToSleep(&WalSndCtl->wal_flush_cv); ConditionVariablePrepareToSleep(&WalSndCtl->wal_flush_cv);
/*
* Now that we prepared the condvar, check flush ptr again -- it might have
* changed before we subscribed to cv so we missed the wakeup.
*
* Do that only when we're interested in new WAL: without sync-safekeepers
* and if election already passed.
*/
if (!wp->config->syncSafekeepers && wp->availableLsn != InvalidXLogRecPtr && GetFlushRecPtr(NULL) > wp->availableLsn)
{
ConditionVariableCancelSleep();
ResetLatch(MyLatch);
*events = WL_LATCH_SET;
return 1;
}
#endif #endif
/* /*
@@ -1697,9 +1712,9 @@ walprop_pg_after_election(WalProposer *wp)
f = fopen("restart.lsn", "rb"); f = fopen("restart.lsn", "rb");
if (f != NULL && !wp->config->syncSafekeepers) if (f != NULL && !wp->config->syncSafekeepers)
{ {
fread(&lrRestartLsn, sizeof(lrRestartLsn), 1, f); size_t rc = fread(&lrRestartLsn, sizeof(lrRestartLsn), 1, f);
fclose(f); fclose(f);
if (lrRestartLsn != InvalidXLogRecPtr) if (rc == 1 && lrRestartLsn != InvalidXLogRecPtr)
{ {
elog(LOG, "Logical replication restart LSN %X/%X", LSN_FORMAT_ARGS(lrRestartLsn)); elog(LOG, "Logical replication restart LSN %X/%X", LSN_FORMAT_ARGS(lrRestartLsn));

258
poetry.lock generated
View File

@@ -2092,51 +2092,61 @@ files = [
[[package]] [[package]]
name = "pyyaml" name = "pyyaml"
version = "6.0" version = "6.0.1"
description = "YAML parser and emitter for Python" description = "YAML parser and emitter for Python"
optional = false optional = false
python-versions = ">=3.6" python-versions = ">=3.6"
files = [ files = [
{file = "PyYAML-6.0-cp310-cp310-macosx_10_9_x86_64.whl", hash = "sha256:d4db7c7aef085872ef65a8fd7d6d09a14ae91f691dec3e87ee5ee0539d516f53"}, {file = "PyYAML-6.0.1-cp310-cp310-macosx_10_9_x86_64.whl", hash = "sha256:d858aa552c999bc8a8d57426ed01e40bef403cd8ccdd0fc5f6f04a00414cac2a"},
{file = "PyYAML-6.0-cp310-cp310-macosx_11_0_arm64.whl", hash = "sha256:9df7ed3b3d2e0ecfe09e14741b857df43adb5a3ddadc919a2d94fbdf78fea53c"}, {file = "PyYAML-6.0.1-cp310-cp310-macosx_11_0_arm64.whl", hash = "sha256:fd66fc5d0da6d9815ba2cebeb4205f95818ff4b79c3ebe268e75d961704af52f"},
{file = "PyYAML-6.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:77f396e6ef4c73fdc33a9157446466f1cff553d979bd00ecb64385760c6babdc"}, {file = "PyYAML-6.0.1-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:69b023b2b4daa7548bcfbd4aa3da05b3a74b772db9e23b982788168117739938"},
{file = "PyYAML-6.0-cp310-cp310-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:a80a78046a72361de73f8f395f1f1e49f956c6be882eed58505a15f3e430962b"}, {file = "PyYAML-6.0.1-cp310-cp310-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:81e0b275a9ecc9c0c0c07b4b90ba548307583c125f54d5b6946cfee6360c733d"},
{file = "PyYAML-6.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl", hash = "sha256:f84fbc98b019fef2ee9a1cb3ce93e3187a6df0b2538a651bfb890254ba9f90b5"}, {file = "PyYAML-6.0.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:ba336e390cd8e4d1739f42dfe9bb83a3cc2e80f567d8805e11b46f4a943f5515"},
{file = "PyYAML-6.0-cp310-cp310-win32.whl", hash = "sha256:2cd5df3de48857ed0544b34e2d40e9fac445930039f3cfe4bcc592a1f836d513"}, {file = "PyYAML-6.0.1-cp310-cp310-musllinux_1_1_x86_64.whl", hash = "sha256:326c013efe8048858a6d312ddd31d56e468118ad4cdeda36c719bf5bb6192290"},
{file = "PyYAML-6.0-cp310-cp310-win_amd64.whl", hash = "sha256:daf496c58a8c52083df09b80c860005194014c3698698d1a57cbcfa182142a3a"}, {file = "PyYAML-6.0.1-cp310-cp310-win32.whl", hash = "sha256:bd4af7373a854424dabd882decdc5579653d7868b8fb26dc7d0e99f823aa5924"},
{file = "PyYAML-6.0-cp311-cp311-macosx_10_9_x86_64.whl", hash = "sha256:d4b0ba9512519522b118090257be113b9468d804b19d63c71dbcf4a48fa32358"}, {file = "PyYAML-6.0.1-cp310-cp310-win_amd64.whl", hash = "sha256:fd1592b3fdf65fff2ad0004b5e363300ef59ced41c2e6b3a99d4089fa8c5435d"},
{file = "PyYAML-6.0-cp311-cp311-macosx_11_0_arm64.whl", hash = "sha256:81957921f441d50af23654aa6c5e5eaf9b06aba7f0a19c18a538dc7ef291c5a1"}, {file = "PyYAML-6.0.1-cp311-cp311-macosx_10_9_x86_64.whl", hash = "sha256:6965a7bc3cf88e5a1c3bd2e0b5c22f8d677dc88a455344035f03399034eb3007"},
{file = "PyYAML-6.0-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:afa17f5bc4d1b10afd4466fd3a44dc0e245382deca5b3c353d8b757f9e3ecb8d"}, {file = "PyYAML-6.0.1-cp311-cp311-macosx_11_0_arm64.whl", hash = "sha256:f003ed9ad21d6a4713f0a9b5a7a0a79e08dd0f221aff4525a2be4c346ee60aab"},
{file = "PyYAML-6.0-cp311-cp311-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:dbad0e9d368bb989f4515da330b88a057617d16b6a8245084f1b05400f24609f"}, {file = "PyYAML-6.0.1-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:42f8152b8dbc4fe7d96729ec2b99c7097d656dc1213a3229ca5383f973a5ed6d"},
{file = "PyYAML-6.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:432557aa2c09802be39460360ddffd48156e30721f5e8d917f01d31694216782"}, {file = "PyYAML-6.0.1-cp311-cp311-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:062582fca9fabdd2c8b54a3ef1c978d786e0f6b3a1510e0ac93ef59e0ddae2bc"},
{file = "PyYAML-6.0-cp311-cp311-win32.whl", hash = "sha256:bfaef573a63ba8923503d27530362590ff4f576c626d86a9fed95822a8255fd7"}, {file = "PyYAML-6.0.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:d2b04aac4d386b172d5b9692e2d2da8de7bfb6c387fa4f801fbf6fb2e6ba4673"},
{file = "PyYAML-6.0-cp311-cp311-win_amd64.whl", hash = "sha256:01b45c0191e6d66c470b6cf1b9531a771a83c1c4208272ead47a3ae4f2f603bf"}, {file = "PyYAML-6.0.1-cp311-cp311-musllinux_1_1_x86_64.whl", hash = "sha256:e7d73685e87afe9f3b36c799222440d6cf362062f78be1013661b00c5c6f678b"},
{file = "PyYAML-6.0-cp36-cp36m-macosx_10_9_x86_64.whl", hash = "sha256:897b80890765f037df3403d22bab41627ca8811ae55e9a722fd0392850ec4d86"}, {file = "PyYAML-6.0.1-cp311-cp311-win32.whl", hash = "sha256:1635fd110e8d85d55237ab316b5b011de701ea0f29d07611174a1b42f1444741"},
{file = "PyYAML-6.0-cp36-cp36m-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:50602afada6d6cbfad699b0c7bb50d5ccffa7e46a3d738092afddc1f9758427f"}, {file = "PyYAML-6.0.1-cp311-cp311-win_amd64.whl", hash = "sha256:bf07ee2fef7014951eeb99f56f39c9bb4af143d8aa3c21b1677805985307da34"},
{file = "PyYAML-6.0-cp36-cp36m-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:48c346915c114f5fdb3ead70312bd042a953a8ce5c7106d5bfb1a5254e47da92"}, {file = "PyYAML-6.0.1-cp312-cp312-macosx_10_9_x86_64.whl", hash = "sha256:855fb52b0dc35af121542a76b9a84f8d1cd886ea97c84703eaa6d88e37a2ad28"},
{file = "PyYAML-6.0-cp36-cp36m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl", hash = "sha256:98c4d36e99714e55cfbaaee6dd5badbc9a1ec339ebfc3b1f52e293aee6bb71a4"}, {file = "PyYAML-6.0.1-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:40df9b996c2b73138957fe23a16a4f0ba614f4c0efce1e9406a184b6d07fa3a9"},
{file = "PyYAML-6.0-cp36-cp36m-win32.whl", hash = "sha256:0283c35a6a9fbf047493e3a0ce8d79ef5030852c51e9d911a27badfde0605293"}, {file = "PyYAML-6.0.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:6c22bec3fbe2524cde73d7ada88f6566758a8f7227bfbf93a408a9d86bcc12a0"},
{file = "PyYAML-6.0-cp36-cp36m-win_amd64.whl", hash = "sha256:07751360502caac1c067a8132d150cf3d61339af5691fe9e87803040dbc5db57"}, {file = "PyYAML-6.0.1-cp312-cp312-musllinux_1_1_x86_64.whl", hash = "sha256:8d4e9c88387b0f5c7d5f281e55304de64cf7f9c0021a3525bd3b1c542da3b0e4"},
{file = "PyYAML-6.0-cp37-cp37m-macosx_10_9_x86_64.whl", hash = "sha256:819b3830a1543db06c4d4b865e70ded25be52a2e0631ccd2f6a47a2822f2fd7c"}, {file = "PyYAML-6.0.1-cp312-cp312-win32.whl", hash = "sha256:d483d2cdf104e7c9fa60c544d92981f12ad66a457afae824d146093b8c294c54"},
{file = "PyYAML-6.0-cp37-cp37m-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:473f9edb243cb1935ab5a084eb238d842fb8f404ed2193a915d1784b5a6b5fc0"}, {file = "PyYAML-6.0.1-cp312-cp312-win_amd64.whl", hash = "sha256:0d3304d8c0adc42be59c5f8a4d9e3d7379e6955ad754aa9d6ab7a398b59dd1df"},
{file = "PyYAML-6.0-cp37-cp37m-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:0ce82d761c532fe4ec3f87fc45688bdd3a4c1dc5e0b4a19814b9009a29baefd4"}, {file = "PyYAML-6.0.1-cp36-cp36m-macosx_10_9_x86_64.whl", hash = "sha256:50550eb667afee136e9a77d6dc71ae76a44df8b3e51e41b77f6de2932bfe0f47"},
{file = "PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl", hash = "sha256:231710d57adfd809ef5d34183b8ed1eeae3f76459c18fb4a0b373ad56bedcdd9"}, {file = "PyYAML-6.0.1-cp36-cp36m-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:1fe35611261b29bd1de0070f0b2f47cb6ff71fa6595c077e42bd0c419fa27b98"},
{file = "PyYAML-6.0-cp37-cp37m-win32.whl", hash = "sha256:c5687b8d43cf58545ade1fe3e055f70eac7a5a1a0bf42824308d868289a95737"}, {file = "PyYAML-6.0.1-cp36-cp36m-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:704219a11b772aea0d8ecd7058d0082713c3562b4e271b849ad7dc4a5c90c13c"},
{file = "PyYAML-6.0-cp37-cp37m-win_amd64.whl", hash = "sha256:d15a181d1ecd0d4270dc32edb46f7cb7733c7c508857278d3d378d14d606db2d"}, {file = "PyYAML-6.0.1-cp36-cp36m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:afd7e57eddb1a54f0f1a974bc4391af8bcce0b444685d936840f125cf046d5bd"},
{file = "PyYAML-6.0-cp38-cp38-macosx_10_9_x86_64.whl", hash = "sha256:0b4624f379dab24d3725ffde76559cff63d9ec94e1736b556dacdfebe5ab6d4b"}, {file = "PyYAML-6.0.1-cp36-cp36m-win32.whl", hash = "sha256:fca0e3a251908a499833aa292323f32437106001d436eca0e6e7833256674585"},
{file = "PyYAML-6.0-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:213c60cd50106436cc818accf5baa1aba61c0189ff610f64f4a3e8c6726218ba"}, {file = "PyYAML-6.0.1-cp36-cp36m-win_amd64.whl", hash = "sha256:f22ac1c3cac4dbc50079e965eba2c1058622631e526bd9afd45fedd49ba781fa"},
{file = "PyYAML-6.0-cp38-cp38-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:9fa600030013c4de8165339db93d182b9431076eb98eb40ee068700c9c813e34"}, {file = "PyYAML-6.0.1-cp37-cp37m-macosx_10_9_x86_64.whl", hash = "sha256:b1275ad35a5d18c62a7220633c913e1b42d44b46ee12554e5fd39c70a243d6a3"},
{file = "PyYAML-6.0-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl", hash = "sha256:277a0ef2981ca40581a47093e9e2d13b3f1fbbeffae064c1d21bfceba2030287"}, {file = "PyYAML-6.0.1-cp37-cp37m-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:18aeb1bf9a78867dc38b259769503436b7c72f7a1f1f4c93ff9a17de54319b27"},
{file = "PyYAML-6.0-cp38-cp38-win32.whl", hash = "sha256:d4eccecf9adf6fbcc6861a38015c2a64f38b9d94838ac1810a9023a0609e1b78"}, {file = "PyYAML-6.0.1-cp37-cp37m-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:596106435fa6ad000c2991a98fa58eeb8656ef2325d7e158344fb33864ed87e3"},
{file = "PyYAML-6.0-cp38-cp38-win_amd64.whl", hash = "sha256:1e4747bc279b4f613a09eb64bba2ba602d8a6664c6ce6396a4d0cd413a50ce07"}, {file = "PyYAML-6.0.1-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:baa90d3f661d43131ca170712d903e6295d1f7a0f595074f151c0aed377c9b9c"},
{file = "PyYAML-6.0-cp39-cp39-macosx_10_9_x86_64.whl", hash = "sha256:055d937d65826939cb044fc8c9b08889e8c743fdc6a32b33e2390f66013e449b"}, {file = "PyYAML-6.0.1-cp37-cp37m-win32.whl", hash = "sha256:9046c58c4395dff28dd494285c82ba00b546adfc7ef001486fbf0324bc174fba"},
{file = "PyYAML-6.0-cp39-cp39-macosx_11_0_arm64.whl", hash = "sha256:e61ceaab6f49fb8bdfaa0f92c4b57bcfbea54c09277b1b4f7ac376bfb7a7c174"}, {file = "PyYAML-6.0.1-cp37-cp37m-win_amd64.whl", hash = "sha256:4fb147e7a67ef577a588a0e2c17b6db51dda102c71de36f8549b6816a96e1867"},
{file = "PyYAML-6.0-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:d67d839ede4ed1b28a4e8909735fc992a923cdb84e618544973d7dfc71540803"}, {file = "PyYAML-6.0.1-cp38-cp38-macosx_10_9_x86_64.whl", hash = "sha256:1d4c7e777c441b20e32f52bd377e0c409713e8bb1386e1099c2415f26e479595"},
{file = "PyYAML-6.0-cp39-cp39-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:cba8c411ef271aa037d7357a2bc8f9ee8b58b9965831d9e51baf703280dc73d3"}, {file = "PyYAML-6.0.1-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:a0cd17c15d3bb3fa06978b4e8958dcdc6e0174ccea823003a106c7d4d7899ac5"},
{file = "PyYAML-6.0-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl", hash = "sha256:40527857252b61eacd1d9af500c3337ba8deb8fc298940291486c465c8b46ec0"}, {file = "PyYAML-6.0.1-cp38-cp38-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:28c119d996beec18c05208a8bd78cbe4007878c6dd15091efb73a30e90539696"},
{file = "PyYAML-6.0-cp39-cp39-win32.whl", hash = "sha256:b5b9eccad747aabaaffbc6064800670f0c297e52c12754eb1d976c57e4f74dcb"}, {file = "PyYAML-6.0.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:7e07cbde391ba96ab58e532ff4803f79c4129397514e1413a7dc761ccd755735"},
{file = "PyYAML-6.0-cp39-cp39-win_amd64.whl", hash = "sha256:b3d267842bf12586ba6c734f89d1f5b871df0273157918b0ccefa29deb05c21c"}, {file = "PyYAML-6.0.1-cp38-cp38-musllinux_1_1_x86_64.whl", hash = "sha256:49a183be227561de579b4a36efbb21b3eab9651dd81b1858589f796549873dd6"},
{file = "PyYAML-6.0.tar.gz", hash = "sha256:68fb519c14306fec9720a2a5b45bc9f0c8d1b9c72adf45c37baedfcd949c35a2"}, {file = "PyYAML-6.0.1-cp38-cp38-win32.whl", hash = "sha256:184c5108a2aca3c5b3d3bf9395d50893a7ab82a38004c8f61c258d4428e80206"},
{file = "PyYAML-6.0.1-cp38-cp38-win_amd64.whl", hash = "sha256:1e2722cc9fbb45d9b87631ac70924c11d3a401b2d7f410cc0e3bbf249f2dca62"},
{file = "PyYAML-6.0.1-cp39-cp39-macosx_10_9_x86_64.whl", hash = "sha256:9eb6caa9a297fc2c2fb8862bc5370d0303ddba53ba97e71f08023b6cd73d16a8"},
{file = "PyYAML-6.0.1-cp39-cp39-macosx_11_0_arm64.whl", hash = "sha256:c8098ddcc2a85b61647b2590f825f3db38891662cfc2fc776415143f599bb859"},
{file = "PyYAML-6.0.1-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:5773183b6446b2c99bb77e77595dd486303b4faab2b086e7b17bc6bef28865f6"},
{file = "PyYAML-6.0.1-cp39-cp39-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:b786eecbdf8499b9ca1d697215862083bd6d2a99965554781d0d8d1ad31e13a0"},
{file = "PyYAML-6.0.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:bc1bf2925a1ecd43da378f4db9e4f799775d6367bdb94671027b73b393a7c42c"},
{file = "PyYAML-6.0.1-cp39-cp39-musllinux_1_1_x86_64.whl", hash = "sha256:04ac92ad1925b2cff1db0cfebffb6ffc43457495c9b3c39d3fcae417d7125dc5"},
{file = "PyYAML-6.0.1-cp39-cp39-win32.whl", hash = "sha256:faca3bdcf85b2fc05d06ff3fbc1f83e1391b3e724afa3feba7d13eeab355484c"},
{file = "PyYAML-6.0.1-cp39-cp39-win_amd64.whl", hash = "sha256:510c9deebc5c0225e8c96813043e62b680ba2f9c50a08d3724c7f28a747d1486"},
{file = "PyYAML-6.0.1.tar.gz", hash = "sha256:bfdf460b1736c775f2ba9f6a92bca30bc2095067b8a9d77876d1fad6cc3b4a43"},
] ]
[[package]] [[package]]
@@ -2553,85 +2563,101 @@ files = [
[[package]] [[package]]
name = "yarl" name = "yarl"
version = "1.8.2" version = "1.9.4"
description = "Yet another URL library" description = "Yet another URL library"
optional = false optional = false
python-versions = ">=3.7" python-versions = ">=3.7"
files = [ files = [
{file = "yarl-1.8.2-cp310-cp310-macosx_10_9_universal2.whl", hash = "sha256:bb81f753c815f6b8e2ddd2eef3c855cf7da193b82396ac013c661aaa6cc6b0a5"}, {file = "yarl-1.9.4-cp310-cp310-macosx_10_9_universal2.whl", hash = "sha256:a8c1df72eb746f4136fe9a2e72b0c9dc1da1cbd23b5372f94b5820ff8ae30e0e"},
{file = "yarl-1.8.2-cp310-cp310-macosx_10_9_x86_64.whl", hash = "sha256:47d49ac96156f0928f002e2424299b2c91d9db73e08c4cd6742923a086f1c863"}, {file = "yarl-1.9.4-cp310-cp310-macosx_10_9_x86_64.whl", hash = "sha256:a3a6ed1d525bfb91b3fc9b690c5a21bb52de28c018530ad85093cc488bee2dd2"},
{file = "yarl-1.8.2-cp310-cp310-macosx_11_0_arm64.whl", hash = "sha256:3fc056e35fa6fba63248d93ff6e672c096f95f7836938241ebc8260e062832fe"}, {file = "yarl-1.9.4-cp310-cp310-macosx_11_0_arm64.whl", hash = "sha256:c38c9ddb6103ceae4e4498f9c08fac9b590c5c71b0370f98714768e22ac6fa66"},
{file = "yarl-1.8.2-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:58a3c13d1c3005dbbac5c9f0d3210b60220a65a999b1833aa46bd6677c69b08e"}, {file = "yarl-1.9.4-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:d9e09c9d74f4566e905a0b8fa668c58109f7624db96a2171f21747abc7524234"},
{file = "yarl-1.8.2-cp310-cp310-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:10b08293cda921157f1e7c2790999d903b3fd28cd5c208cf8826b3b508026996"}, {file = "yarl-1.9.4-cp310-cp310-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:b8477c1ee4bd47c57d49621a062121c3023609f7a13b8a46953eb6c9716ca392"},
{file = "yarl-1.8.2-cp310-cp310-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:de986979bbd87272fe557e0a8fcb66fd40ae2ddfe28a8b1ce4eae22681728fef"}, {file = "yarl-1.9.4-cp310-cp310-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:d5ff2c858f5f6a42c2a8e751100f237c5e869cbde669a724f2062d4c4ef93551"},
{file = "yarl-1.8.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:6c4fcfa71e2c6a3cb568cf81aadc12768b9995323186a10827beccf5fa23d4f8"}, {file = "yarl-1.9.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:357495293086c5b6d34ca9616a43d329317feab7917518bc97a08f9e55648455"},
{file = "yarl-1.8.2-cp310-cp310-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:ae4d7ff1049f36accde9e1ef7301912a751e5bae0a9d142459646114c70ecba6"}, {file = "yarl-1.9.4-cp310-cp310-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:54525ae423d7b7a8ee81ba189f131054defdb122cde31ff17477951464c1691c"},
{file = "yarl-1.8.2-cp310-cp310-musllinux_1_1_aarch64.whl", hash = "sha256:bf071f797aec5b96abfc735ab97da9fd8f8768b43ce2abd85356a3127909d146"}, {file = "yarl-1.9.4-cp310-cp310-musllinux_1_1_aarch64.whl", hash = "sha256:801e9264d19643548651b9db361ce3287176671fb0117f96b5ac0ee1c3530d53"},
{file = "yarl-1.8.2-cp310-cp310-musllinux_1_1_i686.whl", hash = "sha256:74dece2bfc60f0f70907c34b857ee98f2c6dd0f75185db133770cd67300d505f"}, {file = "yarl-1.9.4-cp310-cp310-musllinux_1_1_i686.whl", hash = "sha256:e516dc8baf7b380e6c1c26792610230f37147bb754d6426462ab115a02944385"},
{file = "yarl-1.8.2-cp310-cp310-musllinux_1_1_ppc64le.whl", hash = "sha256:df60a94d332158b444301c7f569659c926168e4d4aad2cfbf4bce0e8fb8be826"}, {file = "yarl-1.9.4-cp310-cp310-musllinux_1_1_ppc64le.whl", hash = "sha256:7d5aaac37d19b2904bb9dfe12cdb08c8443e7ba7d2852894ad448d4b8f442863"},
{file = "yarl-1.8.2-cp310-cp310-musllinux_1_1_s390x.whl", hash = "sha256:63243b21c6e28ec2375f932a10ce7eda65139b5b854c0f6b82ed945ba526bff3"}, {file = "yarl-1.9.4-cp310-cp310-musllinux_1_1_s390x.whl", hash = "sha256:54beabb809ffcacbd9d28ac57b0db46e42a6e341a030293fb3185c409e626b8b"},
{file = "yarl-1.8.2-cp310-cp310-musllinux_1_1_x86_64.whl", hash = "sha256:cfa2bbca929aa742b5084fd4663dd4b87c191c844326fcb21c3afd2d11497f80"}, {file = "yarl-1.9.4-cp310-cp310-musllinux_1_1_x86_64.whl", hash = "sha256:bac8d525a8dbc2a1507ec731d2867025d11ceadcb4dd421423a5d42c56818541"},
{file = "yarl-1.8.2-cp310-cp310-win32.whl", hash = "sha256:b05df9ea7496df11b710081bd90ecc3a3db6adb4fee36f6a411e7bc91a18aa42"}, {file = "yarl-1.9.4-cp310-cp310-win32.whl", hash = "sha256:7855426dfbddac81896b6e533ebefc0af2f132d4a47340cee6d22cac7190022d"},
{file = "yarl-1.8.2-cp310-cp310-win_amd64.whl", hash = "sha256:24ad1d10c9db1953291f56b5fe76203977f1ed05f82d09ec97acb623a7976574"}, {file = "yarl-1.9.4-cp310-cp310-win_amd64.whl", hash = "sha256:848cd2a1df56ddbffeb375535fb62c9d1645dde33ca4d51341378b3f5954429b"},
{file = "yarl-1.8.2-cp311-cp311-macosx_10_9_universal2.whl", hash = "sha256:2a1fca9588f360036242f379bfea2b8b44cae2721859b1c56d033adfd5893634"}, {file = "yarl-1.9.4-cp311-cp311-macosx_10_9_universal2.whl", hash = "sha256:35a2b9396879ce32754bd457d31a51ff0a9d426fd9e0e3c33394bf4b9036b099"},
{file = "yarl-1.8.2-cp311-cp311-macosx_10_9_x86_64.whl", hash = "sha256:f37db05c6051eff17bc832914fe46869f8849de5b92dc4a3466cd63095d23dfd"}, {file = "yarl-1.9.4-cp311-cp311-macosx_10_9_x86_64.whl", hash = "sha256:4c7d56b293cc071e82532f70adcbd8b61909eec973ae9d2d1f9b233f3d943f2c"},
{file = "yarl-1.8.2-cp311-cp311-macosx_11_0_arm64.whl", hash = "sha256:77e913b846a6b9c5f767b14dc1e759e5aff05502fe73079f6f4176359d832581"}, {file = "yarl-1.9.4-cp311-cp311-macosx_11_0_arm64.whl", hash = "sha256:d8a1c6c0be645c745a081c192e747c5de06e944a0d21245f4cf7c05e457c36e0"},
{file = "yarl-1.8.2-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:0978f29222e649c351b173da2b9b4665ad1feb8d1daa9d971eb90df08702668a"}, {file = "yarl-1.9.4-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:4b3c1ffe10069f655ea2d731808e76e0f452fc6c749bea04781daf18e6039525"},
{file = "yarl-1.8.2-cp311-cp311-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:388a45dc77198b2460eac0aca1efd6a7c09e976ee768b0d5109173e521a19daf"}, {file = "yarl-1.9.4-cp311-cp311-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:549d19c84c55d11687ddbd47eeb348a89df9cb30e1993f1b128f4685cd0ebbf8"},
{file = "yarl-1.8.2-cp311-cp311-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:2305517e332a862ef75be8fad3606ea10108662bc6fe08509d5ca99503ac2aee"}, {file = "yarl-1.9.4-cp311-cp311-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:a7409f968456111140c1c95301cadf071bd30a81cbd7ab829169fb9e3d72eae9"},
{file = "yarl-1.8.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:42430ff511571940d51e75cf42f1e4dbdded477e71c1b7a17f4da76c1da8ea76"}, {file = "yarl-1.9.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:e23a6d84d9d1738dbc6e38167776107e63307dfc8ad108e580548d1f2c587f42"},
{file = "yarl-1.8.2-cp311-cp311-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:3150078118f62371375e1e69b13b48288e44f6691c1069340081c3fd12c94d5b"}, {file = "yarl-1.9.4-cp311-cp311-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:d8b889777de69897406c9fb0b76cdf2fd0f31267861ae7501d93003d55f54fbe"},
{file = "yarl-1.8.2-cp311-cp311-musllinux_1_1_aarch64.whl", hash = "sha256:c15163b6125db87c8f53c98baa5e785782078fbd2dbeaa04c6141935eb6dab7a"}, {file = "yarl-1.9.4-cp311-cp311-musllinux_1_1_aarch64.whl", hash = "sha256:03caa9507d3d3c83bca08650678e25364e1843b484f19986a527630ca376ecce"},
{file = "yarl-1.8.2-cp311-cp311-musllinux_1_1_i686.whl", hash = "sha256:4d04acba75c72e6eb90745447d69f84e6c9056390f7a9724605ca9c56b4afcc6"}, {file = "yarl-1.9.4-cp311-cp311-musllinux_1_1_i686.whl", hash = "sha256:4e9035df8d0880b2f1c7f5031f33f69e071dfe72ee9310cfc76f7b605958ceb9"},
{file = "yarl-1.8.2-cp311-cp311-musllinux_1_1_ppc64le.whl", hash = "sha256:e7fd20d6576c10306dea2d6a5765f46f0ac5d6f53436217913e952d19237efc4"}, {file = "yarl-1.9.4-cp311-cp311-musllinux_1_1_ppc64le.whl", hash = "sha256:c0ec0ed476f77db9fb29bca17f0a8fcc7bc97ad4c6c1d8959c507decb22e8572"},
{file = "yarl-1.8.2-cp311-cp311-musllinux_1_1_s390x.whl", hash = "sha256:75c16b2a900b3536dfc7014905a128a2bea8fb01f9ee26d2d7d8db0a08e7cb2c"}, {file = "yarl-1.9.4-cp311-cp311-musllinux_1_1_s390x.whl", hash = "sha256:ee04010f26d5102399bd17f8df8bc38dc7ccd7701dc77f4a68c5b8d733406958"},
{file = "yarl-1.8.2-cp311-cp311-musllinux_1_1_x86_64.whl", hash = "sha256:6d88056a04860a98341a0cf53e950e3ac9f4e51d1b6f61a53b0609df342cc8b2"}, {file = "yarl-1.9.4-cp311-cp311-musllinux_1_1_x86_64.whl", hash = "sha256:49a180c2e0743d5d6e0b4d1a9e5f633c62eca3f8a86ba5dd3c471060e352ca98"},
{file = "yarl-1.8.2-cp311-cp311-win32.whl", hash = "sha256:fb742dcdd5eec9f26b61224c23baea46c9055cf16f62475e11b9b15dfd5c117b"}, {file = "yarl-1.9.4-cp311-cp311-win32.whl", hash = "sha256:81eb57278deb6098a5b62e88ad8281b2ba09f2f1147c4767522353eaa6260b31"},
{file = "yarl-1.8.2-cp311-cp311-win_amd64.whl", hash = "sha256:8c46d3d89902c393a1d1e243ac847e0442d0196bbd81aecc94fcebbc2fd5857c"}, {file = "yarl-1.9.4-cp311-cp311-win_amd64.whl", hash = "sha256:d1d2532b340b692880261c15aee4dc94dd22ca5d61b9db9a8a361953d36410b1"},
{file = "yarl-1.8.2-cp37-cp37m-macosx_10_9_x86_64.whl", hash = "sha256:ceff9722e0df2e0a9e8a79c610842004fa54e5b309fe6d218e47cd52f791d7ef"}, {file = "yarl-1.9.4-cp312-cp312-macosx_10_9_universal2.whl", hash = "sha256:0d2454f0aef65ea81037759be5ca9947539667eecebca092733b2eb43c965a81"},
{file = "yarl-1.8.2-cp37-cp37m-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:3f6b4aca43b602ba0f1459de647af954769919c4714706be36af670a5f44c9c1"}, {file = "yarl-1.9.4-cp312-cp312-macosx_10_9_x86_64.whl", hash = "sha256:44d8ffbb9c06e5a7f529f38f53eda23e50d1ed33c6c869e01481d3fafa6b8142"},
{file = "yarl-1.8.2-cp37-cp37m-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:1684a9bd9077e922300ecd48003ddae7a7474e0412bea38d4631443a91d61077"}, {file = "yarl-1.9.4-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:aaaea1e536f98754a6e5c56091baa1b6ce2f2700cc4a00b0d49eca8dea471074"},
{file = "yarl-1.8.2-cp37-cp37m-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:ebb78745273e51b9832ef90c0898501006670d6e059f2cdb0e999494eb1450c2"}, {file = "yarl-1.9.4-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:3777ce5536d17989c91696db1d459574e9a9bd37660ea7ee4d3344579bb6f129"},
{file = "yarl-1.8.2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:3adeef150d528ded2a8e734ebf9ae2e658f4c49bf413f5f157a470e17a4a2e89"}, {file = "yarl-1.9.4-cp312-cp312-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:9fc5fc1eeb029757349ad26bbc5880557389a03fa6ada41703db5e068881e5f2"},
{file = "yarl-1.8.2-cp37-cp37m-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:57a7c87927a468e5a1dc60c17caf9597161d66457a34273ab1760219953f7f4c"}, {file = "yarl-1.9.4-cp312-cp312-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:ea65804b5dc88dacd4a40279af0cdadcfe74b3e5b4c897aa0d81cf86927fee78"},
{file = "yarl-1.8.2-cp37-cp37m-musllinux_1_1_aarch64.whl", hash = "sha256:efff27bd8cbe1f9bd127e7894942ccc20c857aa8b5a0327874f30201e5ce83d0"}, {file = "yarl-1.9.4-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:aa102d6d280a5455ad6a0f9e6d769989638718e938a6a0a2ff3f4a7ff8c62cc4"},
{file = "yarl-1.8.2-cp37-cp37m-musllinux_1_1_i686.whl", hash = "sha256:a783cd344113cb88c5ff7ca32f1f16532a6f2142185147822187913eb989f739"}, {file = "yarl-1.9.4-cp312-cp312-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:09efe4615ada057ba2d30df871d2f668af661e971dfeedf0c159927d48bbeff0"},
{file = "yarl-1.8.2-cp37-cp37m-musllinux_1_1_ppc64le.whl", hash = "sha256:705227dccbe96ab02c7cb2c43e1228e2826e7ead880bb19ec94ef279e9555b5b"}, {file = "yarl-1.9.4-cp312-cp312-musllinux_1_1_aarch64.whl", hash = "sha256:008d3e808d03ef28542372d01057fd09168419cdc8f848efe2804f894ae03e51"},
{file = "yarl-1.8.2-cp37-cp37m-musllinux_1_1_s390x.whl", hash = "sha256:34c09b43bd538bf6c4b891ecce94b6fa4f1f10663a8d4ca589a079a5018f6ed7"}, {file = "yarl-1.9.4-cp312-cp312-musllinux_1_1_i686.whl", hash = "sha256:6f5cb257bc2ec58f437da2b37a8cd48f666db96d47b8a3115c29f316313654ff"},
{file = "yarl-1.8.2-cp37-cp37m-musllinux_1_1_x86_64.whl", hash = "sha256:a48f4f7fea9a51098b02209d90297ac324241bf37ff6be6d2b0149ab2bd51b37"}, {file = "yarl-1.9.4-cp312-cp312-musllinux_1_1_ppc64le.whl", hash = "sha256:992f18e0ea248ee03b5a6e8b3b4738850ae7dbb172cc41c966462801cbf62cf7"},
{file = "yarl-1.8.2-cp37-cp37m-win32.whl", hash = "sha256:0414fd91ce0b763d4eadb4456795b307a71524dbacd015c657bb2a39db2eab89"}, {file = "yarl-1.9.4-cp312-cp312-musllinux_1_1_s390x.whl", hash = "sha256:0e9d124c191d5b881060a9e5060627694c3bdd1fe24c5eecc8d5d7d0eb6faabc"},
{file = "yarl-1.8.2-cp37-cp37m-win_amd64.whl", hash = "sha256:d881d152ae0007809c2c02e22aa534e702f12071e6b285e90945aa3c376463c5"}, {file = "yarl-1.9.4-cp312-cp312-musllinux_1_1_x86_64.whl", hash = "sha256:3986b6f41ad22988e53d5778f91855dc0399b043fc8946d4f2e68af22ee9ff10"},
{file = "yarl-1.8.2-cp38-cp38-macosx_10_9_universal2.whl", hash = "sha256:5df5e3d04101c1e5c3b1d69710b0574171cc02fddc4b23d1b2813e75f35a30b1"}, {file = "yarl-1.9.4-cp312-cp312-win32.whl", hash = "sha256:4b21516d181cd77ebd06ce160ef8cc2a5e9ad35fb1c5930882baff5ac865eee7"},
{file = "yarl-1.8.2-cp38-cp38-macosx_10_9_x86_64.whl", hash = "sha256:7a66c506ec67eb3159eea5096acd05f5e788ceec7b96087d30c7d2865a243918"}, {file = "yarl-1.9.4-cp312-cp312-win_amd64.whl", hash = "sha256:a9bd00dc3bc395a662900f33f74feb3e757429e545d831eef5bb280252631984"},
{file = "yarl-1.8.2-cp38-cp38-macosx_11_0_arm64.whl", hash = "sha256:2b4fa2606adf392051d990c3b3877d768771adc3faf2e117b9de7eb977741229"}, {file = "yarl-1.9.4-cp37-cp37m-macosx_10_9_x86_64.whl", hash = "sha256:63b20738b5aac74e239622d2fe30df4fca4942a86e31bf47a81a0e94c14df94f"},
{file = "yarl-1.8.2-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:1e21fb44e1eff06dd6ef971d4bdc611807d6bd3691223d9c01a18cec3677939e"}, {file = "yarl-1.9.4-cp37-cp37m-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:d7d7f7de27b8944f1fee2c26a88b4dabc2409d2fea7a9ed3df79b67277644e17"},
{file = "yarl-1.8.2-cp38-cp38-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:93202666046d9edadfe9f2e7bf5e0782ea0d497b6d63da322e541665d65a044e"}, {file = "yarl-1.9.4-cp37-cp37m-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:c74018551e31269d56fab81a728f683667e7c28c04e807ba08f8c9e3bba32f14"},
{file = "yarl-1.8.2-cp38-cp38-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:fc77086ce244453e074e445104f0ecb27530d6fd3a46698e33f6c38951d5a0f1"}, {file = "yarl-1.9.4-cp37-cp37m-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:ca06675212f94e7a610e85ca36948bb8fc023e458dd6c63ef71abfd482481aa5"},
{file = "yarl-1.8.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:64dd68a92cab699a233641f5929a40f02a4ede8c009068ca8aa1fe87b8c20ae3"}, {file = "yarl-1.9.4-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:5aef935237d60a51a62b86249839b51345f47564208c6ee615ed2a40878dccdd"},
{file = "yarl-1.8.2-cp38-cp38-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:1b372aad2b5f81db66ee7ec085cbad72c4da660d994e8e590c997e9b01e44901"}, {file = "yarl-1.9.4-cp37-cp37m-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:2b134fd795e2322b7684155b7855cc99409d10b2e408056db2b93b51a52accc7"},
{file = "yarl-1.8.2-cp38-cp38-musllinux_1_1_aarch64.whl", hash = "sha256:e6f3515aafe0209dd17fb9bdd3b4e892963370b3de781f53e1746a521fb39fc0"}, {file = "yarl-1.9.4-cp37-cp37m-musllinux_1_1_aarch64.whl", hash = "sha256:d25039a474c4c72a5ad4b52495056f843a7ff07b632c1b92ea9043a3d9950f6e"},
{file = "yarl-1.8.2-cp38-cp38-musllinux_1_1_i686.whl", hash = "sha256:dfef7350ee369197106805e193d420b75467b6cceac646ea5ed3049fcc950a05"}, {file = "yarl-1.9.4-cp37-cp37m-musllinux_1_1_i686.whl", hash = "sha256:f7d6b36dd2e029b6bcb8a13cf19664c7b8e19ab3a58e0fefbb5b8461447ed5ec"},
{file = "yarl-1.8.2-cp38-cp38-musllinux_1_1_ppc64le.whl", hash = "sha256:728be34f70a190566d20aa13dc1f01dc44b6aa74580e10a3fb159691bc76909d"}, {file = "yarl-1.9.4-cp37-cp37m-musllinux_1_1_ppc64le.whl", hash = "sha256:957b4774373cf6f709359e5c8c4a0af9f6d7875db657adb0feaf8d6cb3c3964c"},
{file = "yarl-1.8.2-cp38-cp38-musllinux_1_1_s390x.whl", hash = "sha256:ff205b58dc2929191f68162633d5e10e8044398d7a45265f90a0f1d51f85f72c"}, {file = "yarl-1.9.4-cp37-cp37m-musllinux_1_1_s390x.whl", hash = "sha256:d7eeb6d22331e2fd42fce928a81c697c9ee2d51400bd1a28803965883e13cead"},
{file = "yarl-1.8.2-cp38-cp38-musllinux_1_1_x86_64.whl", hash = "sha256:baf211dcad448a87a0d9047dc8282d7de59473ade7d7fdf22150b1d23859f946"}, {file = "yarl-1.9.4-cp37-cp37m-musllinux_1_1_x86_64.whl", hash = "sha256:6a962e04b8f91f8c4e5917e518d17958e3bdee71fd1d8b88cdce74dd0ebbf434"},
{file = "yarl-1.8.2-cp38-cp38-win32.whl", hash = "sha256:272b4f1599f1b621bf2aabe4e5b54f39a933971f4e7c9aa311d6d7dc06965165"}, {file = "yarl-1.9.4-cp37-cp37m-win32.whl", hash = "sha256:f3bc6af6e2b8f92eced34ef6a96ffb248e863af20ef4fde9448cc8c9b858b749"},
{file = "yarl-1.8.2-cp38-cp38-win_amd64.whl", hash = "sha256:326dd1d3caf910cd26a26ccbfb84c03b608ba32499b5d6eeb09252c920bcbe4f"}, {file = "yarl-1.9.4-cp37-cp37m-win_amd64.whl", hash = "sha256:ad4d7a90a92e528aadf4965d685c17dacff3df282db1121136c382dc0b6014d2"},
{file = "yarl-1.8.2-cp39-cp39-macosx_10_9_universal2.whl", hash = "sha256:f8ca8ad414c85bbc50f49c0a106f951613dfa5f948ab69c10ce9b128d368baf8"}, {file = "yarl-1.9.4-cp38-cp38-macosx_10_9_universal2.whl", hash = "sha256:ec61d826d80fc293ed46c9dd26995921e3a82146feacd952ef0757236fc137be"},
{file = "yarl-1.8.2-cp39-cp39-macosx_10_9_x86_64.whl", hash = "sha256:418857f837347e8aaef682679f41e36c24250097f9e2f315d39bae3a99a34cbf"}, {file = "yarl-1.9.4-cp38-cp38-macosx_10_9_x86_64.whl", hash = "sha256:8be9e837ea9113676e5754b43b940b50cce76d9ed7d2461df1af39a8ee674d9f"},
{file = "yarl-1.8.2-cp39-cp39-macosx_11_0_arm64.whl", hash = "sha256:ae0eec05ab49e91a78700761777f284c2df119376e391db42c38ab46fd662b77"}, {file = "yarl-1.9.4-cp38-cp38-macosx_11_0_arm64.whl", hash = "sha256:bef596fdaa8f26e3d66af846bbe77057237cb6e8efff8cd7cc8dff9a62278bbf"},
{file = "yarl-1.8.2-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:009a028127e0a1755c38b03244c0bea9d5565630db9c4cf9572496e947137a87"}, {file = "yarl-1.9.4-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:2d47552b6e52c3319fede1b60b3de120fe83bde9b7bddad11a69fb0af7db32f1"},
{file = "yarl-1.8.2-cp39-cp39-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:3edac5d74bb3209c418805bda77f973117836e1de7c000e9755e572c1f7850d0"}, {file = "yarl-1.9.4-cp38-cp38-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:84fc30f71689d7fc9168b92788abc977dc8cefa806909565fc2951d02f6b7d57"},
{file = "yarl-1.8.2-cp39-cp39-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:da65c3f263729e47351261351b8679c6429151ef9649bba08ef2528ff2c423b2"}, {file = "yarl-1.9.4-cp38-cp38-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:4aa9741085f635934f3a2583e16fcf62ba835719a8b2b28fb2917bb0537c1dfa"},
{file = "yarl-1.8.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:0ef8fb25e52663a1c85d608f6dd72e19bd390e2ecaf29c17fb08f730226e3a08"}, {file = "yarl-1.9.4-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:206a55215e6d05dbc6c98ce598a59e6fbd0c493e2de4ea6cc2f4934d5a18d130"},
{file = "yarl-1.8.2-cp39-cp39-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:bcd7bb1e5c45274af9a1dd7494d3c52b2be5e6bd8d7e49c612705fd45420b12d"}, {file = "yarl-1.9.4-cp38-cp38-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:07574b007ee20e5c375a8fe4a0789fad26db905f9813be0f9fef5a68080de559"},
{file = "yarl-1.8.2-cp39-cp39-musllinux_1_1_aarch64.whl", hash = "sha256:44ceac0450e648de86da8e42674f9b7077d763ea80c8ceb9d1c3e41f0f0a9951"}, {file = "yarl-1.9.4-cp38-cp38-musllinux_1_1_aarch64.whl", hash = "sha256:5a2e2433eb9344a163aced6a5f6c9222c0786e5a9e9cac2c89f0b28433f56e23"},
{file = "yarl-1.8.2-cp39-cp39-musllinux_1_1_i686.whl", hash = "sha256:97209cc91189b48e7cfe777237c04af8e7cc51eb369004e061809bcdf4e55220"}, {file = "yarl-1.9.4-cp38-cp38-musllinux_1_1_i686.whl", hash = "sha256:6ad6d10ed9b67a382b45f29ea028f92d25bc0bc1daf6c5b801b90b5aa70fb9ec"},
{file = "yarl-1.8.2-cp39-cp39-musllinux_1_1_ppc64le.whl", hash = "sha256:48dd18adcf98ea9cd721a25313aef49d70d413a999d7d89df44f469edfb38a06"}, {file = "yarl-1.9.4-cp38-cp38-musllinux_1_1_ppc64le.whl", hash = "sha256:6fe79f998a4052d79e1c30eeb7d6c1c1056ad33300f682465e1b4e9b5a188b78"},
{file = "yarl-1.8.2-cp39-cp39-musllinux_1_1_s390x.whl", hash = "sha256:e59399dda559688461762800d7fb34d9e8a6a7444fd76ec33220a926c8be1516"}, {file = "yarl-1.9.4-cp38-cp38-musllinux_1_1_s390x.whl", hash = "sha256:a825ec844298c791fd28ed14ed1bffc56a98d15b8c58a20e0e08c1f5f2bea1be"},
{file = "yarl-1.8.2-cp39-cp39-musllinux_1_1_x86_64.whl", hash = "sha256:d617c241c8c3ad5c4e78a08429fa49e4b04bedfc507b34b4d8dceb83b4af3588"}, {file = "yarl-1.9.4-cp38-cp38-musllinux_1_1_x86_64.whl", hash = "sha256:8619d6915b3b0b34420cf9b2bb6d81ef59d984cb0fde7544e9ece32b4b3043c3"},
{file = "yarl-1.8.2-cp39-cp39-win32.whl", hash = "sha256:cb6d48d80a41f68de41212f3dfd1a9d9898d7841c8f7ce6696cf2fd9cb57ef83"}, {file = "yarl-1.9.4-cp38-cp38-win32.whl", hash = "sha256:686a0c2f85f83463272ddffd4deb5e591c98aac1897d65e92319f729c320eece"},
{file = "yarl-1.8.2-cp39-cp39-win_amd64.whl", hash = "sha256:6604711362f2dbf7160df21c416f81fac0de6dbcf0b5445a2ef25478ecc4c778"}, {file = "yarl-1.9.4-cp38-cp38-win_amd64.whl", hash = "sha256:a00862fb23195b6b8322f7d781b0dc1d82cb3bcac346d1e38689370cc1cc398b"},
{file = "yarl-1.8.2.tar.gz", hash = "sha256:49d43402c6e3013ad0978602bf6bf5328535c48d192304b91b97a3c6790b1562"}, {file = "yarl-1.9.4-cp39-cp39-macosx_10_9_universal2.whl", hash = "sha256:604f31d97fa493083ea21bd9b92c419012531c4e17ea6da0f65cacdcf5d0bd27"},
{file = "yarl-1.9.4-cp39-cp39-macosx_10_9_x86_64.whl", hash = "sha256:8a854227cf581330ffa2c4824d96e52ee621dd571078a252c25e3a3b3d94a1b1"},
{file = "yarl-1.9.4-cp39-cp39-macosx_11_0_arm64.whl", hash = "sha256:ba6f52cbc7809cd8d74604cce9c14868306ae4aa0282016b641c661f981a6e91"},
{file = "yarl-1.9.4-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:a6327976c7c2f4ee6816eff196e25385ccc02cb81427952414a64811037bbc8b"},
{file = "yarl-1.9.4-cp39-cp39-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:8397a3817d7dcdd14bb266283cd1d6fc7264a48c186b986f32e86d86d35fbac5"},
{file = "yarl-1.9.4-cp39-cp39-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:e0381b4ce23ff92f8170080c97678040fc5b08da85e9e292292aba67fdac6c34"},
{file = "yarl-1.9.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:23d32a2594cb5d565d358a92e151315d1b2268bc10f4610d098f96b147370136"},
{file = "yarl-1.9.4-cp39-cp39-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:ddb2a5c08a4eaaba605340fdee8fc08e406c56617566d9643ad8bf6852778fc7"},
{file = "yarl-1.9.4-cp39-cp39-musllinux_1_1_aarch64.whl", hash = "sha256:26a1dc6285e03f3cc9e839a2da83bcbf31dcb0d004c72d0730e755b33466c30e"},
{file = "yarl-1.9.4-cp39-cp39-musllinux_1_1_i686.whl", hash = "sha256:18580f672e44ce1238b82f7fb87d727c4a131f3a9d33a5e0e82b793362bf18b4"},
{file = "yarl-1.9.4-cp39-cp39-musllinux_1_1_ppc64le.whl", hash = "sha256:29e0f83f37610f173eb7e7b5562dd71467993495e568e708d99e9d1944f561ec"},
{file = "yarl-1.9.4-cp39-cp39-musllinux_1_1_s390x.whl", hash = "sha256:1f23e4fe1e8794f74b6027d7cf19dc25f8b63af1483d91d595d4a07eca1fb26c"},
{file = "yarl-1.9.4-cp39-cp39-musllinux_1_1_x86_64.whl", hash = "sha256:db8e58b9d79200c76956cefd14d5c90af54416ff5353c5bfd7cbe58818e26ef0"},
{file = "yarl-1.9.4-cp39-cp39-win32.whl", hash = "sha256:c7224cab95645c7ab53791022ae77a4509472613e839dab722a72abe5a684575"},
{file = "yarl-1.9.4-cp39-cp39-win_amd64.whl", hash = "sha256:824d6c50492add5da9374875ce72db7a0733b29c2394890aef23d533106e2b15"},
{file = "yarl-1.9.4-py3-none-any.whl", hash = "sha256:928cecb0ef9d5a7946eb6ff58417ad2fe9375762382f1bf5c55e61645f2c43ad"},
{file = "yarl-1.9.4.tar.gz", hash = "sha256:566db86717cf8080b99b58b083b773a908ae40f06681e87e589a976faf8246bf"},
] ]
[package.dependencies] [package.dependencies]

View File

@@ -87,6 +87,10 @@ impl AuthError {
pub fn too_many_connections() -> Self { pub fn too_many_connections() -> Self {
AuthErrorImpl::TooManyConnections.into() AuthErrorImpl::TooManyConnections.into()
} }
pub fn is_auth_failed(&self) -> bool {
matches!(self.0.as_ref(), AuthErrorImpl::AuthFailed(_))
}
} }
impl<E: Into<AuthErrorImpl>> From<E> for AuthError { impl<E: Into<AuthErrorImpl>> From<E> for AuthError {

View File

@@ -9,9 +9,9 @@ use tokio_postgres::config::AuthKeys;
use crate::auth::credentials::check_peer_addr_is_in_list; use crate::auth::credentials::check_peer_addr_is_in_list;
use crate::auth::validate_password_and_exchange; use crate::auth::validate_password_and_exchange;
use crate::console::errors::GetAuthInfoError; use crate::console::errors::GetAuthInfoError;
use crate::console::provider::AuthInfo;
use crate::console::AuthSecret; use crate::console::AuthSecret;
use crate::proxy::{handle_try_wake, retry_after, LatencyTimer}; use crate::proxy::connect_compute::handle_try_wake;
use crate::proxy::retry::retry_after;
use crate::scram; use crate::scram;
use crate::stream::Stream; use crate::stream::Stream;
use crate::{ use crate::{
@@ -22,6 +22,7 @@ use crate::{
provider::{CachedNodeInfo, ConsoleReqExtra}, provider::{CachedNodeInfo, ConsoleReqExtra},
Api, Api,
}, },
metrics::LatencyTimer,
stream, url, stream, url,
}; };
use futures::TryFutureExt; use futures::TryFutureExt;
@@ -185,24 +186,52 @@ async fn auth_quirks(
}; };
info!("fetching user's authentication info"); info!("fetching user's authentication info");
// TODO(anna): this will slow down both "hacks" below; we probably need a cache. let allowed_ips = api.get_allowed_ips(extra, &info).await?;
let AuthInfo {
secret,
allowed_ips,
} = api.get_auth_info(extra, &info).await?;
// check allowed list // check allowed list
if !check_peer_addr_is_in_list(&info.inner.peer_addr, &allowed_ips) { if !check_peer_addr_is_in_list(&info.inner.peer_addr, &allowed_ips) {
return Err(auth::AuthError::ip_address_not_allowed()); return Err(auth::AuthError::ip_address_not_allowed());
} }
let secret = secret.unwrap_or_else(|| { let cached_secret = api.get_role_secret(extra, &info).await?;
let secret = cached_secret.clone().unwrap_or_else(|| {
// If we don't have an authentication secret, we mock one to // If we don't have an authentication secret, we mock one to
// prevent malicious probing (possible due to missing protocol steps). // prevent malicious probing (possible due to missing protocol steps).
// This mocked secret will never lead to successful authentication. // This mocked secret will never lead to successful authentication.
info!("authentication info not found, mocking it"); info!("authentication info not found, mocking it");
AuthSecret::Scram(scram::ServerSecret::mock(&info.inner.user, rand::random())) AuthSecret::Scram(scram::ServerSecret::mock(&info.inner.user, rand::random()))
}); });
match authenticate_with_secret(
secret,
info,
client,
unauthenticated_password,
allow_cleartext,
config,
latency_timer,
)
.await
{
Ok(keys) => Ok(keys),
Err(e) => {
if e.is_auth_failed() {
// The password could have been changed, so we invalidate the cache.
cached_secret.invalidate();
}
Err(e)
}
}
}
async fn authenticate_with_secret(
secret: AuthSecret,
info: ComputeUserInfo,
client: &mut stream::PqStream<Stream<impl AsyncRead + AsyncWrite + Unpin>>,
unauthenticated_password: Option<Vec<u8>>,
allow_cleartext: bool,
config: &'static AuthenticationConfig,
latency_timer: &mut LatencyTimer,
) -> auth::Result<ComputeCredentials<ComputeCredentialKeys>> {
if let Some(password) = unauthenticated_password { if let Some(password) = unauthenticated_password {
let auth_outcome = validate_password_and_exchange(&password, secret)?; let auth_outcome = validate_password_and_exchange(&password, secret)?;
let keys = match auth_outcome { let keys = match auth_outcome {

View File

@@ -4,7 +4,7 @@ use crate::{
compute, compute,
config::AuthenticationConfig, config::AuthenticationConfig,
console::AuthSecret, console::AuthSecret,
proxy::LatencyTimer, metrics::LatencyTimer,
sasl, sasl,
stream::{PqStream, Stream}, stream::{PqStream, Stream},
}; };

View File

@@ -4,7 +4,7 @@ use super::{
use crate::{ use crate::{
auth::{self, AuthFlow}, auth::{self, AuthFlow},
console::AuthSecret, console::AuthSecret,
proxy::LatencyTimer, metrics::LatencyTimer,
sasl, sasl,
stream::{self, Stream}, stream::{self, Stream},
}; };

View File

@@ -1,9 +1,8 @@
//! User credentials used in authentication. //! User credentials used in authentication.
use crate::{ use crate::{
auth::password_hack::parse_endpoint_param, auth::password_hack::parse_endpoint_param, error::UserFacingError,
error::UserFacingError, metrics::NUM_CONNECTION_ACCEPTED_BY_SNI, proxy::neon_options_str,
proxy::{neon_options_str, NUM_CONNECTION_ACCEPTED_BY_SNI},
}; };
use itertools::Itertools; use itertools::Itertools;
use pq_proto::StartupMessageParams; use pq_proto::StartupMessageParams;

View File

@@ -6,6 +6,7 @@ use proxy::config::HttpConfig;
use proxy::console; use proxy::console;
use proxy::console::provider::AllowedIpsCache; use proxy::console::provider::AllowedIpsCache;
use proxy::console::provider::NodeInfoCache; use proxy::console::provider::NodeInfoCache;
use proxy::console::provider::RoleSecretCache;
use proxy::http; use proxy::http;
use proxy::rate_limiter::EndpointRateLimiter; use proxy::rate_limiter::EndpointRateLimiter;
use proxy::rate_limiter::RateBucketInfo; use proxy::rate_limiter::RateBucketInfo;
@@ -86,7 +87,7 @@ struct ProxyCliArgs {
#[clap(long)] #[clap(long)]
metric_collection_interval: Option<String>, metric_collection_interval: Option<String>,
/// cache for `wake_compute` api method (use `size=0` to disable) /// cache for `wake_compute` api method (use `size=0` to disable)
#[clap(long, default_value = config::CacheOptions::DEFAULT_OPTIONS_NODE_INFO)] #[clap(long, default_value = config::CacheOptions::CACHE_DEFAULT_OPTIONS)]
wake_compute_cache: String, wake_compute_cache: String,
/// lock for `wake_compute` api method. example: "shards=32,permits=4,epoch=10m,timeout=1s". (use `permits=0` to disable). /// lock for `wake_compute` api method. example: "shards=32,permits=4,epoch=10m,timeout=1s". (use `permits=0` to disable).
#[clap(long, default_value = config::WakeComputeLockOptions::DEFAULT_OPTIONS_WAKE_COMPUTE_LOCK)] #[clap(long, default_value = config::WakeComputeLockOptions::DEFAULT_OPTIONS_WAKE_COMPUTE_LOCK)]
@@ -127,8 +128,11 @@ struct ProxyCliArgs {
#[clap(flatten)] #[clap(flatten)]
aimd_config: proxy::rate_limiter::AimdConfig, aimd_config: proxy::rate_limiter::AimdConfig,
/// cache for `allowed_ips` (use `size=0` to disable) /// cache for `allowed_ips` (use `size=0` to disable)
#[clap(long, default_value = config::CacheOptions::DEFAULT_OPTIONS_NODE_INFO)] #[clap(long, default_value = config::CacheOptions::CACHE_DEFAULT_OPTIONS)]
allowed_ips_cache: String, allowed_ips_cache: String,
/// cache for `role_secret` (use `size=0` to disable)
#[clap(long, default_value = config::CacheOptions::CACHE_DEFAULT_OPTIONS)]
role_secret_cache: String,
/// disable ip check for http requests. If it is too time consuming, it could be turned off. /// disable ip check for http requests. If it is too time consuming, it could be turned off.
#[clap(long, default_value_t = false, value_parser = clap::builder::BoolishValueParser::new(), action = clap::ArgAction::Set)] #[clap(long, default_value_t = false, value_parser = clap::builder::BoolishValueParser::new(), action = clap::ArgAction::Set)]
disable_ip_check_for_http: bool, disable_ip_check_for_http: bool,
@@ -266,9 +270,11 @@ fn build_config(args: &ProxyCliArgs) -> anyhow::Result<&'static ProxyConfig> {
AuthBackend::Console => { AuthBackend::Console => {
let wake_compute_cache_config: CacheOptions = args.wake_compute_cache.parse()?; let wake_compute_cache_config: CacheOptions = args.wake_compute_cache.parse()?;
let allowed_ips_cache_config: CacheOptions = args.allowed_ips_cache.parse()?; let allowed_ips_cache_config: CacheOptions = args.allowed_ips_cache.parse()?;
let role_secret_cache_config: CacheOptions = args.role_secret_cache.parse()?;
info!("Using NodeInfoCache (wake_compute) with options={wake_compute_cache_config:?}"); info!("Using NodeInfoCache (wake_compute) with options={wake_compute_cache_config:?}");
info!("Using AllowedIpsCache (wake_compute) with options={allowed_ips_cache_config:?}"); info!("Using AllowedIpsCache (wake_compute) with options={allowed_ips_cache_config:?}");
info!("Using RoleSecretCache (wake_compute) with options={role_secret_cache_config:?}");
let caches = Box::leak(Box::new(console::caches::ApiCaches { let caches = Box::leak(Box::new(console::caches::ApiCaches {
node_info: NodeInfoCache::new( node_info: NodeInfoCache::new(
"node_info_cache", "node_info_cache",
@@ -282,6 +288,12 @@ fn build_config(args: &ProxyCliArgs) -> anyhow::Result<&'static ProxyConfig> {
allowed_ips_cache_config.ttl, allowed_ips_cache_config.ttl,
false, false,
), ),
role_secret: RoleSecretCache::new(
"role_secret_cache",
role_secret_cache_config.size,
role_secret_cache_config.ttl,
false,
),
})); }));
let config::WakeComputeLockOptions { let config::WakeComputeLockOptions {

View File

@@ -1,9 +1,6 @@
use crate::{ use crate::{
auth::parse_endpoint_param, auth::parse_endpoint_param, cancellation::CancelClosure, console::errors::WakeComputeError,
cancellation::CancelClosure, error::UserFacingError, metrics::NUM_DB_CONNECTIONS_GAUGE, proxy::neon_option,
console::errors::WakeComputeError,
error::UserFacingError,
proxy::{neon_option, NUM_DB_CONNECTIONS_GAUGE},
}; };
use futures::{FutureExt, TryFutureExt}; use futures::{FutureExt, TryFutureExt};
use itertools::Itertools; use itertools::Itertools;

View File

@@ -310,10 +310,10 @@ pub struct CacheOptions {
impl CacheOptions { impl CacheOptions {
/// Default options for [`crate::console::provider::NodeInfoCache`]. /// Default options for [`crate::console::provider::NodeInfoCache`].
pub const DEFAULT_OPTIONS_NODE_INFO: &'static str = "size=4000,ttl=4m"; pub const CACHE_DEFAULT_OPTIONS: &'static str = "size=4000,ttl=4m";
/// Parse cache options passed via cmdline. /// Parse cache options passed via cmdline.
/// Example: [`Self::DEFAULT_OPTIONS_NODE_INFO`]. /// Example: [`Self::CACHE_DEFAULT_OPTIONS`].
fn parse(options: &str) -> anyhow::Result<Self> { fn parse(options: &str) -> anyhow::Result<Self> {
let mut size = None; let mut size = None;
let mut ttl = None; let mut ttl = None;

View File

@@ -10,6 +10,7 @@ use crate::{
}; };
use async_trait::async_trait; use async_trait::async_trait;
use dashmap::DashMap; use dashmap::DashMap;
use smol_str::SmolStr;
use std::{sync::Arc, time::Duration}; use std::{sync::Arc, time::Duration};
use tokio::{ use tokio::{
sync::{OwnedSemaphorePermit, Semaphore}, sync::{OwnedSemaphorePermit, Semaphore},
@@ -21,7 +22,7 @@ pub mod errors {
use crate::{ use crate::{
error::{io_error, UserFacingError}, error::{io_error, UserFacingError},
http, http,
proxy::ShouldRetry, proxy::retry::ShouldRetry,
}; };
use thiserror::Error; use thiserror::Error;
@@ -216,6 +217,7 @@ impl ConsoleReqExtra {
} }
/// Auth secret which is managed by the cloud. /// Auth secret which is managed by the cloud.
#[derive(Clone)]
pub enum AuthSecret { pub enum AuthSecret {
#[cfg(feature = "testing")] #[cfg(feature = "testing")]
/// Md5 hash of user's password. /// Md5 hash of user's password.
@@ -250,18 +252,20 @@ pub struct NodeInfo {
pub type NodeInfoCache = TimedLru<Arc<str>, NodeInfo>; pub type NodeInfoCache = TimedLru<Arc<str>, NodeInfo>;
pub type CachedNodeInfo = timed_lru::Cached<&'static NodeInfoCache>; pub type CachedNodeInfo = timed_lru::Cached<&'static NodeInfoCache>;
pub type AllowedIpsCache = TimedLru<Arc<str>, Arc<Vec<String>>>; pub type AllowedIpsCache = TimedLru<SmolStr, Arc<Vec<String>>>;
pub type RoleSecretCache = TimedLru<(SmolStr, SmolStr), Option<AuthSecret>>;
pub type CachedRoleSecret = timed_lru::Cached<&'static RoleSecretCache>;
/// This will allocate per each call, but the http requests alone /// This will allocate per each call, but the http requests alone
/// already require a few allocations, so it should be fine. /// already require a few allocations, so it should be fine.
#[async_trait] #[async_trait]
pub trait Api { pub trait Api {
/// Get the client's auth secret for authentication. /// Get the client's auth secret for authentication.
async fn get_auth_info( async fn get_role_secret(
&self, &self,
extra: &ConsoleReqExtra, extra: &ConsoleReqExtra,
creds: &ComputeUserInfo, creds: &ComputeUserInfo,
) -> Result<AuthInfo, errors::GetAuthInfoError>; ) -> Result<CachedRoleSecret, errors::GetAuthInfoError>;
async fn get_allowed_ips( async fn get_allowed_ips(
&self, &self,
@@ -282,7 +286,9 @@ pub struct ApiCaches {
/// Cache for the `wake_compute` API method. /// Cache for the `wake_compute` API method.
pub node_info: NodeInfoCache, pub node_info: NodeInfoCache,
/// Cache for the `get_allowed_ips`. TODO(anna): use notifications listener instead. /// Cache for the `get_allowed_ips`. TODO(anna): use notifications listener instead.
pub allowed_ips: TimedLru<Arc<str>, Arc<Vec<String>>>, pub allowed_ips: AllowedIpsCache,
/// Cache for the `get_role_secret`. TODO(anna): use notifications listener instead.
pub role_secret: RoleSecretCache,
} }
/// Various caches for [`console`](super). /// Various caches for [`console`](super).

View File

@@ -6,6 +6,7 @@ use super::{
errors::{ApiError, GetAuthInfoError, WakeComputeError}, errors::{ApiError, GetAuthInfoError, WakeComputeError},
AuthInfo, AuthSecret, CachedNodeInfo, ConsoleReqExtra, NodeInfo, AuthInfo, AuthSecret, CachedNodeInfo, ConsoleReqExtra, NodeInfo,
}; };
use crate::console::provider::CachedRoleSecret;
use crate::{auth::backend::ComputeUserInfo, compute, error::io_error, scram, url::ApiUrl}; use crate::{auth::backend::ComputeUserInfo, compute, error::io_error, scram, url::ApiUrl};
use async_trait::async_trait; use async_trait::async_trait;
use futures::TryFutureExt; use futures::TryFutureExt;
@@ -142,12 +143,14 @@ async fn get_execute_postgres_query(
#[async_trait] #[async_trait]
impl super::Api for Api { impl super::Api for Api {
#[tracing::instrument(skip_all)] #[tracing::instrument(skip_all)]
async fn get_auth_info( async fn get_role_secret(
&self, &self,
_extra: &ConsoleReqExtra, _extra: &ConsoleReqExtra,
creds: &ComputeUserInfo, creds: &ComputeUserInfo,
) -> Result<AuthInfo, GetAuthInfoError> { ) -> Result<CachedRoleSecret, GetAuthInfoError> {
self.do_get_auth_info(creds).await Ok(CachedRoleSecret::new_uncached(
self.do_get_auth_info(creds).await?.secret,
))
} }
async fn get_allowed_ips( async fn get_allowed_ips(

View File

@@ -3,9 +3,10 @@
use super::{ use super::{
super::messages::{ConsoleError, GetRoleSecret, WakeCompute}, super::messages::{ConsoleError, GetRoleSecret, WakeCompute},
errors::{ApiError, GetAuthInfoError, WakeComputeError}, errors::{ApiError, GetAuthInfoError, WakeComputeError},
ApiCaches, ApiLocks, AuthInfo, AuthSecret, CachedNodeInfo, ConsoleReqExtra, NodeInfo, ApiCaches, ApiLocks, AuthInfo, AuthSecret, CachedNodeInfo, CachedRoleSecret, ConsoleReqExtra,
NodeInfo,
}; };
use crate::proxy::{ALLOWED_IPS_BY_CACHE_OUTCOME, ALLOWED_IPS_NUMBER}; use crate::metrics::{ALLOWED_IPS_BY_CACHE_OUTCOME, ALLOWED_IPS_NUMBER};
use crate::{auth::backend::ComputeUserInfo, compute, http, scram}; use crate::{auth::backend::ComputeUserInfo, compute, http, scram};
use async_trait::async_trait; use async_trait::async_trait;
use futures::TryFutureExt; use futures::TryFutureExt;
@@ -159,12 +160,25 @@ impl Api {
#[async_trait] #[async_trait]
impl super::Api for Api { impl super::Api for Api {
#[tracing::instrument(skip_all)] #[tracing::instrument(skip_all)]
async fn get_auth_info( async fn get_role_secret(
&self, &self,
extra: &ConsoleReqExtra, extra: &ConsoleReqExtra,
creds: &ComputeUserInfo, creds: &ComputeUserInfo,
) -> Result<AuthInfo, GetAuthInfoError> { ) -> Result<CachedRoleSecret, GetAuthInfoError> {
self.do_get_auth_info(extra, creds).await let ep = creds.endpoint.clone();
let user = creds.inner.user.clone();
if let Some(role_secret) = self.caches.role_secret.get(&(ep.clone(), user.clone())) {
return Ok(role_secret);
}
let auth_info = self.do_get_auth_info(extra, creds).await?;
let (_, secret) = self
.caches
.role_secret
.insert((ep.clone(), user), auth_info.secret.clone());
self.caches
.allowed_ips
.insert(ep, Arc::new(auth_info.allowed_ips));
Ok(secret)
} }
async fn get_allowed_ips( async fn get_allowed_ips(
@@ -172,8 +186,7 @@ impl super::Api for Api {
extra: &ConsoleReqExtra, extra: &ConsoleReqExtra,
creds: &ComputeUserInfo, creds: &ComputeUserInfo,
) -> Result<Arc<Vec<String>>, GetAuthInfoError> { ) -> Result<Arc<Vec<String>>, GetAuthInfoError> {
let key: &str = &creds.endpoint; if let Some(allowed_ips) = self.caches.allowed_ips.get(&creds.endpoint) {
if let Some(allowed_ips) = self.caches.allowed_ips.get(key) {
ALLOWED_IPS_BY_CACHE_OUTCOME ALLOWED_IPS_BY_CACHE_OUTCOME
.with_label_values(&["hit"]) .with_label_values(&["hit"])
.inc(); .inc();
@@ -182,10 +195,14 @@ impl super::Api for Api {
ALLOWED_IPS_BY_CACHE_OUTCOME ALLOWED_IPS_BY_CACHE_OUTCOME
.with_label_values(&["miss"]) .with_label_values(&["miss"])
.inc(); .inc();
let allowed_ips = Arc::new(self.do_get_auth_info(extra, creds).await?.allowed_ips); let auth_info = self.do_get_auth_info(extra, creds).await?;
let allowed_ips = Arc::new(auth_info.allowed_ips);
let ep = creds.endpoint.clone();
let user = creds.inner.user.clone();
self.caches self.caches
.allowed_ips .role_secret
.insert(key.into(), allowed_ips.clone()); .insert((ep.clone(), user), auth_info.secret);
self.caches.allowed_ips.insert(ep, allowed_ips.clone());
Ok(allowed_ips) Ok(allowed_ips)
} }

View File

@@ -13,7 +13,7 @@ pub use reqwest_retry::{policies::ExponentialBackoff, RetryTransientMiddleware};
use tokio::time::Instant; use tokio::time::Instant;
use tracing::trace; use tracing::trace;
use crate::{proxy::CONSOLE_REQUEST_LATENCY, rate_limiter, url::ApiUrl}; use crate::{metrics::CONSOLE_REQUEST_LATENCY, rate_limiter, url::ApiUrl};
use reqwest_middleware::RequestBuilder; use reqwest_middleware::RequestBuilder;
/// This is the preferred way to create new http clients, /// This is the preferred way to create new http clients,

View File

@@ -16,6 +16,7 @@ pub mod console;
pub mod error; pub mod error;
pub mod http; pub mod http;
pub mod logging; pub mod logging;
pub mod metrics;
pub mod parse; pub mod parse;
pub mod protocol2; pub mod protocol2;
pub mod proxy; pub mod proxy;

232
proxy/src/metrics.rs Normal file
View File

@@ -0,0 +1,232 @@
use ::metrics::{
exponential_buckets, register_int_counter_pair_vec, register_int_counter_vec,
IntCounterPairVec, IntCounterVec,
};
use prometheus::{
register_histogram, register_histogram_vec, register_int_gauge_vec, Histogram, HistogramVec,
IntGaugeVec,
};
use once_cell::sync::Lazy;
use tokio::time;
pub static NUM_DB_CONNECTIONS_GAUGE: Lazy<IntCounterPairVec> = Lazy::new(|| {
register_int_counter_pair_vec!(
"proxy_opened_db_connections_total",
"Number of opened connections to a database.",
"proxy_closed_db_connections_total",
"Number of closed connections to a database.",
&["protocol"],
)
.unwrap()
});
pub static NUM_CLIENT_CONNECTION_GAUGE: Lazy<IntCounterPairVec> = Lazy::new(|| {
register_int_counter_pair_vec!(
"proxy_opened_client_connections_total",
"Number of opened connections from a client.",
"proxy_closed_client_connections_total",
"Number of closed connections from a client.",
&["protocol"],
)
.unwrap()
});
pub static NUM_CONNECTION_REQUESTS_GAUGE: Lazy<IntCounterPairVec> = Lazy::new(|| {
register_int_counter_pair_vec!(
"proxy_accepted_connections_total",
"Number of client connections accepted.",
"proxy_closed_connections_total",
"Number of client connections closed.",
&["protocol"],
)
.unwrap()
});
pub static COMPUTE_CONNECTION_LATENCY: Lazy<HistogramVec> = Lazy::new(|| {
register_histogram_vec!(
"proxy_compute_connection_latency_seconds",
"Time it took for proxy to establish a connection to the compute endpoint",
// http/ws/tcp, true/false, true/false, success/failure
// 3 * 2 * 2 * 2 = 24 counters
&["protocol", "cache_miss", "pool_miss", "outcome"],
// largest bucket = 2^16 * 0.5ms = 32s
exponential_buckets(0.0005, 2.0, 16).unwrap(),
)
.unwrap()
});
pub static CONSOLE_REQUEST_LATENCY: Lazy<HistogramVec> = Lazy::new(|| {
register_histogram_vec!(
"proxy_console_request_latency",
"Time it took for proxy to establish a connection to the compute endpoint",
// proxy_wake_compute/proxy_get_role_info
&["request"],
// largest bucket = 2^16 * 0.2ms = 13s
exponential_buckets(0.0002, 2.0, 16).unwrap(),
)
.unwrap()
});
pub static ALLOWED_IPS_BY_CACHE_OUTCOME: Lazy<IntCounterVec> = Lazy::new(|| {
register_int_counter_vec!(
"proxy_allowed_ips_cache_misses",
"Number of cache hits/misses for allowed ips",
// hit/miss
&["outcome"],
)
.unwrap()
});
pub static RATE_LIMITER_ACQUIRE_LATENCY: Lazy<Histogram> = Lazy::new(|| {
register_histogram!(
"proxy_control_plane_token_acquire_seconds",
"Time it took for proxy to establish a connection to the compute endpoint",
// largest bucket = 3^16 * 0.05ms = 2.15s
exponential_buckets(0.00005, 3.0, 16).unwrap(),
)
.unwrap()
});
pub static RATE_LIMITER_LIMIT: Lazy<IntGaugeVec> = Lazy::new(|| {
register_int_gauge_vec!(
"semaphore_control_plane_limit",
"Current limit of the semaphore control plane",
&["limit"], // 2 counters
)
.unwrap()
});
pub static NUM_CONNECTION_ACCEPTED_BY_SNI: Lazy<IntCounterVec> = Lazy::new(|| {
register_int_counter_vec!(
"proxy_accepted_connections_by_sni",
"Number of connections (per sni).",
&["kind"],
)
.unwrap()
});
pub static ALLOWED_IPS_NUMBER: Lazy<Histogram> = Lazy::new(|| {
register_histogram!(
"proxy_allowed_ips_number",
"Number of allowed ips",
vec![0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 10.0, 20.0, 50.0, 100.0],
)
.unwrap()
});
pub struct LatencyTimer {
// time since the stopwatch was started
start: Option<time::Instant>,
// accumulated time on the stopwatch
accumulated: std::time::Duration,
// label data
protocol: &'static str,
cache_miss: bool,
pool_miss: bool,
outcome: &'static str,
}
pub struct LatencyTimerPause<'a> {
timer: &'a mut LatencyTimer,
}
impl LatencyTimer {
pub fn new(protocol: &'static str) -> Self {
Self {
start: Some(time::Instant::now()),
accumulated: std::time::Duration::ZERO,
protocol,
cache_miss: false,
// by default we don't do pooling
pool_miss: true,
// assume failed unless otherwise specified
outcome: "failed",
}
}
pub fn pause(&mut self) -> LatencyTimerPause<'_> {
// stop the stopwatch and record the time that we have accumulated
let start = self.start.take().expect("latency timer should be started");
self.accumulated += start.elapsed();
LatencyTimerPause { timer: self }
}
pub fn cache_miss(&mut self) {
self.cache_miss = true;
}
pub fn pool_hit(&mut self) {
self.pool_miss = false;
}
pub fn success(mut self) {
self.outcome = "success";
}
}
impl Drop for LatencyTimerPause<'_> {
fn drop(&mut self) {
// start the stopwatch again
self.timer.start = Some(time::Instant::now());
}
}
impl Drop for LatencyTimer {
fn drop(&mut self) {
let duration =
self.start.map(|start| start.elapsed()).unwrap_or_default() + self.accumulated;
COMPUTE_CONNECTION_LATENCY
.with_label_values(&[
self.protocol,
bool_to_str(self.cache_miss),
bool_to_str(self.pool_miss),
self.outcome,
])
.observe(duration.as_secs_f64())
}
}
pub static NUM_CONNECTION_FAILURES: Lazy<IntCounterVec> = Lazy::new(|| {
register_int_counter_vec!(
"proxy_connection_failures_total",
"Number of connection failures (per kind).",
&["kind"],
)
.unwrap()
});
pub static NUM_WAKEUP_FAILURES: Lazy<IntCounterVec> = Lazy::new(|| {
register_int_counter_vec!(
"proxy_connection_failures_breakdown",
"Number of wake-up failures (per kind).",
&["retry", "kind"],
)
.unwrap()
});
pub static NUM_BYTES_PROXIED_PER_CLIENT_COUNTER: Lazy<IntCounterVec> = Lazy::new(|| {
register_int_counter_vec!(
"proxy_io_bytes_per_client",
"Number of bytes sent/received between client and backend.",
crate::console::messages::MetricsAuxInfo::TRAFFIC_LABELS,
)
.unwrap()
});
pub static NUM_BYTES_PROXIED_COUNTER: Lazy<IntCounterVec> = Lazy::new(|| {
register_int_counter_vec!(
"proxy_io_bytes",
"Number of bytes sent/received between all clients and backends.",
&["direction"],
)
.unwrap()
});
pub const fn bool_to_str(x: bool) -> &'static str {
if x {
"true"
} else {
"false"
}
}

View File

@@ -1,265 +1,41 @@
#[cfg(test)] #[cfg(test)]
mod tests; mod tests;
pub mod connect_compute;
pub mod retry;
use crate::{ use crate::{
auth, auth,
cancellation::{self, CancelMap}, cancellation::{self, CancelMap},
compute::{self, PostgresConnection}, compute,
config::{AuthenticationConfig, ProxyConfig, TlsConfig}, config::{AuthenticationConfig, ProxyConfig, TlsConfig},
console::{self, errors::WakeComputeError, messages::MetricsAuxInfo, Api}, console::{self, messages::MetricsAuxInfo},
http::StatusCode, metrics::{
LatencyTimer, NUM_BYTES_PROXIED_COUNTER, NUM_BYTES_PROXIED_PER_CLIENT_COUNTER,
NUM_CLIENT_CONNECTION_GAUGE, NUM_CONNECTION_REQUESTS_GAUGE,
},
protocol2::WithClientIp, protocol2::WithClientIp,
rate_limiter::EndpointRateLimiter, rate_limiter::EndpointRateLimiter,
stream::{PqStream, Stream}, stream::{PqStream, Stream},
usage_metrics::{Ids, USAGE_METRICS}, usage_metrics::{Ids, USAGE_METRICS},
}; };
use anyhow::{bail, Context}; use anyhow::{bail, Context};
use async_trait::async_trait;
use futures::TryFutureExt; use futures::TryFutureExt;
use itertools::Itertools; use itertools::Itertools;
use metrics::{ use once_cell::sync::OnceCell;
exponential_buckets, register_int_counter_pair_vec, register_int_counter_vec,
IntCounterPairVec, IntCounterVec,
};
use once_cell::sync::{Lazy, OnceCell};
use pq_proto::{BeMessage as Be, FeStartupPacket, StartupMessageParams}; use pq_proto::{BeMessage as Be, FeStartupPacket, StartupMessageParams};
use prometheus::{
register_histogram, register_histogram_vec, register_int_gauge_vec, Histogram, HistogramVec,
IntGaugeVec,
};
use regex::Regex; use regex::Regex;
use std::{error::Error, io, net::IpAddr, ops::ControlFlow, sync::Arc, time::Instant}; use std::{net::IpAddr, sync::Arc};
use tokio::{ use tokio::io::{AsyncRead, AsyncWrite, AsyncWriteExt};
io::{AsyncRead, AsyncWrite, AsyncWriteExt},
time,
};
use tokio_util::sync::CancellationToken; use tokio_util::sync::CancellationToken;
use tracing::{error, info, info_span, warn, Instrument}; use tracing::{error, info, info_span, Instrument};
use utils::measured_stream::MeasuredStream; use utils::measured_stream::MeasuredStream;
/// Number of times we should retry the `/proxy_wake_compute` http request. use self::connect_compute::{connect_to_compute, TcpMechanism};
/// Retry duration is BASE_RETRY_WAIT_DURATION * RETRY_WAIT_EXPONENT_BASE ^ n, where n starts at 0
pub const NUM_RETRIES_CONNECT: u32 = 16;
const CONNECT_TIMEOUT: time::Duration = time::Duration::from_secs(2);
const BASE_RETRY_WAIT_DURATION: time::Duration = time::Duration::from_millis(25);
const RETRY_WAIT_EXPONENT_BASE: f64 = std::f64::consts::SQRT_2;
const ERR_INSECURE_CONNECTION: &str = "connection is insecure (try using `sslmode=require`)"; const ERR_INSECURE_CONNECTION: &str = "connection is insecure (try using `sslmode=require`)";
const ERR_PROTO_VIOLATION: &str = "protocol violation"; const ERR_PROTO_VIOLATION: &str = "protocol violation";
pub static NUM_DB_CONNECTIONS_GAUGE: Lazy<IntCounterPairVec> = Lazy::new(|| {
register_int_counter_pair_vec!(
"proxy_opened_db_connections_total",
"Number of opened connections to a database.",
"proxy_closed_db_connections_total",
"Number of closed connections to a database.",
&["protocol"],
)
.unwrap()
});
pub static NUM_CLIENT_CONNECTION_GAUGE: Lazy<IntCounterPairVec> = Lazy::new(|| {
register_int_counter_pair_vec!(
"proxy_opened_client_connections_total",
"Number of opened connections from a client.",
"proxy_closed_client_connections_total",
"Number of closed connections from a client.",
&["protocol"],
)
.unwrap()
});
pub static NUM_CONNECTION_REQUESTS_GAUGE: Lazy<IntCounterPairVec> = Lazy::new(|| {
register_int_counter_pair_vec!(
"proxy_accepted_connections_total",
"Number of client connections accepted.",
"proxy_closed_connections_total",
"Number of client connections closed.",
&["protocol"],
)
.unwrap()
});
static COMPUTE_CONNECTION_LATENCY: Lazy<HistogramVec> = Lazy::new(|| {
register_histogram_vec!(
"proxy_compute_connection_latency_seconds",
"Time it took for proxy to establish a connection to the compute endpoint",
// http/ws/tcp, true/false, true/false, success/failure
// 3 * 2 * 2 * 2 = 24 counters
&["protocol", "cache_miss", "pool_miss", "outcome"],
// largest bucket = 2^16 * 0.5ms = 32s
exponential_buckets(0.0005, 2.0, 16).unwrap(),
)
.unwrap()
});
pub static CONSOLE_REQUEST_LATENCY: Lazy<HistogramVec> = Lazy::new(|| {
register_histogram_vec!(
"proxy_console_request_latency",
"Time it took for proxy to establish a connection to the compute endpoint",
// proxy_wake_compute/proxy_get_role_info
&["request"],
// largest bucket = 2^16 * 0.2ms = 13s
exponential_buckets(0.0002, 2.0, 16).unwrap(),
)
.unwrap()
});
pub static ALLOWED_IPS_BY_CACHE_OUTCOME: Lazy<IntCounterVec> = Lazy::new(|| {
register_int_counter_vec!(
"proxy_allowed_ips_cache_misses",
"Number of cache hits/misses for allowed ips",
// hit/miss
&["outcome"],
)
.unwrap()
});
pub static RATE_LIMITER_ACQUIRE_LATENCY: Lazy<Histogram> = Lazy::new(|| {
register_histogram!(
"proxy_control_plane_token_acquire_seconds",
"Time it took for proxy to establish a connection to the compute endpoint",
// largest bucket = 3^16 * 0.05ms = 2.15s
exponential_buckets(0.00005, 3.0, 16).unwrap(),
)
.unwrap()
});
pub static RATE_LIMITER_LIMIT: Lazy<IntGaugeVec> = Lazy::new(|| {
register_int_gauge_vec!(
"semaphore_control_plane_limit",
"Current limit of the semaphore control plane",
&["limit"], // 2 counters
)
.unwrap()
});
pub static NUM_CONNECTION_ACCEPTED_BY_SNI: Lazy<IntCounterVec> = Lazy::new(|| {
register_int_counter_vec!(
"proxy_accepted_connections_by_sni",
"Number of connections (per sni).",
&["kind"],
)
.unwrap()
});
pub static ALLOWED_IPS_NUMBER: Lazy<Histogram> = Lazy::new(|| {
register_histogram!(
"proxy_allowed_ips_number",
"Number of allowed ips",
vec![0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 10.0, 20.0, 50.0, 100.0],
)
.unwrap()
});
pub struct LatencyTimer {
// time since the stopwatch was started
start: Option<Instant>,
// accumulated time on the stopwatch
accumulated: std::time::Duration,
// label data
protocol: &'static str,
cache_miss: bool,
pool_miss: bool,
outcome: &'static str,
}
pub struct LatencyTimerPause<'a> {
timer: &'a mut LatencyTimer,
}
impl LatencyTimer {
pub fn new(protocol: &'static str) -> Self {
Self {
start: Some(Instant::now()),
accumulated: std::time::Duration::ZERO,
protocol,
cache_miss: false,
// by default we don't do pooling
pool_miss: true,
// assume failed unless otherwise specified
outcome: "failed",
}
}
pub fn pause(&mut self) -> LatencyTimerPause<'_> {
// stop the stopwatch and record the time that we have accumulated
let start = self.start.take().expect("latency timer should be started");
self.accumulated += start.elapsed();
LatencyTimerPause { timer: self }
}
pub fn cache_miss(&mut self) {
self.cache_miss = true;
}
pub fn pool_hit(&mut self) {
self.pool_miss = false;
}
pub fn success(mut self) {
self.outcome = "success";
}
}
impl Drop for LatencyTimerPause<'_> {
fn drop(&mut self) {
// start the stopwatch again
self.timer.start = Some(Instant::now());
}
}
impl Drop for LatencyTimer {
fn drop(&mut self) {
let duration =
self.start.map(|start| start.elapsed()).unwrap_or_default() + self.accumulated;
COMPUTE_CONNECTION_LATENCY
.with_label_values(&[
self.protocol,
bool_to_str(self.cache_miss),
bool_to_str(self.pool_miss),
self.outcome,
])
.observe(duration.as_secs_f64())
}
}
static NUM_CONNECTION_FAILURES: Lazy<IntCounterVec> = Lazy::new(|| {
register_int_counter_vec!(
"proxy_connection_failures_total",
"Number of connection failures (per kind).",
&["kind"],
)
.unwrap()
});
static NUM_WAKEUP_FAILURES: Lazy<IntCounterVec> = Lazy::new(|| {
register_int_counter_vec!(
"proxy_connection_failures_breakdown",
"Number of wake-up failures (per kind).",
&["retry", "kind"],
)
.unwrap()
});
static NUM_BYTES_PROXIED_PER_CLIENT_COUNTER: Lazy<IntCounterVec> = Lazy::new(|| {
register_int_counter_vec!(
"proxy_io_bytes_per_client",
"Number of bytes sent/received between client and backend.",
crate::console::messages::MetricsAuxInfo::TRAFFIC_LABELS,
)
.unwrap()
});
static NUM_BYTES_PROXIED_COUNTER: Lazy<IntCounterVec> = Lazy::new(|| {
register_int_counter_vec!(
"proxy_io_bytes",
"Number of bytes sent/received between all clients and backends.",
&["direction"],
)
.unwrap()
});
pub async fn run_until_cancelled<F: std::future::Future>( pub async fn run_until_cancelled<F: std::future::Future>(
f: F, f: F,
cancellation_token: &CancellationToken, cancellation_token: &CancellationToken,
@@ -539,296 +315,6 @@ async fn handshake<S: AsyncRead + AsyncWrite + Unpin>(
} }
} }
/// If we couldn't connect, a cached connection info might be to blame
/// (e.g. the compute node's address might've changed at the wrong time).
/// Invalidate the cache entry (if any) to prevent subsequent errors.
#[tracing::instrument(name = "invalidate_cache", skip_all)]
pub fn invalidate_cache(node_info: console::CachedNodeInfo) -> compute::ConnCfg {
let is_cached = node_info.cached();
if is_cached {
warn!("invalidating stalled compute node info cache entry");
}
let label = match is_cached {
true => "compute_cached",
false => "compute_uncached",
};
NUM_CONNECTION_FAILURES.with_label_values(&[label]).inc();
node_info.invalidate().config
}
/// Try to connect to the compute node once.
#[tracing::instrument(name = "connect_once", fields(pid = tracing::field::Empty), skip_all)]
async fn connect_to_compute_once(
node_info: &console::CachedNodeInfo,
timeout: time::Duration,
proto: &'static str,
) -> Result<PostgresConnection, compute::ConnectionError> {
let allow_self_signed_compute = node_info.allow_self_signed_compute;
node_info
.config
.connect(allow_self_signed_compute, timeout, proto)
.await
}
#[async_trait]
pub trait ConnectMechanism {
type Connection;
type ConnectError;
type Error: From<Self::ConnectError>;
async fn connect_once(
&self,
node_info: &console::CachedNodeInfo,
timeout: time::Duration,
) -> Result<Self::Connection, Self::ConnectError>;
fn update_connect_config(&self, conf: &mut compute::ConnCfg);
}
pub struct TcpMechanism<'a> {
/// KV-dictionary with PostgreSQL connection params.
pub params: &'a StartupMessageParams,
pub proto: &'static str,
}
#[async_trait]
impl ConnectMechanism for TcpMechanism<'_> {
type Connection = PostgresConnection;
type ConnectError = compute::ConnectionError;
type Error = compute::ConnectionError;
async fn connect_once(
&self,
node_info: &console::CachedNodeInfo,
timeout: time::Duration,
) -> Result<PostgresConnection, Self::Error> {
connect_to_compute_once(node_info, timeout, self.proto).await
}
fn update_connect_config(&self, config: &mut compute::ConnCfg) {
config.set_startup_params(self.params);
}
}
const fn bool_to_str(x: bool) -> &'static str {
if x {
"true"
} else {
"false"
}
}
fn report_error(e: &WakeComputeError, retry: bool) {
use crate::console::errors::ApiError;
let retry = bool_to_str(retry);
let kind = match e {
WakeComputeError::BadComputeAddress(_) => "bad_compute_address",
WakeComputeError::ApiError(ApiError::Transport(_)) => "api_transport_error",
WakeComputeError::ApiError(ApiError::Console {
status: StatusCode::LOCKED,
ref text,
}) if text.contains("written data quota exceeded")
|| text.contains("the limit for current plan reached") =>
{
"quota_exceeded"
}
WakeComputeError::ApiError(ApiError::Console {
status: StatusCode::LOCKED,
..
}) => "api_console_locked",
WakeComputeError::ApiError(ApiError::Console {
status: StatusCode::BAD_REQUEST,
..
}) => "api_console_bad_request",
WakeComputeError::ApiError(ApiError::Console { status, .. })
if status.is_server_error() =>
{
"api_console_other_server_error"
}
WakeComputeError::ApiError(ApiError::Console { .. }) => "api_console_other_error",
WakeComputeError::TimeoutError => "timeout_error",
};
NUM_WAKEUP_FAILURES.with_label_values(&[retry, kind]).inc();
}
/// Try to connect to the compute node, retrying if necessary.
/// This function might update `node_info`, so we take it by `&mut`.
#[tracing::instrument(skip_all)]
pub async fn connect_to_compute<M: ConnectMechanism>(
mechanism: &M,
mut node_info: console::CachedNodeInfo,
extra: &console::ConsoleReqExtra,
creds: &auth::BackendType<'_, auth::backend::ComputeUserInfo>,
mut latency_timer: LatencyTimer,
) -> Result<M::Connection, M::Error>
where
M::ConnectError: ShouldRetry + std::fmt::Debug,
M::Error: From<WakeComputeError>,
{
mechanism.update_connect_config(&mut node_info.config);
// try once
let (config, err) = match mechanism.connect_once(&node_info, CONNECT_TIMEOUT).await {
Ok(res) => {
latency_timer.success();
return Ok(res);
}
Err(e) => {
error!(error = ?e, "could not connect to compute node");
(invalidate_cache(node_info), e)
}
};
latency_timer.cache_miss();
let mut num_retries = 1;
// if we failed to connect, it's likely that the compute node was suspended, wake a new compute node
info!("compute node's state has likely changed; requesting a wake-up");
let node_info = loop {
let wake_res = match creds {
auth::BackendType::Console(api, creds) => api.wake_compute(extra, creds).await,
#[cfg(feature = "testing")]
auth::BackendType::Postgres(api, creds) => api.wake_compute(extra, creds).await,
// nothing to do?
auth::BackendType::Link(_) => return Err(err.into()),
// test backend
#[cfg(test)]
auth::BackendType::Test(x) => x.wake_compute(),
};
match handle_try_wake(wake_res, num_retries) {
Err(e) => {
error!(error = ?e, num_retries, retriable = false, "couldn't wake compute node");
report_error(&e, false);
return Err(e.into());
}
// failed to wake up but we can continue to retry
Ok(ControlFlow::Continue(e)) => {
report_error(&e, true);
warn!(error = ?e, num_retries, retriable = true, "couldn't wake compute node");
}
// successfully woke up a compute node and can break the wakeup loop
Ok(ControlFlow::Break(mut node_info)) => {
node_info.config.reuse_password(&config);
mechanism.update_connect_config(&mut node_info.config);
break node_info;
}
}
let wait_duration = retry_after(num_retries);
num_retries += 1;
time::sleep(wait_duration).await;
};
// now that we have a new node, try connect to it repeatedly.
// this can error for a few reasons, for instance:
// * DNS connection settings haven't quite propagated yet
info!("wake_compute success. attempting to connect");
loop {
match mechanism.connect_once(&node_info, CONNECT_TIMEOUT).await {
Ok(res) => {
latency_timer.success();
return Ok(res);
}
Err(e) => {
let retriable = e.should_retry(num_retries);
if !retriable {
error!(error = ?e, num_retries, retriable, "couldn't connect to compute node");
return Err(e.into());
}
warn!(error = ?e, num_retries, retriable, "couldn't connect to compute node");
}
}
let wait_duration = retry_after(num_retries);
num_retries += 1;
time::sleep(wait_duration).await;
}
}
/// Attempts to wake up the compute node.
/// * Returns Ok(Continue(e)) if there was an error waking but retries are acceptable
/// * Returns Ok(Break(node)) if the wakeup succeeded
/// * Returns Err(e) if there was an error
pub fn handle_try_wake(
result: Result<console::CachedNodeInfo, WakeComputeError>,
num_retries: u32,
) -> Result<ControlFlow<console::CachedNodeInfo, WakeComputeError>, WakeComputeError> {
match result {
Err(err) => match &err {
WakeComputeError::ApiError(api) if api.should_retry(num_retries) => {
Ok(ControlFlow::Continue(err))
}
_ => Err(err),
},
// Ready to try again.
Ok(new) => Ok(ControlFlow::Break(new)),
}
}
pub trait ShouldRetry {
fn could_retry(&self) -> bool;
fn should_retry(&self, num_retries: u32) -> bool {
match self {
_ if num_retries >= NUM_RETRIES_CONNECT => false,
err => err.could_retry(),
}
}
}
impl ShouldRetry for io::Error {
fn could_retry(&self) -> bool {
use std::io::ErrorKind;
matches!(
self.kind(),
ErrorKind::ConnectionRefused | ErrorKind::AddrNotAvailable | ErrorKind::TimedOut
)
}
}
impl ShouldRetry for tokio_postgres::error::DbError {
fn could_retry(&self) -> bool {
use tokio_postgres::error::SqlState;
matches!(
self.code(),
&SqlState::CONNECTION_FAILURE
| &SqlState::CONNECTION_EXCEPTION
| &SqlState::CONNECTION_DOES_NOT_EXIST
| &SqlState::SQLCLIENT_UNABLE_TO_ESTABLISH_SQLCONNECTION,
)
}
}
impl ShouldRetry for tokio_postgres::Error {
fn could_retry(&self) -> bool {
if let Some(io_err) = self.source().and_then(|x| x.downcast_ref()) {
io::Error::could_retry(io_err)
} else if let Some(db_err) = self.source().and_then(|x| x.downcast_ref()) {
tokio_postgres::error::DbError::could_retry(db_err)
} else {
false
}
}
}
impl ShouldRetry for compute::ConnectionError {
fn could_retry(&self) -> bool {
match self {
compute::ConnectionError::Postgres(err) => err.could_retry(),
compute::ConnectionError::CouldNotConnect(err) => err.could_retry(),
_ => false,
}
}
}
pub fn retry_after(num_retries: u32) -> time::Duration {
BASE_RETRY_WAIT_DURATION.mul_f64(RETRY_WAIT_EXPONENT_BASE.powi((num_retries as i32) - 1))
}
/// Finish client connection initialization: confirm auth success, send params, etc. /// Finish client connection initialization: confirm auth success, send params, etc.
#[tracing::instrument(skip_all)] #[tracing::instrument(skip_all)]
async fn prepare_client_connection( async fn prepare_client_connection(

View File

@@ -0,0 +1,238 @@
use crate::{
auth,
compute::{self, PostgresConnection},
console::{self, errors::WakeComputeError, Api},
metrics::{bool_to_str, LatencyTimer, NUM_CONNECTION_FAILURES, NUM_WAKEUP_FAILURES},
proxy::retry::{retry_after, ShouldRetry},
};
use async_trait::async_trait;
use hyper::StatusCode;
use pq_proto::StartupMessageParams;
use std::ops::ControlFlow;
use tokio::time;
use tracing::{error, info, warn};
const CONNECT_TIMEOUT: time::Duration = time::Duration::from_secs(2);
/// If we couldn't connect, a cached connection info might be to blame
/// (e.g. the compute node's address might've changed at the wrong time).
/// Invalidate the cache entry (if any) to prevent subsequent errors.
#[tracing::instrument(name = "invalidate_cache", skip_all)]
pub fn invalidate_cache(node_info: console::CachedNodeInfo) -> compute::ConnCfg {
let is_cached = node_info.cached();
if is_cached {
warn!("invalidating stalled compute node info cache entry");
}
let label = match is_cached {
true => "compute_cached",
false => "compute_uncached",
};
NUM_CONNECTION_FAILURES.with_label_values(&[label]).inc();
node_info.invalidate().config
}
/// Try to connect to the compute node once.
#[tracing::instrument(name = "connect_once", fields(pid = tracing::field::Empty), skip_all)]
async fn connect_to_compute_once(
node_info: &console::CachedNodeInfo,
timeout: time::Duration,
proto: &'static str,
) -> Result<PostgresConnection, compute::ConnectionError> {
let allow_self_signed_compute = node_info.allow_self_signed_compute;
node_info
.config
.connect(allow_self_signed_compute, timeout, proto)
.await
}
#[async_trait]
pub trait ConnectMechanism {
type Connection;
type ConnectError;
type Error: From<Self::ConnectError>;
async fn connect_once(
&self,
node_info: &console::CachedNodeInfo,
timeout: time::Duration,
) -> Result<Self::Connection, Self::ConnectError>;
fn update_connect_config(&self, conf: &mut compute::ConnCfg);
}
pub struct TcpMechanism<'a> {
/// KV-dictionary with PostgreSQL connection params.
pub params: &'a StartupMessageParams,
pub proto: &'static str,
}
#[async_trait]
impl ConnectMechanism for TcpMechanism<'_> {
type Connection = PostgresConnection;
type ConnectError = compute::ConnectionError;
type Error = compute::ConnectionError;
async fn connect_once(
&self,
node_info: &console::CachedNodeInfo,
timeout: time::Duration,
) -> Result<PostgresConnection, Self::Error> {
connect_to_compute_once(node_info, timeout, self.proto).await
}
fn update_connect_config(&self, config: &mut compute::ConnCfg) {
config.set_startup_params(self.params);
}
}
fn report_error(e: &WakeComputeError, retry: bool) {
use crate::console::errors::ApiError;
let retry = bool_to_str(retry);
let kind = match e {
WakeComputeError::BadComputeAddress(_) => "bad_compute_address",
WakeComputeError::ApiError(ApiError::Transport(_)) => "api_transport_error",
WakeComputeError::ApiError(ApiError::Console {
status: StatusCode::LOCKED,
ref text,
}) if text.contains("written data quota exceeded")
|| text.contains("the limit for current plan reached") =>
{
"quota_exceeded"
}
WakeComputeError::ApiError(ApiError::Console {
status: StatusCode::LOCKED,
..
}) => "api_console_locked",
WakeComputeError::ApiError(ApiError::Console {
status: StatusCode::BAD_REQUEST,
..
}) => "api_console_bad_request",
WakeComputeError::ApiError(ApiError::Console { status, .. })
if status.is_server_error() =>
{
"api_console_other_server_error"
}
WakeComputeError::ApiError(ApiError::Console { .. }) => "api_console_other_error",
WakeComputeError::TimeoutError => "timeout_error",
};
NUM_WAKEUP_FAILURES.with_label_values(&[retry, kind]).inc();
}
/// Try to connect to the compute node, retrying if necessary.
/// This function might update `node_info`, so we take it by `&mut`.
#[tracing::instrument(skip_all)]
pub async fn connect_to_compute<M: ConnectMechanism>(
mechanism: &M,
mut node_info: console::CachedNodeInfo,
extra: &console::ConsoleReqExtra,
creds: &auth::BackendType<'_, auth::backend::ComputeUserInfo>,
mut latency_timer: LatencyTimer,
) -> Result<M::Connection, M::Error>
where
M::ConnectError: ShouldRetry + std::fmt::Debug,
M::Error: From<WakeComputeError>,
{
mechanism.update_connect_config(&mut node_info.config);
// try once
let (config, err) = match mechanism.connect_once(&node_info, CONNECT_TIMEOUT).await {
Ok(res) => {
latency_timer.success();
return Ok(res);
}
Err(e) => {
error!(error = ?e, "could not connect to compute node");
(invalidate_cache(node_info), e)
}
};
latency_timer.cache_miss();
let mut num_retries = 1;
// if we failed to connect, it's likely that the compute node was suspended, wake a new compute node
info!("compute node's state has likely changed; requesting a wake-up");
let node_info = loop {
let wake_res = match creds {
auth::BackendType::Console(api, creds) => api.wake_compute(extra, creds).await,
#[cfg(feature = "testing")]
auth::BackendType::Postgres(api, creds) => api.wake_compute(extra, creds).await,
// nothing to do?
auth::BackendType::Link(_) => return Err(err.into()),
// test backend
#[cfg(test)]
auth::BackendType::Test(x) => x.wake_compute(),
};
match handle_try_wake(wake_res, num_retries) {
Err(e) => {
error!(error = ?e, num_retries, retriable = false, "couldn't wake compute node");
report_error(&e, false);
return Err(e.into());
}
// failed to wake up but we can continue to retry
Ok(ControlFlow::Continue(e)) => {
report_error(&e, true);
warn!(error = ?e, num_retries, retriable = true, "couldn't wake compute node");
}
// successfully woke up a compute node and can break the wakeup loop
Ok(ControlFlow::Break(mut node_info)) => {
node_info.config.reuse_password(&config);
mechanism.update_connect_config(&mut node_info.config);
break node_info;
}
}
let wait_duration = retry_after(num_retries);
num_retries += 1;
time::sleep(wait_duration).await;
};
// now that we have a new node, try connect to it repeatedly.
// this can error for a few reasons, for instance:
// * DNS connection settings haven't quite propagated yet
info!("wake_compute success. attempting to connect");
loop {
match mechanism.connect_once(&node_info, CONNECT_TIMEOUT).await {
Ok(res) => {
latency_timer.success();
return Ok(res);
}
Err(e) => {
let retriable = e.should_retry(num_retries);
if !retriable {
error!(error = ?e, num_retries, retriable, "couldn't connect to compute node");
return Err(e.into());
}
warn!(error = ?e, num_retries, retriable, "couldn't connect to compute node");
}
}
let wait_duration = retry_after(num_retries);
num_retries += 1;
time::sleep(wait_duration).await;
}
}
/// Attempts to wake up the compute node.
/// * Returns Ok(Continue(e)) if there was an error waking but retries are acceptable
/// * Returns Ok(Break(node)) if the wakeup succeeded
/// * Returns Err(e) if there was an error
pub fn handle_try_wake(
result: Result<console::CachedNodeInfo, WakeComputeError>,
num_retries: u32,
) -> Result<ControlFlow<console::CachedNodeInfo, WakeComputeError>, WakeComputeError> {
match result {
Err(err) => match &err {
WakeComputeError::ApiError(api) if api.should_retry(num_retries) => {
Ok(ControlFlow::Continue(err))
}
_ => Err(err),
},
// Ready to try again.
Ok(new) => Ok(ControlFlow::Break(new)),
}
}

68
proxy/src/proxy/retry.rs Normal file
View File

@@ -0,0 +1,68 @@
use crate::compute;
use std::{error::Error, io};
use tokio::time;
/// Number of times we should retry the `/proxy_wake_compute` http request.
/// Retry duration is BASE_RETRY_WAIT_DURATION * RETRY_WAIT_EXPONENT_BASE ^ n, where n starts at 0
pub const NUM_RETRIES_CONNECT: u32 = 16;
const BASE_RETRY_WAIT_DURATION: time::Duration = time::Duration::from_millis(25);
const RETRY_WAIT_EXPONENT_BASE: f64 = std::f64::consts::SQRT_2;
pub trait ShouldRetry {
fn could_retry(&self) -> bool;
fn should_retry(&self, num_retries: u32) -> bool {
match self {
_ if num_retries >= NUM_RETRIES_CONNECT => false,
err => err.could_retry(),
}
}
}
impl ShouldRetry for io::Error {
fn could_retry(&self) -> bool {
use std::io::ErrorKind;
matches!(
self.kind(),
ErrorKind::ConnectionRefused | ErrorKind::AddrNotAvailable | ErrorKind::TimedOut
)
}
}
impl ShouldRetry for tokio_postgres::error::DbError {
fn could_retry(&self) -> bool {
use tokio_postgres::error::SqlState;
matches!(
self.code(),
&SqlState::CONNECTION_FAILURE
| &SqlState::CONNECTION_EXCEPTION
| &SqlState::CONNECTION_DOES_NOT_EXIST
| &SqlState::SQLCLIENT_UNABLE_TO_ESTABLISH_SQLCONNECTION,
)
}
}
impl ShouldRetry for tokio_postgres::Error {
fn could_retry(&self) -> bool {
if let Some(io_err) = self.source().and_then(|x| x.downcast_ref()) {
io::Error::could_retry(io_err)
} else if let Some(db_err) = self.source().and_then(|x| x.downcast_ref()) {
tokio_postgres::error::DbError::could_retry(db_err)
} else {
false
}
}
}
impl ShouldRetry for compute::ConnectionError {
fn could_retry(&self) -> bool {
match self {
compute::ConnectionError::Postgres(err) => err.could_retry(),
compute::ConnectionError::CouldNotConnect(err) => err.could_retry(),
_ => false,
}
}
}
pub fn retry_after(num_retries: u32) -> time::Duration {
BASE_RETRY_WAIT_DURATION.mul_f64(RETRY_WAIT_EXPONENT_BASE.powi((num_retries as i32) - 1))
}

View File

@@ -2,10 +2,13 @@
mod mitm; mod mitm;
use super::connect_compute::ConnectMechanism;
use super::retry::ShouldRetry;
use super::*; use super::*;
use crate::auth::backend::{ComputeUserInfo, TestBackend}; use crate::auth::backend::{ComputeUserInfo, TestBackend};
use crate::config::CertResolver; use crate::config::CertResolver;
use crate::console::{CachedNodeInfo, NodeInfo}; use crate::console::{CachedNodeInfo, NodeInfo};
use crate::proxy::retry::{retry_after, NUM_RETRIES_CONNECT};
use crate::{auth, http, sasl, scram}; use crate::{auth, http, sasl, scram};
use async_trait::async_trait; use async_trait::async_trait;
use rstest::rstest; use rstest::rstest;
@@ -423,7 +426,7 @@ impl ConnectMechanism for TestConnectMechanism {
async fn connect_once( async fn connect_once(
&self, &self,
_node_info: &console::CachedNodeInfo, _node_info: &console::CachedNodeInfo,
_timeout: time::Duration, _timeout: std::time::Duration,
) -> Result<Self::Connection, Self::ConnectError> { ) -> Result<Self::Connection, Self::ConnectError> {
let mut counter = self.counter.lock().unwrap(); let mut counter = self.counter.lock().unwrap();
let action = self.sequence[*counter]; let action = self.sequence[*counter];

View File

@@ -120,7 +120,7 @@ where
struct PgFrame; struct PgFrame;
impl Decoder for PgFrame { impl Decoder for PgFrame {
type Item = Bytes; type Item = Bytes;
type Error = io::Error; type Error = std::io::Error;
fn decode(&mut self, src: &mut BytesMut) -> Result<Option<Self::Item>, Self::Error> { fn decode(&mut self, src: &mut BytesMut) -> Result<Option<Self::Item>, Self::Error> {
if src.len() < 5 { if src.len() < 5 {
@@ -136,7 +136,7 @@ impl Decoder for PgFrame {
} }
} }
impl Encoder<Bytes> for PgFrame { impl Encoder<Bytes> for PgFrame {
type Error = io::Error; type Error = std::io::Error;
fn encode(&mut self, item: Bytes, dst: &mut BytesMut) -> Result<(), Self::Error> { fn encode(&mut self, item: Bytes, dst: &mut BytesMut) -> Result<(), Self::Error> {
dst.extend_from_slice(&item); dst.extend_from_slice(&item);

View File

@@ -393,10 +393,10 @@ impl Limiter {
} }
new_limit new_limit
}; };
crate::proxy::RATE_LIMITER_LIMIT crate::metrics::RATE_LIMITER_LIMIT
.with_label_values(&["expected"]) .with_label_values(&["expected"])
.set(new_limit as i64); .set(new_limit as i64);
crate::proxy::RATE_LIMITER_LIMIT crate::metrics::RATE_LIMITER_LIMIT
.with_label_values(&["actual"]) .with_label_values(&["actual"])
.set(actual_limit as i64); .set(actual_limit as i64);
self.limits.store(new_limit, Ordering::Release); self.limits.store(new_limit, Ordering::Release);
@@ -470,7 +470,7 @@ impl reqwest_middleware::Middleware for Limiter {
) )
})?; })?;
info!(duration = ?start.elapsed(), "waiting for token to connect to the control plane"); info!(duration = ?start.elapsed(), "waiting for token to connect to the control plane");
crate::proxy::RATE_LIMITER_ACQUIRE_LATENCY.observe(start.elapsed().as_secs_f64()); crate::metrics::RATE_LIMITER_ACQUIRE_LATENCY.observe(start.elapsed().as_secs_f64());
match next.run(req, extensions).await { match next.run(req, extensions).await {
Ok(response) => { Ok(response) => {
self.release(token, Some(Outcome::from_reqwest_response(&response))) self.release(token, Some(Outcome::from_reqwest_response(&response)))

View File

@@ -6,7 +6,7 @@ pub const SCRAM_KEY_LEN: usize = 32;
/// One of the keys derived from the [password](super::password::SaltedPassword). /// One of the keys derived from the [password](super::password::SaltedPassword).
/// We use the same structure for all keys, i.e. /// We use the same structure for all keys, i.e.
/// `ClientKey`, `StoredKey`, and `ServerKey`. /// `ClientKey`, `StoredKey`, and `ServerKey`.
#[derive(Default, PartialEq, Eq)] #[derive(Clone, Default, PartialEq, Eq)]
#[repr(transparent)] #[repr(transparent)]
pub struct ScramKey { pub struct ScramKey {
bytes: [u8; SCRAM_KEY_LEN], bytes: [u8; SCRAM_KEY_LEN],

View File

@@ -5,6 +5,7 @@ use super::key::ScramKey;
/// Server secret is produced from [password](super::password::SaltedPassword) /// Server secret is produced from [password](super::password::SaltedPassword)
/// and is used throughout the authentication process. /// and is used throughout the authentication process.
#[derive(Clone)]
pub struct ServerSecret { pub struct ServerSecret {
/// Number of iterations for `PBKDF2` function. /// Number of iterations for `PBKDF2` function.
pub iterations: u32, pub iterations: u32,

View File

@@ -13,8 +13,8 @@ pub use reqwest_middleware::{ClientWithMiddleware, Error};
pub use reqwest_retry::{policies::ExponentialBackoff, RetryTransientMiddleware}; pub use reqwest_retry::{policies::ExponentialBackoff, RetryTransientMiddleware};
use tokio_util::task::TaskTracker; use tokio_util::task::TaskTracker;
use crate::metrics::NUM_CLIENT_CONNECTION_GAUGE;
use crate::protocol2::{ProxyProtocolAccept, WithClientIp}; use crate::protocol2::{ProxyProtocolAccept, WithClientIp};
use crate::proxy::NUM_CLIENT_CONNECTION_GAUGE;
use crate::rate_limiter::EndpointRateLimiter; use crate::rate_limiter::EndpointRateLimiter;
use crate::{cancellation::CancelMap, config::ProxyConfig}; use crate::{cancellation::CancelMap, config::ProxyConfig};
use futures::StreamExt; use futures::StreamExt;

View File

@@ -24,13 +24,12 @@ use tokio_postgres::{AsyncMessage, ReadyForQueryStatus};
use crate::{ use crate::{
auth::{self, backend::ComputeUserInfo, check_peer_addr_is_in_list}, auth::{self, backend::ComputeUserInfo, check_peer_addr_is_in_list},
console, console,
proxy::{neon_options, LatencyTimer, NUM_DB_CONNECTIONS_GAUGE}, metrics::{LatencyTimer, NUM_DB_CONNECTIONS_GAUGE},
proxy::{connect_compute::ConnectMechanism, neon_options},
usage_metrics::{Ids, MetricCounter, USAGE_METRICS}, usage_metrics::{Ids, MetricCounter, USAGE_METRICS},
}; };
use crate::{compute, config}; use crate::{compute, config};
use crate::proxy::ConnectMechanism;
use tracing::{error, warn, Span}; use tracing::{error, warn, Span};
use tracing::{info, info_span, Instrument}; use tracing::{info, info_span, Instrument};
@@ -432,7 +431,6 @@ async fn connect_to_compute(
application_name: APP_NAME.to_string(), application_name: APP_NAME.to_string(),
options: console_options, options: console_options,
}; };
// TODO(anna): this is a bit hacky way, consider using console notification listener.
if !config.disable_ip_check_for_http { if !config.disable_ip_check_for_http {
let allowed_ips = backend.get_allowed_ips(&extra).await?; let allowed_ips = backend.get_allowed_ips(&extra).await?;
if !check_peer_addr_is_in_list(&peer_addr, &allowed_ips) { if !check_peer_addr_is_in_list(&peer_addr, &allowed_ips) {
@@ -444,7 +442,7 @@ async fn connect_to_compute(
.await? .await?
.context("missing cache entry from wake_compute")?; .context("missing cache entry from wake_compute")?;
crate::proxy::connect_to_compute( crate::proxy::connect_compute::connect_to_compute(
&TokioMechanism { &TokioMechanism {
conn_id, conn_id,
conn_info, conn_info,

View File

@@ -29,7 +29,7 @@ use utils::http::error::ApiError;
use utils::http::json::json_response; use utils::http::json::json_response;
use crate::config::HttpConfig; use crate::config::HttpConfig;
use crate::proxy::NUM_CONNECTION_REQUESTS_GAUGE; use crate::metrics::NUM_CONNECTION_REQUESTS_GAUGE;
use super::conn_pool::ConnInfo; use super::conn_pool::ConnInfo;
use super::conn_pool::GlobalConnPool; use super::conn_pool::GlobalConnPool;

View File

@@ -31,6 +31,7 @@ reqwest = { workspace = true, default-features = false, features = ["rustls-tls"
aws-config = { workspace = true, default-features = false, features = ["rustls", "sso"] } aws-config = { workspace = true, default-features = false, features = ["rustls", "sso"] }
pageserver = { path = "../pageserver" } pageserver = { path = "../pageserver" }
pageserver_api = { path = "../libs/pageserver_api" }
remote_storage = { path = "../libs/remote_storage" } remote_storage = { path = "../libs/remote_storage" }
tracing.workspace = true tracing.workspace = true

View File

@@ -1,19 +1,21 @@
use std::collections::HashSet; use std::collections::{HashMap, HashSet};
use anyhow::Context; use anyhow::Context;
use aws_sdk_s3::{types::ObjectIdentifier, Client}; use aws_sdk_s3::{types::ObjectIdentifier, Client};
use pageserver::tenant::remote_timeline_client::index::IndexLayerMetadata;
use pageserver_api::shard::ShardIndex;
use tracing::{error, info, warn}; use tracing::{error, info, warn};
use utils::generation::Generation; use utils::generation::Generation;
use utils::id::TimelineId;
use crate::cloud_admin_api::BranchData; use crate::cloud_admin_api::BranchData;
use crate::metadata_stream::stream_listing; use crate::metadata_stream::stream_listing;
use crate::{download_object_with_retries, RootTarget}; use crate::{download_object_with_retries, RootTarget, TenantShardTimelineId};
use futures_util::{pin_mut, StreamExt}; use futures_util::{pin_mut, StreamExt};
use pageserver::tenant::remote_timeline_client::parse_remote_index_path; use pageserver::tenant::remote_timeline_client::parse_remote_index_path;
use pageserver::tenant::storage_layer::LayerFileName; use pageserver::tenant::storage_layer::LayerFileName;
use pageserver::tenant::IndexPart; use pageserver::tenant::IndexPart;
use remote_storage::RemotePath; use remote_storage::RemotePath;
use utils::id::TenantTimelineId;
pub(crate) struct TimelineAnalysis { pub(crate) struct TimelineAnalysis {
/// Anomalies detected /// Anomalies detected
@@ -39,9 +41,9 @@ impl TimelineAnalysis {
} }
} }
pub(crate) async fn branch_cleanup_and_check_errors( pub(crate) fn branch_cleanup_and_check_errors(
id: &TenantTimelineId, id: &TenantShardTimelineId,
s3_root: &RootTarget, tenant_objects: &mut TenantObjectListing,
s3_active_branch: Option<&BranchData>, s3_active_branch: Option<&BranchData>,
console_branch: Option<BranchData>, console_branch: Option<BranchData>,
s3_data: Option<S3TimelineBlobData>, s3_data: Option<S3TimelineBlobData>,
@@ -73,8 +75,8 @@ pub(crate) async fn branch_cleanup_and_check_errors(
match s3_data.blob_data { match s3_data.blob_data {
BlobDataParseResult::Parsed { BlobDataParseResult::Parsed {
index_part, index_part,
index_part_generation, index_part_generation: _index_part_generation,
mut s3_layers, s3_layers: _s3_layers,
} => { } => {
if !IndexPart::KNOWN_VERSIONS.contains(&index_part.get_version()) { if !IndexPart::KNOWN_VERSIONS.contains(&index_part.get_version()) {
result.errors.push(format!( result.errors.push(format!(
@@ -112,65 +114,19 @@ pub(crate) async fn branch_cleanup_and_check_errors(
)) ))
} }
let layer_map_key = (layer, metadata.generation); if !tenant_objects.check_ref(id.timeline_id, &layer, &metadata) {
if !s3_layers.remove(&layer_map_key) {
// FIXME: this will emit false positives if an index was // FIXME: this will emit false positives if an index was
// uploaded concurrently with our scan. To make this check // uploaded concurrently with our scan. To make this check
// correct, we need to try sending a HEAD request for the // correct, we need to try sending a HEAD request for the
// layer we think is missing. // layer we think is missing.
result.errors.push(format!( result.errors.push(format!(
"index_part.json contains a layer {}{} that is not present in remote storage", "index_part.json contains a layer {}{} (shard {}) that is not present in remote storage",
layer_map_key.0.file_name(), layer.file_name(),
layer_map_key.1.get_suffix() metadata.generation.get_suffix(),
metadata.shard
)) ))
} }
} }
let orphan_layers: Vec<(LayerFileName, Generation)> = s3_layers
.into_iter()
.filter(|(_layer_name, gen)|
// A layer is only considered orphaned if it has a generation below
// the index. If the generation is >= the index, then the layer may
// be an upload from a running pageserver, or even an upload from
// a new generation that didn't upload an index yet.
//
// Even so, a layer that is not referenced by the index could just
// be something enqueued for deletion, so while this check is valid
// for indicating that a layer is garbage, it is not an indicator
// of a problem.
gen < &index_part_generation)
.collect();
if !orphan_layers.is_empty() {
// An orphan layer is not an error: it's arguably not even a warning, but it is helpful to report
// these as a hint that there is something worth cleaning up here.
result.warnings.push(format!(
"index_part.json does not contain layers from S3: {:?}",
orphan_layers
.iter()
.map(|(layer_name, gen)| format!(
"{}{}",
layer_name.file_name(),
gen.get_suffix()
))
.collect::<Vec<_>>(),
));
result.garbage_keys.extend(orphan_layers.iter().map(
|(layer_name, layer_gen)| {
let mut key = s3_root.timeline_root(id).prefix_in_bucket;
let delimiter = s3_root.delimiter();
if !key.ends_with(delimiter) {
key.push_str(delimiter);
}
key.push_str(&format!(
"{}{}",
&layer_name.file_name(),
layer_gen.get_suffix()
));
key
},
));
}
} }
BlobDataParseResult::Relic => {} BlobDataParseResult::Relic => {}
BlobDataParseResult::Incorrect(parse_errors) => result.errors.extend( BlobDataParseResult::Incorrect(parse_errors) => result.errors.extend(
@@ -205,6 +161,83 @@ pub(crate) async fn branch_cleanup_and_check_errors(
result result
} }
#[derive(Default)]
pub(crate) struct LayerRef {
ref_count: usize,
}
/// Top-level index of objects in a tenant. This may be used by any shard-timeline within
/// the tenant to query whether an object exists.
#[derive(Default)]
pub(crate) struct TenantObjectListing {
shard_timelines:
HashMap<(ShardIndex, TimelineId), HashMap<(LayerFileName, Generation), LayerRef>>,
}
impl TenantObjectListing {
/// Having done an S3 listing of the keys within a timeline prefix, merge them into the overall
/// list of layer keys for the Tenant.
pub(crate) fn push(
&mut self,
ttid: TenantShardTimelineId,
layers: HashSet<(LayerFileName, Generation)>,
) {
let shard_index = ShardIndex::new(
ttid.tenant_shard_id.shard_number,
ttid.tenant_shard_id.shard_count,
);
let replaced = self.shard_timelines.insert(
(shard_index, ttid.timeline_id),
layers
.into_iter()
.map(|l| (l, LayerRef::default()))
.collect(),
);
assert!(
replaced.is_none(),
"Built from an S3 object listing, which should never repeat a key"
);
}
/// Having loaded a timeline index, check if a layer referenced by the index exists. If it does,
/// the layer's refcount will be incremented. Later, after calling this for all references in all indices
/// in a tenant, orphan layers may be detected by their zero refcounts.
///
/// Returns true if the layer exists
pub(crate) fn check_ref(
&mut self,
timeline_id: TimelineId,
layer_file: &LayerFileName,
metadata: &IndexLayerMetadata,
) -> bool {
let Some(shard_tl) = self.shard_timelines.get_mut(&(metadata.shard, timeline_id)) else {
return false;
};
let Some(layer_ref) = shard_tl.get_mut(&(layer_file.clone(), metadata.generation)) else {
return false;
};
layer_ref.ref_count += 1;
true
}
pub(crate) fn get_orphans(&self) -> Vec<(ShardIndex, TimelineId, LayerFileName, Generation)> {
let mut result = Vec::new();
for ((shard_index, timeline_id), layers) in &self.shard_timelines {
for ((layer_file, generation), layer_ref) in layers {
if layer_ref.ref_count == 0 {
result.push((*shard_index, *timeline_id, layer_file.clone(), *generation))
}
}
}
result
}
}
#[derive(Debug)] #[derive(Debug)]
pub(crate) struct S3TimelineBlobData { pub(crate) struct S3TimelineBlobData {
pub(crate) blob_data: BlobDataParseResult, pub(crate) blob_data: BlobDataParseResult,
@@ -238,7 +271,7 @@ fn parse_layer_object_name(name: &str) -> Result<(LayerFileName, Generation), St
pub(crate) async fn list_timeline_blobs( pub(crate) async fn list_timeline_blobs(
s3_client: &Client, s3_client: &Client,
id: TenantTimelineId, id: TenantShardTimelineId,
s3_root: &RootTarget, s3_root: &RootTarget,
) -> anyhow::Result<S3TimelineBlobData> { ) -> anyhow::Result<S3TimelineBlobData> {
let mut s3_layers = HashSet::new(); let mut s3_layers = HashSet::new();

View File

@@ -10,15 +10,16 @@ use aws_sdk_s3::{
Client, Client,
}; };
use futures_util::{pin_mut, TryStreamExt}; use futures_util::{pin_mut, TryStreamExt};
use pageserver_api::shard::TenantShardId;
use serde::{Deserialize, Serialize}; use serde::{Deserialize, Serialize};
use tokio_stream::StreamExt; use tokio_stream::StreamExt;
use utils::id::{TenantId, TenantTimelineId}; use utils::id::TenantId;
use crate::{ use crate::{
cloud_admin_api::{CloudAdminApiClient, MaybeDeleted, ProjectData}, cloud_admin_api::{CloudAdminApiClient, MaybeDeleted, ProjectData},
init_remote, init_remote,
metadata_stream::{stream_listing, stream_tenant_timelines, stream_tenants}, metadata_stream::{stream_listing, stream_tenant_timelines, stream_tenants},
BucketConfig, ConsoleConfig, NodeKind, RootTarget, TraversingDepth, BucketConfig, ConsoleConfig, NodeKind, RootTarget, TenantShardTimelineId, TraversingDepth,
}; };
#[derive(Serialize, Deserialize, Debug)] #[derive(Serialize, Deserialize, Debug)]
@@ -29,8 +30,8 @@ enum GarbageReason {
#[derive(Serialize, Deserialize, Debug)] #[derive(Serialize, Deserialize, Debug)]
enum GarbageEntity { enum GarbageEntity {
Tenant(TenantId), Tenant(TenantShardId),
Timeline(TenantTimelineId), Timeline(TenantShardTimelineId),
} }
#[derive(Serialize, Deserialize, Debug)] #[derive(Serialize, Deserialize, Debug)]
@@ -142,6 +143,9 @@ async fn find_garbage_inner(
console_projects.len() console_projects.len()
); );
// TODO(sharding): batch calls into Console so that we only call once for each TenantId,
// rather than checking the same TenantId for multiple TenantShardId
// Enumerate Tenants in S3, and check if each one exists in Console // Enumerate Tenants in S3, and check if each one exists in Console
tracing::info!("Finding all tenants in bucket {}...", bucket_config.bucket); tracing::info!("Finding all tenants in bucket {}...", bucket_config.bucket);
let tenants = stream_tenants(&s3_client, &target); let tenants = stream_tenants(&s3_client, &target);
@@ -149,10 +153,10 @@ async fn find_garbage_inner(
let api_client = cloud_admin_api_client.clone(); let api_client = cloud_admin_api_client.clone();
let console_projects = &console_projects; let console_projects = &console_projects;
async move { async move {
match console_projects.get(&t) { match console_projects.get(&t.tenant_id) {
Some(project_data) => Ok((t, Some(project_data.clone()))), Some(project_data) => Ok((t, Some(project_data.clone()))),
None => api_client None => api_client
.find_tenant_project(t) .find_tenant_project(t.tenant_id)
.await .await
.map_err(|e| anyhow::anyhow!(e)) .map_err(|e| anyhow::anyhow!(e))
.map(|r| (t, r)), .map(|r| (t, r)),
@@ -166,21 +170,21 @@ async fn find_garbage_inner(
// checks if they are enabled by the `depth` parameter. // checks if they are enabled by the `depth` parameter.
pin_mut!(tenants_checked); pin_mut!(tenants_checked);
let mut garbage = GarbageList::new(node_kind, bucket_config); let mut garbage = GarbageList::new(node_kind, bucket_config);
let mut active_tenants: Vec<TenantId> = vec![]; let mut active_tenants: Vec<TenantShardId> = vec![];
let mut counter = 0; let mut counter = 0;
while let Some(result) = tenants_checked.next().await { while let Some(result) = tenants_checked.next().await {
let (tenant_id, console_result) = result?; let (tenant_shard_id, console_result) = result?;
// Paranoia check // Paranoia check
if let Some(project) = &console_result { if let Some(project) = &console_result {
assert!(project.tenant == tenant_id); assert!(project.tenant == tenant_shard_id.tenant_id);
} }
if garbage.maybe_append(GarbageEntity::Tenant(tenant_id), console_result) { if garbage.maybe_append(GarbageEntity::Tenant(tenant_shard_id), console_result) {
tracing::debug!("Tenant {tenant_id} is garbage"); tracing::debug!("Tenant {tenant_shard_id} is garbage");
} else { } else {
tracing::debug!("Tenant {tenant_id} is active"); tracing::debug!("Tenant {tenant_shard_id} is active");
active_tenants.push(tenant_id); active_tenants.push(tenant_shard_id);
} }
counter += 1; counter += 1;
@@ -266,13 +270,13 @@ impl std::fmt::Display for PurgeMode {
pub async fn get_tenant_objects( pub async fn get_tenant_objects(
s3_client: &Arc<Client>, s3_client: &Arc<Client>,
target: RootTarget, target: RootTarget,
tenant_id: TenantId, tenant_shard_id: TenantShardId,
) -> anyhow::Result<Vec<ObjectIdentifier>> { ) -> anyhow::Result<Vec<ObjectIdentifier>> {
tracing::debug!("Listing objects in tenant {tenant_id}"); tracing::debug!("Listing objects in tenant {tenant_shard_id}");
// TODO: apply extra validation based on object modification time. Don't purge // TODO: apply extra validation based on object modification time. Don't purge
// tenants where any timeline's index_part.json has been touched recently. // tenants where any timeline's index_part.json has been touched recently.
let mut tenant_root = target.tenant_root(&tenant_id); let mut tenant_root = target.tenant_root(&tenant_shard_id);
// Remove delimiter, so that object listing lists all keys in the prefix and not just // Remove delimiter, so that object listing lists all keys in the prefix and not just
// common prefixes. // common prefixes.
@@ -285,7 +289,7 @@ pub async fn get_tenant_objects(
pub async fn get_timeline_objects( pub async fn get_timeline_objects(
s3_client: &Arc<Client>, s3_client: &Arc<Client>,
target: RootTarget, target: RootTarget,
ttid: TenantTimelineId, ttid: TenantShardTimelineId,
) -> anyhow::Result<Vec<ObjectIdentifier>> { ) -> anyhow::Result<Vec<ObjectIdentifier>> {
tracing::debug!("Listing objects in timeline {ttid}"); tracing::debug!("Listing objects in timeline {ttid}");
let mut timeline_root = target.timeline_root(&ttid); let mut timeline_root = target.timeline_root(&ttid);

View File

@@ -22,6 +22,7 @@ use aws_sdk_s3::{Client, Config};
use clap::ValueEnum; use clap::ValueEnum;
use pageserver::tenant::TENANTS_SEGMENT_NAME; use pageserver::tenant::TENANTS_SEGMENT_NAME;
use pageserver_api::shard::TenantShardId;
use reqwest::Url; use reqwest::Url;
use serde::{Deserialize, Serialize}; use serde::{Deserialize, Serialize};
use std::io::IsTerminal; use std::io::IsTerminal;
@@ -29,7 +30,7 @@ use tokio::io::AsyncReadExt;
use tracing::error; use tracing::error;
use tracing_appender::non_blocking::WorkerGuard; use tracing_appender::non_blocking::WorkerGuard;
use tracing_subscriber::{fmt, prelude::*, EnvFilter}; use tracing_subscriber::{fmt, prelude::*, EnvFilter};
use utils::id::{TenantId, TenantTimelineId}; use utils::id::TimelineId;
const MAX_RETRIES: usize = 20; const MAX_RETRIES: usize = 20;
const CLOUD_ADMIN_API_TOKEN_ENV_VAR: &str = "CLOUD_ADMIN_API_TOKEN"; const CLOUD_ADMIN_API_TOKEN_ENV_VAR: &str = "CLOUD_ADMIN_API_TOKEN";
@@ -44,6 +45,35 @@ pub struct S3Target {
pub delimiter: String, pub delimiter: String,
} }
/// Convenience for referring to timelines within a particular shard: more ergonomic
/// than using a 2-tuple.
///
/// This is the shard-aware equivalent of TenantTimelineId. It's defined here rather
/// than somewhere more broadly exposed, because this kind of thing is rarely needed
/// in the pageserver, as all timeline objects existing in the scope of a particular
/// tenant: the scrubber is different in that it handles collections of data referring to many
/// TenantShardTimelineIds in on place.
#[derive(Serialize, Deserialize, Debug, Clone, Copy, Hash, PartialEq, Eq)]
pub struct TenantShardTimelineId {
tenant_shard_id: TenantShardId,
timeline_id: TimelineId,
}
impl TenantShardTimelineId {
fn new(tenant_shard_id: TenantShardId, timeline_id: TimelineId) -> Self {
Self {
tenant_shard_id,
timeline_id,
}
}
}
impl Display for TenantShardTimelineId {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
write!(f, "{}/{}", self.tenant_shard_id, self.timeline_id)
}
}
#[derive(clap::ValueEnum, Debug, Clone, Copy, PartialEq, Eq)] #[derive(clap::ValueEnum, Debug, Clone, Copy, PartialEq, Eq)]
pub enum TraversingDepth { pub enum TraversingDepth {
Tenant, Tenant,
@@ -110,19 +140,19 @@ impl RootTarget {
} }
} }
pub fn tenant_root(&self, tenant_id: &TenantId) -> S3Target { pub fn tenant_root(&self, tenant_id: &TenantShardId) -> S3Target {
self.tenants_root().with_sub_segment(&tenant_id.to_string()) self.tenants_root().with_sub_segment(&tenant_id.to_string())
} }
pub fn timelines_root(&self, tenant_id: &TenantId) -> S3Target { pub fn timelines_root(&self, tenant_id: &TenantShardId) -> S3Target {
match self { match self {
Self::Pageserver(_) => self.tenant_root(tenant_id).with_sub_segment("timelines"), Self::Pageserver(_) => self.tenant_root(tenant_id).with_sub_segment("timelines"),
Self::Safekeeper(_) => self.tenant_root(tenant_id), Self::Safekeeper(_) => self.tenant_root(tenant_id),
} }
} }
pub fn timeline_root(&self, id: &TenantTimelineId) -> S3Target { pub fn timeline_root(&self, id: &TenantShardTimelineId) -> S3Target {
self.timelines_root(&id.tenant_id) self.timelines_root(&id.tenant_shard_id)
.with_sub_segment(&id.timeline_id.to_string()) .with_sub_segment(&id.timeline_id.to_string())
} }

View File

@@ -3,14 +3,15 @@ use async_stream::{stream, try_stream};
use aws_sdk_s3::{types::ObjectIdentifier, Client}; use aws_sdk_s3::{types::ObjectIdentifier, Client};
use tokio_stream::Stream; use tokio_stream::Stream;
use crate::{list_objects_with_retries, RootTarget, S3Target, TenantId}; use crate::{list_objects_with_retries, RootTarget, S3Target, TenantShardTimelineId};
use utils::id::{TenantTimelineId, TimelineId}; use pageserver_api::shard::TenantShardId;
use utils::id::TimelineId;
/// Given an S3 bucket, output a stream of TenantIds discovered via ListObjectsv2 /// Given an S3 bucket, output a stream of TenantIds discovered via ListObjectsv2
pub fn stream_tenants<'a>( pub fn stream_tenants<'a>(
s3_client: &'a Client, s3_client: &'a Client,
target: &'a RootTarget, target: &'a RootTarget,
) -> impl Stream<Item = anyhow::Result<TenantId>> + 'a { ) -> impl Stream<Item = anyhow::Result<TenantShardId>> + 'a {
try_stream! { try_stream! {
let mut continuation_token = None; let mut continuation_token = None;
let tenants_target = target.tenants_root(); let tenants_target = target.tenants_root();
@@ -44,14 +45,14 @@ pub fn stream_tenants<'a>(
} }
} }
/// Given a TenantId, output a stream of the timelines within that tenant, discovered /// Given a TenantShardId, output a stream of the timelines within that tenant, discovered
/// using ListObjectsv2. The listing is done before the stream is built, so that this /// using ListObjectsv2. The listing is done before the stream is built, so that this
/// function can be used to generate concurrency on a stream using buffer_unordered. /// function can be used to generate concurrency on a stream using buffer_unordered.
pub async fn stream_tenant_timelines<'a>( pub async fn stream_tenant_timelines<'a>(
s3_client: &'a Client, s3_client: &'a Client,
target: &'a RootTarget, target: &'a RootTarget,
tenant: TenantId, tenant: TenantShardId,
) -> anyhow::Result<impl Stream<Item = Result<TenantTimelineId, anyhow::Error>> + 'a> { ) -> anyhow::Result<impl Stream<Item = Result<TenantShardTimelineId, anyhow::Error>> + 'a> {
let mut timeline_ids: Vec<Result<TimelineId, anyhow::Error>> = Vec::new(); let mut timeline_ids: Vec<Result<TimelineId, anyhow::Error>> = Vec::new();
let mut continuation_token = None; let mut continuation_token = None;
let timelines_target = target.timelines_root(&tenant); let timelines_target = target.timelines_root(&tenant);
@@ -98,7 +99,7 @@ pub async fn stream_tenant_timelines<'a>(
Ok(stream! { Ok(stream! {
for i in timeline_ids { for i in timeline_ids {
let id = i?; let id = i?;
yield Ok(TenantTimelineId::new(tenant, id)); yield Ok(TenantShardTimelineId::new(tenant, id));
} }
}) })
} }

View File

@@ -2,23 +2,25 @@ use std::collections::{HashMap, HashSet};
use crate::checks::{ use crate::checks::{
branch_cleanup_and_check_errors, list_timeline_blobs, BlobDataParseResult, S3TimelineBlobData, branch_cleanup_and_check_errors, list_timeline_blobs, BlobDataParseResult, S3TimelineBlobData,
TimelineAnalysis, TenantObjectListing, TimelineAnalysis,
}; };
use crate::metadata_stream::{stream_tenant_timelines, stream_tenants}; use crate::metadata_stream::{stream_tenant_timelines, stream_tenants};
use crate::{init_remote, BucketConfig, NodeKind, RootTarget}; use crate::{init_remote, BucketConfig, NodeKind, RootTarget, TenantShardTimelineId};
use aws_sdk_s3::Client; use aws_sdk_s3::Client;
use futures_util::{pin_mut, StreamExt, TryStreamExt}; use futures_util::{pin_mut, StreamExt, TryStreamExt};
use histogram::Histogram; use histogram::Histogram;
use pageserver::tenant::remote_timeline_client::remote_layer_path;
use pageserver::tenant::IndexPart; use pageserver::tenant::IndexPart;
use pageserver_api::shard::TenantShardId;
use serde::Serialize; use serde::Serialize;
use utils::id::TenantTimelineId; use utils::id::TenantId;
#[derive(Serialize)] #[derive(Serialize)]
pub struct MetadataSummary { pub struct MetadataSummary {
count: usize, count: usize,
with_errors: HashSet<TenantTimelineId>, with_errors: HashSet<TenantShardTimelineId>,
with_warnings: HashSet<TenantTimelineId>, with_warnings: HashSet<TenantShardTimelineId>,
with_garbage: HashSet<TenantTimelineId>, with_orphans: HashSet<TenantShardTimelineId>,
indices_by_version: HashMap<usize, usize>, indices_by_version: HashMap<usize, usize>,
layer_count: MinMaxHisto, layer_count: MinMaxHisto,
@@ -88,7 +90,7 @@ impl MetadataSummary {
count: 0, count: 0,
with_errors: HashSet::new(), with_errors: HashSet::new(),
with_warnings: HashSet::new(), with_warnings: HashSet::new(),
with_garbage: HashSet::new(), with_orphans: HashSet::new(),
indices_by_version: HashMap::new(), indices_by_version: HashMap::new(),
layer_count: MinMaxHisto::new(), layer_count: MinMaxHisto::new(),
timeline_size_bytes: MinMaxHisto::new(), timeline_size_bytes: MinMaxHisto::new(),
@@ -132,7 +134,7 @@ impl MetadataSummary {
} }
} }
fn update_analysis(&mut self, id: &TenantTimelineId, analysis: &TimelineAnalysis) { fn update_analysis(&mut self, id: &TenantShardTimelineId, analysis: &TimelineAnalysis) {
if !analysis.errors.is_empty() { if !analysis.errors.is_empty() {
self.with_errors.insert(*id); self.with_errors.insert(*id);
} }
@@ -142,6 +144,10 @@ impl MetadataSummary {
} }
} }
fn notify_timeline_orphan(&mut self, ttid: &TenantShardTimelineId) {
self.with_orphans.insert(*ttid);
}
/// Long-form output for printing at end of a scan /// Long-form output for printing at end of a scan
pub fn summary_string(&self) -> String { pub fn summary_string(&self) -> String {
let version_summary: String = itertools::join( let version_summary: String = itertools::join(
@@ -155,7 +161,7 @@ impl MetadataSummary {
"Timelines: {0} "Timelines: {0}
With errors: {1} With errors: {1}
With warnings: {2} With warnings: {2}
With garbage: {3} With orphan layers: {3}
Index versions: {version_summary} Index versions: {version_summary}
Timeline size bytes: {4} Timeline size bytes: {4}
Layer size bytes: {5} Layer size bytes: {5}
@@ -164,7 +170,7 @@ Timeline layer count: {6}
self.count, self.count,
self.with_errors.len(), self.with_errors.len(),
self.with_warnings.len(), self.with_warnings.len(),
self.with_garbage.len(), self.with_orphans.len(),
self.timeline_size_bytes.oneline(), self.timeline_size_bytes.oneline(),
self.layer_size_bytes.oneline(), self.layer_size_bytes.oneline(),
self.layer_count.oneline(), self.layer_count.oneline(),
@@ -192,31 +198,131 @@ pub async fn scan_metadata(bucket_config: BucketConfig) -> anyhow::Result<Metada
// Generate a stream of TenantTimelineId // Generate a stream of TenantTimelineId
let timelines = tenants.map_ok(|t| stream_tenant_timelines(&s3_client, &target, t)); let timelines = tenants.map_ok(|t| stream_tenant_timelines(&s3_client, &target, t));
let timelines = timelines.try_buffer_unordered(CONCURRENCY); let timelines = timelines.try_buffered(CONCURRENCY);
let timelines = timelines.try_flatten(); let timelines = timelines.try_flatten();
// Generate a stream of S3TimelineBlobData // Generate a stream of S3TimelineBlobData
async fn report_on_timeline( async fn report_on_timeline(
s3_client: &Client, s3_client: &Client,
target: &RootTarget, target: &RootTarget,
ttid: TenantTimelineId, ttid: TenantShardTimelineId,
) -> anyhow::Result<(TenantTimelineId, S3TimelineBlobData)> { ) -> anyhow::Result<(TenantShardTimelineId, S3TimelineBlobData)> {
let data = list_timeline_blobs(s3_client, ttid, target).await?; let data = list_timeline_blobs(s3_client, ttid, target).await?;
Ok((ttid, data)) Ok((ttid, data))
} }
let timelines = timelines.map_ok(|ttid| report_on_timeline(&s3_client, &target, ttid)); let timelines = timelines.map_ok(|ttid| report_on_timeline(&s3_client, &target, ttid));
let timelines = timelines.try_buffer_unordered(CONCURRENCY); let timelines = timelines.try_buffered(CONCURRENCY);
// We must gather all the TenantShardTimelineId->S3TimelineBlobData for each tenant, because different
// shards in the same tenant might refer to one anothers' keys if a shard split has happened.
let mut tenant_id = None;
let mut tenant_objects = TenantObjectListing::default();
let mut tenant_timeline_results = Vec::new();
fn analyze_tenant(
tenant_id: TenantId,
summary: &mut MetadataSummary,
mut tenant_objects: TenantObjectListing,
timelines: Vec<(TenantShardTimelineId, S3TimelineBlobData)>,
) {
let mut timeline_generations = HashMap::new();
for (ttid, data) in timelines {
// Stash the generation of each timeline, for later use identifying orphan layers
if let BlobDataParseResult::Parsed {
index_part: _index_part,
index_part_generation,
s3_layers: _s3_layers,
} = &data.blob_data
{
timeline_generations.insert(ttid, *index_part_generation);
}
// Apply checks to this timeline shard's metadata, and in the process update `tenant_objects`
// reference counts for layers across the tenant.
let analysis =
branch_cleanup_and_check_errors(&ttid, &mut tenant_objects, None, None, Some(data));
summary.update_analysis(&ttid, &analysis);
}
// Identifying orphan layers must be done on a tenant-wide basis, because individual
// shards' layers may be referenced by other shards.
//
// Orphan layers are not a corruption, and not an indication of a problem. They are just
// consuming some space in remote storage, and may be cleaned up at leisure.
for (shard_index, timeline_id, layer_file, generation) in tenant_objects.get_orphans() {
let ttid = TenantShardTimelineId {
tenant_shard_id: TenantShardId {
tenant_id,
shard_count: shard_index.shard_count,
shard_number: shard_index.shard_number,
},
timeline_id,
};
if let Some(timeline_generation) = timeline_generations.get(&ttid) {
if &generation >= timeline_generation {
// Candidate orphan layer is in the current or future generation relative
// to the index we read for this timeline shard, so its absence from the index
// doesn't make it an orphan: more likely, it is a case where the layer was
// uploaded, but the index referencing the layer wasn't written yet.
continue;
}
}
let orphan_path = remote_layer_path(
&tenant_id,
&timeline_id,
shard_index,
&layer_file,
generation,
);
tracing::info!("Orphan layer detected: {orphan_path}");
summary.notify_timeline_orphan(&ttid);
}
}
// Iterate through all the timeline results. These are in key-order, so
// all results for the same tenant will be adjacent. We accumulate these,
// and then call `analyze_tenant` to flush, when we see the next tenant ID.
let mut summary = MetadataSummary::new(); let mut summary = MetadataSummary::new();
pin_mut!(timelines); pin_mut!(timelines);
while let Some(i) = timelines.next().await { while let Some(i) = timelines.next().await {
let (ttid, data) = i?; let (ttid, data) = i?;
summary.update_data(&data); summary.update_data(&data);
let analysis = match tenant_id {
branch_cleanup_and_check_errors(&ttid, &target, None, None, Some(data)).await; None => tenant_id = Some(ttid.tenant_shard_id.tenant_id),
Some(prev_tenant_id) => {
if prev_tenant_id != ttid.tenant_shard_id.tenant_id {
let tenant_objects = std::mem::take(&mut tenant_objects);
let timelines = std::mem::take(&mut tenant_timeline_results);
analyze_tenant(prev_tenant_id, &mut summary, tenant_objects, timelines);
tenant_id = Some(ttid.tenant_shard_id.tenant_id);
}
}
}
summary.update_analysis(&ttid, &analysis); if let BlobDataParseResult::Parsed {
index_part: _index_part,
index_part_generation: _index_part_generation,
s3_layers,
} = &data.blob_data
{
tenant_objects.push(ttid, s3_layers.clone());
}
tenant_timeline_results.push((ttid, data));
}
if !tenant_timeline_results.is_empty() {
analyze_tenant(
tenant_id.expect("Must be set if results are present"),
&mut summary,
tenant_objects,
tenant_timeline_results,
);
} }
Ok(summary) Ok(summary)

Some files were not shown because too many files have changed in this diff Show More