Compare commits

..

40 Commits

Author SHA1 Message Date
Heikki Linnakangas
6862a803b1 Cache initdb output to speed up tenant creation in tests
initdb takes about 1 s. Our tests create and destroy a lot of tenants,
so that adds up. Cache the initdb result to speed it up.

This is currently only enabled in tests. Out of caution, mostly. But
also because when you reuse the initdb result, all the postgres
clusters end up having the same system_identifier, which is supposed
to be unique. It's not necessary for it to be unique for correctness,
nothing critical relies on it and you can easily end up with duplicate
system_identifiers in standalone PostgreSQL too, if you e.g. create a
backup and restore it on a different system. But it is used in various
checks to reduce the chance that you e.g. accidentally apply WAL
belonging to a different cluster.

Because this is aimed at tests, there are a few things that might be
surprising:

- The initdb cache directory is configurable, and can be outside the
  pageserver's repo directory. This allows reuse across different
  pageservers running on the same host. In production use, that'd be
  pointless, but our tests create a lot of pageservers.

- The cache is not automatically purged at start / shutdown. For
  production use, we'd probably want that, so that we'd pick up any
  changes in what an empty cluster looks like after a Postgres minor
  version upgrade, for example. But again tests create and destroy a
  lot of pageservers, so it's important to retain the cache.

- The locking on the cache directory relies purely on filesystem
  operations and atomic rename(). Using e.g. a rust Mutex() would be
  more straightforward, but that's not enough because the cache needs
  to be shared between different pageservers running on the same
  system.
2024-09-21 02:09:39 +03:00
Anastasia Lubennikova
f03f7b3868 Bump vendor/postgres to include extension path fix (#9076)
This is a pre requisite for
https://github.com/neondatabase/neon/pull/8681
2024-09-20 20:24:40 +03:00
Christian Schwarz
ec5dce04eb pageserver: throttling: per-tenant metrics + more metrics to help understand throttle queue depth (#9077) 2024-09-20 16:48:26 +00:00
John Spray
6014f15157 pageserver: suppress noisy "layer became visible" logs (#9064)
## Problem

When layer visibility was added, an info log was included for the
situation where actual access to a layer disagrees with the visibility
calculation. This situation is safe, but I was interested in seeing when
it happens.

The log is pretty high volume, so this PR refines it to fire less often.

## Summary of changes

- For cases where accessing non-visible layers is normal, don't log at
all.
- Extend a unit test to increase confidence that the updates to
visibility on access are working as expected
- During compaction, only call the visibility calculation routine if
some image layers were created: previously, frequent calls resulted in
the visibility of layers getting reset every time we passed through
create_image_layers.
2024-09-20 16:07:09 +00:00
Conrad Ludgate
e675a21346 utils: leaky bucket should only report throttled if the notify queue is blocked on sleep (#9072)
## Problem

Seems that PS might be too eager in reporting throttled tasks

## Summary of changes

Introduce a sleep counter. If the sleep counter increases, then the
acquire tasks was throttled.
2024-09-20 16:09:39 +01:00
Alex Chi Z.
6b93230270 fix(pageserver): receive body error now 500 (#9052)
close https://github.com/neondatabase/neon/issues/8903

In https://github.com/neondatabase/neon/issues/8903 we observed JSON
decoding error to have the following error message in the log:

```
Error processing HTTP request: Resource temporarily unavailable: 3956 (pageserver-6.ap-southeast-1.aws.neon.tech) error receiving body: error decoding response body
```

This is hard to understand. In this patch, we make the error message
more reasonable.

## Summary of changes

* receive body error is now an internal server error, passthrough the
`reqwest::Error` (only decoding error) as `anyhow::Error`.
* instead of formatting the error using `to_string`, we use the
alternative `anyhow::Error` formatting, so that it prints out the cause
of the error (i.e., what exactly cannot serde decode).

I would expect seeing something like `error receiving body: error
decoding response body: XXX field not found` after this patch, though I
didn't set up a testing environment to observe the exact behavior.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-09-20 10:37:28 -04:00
Heikki Linnakangas
797aa4ffaa Skip running clippy in --release mode. (#9073)
It's pretty expensive to run, and there is very little difference
between debug and release builds that could lead to different clippy
warnings.

This is extracted from PR #8912. That PR wandered off into various
improvements we could make, but we seem to have consensus on this part
at least.
2024-09-20 17:22:58 +03:00
Christian Schwarz
c45b56e0bb pageserver: add counters for started smgr/getpage requests (#9069)
After this PR

```
curl localhost:9898/metrics | grep smgr_ | grep start
```

```
pageserver_smgr_query_started_count{shard_id="0000",smgr_query_type="get_page_at_lsn",tenant_id="...",timeline_id="..."} 0
pageserver_smgr_query_started_global_count{smgr_query_type="get_db_size"} 0
pageserver_smgr_query_started_global_count{smgr_query_type="get_page_at_lsn"} 0
pageserver_smgr_query_started_global_count{smgr_query_type="get_rel_exists"} 0
pageserver_smgr_query_started_global_count{smgr_query_type="get_rel_size"} 0
pageserver_smgr_query_started_global_count{smgr_query_type="get_slru_segment"} 0
```

We instantiate the per-tenant counter only for `get_page_at_lsn`.
2024-09-20 14:55:50 +01:00
Alexander Bayandin
3104f0f250 Safekeeper: fix OpenAPI spec (#9066)
## Problem

Safekeeper's OpenAPI spec is incorrect:

```
Semantic error at paths./v1/tenant/{tenant_id}/timeline/{timeline_id}.get.responses.404.content.application/json.schema.$ref
$refs must reference a valid location in the document
Jump to line 126
```
Checked on https://editor.swagger.io

## Summary of changes
- Add `NotFoundError` 
- Add `description` and `license` fields to make Cloud OpenAPI spec
linter happy
2024-09-20 12:00:05 +01:00
Arseny Sher
f2c08195f0 Bump vendor/postgres.
Includes PRs:
- ERROR out instead of segfaulting when walsender slots are full.
- logical worker: respond to publisher even under dense stream.
2024-09-20 12:38:42 +03:00
Alex Chi Z.
d0cbfda15c refactor(pageserver): check layer map valid in one place (#9051)
We have 3 places where we implement layer map checks.

## Summary of changes

Now we have a single check function being called in all places.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-09-19 20:29:28 +00:00
Yuchen Liang
1708743e78 pageserver: wait for lsn lease duration after transition into AttachedSingle (#9024)
Part of #7497, closes https://github.com/neondatabase/neon/issues/8890.

## Problem

Since leases are in-memory objects, we need to take special care of them
after pageserver restarts and while doing a live migration. The approach
we took for pageserver restart is to wait for at least lease duration
before doing first GC. We want to do the same for live migration. Since
we do not do any GC when a tenant is in `AttachedStale` or
`AttachedMulti` mode, only the transition from `AttachedMulti` to
`AttachedSingle` requires this treatment.

## Summary of changes

- Added `lsn_lease_deadline` field in `GcBlock::reasons`: the tenant is
temporarily blocked from GC until we reach the deadline. This
information does not persist to S3.
- In `GCBlock::start`, skip the GC iteration if we are blocked by the
lsn lease deadline.
- In `TenantManager::upsert_location`, set the lsn_lease_deadline to
`Instant::now() + lsn_lease_length` so the granted leases have a chance
to be renewed before we run GC for the first time after transitioned
from AttachedMulti to AttachedSingle.

Signed-off-by: Yuchen Liang <yuchen@neon.tech>
Co-authored-by: Joonas Koivunen <joonas@neon.tech>
2024-09-19 17:27:10 +01:00
Conrad Ludgate
0a1ca7670c proxy: remove auth info from http conn info & fixup jwt api trait (#9047)
misc changes split out from #8855 

- **allow cloning the request context in a read-only fashion for
background tasks**
- **propagate endpoint and request context through the jwk cache**
- **only allow password based auth for md5 during testing**
- **remove auth info from conn info**
2024-09-19 15:09:30 +00:00
Alex Chi Z.
ff9f065c43 impr(pageserver): log image layer creation (#9050)
https://github.com/neondatabase/neon/pull/9028 changed the image layer
creation log into trace level. However, I personally find logging image
layer creation useful when reading the logs -- it makes it clear that
the image layer creation is happening and gives a clear idea of the
progress. Therefore, I propose to continue logging them for
create_image_layers set of functions.

## Summary of changes

* Add info logging for all image layers created in legacy compaction.
* Add info logging for all layers creation in testing functions.

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-09-19 10:43:12 -04:00
Vlad Lazar
21eeafaaa5 pageserver: simple fix for vectored read image layer skip (#9026)
## Problem

Different keyspaces may require different floor LSNs in vectored
delta layer visits. This patch adds support for such cases.

## Summary of changes

Different keyspaces wishing to read the same layer might
require different stop lsns (or lsn floor). The start LSN
of the read (or the lsn ceil) will always be the same.

With this observation, we fix skipping of image layers by
indexing the fringe by layer id plus lsn floor.

This is very simple, but means that we can visit delta layers twice
in certain cases. Still, I think it's very unlikely for any extra
merging to have taken place in this case, so perhaps it makes sense to go
with the simpler patch.

Fixes https://github.com/neondatabase/neon/issues/9012
Alternative to https://github.com/neondatabase/neon/pull/9025
2024-09-19 14:51:00 +01:00
Arseny Sher
32a0e759bd safekeeper: add wal_last_modified to debug_dump.
Adds to debug_dump option to include highest modified time among all WAL
segments. In passing replace some str with OsStr to have less unwraps.
2024-09-19 16:17:25 +03:00
Heikki Linnakangas
7c489092b7 Remove unused duplicate DEFAULT_INGEST_BATCH_SIZE constant
This constant in 'tenant_conf_defaults' was unused, but there's
another constant with the same name in the global 'defaults'. I wish
the setting was configurable per-tenant, but it isn't, so let's remove
the confusing duplicate.
2024-09-19 15:41:35 +03:00
Heikki Linnakangas
06d55a3b12 Clean up concurrent logical size calc semaphore initialization
The DEFAULT_CONCURRENT_TENANT_SIZE_LOGICAL_SIZE_QUERIES constant was
unused, because we had just hardcoded it to 1 where the constant
should've been used.

Remove the ConfigurableSemaphore::Default implementation, since it was
unused.
2024-09-19 15:41:35 +03:00
Heikki Linnakangas
5c68e6a172 Remove unused constant
The code that used it was removed in commit b9d2c7bdd5
2024-09-19 15:41:35 +03:00
Heikki Linnakangas
2753abc0d8 Remove leftover enums for configuring vectored get implementation
The settings were removed in commit corb9d2c7b.
2024-09-19 15:41:35 +03:00
Heikki Linnakangas
a523548ed1 Remove unused cleanup_remaining_timeline_fs_traces function
There's some more code that still checks for uninit and delete
markers, see callers of is_delete_mark and is_uninit_mark, and github
issue #5718. But these functions were outright dead.
2024-09-19 11:57:10 +03:00
Heikki Linnakangas
2d4e5af18b Remove unused code for parsing a postgresql.conf file 2024-09-19 11:57:10 +03:00
Heikki Linnakangas
5da2340e74 Remove misc dead code in control_plane/ 2024-09-19 11:57:10 +03:00
Heikki Linnakangas
7b34c2d7af Remove misc dead code in libs/ 2024-09-19 11:57:10 +03:00
Heikki Linnakangas
15ae1fc3df Remove a few postgres constants that were not used
Dead code is generally useless, but with Postgres constants in
particular, I'm also worried that if they're not used anywhere, we
might fail to update them at a Postgres version update, and get very
confused later when they have wrong values.
2024-09-19 11:57:10 +03:00
Heikki Linnakangas
728b79b9dd Remove some unnecessary derives 2024-09-19 11:57:10 +03:00
Alex Chi Z.
9d1c6f23d3 fix(storage-scrubber): log version after initialize the logger (#9049)
When I checked the log in Grafana I couldn't find the scrubber version.
Then I realized that it should be logged after the logger gets
initialized.

## Summary of changes

Log after initializing the logger for the scrubber.

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-09-18 14:13:57 -04:00
Christian Schwarz
035a49a6b2 neon_local start: parallel startup to break cyclic dependency (#8950)
(Found this useful during investigation
https://github.com/neondatabase/cloud/issues/16886.)

Problem
-------

Before this PR, `neon_local` sequentially does the following:
1. launch storcon process
2. wait for storcon to signal readiness
[here](75310fe441/control_plane/src/storage_controller.rs (L804-L808))
3. start pageserver
4. wait for pageserver to become ready
[here](c43e664ff5/control_plane/src/pageserver.rs (L343-L346))
5. etc

The problem is that storcon's readiness waits for the
[`startup_reconcile`](cbcd4058ed/storage_controller/src/service.rs (L520-L523))
to complete.

But pageservers aren't started at this point.

So, worst case we wait for `STARTUP_RECONCILE_TIMEOUT/2`, i.e., 15s.

This is more than the 10s default timeout allowed by neon_local.

So, the result is that `neon_local start` fails to start storcon and
stops everything.

Solution
--------

In this PR I choose the the radical solution to start everything in
parallel.

It junks up the output because we do stuff like `print!(".")` to
indicate progress.
We should just abandon that.
And switch to `utils::logging` + `tracing` with separate spans for each
component.
I can do that in this PR or we leave it as a follow-up.

Alternatives Considered
-----------------------

The Pageserver's `/v1/status` or in fact any endpoint of the mgmt API
will not `accept()` on the mgmt API socket until after the `re-attach`
call to storcon returned success.

So, it's insufficient to change the startup order to start Pageservers
first.

We cannot easily change Pageserver startup order because
`init_tenant_mgr` must complete before we start serving the mgmt API.
Otherwise tenant detach calls et al can race with `init_tenant_mgr`.

We'd have to add a "loading" state to tenant mgr and make all API
endpoints except `/v1/status` wait for _that_ to complete.


Related
-------

- https://github.com/neondatabase/neon/pull/6475
2024-09-18 18:17:55 +02:00
Folke Behrens
794bd4b866 proxy: mock cplane usable without allowed-ips table (#9046) 2024-09-18 17:14:53 +02:00
Alexander Bayandin
ac6a1151ae test_postgres_version: reenable version check for prereleased versions 2024-09-18 14:51:59 +01:00
Tristan Partin
2f37f0384c Add v17 to revisions.json
Signed-off-by: Tristan Partin <tristan@neon.tech>
2024-09-18 14:51:59 +01:00
Alexander Bayandin
e161a2fa42 CI(deploy): fix deploy to staging and prod (#9030)
## Problem

It turns out the previous approach (with `skip_if` input) doesn't work
(from https://github.com/neondatabase/neon/pull/9017).
Revert it and use more straightforward if-conditions

## Summary of changes
- Revert efbe8db7f1
- Add if-condition to`promote-compatibility-data` job and relevant
comments
2024-09-18 14:26:47 +01:00
Folke Behrens
c5cd8577ff proxy: make sql-over-http max request/response sizes configurable (#9029) 2024-09-18 13:58:51 +02:00
Heikki Linnakangas
3454ef7507 Refactor ImageLayerWriter to avoid passing a Timeline to finish() (#9028)
Commit ca5390a89d made a similar change to DeltaLayerWriter.

We bumped into this with Stas with our hackathon project, to create a
standalong program to create image layers directly from a Postgres data
directory. It needs to create image layers without having a Timeline and
other pageserver machinery.

This downgrades the "created image layer {}" message from INFO to TRACE
level. TRACE is used for the corresponding message on delta layer
creation too. The path logged in the message is now the temporary path,
before the file is renamed to its final name. Again commit ca5390a89d
made the same change for the message on delta layer creation.
2024-09-18 13:16:51 +03:00
Christian Schwarz
135e7e4306 add neon_local subcommand for the broker & use that from regression tests (#8948)
There's currently no way to just start/stop broker from `neon_local`.

This PR
* adds a sub-command
* uses that sub-command from the test suite instead of the pre-existing
Python `subprocess` based approach.

Found this useful during investigation
https://github.com/neondatabase/cloud/issues/16886.
2024-09-18 09:10:27 +02:00
Christian Schwarz
3cd2a3f931 refactor(walredo): process launch & kill-on-error machinery (#8951)
Immediate benefit: easier to spot what's going on.

Later benefit: use the extracted method in PR

- https://github.com/neondatabase/neon/pull/8952

which adds a `ping` command to walredo.

Found this useful during investigation
https://github.com/neondatabase/cloud/issues/16886.
2024-09-17 19:16:33 +00:00
Alexander Bayandin
d78f5ce6da CI: don't fetch the whole git history if it's not required (#9021)
## Problem
We do use `actions/checkout` with `fetch-depth: 0` when it's not
required

## Summary of changes
- Remove unneeded `fetch-depth: 0`
- Add a comment if `fetch-depth: 0` is required
2024-09-17 18:40:05 +01:00
Arpad Müller
a1b71b73fe Rename some S3 usages to "remote storage" in exposed messages (#8999)
In exposed messages like log messages we mentioned "S3", which is not
entirely accurate as we support Azure blob storage now as well.
2024-09-17 19:15:01 +02:00
Tristan Partin
6138eb50e9 Fix test code related to migrations
We added another migration in 5876c441ab,
but didn't bump this value. This had no effect, but best to fix it
anyway.

Signed-off-by: Tristan Partin <tristan@neon.tech>
2024-09-17 15:56:05 +01:00
Heikki Linnakangas
d211f00f05 Remove unnecessary dependencies (#9000)
Found by "cargo machete"
2024-09-17 17:55:45 +03:00
116 changed files with 1753 additions and 1379 deletions

View File

@@ -26,15 +26,9 @@ on:
description: Azure tenant ID
required: true
type: string
skip_if:
description: Skip the job if this expression is true
required: true
type: boolean
jobs:
push-to-acr:
if: ${{ !inputs.skip_if }}
runs-on: ubuntu-22.04
permissions:
contents: read # This is required for actions/checkout

View File

@@ -54,8 +54,8 @@ jobs:
build-tag: ${{steps.build-tag.outputs.tag}}
steps:
- name: Checkout
uses: actions/checkout@v4
# Need `fetch-depth: 0` to count the number of commits in the branch
- uses: actions/checkout@v4
with:
fetch-depth: 0
@@ -159,6 +159,10 @@ jobs:
# This will catch compiler & clippy warnings in all feature combinations.
# TODO: use cargo hack for build and test as well, but, that's quite expensive.
# NB: keep clippy args in sync with ./run_clippy.sh
#
# The only difference between "clippy --debug" and "clippy --release" is that in --release mode,
# #[cfg(debug_assertions)] blocks are not built. It's not worth building everything for second
# time just for that, so skip "clippy --release".
- run: |
CLIPPY_COMMON_ARGS="$( source .neon_clippy_args; echo "$CLIPPY_COMMON_ARGS")"
if [ "$CLIPPY_COMMON_ARGS" = "" ]; then
@@ -168,8 +172,6 @@ jobs:
echo "CLIPPY_COMMON_ARGS=${CLIPPY_COMMON_ARGS}" >> $GITHUB_ENV
- name: Run cargo clippy (debug)
run: cargo hack --feature-powerset clippy $CLIPPY_COMMON_ARGS
- name: Run cargo clippy (release)
run: cargo hack --feature-powerset clippy --release $CLIPPY_COMMON_ARGS
- name: Check documentation generation
run: cargo doc --workspace --no-deps --document-private-items
@@ -374,8 +376,8 @@ jobs:
coverage-html: ${{ steps.upload-coverage-report-new.outputs.report-url }}
coverage-json: ${{ steps.upload-coverage-report-new.outputs.summary-json }}
steps:
- name: Checkout
uses: actions/checkout@v4
# Need `fetch-depth: 0` for differential coverage (to get diff between two commits)
- uses: actions/checkout@v4
with:
submodules: true
fetch-depth: 0
@@ -476,11 +478,9 @@ jobs:
runs-on: ${{ fromJson(format('["self-hosted", "{0}"]', matrix.arch == 'arm64' && 'large-arm64' || 'large')) }}
steps:
- name: Checkout
uses: actions/checkout@v4
- uses: actions/checkout@v4
with:
submodules: true
fetch-depth: 0
- uses: ./.github/actions/set-docker-config-dir
- uses: docker/setup-buildx-action@v3
@@ -555,11 +555,9 @@ jobs:
runs-on: ${{ fromJson(format('["self-hosted", "{0}"]', matrix.arch == 'arm64' && 'large-arm64' || 'large')) }}
steps:
- name: Checkout
uses: actions/checkout@v4
- uses: actions/checkout@v4
with:
submodules: true
fetch-depth: 0
- uses: ./.github/actions/set-docker-config-dir
- uses: docker/setup-buildx-action@v3
@@ -706,10 +704,7 @@ jobs:
VM_BUILDER_VERSION: v0.29.3
steps:
- name: Checkout
uses: actions/checkout@v4
with:
fetch-depth: 0
- uses: actions/checkout@v4
- name: Downloading vm-builder
run: |
@@ -749,10 +744,7 @@ jobs:
runs-on: ${{ fromJson(format('["self-hosted", "{0}"]', matrix.arch == 'arm64' && 'small-arm64' || 'small')) }}
steps:
- name: Checkout
uses: actions/checkout@v4
with:
fetch-depth: 0
- uses: actions/checkout@v4
- uses: ./.github/actions/set-docker-config-dir
- uses: docker/login-action@v3
@@ -859,6 +851,7 @@ jobs:
done
push-to-acr-dev:
if: github.ref_name == 'main'
needs: [ tag, promote-images ]
uses: ./.github/workflows/_push-to-acr.yml
with:
@@ -868,9 +861,9 @@ jobs:
registry_name: ${{ vars.AZURE_DEV_REGISTRY_NAME }}
subscription_id: ${{ vars.AZURE_DEV_SUBSCRIPTION_ID }}
tenant_id: ${{ vars.AZURE_TENANT_ID }}
skip_if: ${{ github.ref_name != 'main' }}
push-to-acr-prod:
if: github.ref_name == 'release'|| github.ref_name == 'release-proxy'
needs: [ tag, promote-images ]
uses: ./.github/workflows/_push-to-acr.yml
with:
@@ -880,7 +873,6 @@ jobs:
registry_name: ${{ vars.AZURE_PROD_REGISTRY_NAME }}
subscription_id: ${{ vars.AZURE_PROD_SUBSCRIPTION_ID }}
tenant_id: ${{ vars.AZURE_TENANT_ID }}
skip_if: ${{ !startsWith(github.ref_name, 'release') }}
trigger-custom-extensions-build-and-wait:
needs: [ check-permissions, tag ]
@@ -958,7 +950,8 @@ jobs:
deploy:
needs: [ check-permissions, promote-images, tag, build-and-test-locally, trigger-custom-extensions-build-and-wait, push-to-acr-dev, push-to-acr-prod ]
if: github.ref_name == 'main' || github.ref_name == 'release' || github.ref_name == 'release-proxy'
# `!failure() && !cancelled()` is required because the workflow depends on the job that can be skipped: `push-to-acr-dev` and `push-to-acr-prod`
if: (github.ref_name == 'main' || github.ref_name == 'release' || github.ref_name == 'release-proxy') && !failure() && !cancelled()
runs-on: [ self-hosted, small ]
container: 369495373322.dkr.ecr.eu-central-1.amazonaws.com/ansible:latest
@@ -977,10 +970,7 @@ jobs:
git config --global --add safe.directory "${GITHUB_WORKSPACE}/vendor/postgres-v$r"
done
- name: Checkout
uses: actions/checkout@v4
with:
fetch-depth: 0
- uses: actions/checkout@v4
- name: Trigger deploy workflow
env:
@@ -1059,7 +1049,8 @@ jobs:
# The job runs on `release` branch and copies compatibility data and Neon artifact from the last *release PR* to the latest directory
promote-compatibility-data:
needs: [ deploy ]
if: github.ref_name == 'release'
# `!failure() && !cancelled()` is required because the workflow transitively depends on the job that can be skipped: `push-to-acr-dev` and `push-to-acr-prod`
if: github.ref_name == 'release' && !failure() && !cancelled()
runs-on: ubuntu-22.04
steps:

View File

@@ -34,8 +34,8 @@ jobs:
build-tag: ${{ steps.build-tag.outputs.tag }}
steps:
- name: Checkout
uses: actions/checkout@v4
# Need `fetch-depth: 0` to count the number of commits in the branch
- uses: actions/checkout@v4
with:
fetch-depth: 0

144
Cargo.lock generated
View File

@@ -1209,7 +1209,6 @@ dependencies = [
"remote_storage",
"serde",
"serde_json",
"serde_with",
"utils",
]
@@ -1218,7 +1217,6 @@ name = "compute_tools"
version = "0.1.0"
dependencies = [
"anyhow",
"async-compression",
"bytes",
"cfg-if",
"chrono",
@@ -1237,7 +1235,6 @@ dependencies = [
"reqwest 0.12.4",
"rlimit",
"rust-ini",
"serde",
"serde_json",
"signal-hook",
"tar",
@@ -1246,7 +1243,6 @@ dependencies = [
"tokio-postgres",
"tokio-stream",
"tokio-util",
"toml_edit",
"tracing",
"tracing-opentelemetry",
"tracing-subscriber",
@@ -1317,12 +1313,9 @@ dependencies = [
name = "consumption_metrics"
version = "0.1.0"
dependencies = [
"anyhow",
"chrono",
"rand 0.8.5",
"serde",
"serde_with",
"utils",
]
[[package]]
@@ -1334,9 +1327,7 @@ dependencies = [
"clap",
"comfy-table",
"compute_api",
"futures",
"git-version",
"hex",
"humantime",
"humantime-serde",
"hyper 0.14.26",
@@ -1344,7 +1335,6 @@ dependencies = [
"once_cell",
"pageserver_api",
"pageserver_client",
"postgres",
"postgres_backend",
"postgres_connection",
"regex",
@@ -1353,9 +1343,7 @@ dependencies = [
"scopeguard",
"serde",
"serde_json",
"serde_with",
"storage_broker",
"tar",
"thiserror",
"tokio",
"tokio-postgres",
@@ -1663,7 +1651,6 @@ dependencies = [
"hex",
"parking_lot 0.12.1",
"rand 0.8.5",
"scopeguard",
"smallvec",
"tracing",
"utils",
@@ -2233,24 +2220,22 @@ checksum = "4271d37baee1b8c7e4b708028c57d816cf9d2434acb33a549475f78c181f6253"
[[package]]
name = "git-version"
version = "0.3.5"
version = "0.3.9"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "f6b0decc02f4636b9ccad390dcbe77b722a77efedfa393caf8379a51d5c61899"
checksum = "1ad568aa3db0fcbc81f2f116137f263d7304f512a1209b35b85150d3ef88ad19"
dependencies = [
"git-version-macro",
"proc-macro-hack",
]
[[package]]
name = "git-version-macro"
version = "0.3.5"
version = "0.3.9"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "fe69f1cbdb6e28af2bac214e943b99ce8a0a06b447d15d3e61161b0423139f3f"
checksum = "53010ccb100b96a67bc32c0175f0ed1426b31b655d562898e57325f81c023ac0"
dependencies = [
"proc-macro-hack",
"proc-macro2",
"quote",
"syn 1.0.109",
"syn 2.0.52",
]
[[package]]
@@ -2744,19 +2729,6 @@ dependencies = [
"libc",
]
[[package]]
name = "inotify"
version = "0.10.2"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "fdd168d97690d0b8c412d6b6c10360277f4d7ee495c5d0d5d5fe0854923255cc"
dependencies = [
"bitflags 1.3.2",
"futures-core",
"inotify-sys",
"libc",
"tokio",
]
[[package]]
name = "inotify-sys"
version = "0.1.5"
@@ -3251,7 +3223,7 @@ dependencies = [
"crossbeam-channel",
"filetime",
"fsevent-sys",
"inotify 0.9.6",
"inotify",
"kqueue",
"libc",
"log",
@@ -3642,7 +3614,6 @@ name = "pagectl"
version = "0.1.0"
dependencies = [
"anyhow",
"bytes",
"camino",
"clap",
"git-version",
@@ -3651,7 +3622,6 @@ dependencies = [
"pageserver_api",
"postgres_ffi",
"remote_storage",
"serde",
"serde_json",
"svg_fmt",
"thiserror",
@@ -3670,7 +3640,6 @@ dependencies = [
"arc-swap",
"async-compression",
"async-stream",
"async-trait",
"bit_field",
"byteorder",
"bytes",
@@ -3678,16 +3647,13 @@ dependencies = [
"camino-tempfile",
"chrono",
"clap",
"const_format",
"consumption_metrics",
"crc32c",
"criterion",
"crossbeam-utils",
"either",
"enum-map",
"enumset",
"fail",
"flate2",
"futures",
"git-version",
"hex",
@@ -3726,13 +3692,9 @@ dependencies = [
"serde_json",
"serde_path_to_error",
"serde_with",
"signal-hook",
"smallvec",
"storage_broker",
"strum",
"strum_macros",
"svg_fmt",
"sync_wrapper",
"sysinfo",
"tenant_size_model",
"thiserror",
@@ -3746,7 +3708,6 @@ dependencies = [
"tokio-util",
"toml_edit",
"tracing",
"twox-hash",
"url",
"utils",
"walkdir",
@@ -3810,44 +3771,22 @@ name = "pageserver_compaction"
version = "0.1.0"
dependencies = [
"anyhow",
"async-compression",
"async-stream",
"byteorder",
"bytes",
"chrono",
"clap",
"const_format",
"consumption_metrics",
"criterion",
"crossbeam-utils",
"either",
"fail",
"flate2",
"futures",
"git-version",
"hex",
"hex-literal",
"humantime",
"humantime-serde",
"itertools 0.10.5",
"metrics",
"once_cell",
"pageserver_api",
"pin-project-lite",
"rand 0.8.5",
"smallvec",
"svg_fmt",
"sync_wrapper",
"thiserror",
"tokio",
"tokio-io-timeout",
"tokio-util",
"tracing",
"tracing-error",
"tracing-subscriber",
"url",
"utils",
"walkdir",
"workspace_hack",
]
@@ -4164,9 +4103,7 @@ name = "postgres_backend"
version = "0.1.0"
dependencies = [
"anyhow",
"async-trait",
"bytes",
"futures",
"once_cell",
"pq_proto",
"rustls 0.22.4",
@@ -4199,16 +4136,13 @@ version = "0.1.0"
dependencies = [
"anyhow",
"bindgen",
"byteorder",
"bytes",
"crc32c",
"env_logger",
"hex",
"log",
"memoffset 0.8.0",
"once_cell",
"postgres",
"rand 0.8.5",
"regex",
"serde",
"thiserror",
@@ -4243,13 +4177,11 @@ dependencies = [
"byteorder",
"bytes",
"itertools 0.10.5",
"pin-project-lite",
"postgres-protocol",
"rand 0.8.5",
"serde",
"thiserror",
"tokio",
"tracing",
]
[[package]]
@@ -4281,12 +4213,6 @@ dependencies = [
"elliptic-curve 0.13.8",
]
[[package]]
name = "proc-macro-hack"
version = "0.5.20+deprecated"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "dc375e1527247fe1a97d8b7156678dfe7c1af2fc075c9a4db3690ecd2a148068"
[[package]]
name = "proc-macro2"
version = "1.0.78"
@@ -4405,7 +4331,6 @@ dependencies = [
"aws-config",
"aws-sdk-iam",
"aws-sigv4",
"aws-types",
"base64 0.13.1",
"bstr",
"bytes",
@@ -4414,7 +4339,6 @@ dependencies = [
"chrono",
"clap",
"consumption_metrics",
"crossbeam-deque",
"dashmap",
"ecdsa 0.16.9",
"env_logger",
@@ -4440,11 +4364,9 @@ dependencies = [
"jose-jwa",
"jose-jwk",
"lasso",
"md5",
"measured",
"metrics",
"once_cell",
"opentelemetry",
"p256 0.13.2",
"parking_lot 0.12.1",
"parquet",
@@ -4465,7 +4387,6 @@ dependencies = [
"reqwest-middleware",
"reqwest-retry",
"reqwest-tracing",
"routerify",
"rsa",
"rstest",
"rustc-hash",
@@ -4481,7 +4402,6 @@ dependencies = [
"smol_str",
"socket2 0.5.5",
"subtle",
"task-local-extensions",
"thiserror",
"tikv-jemalloc-ctl",
"tikv-jemallocator",
@@ -4491,7 +4411,6 @@ dependencies = [
"tokio-rustls 0.25.0",
"tokio-tungstenite",
"tokio-util",
"tower-service",
"tracing",
"tracing-opentelemetry",
"tracing-subscriber",
@@ -4781,7 +4700,6 @@ dependencies = [
"async-stream",
"async-trait",
"aws-config",
"aws-credential-types",
"aws-sdk-s3",
"aws-smithy-async",
"aws-smithy-types",
@@ -4795,7 +4713,6 @@ dependencies = [
"futures",
"futures-util",
"http-types",
"humantime",
"humantime-serde",
"hyper 0.14.26",
"itertools 0.10.5",
@@ -5275,14 +5192,12 @@ version = "0.1.0"
dependencies = [
"anyhow",
"async-stream",
"async-trait",
"byteorder",
"bytes",
"camino",
"camino-tempfile",
"chrono",
"clap",
"const_format",
"crc32c",
"desim",
"fail",
@@ -5308,9 +5223,7 @@ dependencies = [
"sd-notify",
"serde",
"serde_json",
"serde_with",
"sha2",
"signal-hook",
"storage_broker",
"strum",
"strum_macros",
@@ -5321,7 +5234,6 @@ dependencies = [
"tokio-stream",
"tokio-tar",
"tokio-util",
"toml_edit",
"tracing",
"tracing-subscriber",
"url",
@@ -5336,7 +5248,6 @@ version = "0.1.0"
dependencies = [
"const_format",
"serde",
"serde_with",
"utils",
]
@@ -5865,7 +5776,6 @@ version = "0.1.0"
dependencies = [
"anyhow",
"async-stream",
"bytes",
"clap",
"const_format",
"futures",
@@ -5879,7 +5789,6 @@ dependencies = [
"parking_lot 0.12.1",
"prost",
"tokio",
"tokio-stream",
"tonic",
"tonic-build",
"tracing",
@@ -5892,9 +5801,7 @@ name = "storage_controller"
version = "0.1.0"
dependencies = [
"anyhow",
"aws-config",
"bytes",
"camino",
"chrono",
"clap",
"control_plane",
@@ -5935,20 +5842,9 @@ dependencies = [
name = "storage_controller_client"
version = "0.1.0"
dependencies = [
"anyhow",
"bytes",
"futures",
"pageserver_api",
"pageserver_client",
"postgres",
"reqwest 0.12.4",
"serde",
"thiserror",
"tokio",
"tokio-postgres",
"tokio-stream",
"tokio-util",
"utils",
"workspace_hack",
]
@@ -5960,13 +5856,9 @@ dependencies = [
"async-stream",
"aws-config",
"aws-sdk-s3",
"aws-smithy-async",
"bincode",
"bytes",
"camino",
"chrono",
"clap",
"crc32c",
"either",
"futures",
"futures-util",
@@ -5978,20 +5870,16 @@ dependencies = [
"pageserver",
"pageserver_api",
"postgres_ffi",
"rand 0.8.5",
"remote_storage",
"reqwest 0.12.4",
"rustls 0.22.4",
"rustls-native-certs 0.7.0",
"serde",
"serde_json",
"serde_with",
"storage_controller_client",
"thiserror",
"tokio",
"tokio-postgres",
"tokio-postgres-rustls",
"tokio-rustls 0.25.0",
"tokio-stream",
"tokio-util",
"tracing",
@@ -6010,14 +5898,11 @@ dependencies = [
"comfy-table",
"futures",
"humantime",
"hyper 0.14.26",
"pageserver_api",
"pageserver_client",
"reqwest 0.12.4",
"serde",
"serde_json",
"storage_controller_client",
"thiserror",
"tokio",
"tracing",
"utils",
@@ -6140,15 +6025,6 @@ dependencies = [
"xattr",
]
[[package]]
name = "task-local-extensions"
version = "0.1.4"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "ba323866e5d033818e3240feeb9f7db2c4296674e4d9e16b97b7bf8f490434e8"
dependencies = [
"pin-utils",
]
[[package]]
name = "tempfile"
version = "3.9.0"
@@ -6739,7 +6615,6 @@ dependencies = [
"opentelemetry",
"opentelemetry-otlp",
"opentelemetry-semantic-conventions",
"reqwest 0.12.4",
"tokio",
"tracing",
"tracing-opentelemetry",
@@ -6943,7 +6818,6 @@ dependencies = [
"serde_assert",
"serde_json",
"serde_path_to_error",
"serde_with",
"signal-hook",
"strum",
"strum_macros",
@@ -6999,13 +6873,11 @@ dependencies = [
"cgroups-rs",
"clap",
"futures",
"inotify 0.10.2",
"serde",
"serde_json",
"sysinfo",
"tokio",
"tokio-postgres",
"tokio-stream",
"tokio-util",
"tracing",
"tracing-subscriber",
@@ -7032,7 +6904,6 @@ dependencies = [
"clap",
"env_logger",
"log",
"once_cell",
"postgres",
"postgres_ffi",
"regex",
@@ -7555,6 +7426,7 @@ dependencies = [
"digest",
"either",
"fail",
"futures",
"futures-channel",
"futures-executor",
"futures-io",
@@ -7610,6 +7482,8 @@ dependencies = [
"tower",
"tracing",
"tracing-core",
"tracing-log",
"tracing-subscriber",
"url",
"uuid",
"zeroize",

View File

@@ -57,7 +57,7 @@ RUN set -e \
# Build final image
#
FROM debian:bookworm-slim
FROM debian:bullseye-slim
ARG DEFAULT_PG_VERSION
WORKDIR /data

View File

@@ -1,4 +1,4 @@
FROM debian:bookworm-slim
FROM debian:bullseye-slim
# Use ARG as a build-time environment variable here to allow.
# It's not supposed to be set outside.
@@ -6,7 +6,7 @@ FROM debian:bookworm-slim
# ```
# . /etc/os-release && echo "${VERSION_CODENAME}"
# ```
ARG DEBIAN_VERSION_CODENAME=bookworm
ARG DEBIAN_VERSION_CODENAME=bullseye
# Add nonroot user
RUN useradd -ms /bin/bash nonroot -b /home
@@ -38,14 +38,14 @@ RUN set -e \
libseccomp-dev \
libsqlite3-dev \
libssl-dev \
libstdc++-11-dev \
libstdc++-10-dev \
libtool \
libxml2-dev \
libxmlsec1-dev \
libxxhash-dev \
lsof \
make \
netcat-traditional \
netcat \
net-tools \
openssh-client \
parallel \
@@ -81,11 +81,8 @@ RUN curl -fsSL 'https://apt.llvm.org/llvm-snapshot.gpg.key' | apt-key add - \
&& rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*
# Install docker
RUN curl -fsSL 'https://www.postgresql.org/media/keys/ACCC4CF8.asc' | apt-key add - \
&& echo 'deb http://apt.postgresql.org/pub/repos/apt bookworm-pgdg main' > /etc/apt/sources.list.d/pgdg.list \
&& apt install -y apt-transport-https ca-certificates gnupg-agent \
&& curl -fsSL https://download.docker.com/linux/debian/gpg | gpg --dearmor > /usr/share/keyrings/docker-ce-archive-keyring.gpg \
&& echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/docker-ce-archive-keyring.gpg] https://download.docker.com/linux/debian ${DEBIAN_VERSION_CODENAME} stable" > /etc/apt/sources.list.d/docker-ce.list \
RUN curl -fsSL https://download.docker.com/linux/ubuntu/gpg | gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg \
&& echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/debian ${DEBIAN_VERSION_CODENAME} stable" > /etc/apt/sources.list.d/docker.list \
&& apt update \
&& apt install -y docker-ce docker-ce-cli \
&& rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*

View File

@@ -9,7 +9,7 @@ ARG BUILD_TAG
# Layer "build-deps"
#
#########################################################################################
FROM debian:bookworm-slim AS build-deps
FROM debian:bullseye-slim AS build-deps
RUN apt update && \
apt install -y git autoconf automake libtool build-essential bison flex libreadline-dev \
zlib1g-dev libxml2-dev libcurl4-openssl-dev libossp-uuid-dev wget pkg-config libssl-dev \
@@ -97,9 +97,9 @@ RUN case "${PG_VERSION}" in "v17") \
mkdir -p /sfcgal && \
echo "Postgis doensn't yet support PG17 (needs 3.4.3, if not higher)" && exit 0;; \
esac && \
wget https://gitlab.com/Oslandia/SFCGAL/-/archive/v1.4.1/SFCGAL-v1.4.1.tar.gz -O SFCGAL.tar.gz && \
echo "1800c8a26241588f11cddcf433049e9b9aea902e923414d2ecef33a3295626c3 SFCGAL.tar.gz" | sha256sum --check && \
mkdir sfcgal-src && cd sfcgal-src && tar xvzf ../SFCGAL.tar.gz --strip-components=1 -C . && \
wget https://gitlab.com/Oslandia/SFCGAL/-/archive/v1.3.10/SFCGAL-v1.3.10.tar.gz -O SFCGAL.tar.gz && \
echo "4e39b3b2adada6254a7bdba6d297bb28e1a9835a9f879b74f37e2dab70203232 SFCGAL.tar.gz" | sha256sum --check && \
mkdir sfcgal-src && cd sfcgal-src && tar xzf ../SFCGAL.tar.gz --strip-components=1 -C . && \
cmake -DCMAKE_BUILD_TYPE=Release . && make -j $(getconf _NPROCESSORS_ONLN) && \
DESTDIR=/sfcgal make install -j $(getconf _NPROCESSORS_ONLN) && \
make clean && cp -R /sfcgal/* /
@@ -160,28 +160,31 @@ RUN case "${PG_VERSION}" in "v17") \
FROM build-deps AS plv8-build
ARG PG_VERSION
COPY --from=pg-build /usr/local/pgsql/ /usr/local/pgsql/
ENV PLV8_BRANCH=r3.2
ENV PLV8_VERSION=3.2.2
RUN case "${PG_VERSION}" in "v17") \
echo "v17 extensions are not supported yet. Quit" && exit 0;; \
esac && \
apt update && \
apt install -y ninja-build python3-dev git libncurses5 binutils clang libstdc++-12-dev libtinfo5 pkg-config cmake
apt install -y ninja-build python3-dev libncurses5 binutils clang
RUN case "${PG_VERSION}" in "v17") \
echo "v17 extensions are not supported yet. Quit" && exit 0;; \
esac\
&& export PATH="/usr/local/pgsql/bin:$PATH" \
&& git clone --branch ${PLV8_BRANCH} --single-branch --depth 1 https://github.com/plv8/plv8 \
&& cd plv8 \
&& make DOCKER=1 -j $(getconf _NPROCESSORS_ONLN) install \
&& find /usr/local/pgsql/ -name "plv8-*.so" | xargs strip \
&& cd .. && rm -rf ./plv8 \
esac && \
wget https://github.com/plv8/plv8/archive/refs/tags/v3.1.10.tar.gz -O plv8.tar.gz && \
echo "7096c3290928561f0d4901b7a52794295dc47f6303102fae3f8e42dd575ad97d plv8.tar.gz" | sha256sum --check && \
mkdir plv8-src && cd plv8-src && tar xzf ../plv8.tar.gz --strip-components=1 -C . && \
# generate and copy upgrade scripts
mkdir -p upgrade && ./generate_upgrade.sh 3.1.10 && \
cp upgrade/* /usr/local/pgsql/share/extension/ && \
export PATH="/usr/local/pgsql/bin:$PATH" && \
make DOCKER=1 -j $(getconf _NPROCESSORS_ONLN) install && \
rm -rf /plv8-* && \
find /usr/local/pgsql/ -name "plv8-*.so" | xargs strip && \
# don't break computes with installed old version of plv8
&& ln -s /usr/local/pgsql/lib/plv8-${PLV8_VERSION}.so /usr/local/pgsql/lib/plv8-3.1.5.so \
&& ln -s /usr/local/pgsql/lib/plv8-${PLV8_VERSION}.so /usr/local/pgsql/lib/plv8-3.1.8.so \
&& echo 'trusted = true' >> /usr/local/pgsql/share/extension/plv8.control && \
cd /usr/local/pgsql/lib/ && \
ln -s plv8-3.1.10.so plv8-3.1.5.so && \
ln -s plv8-3.1.10.so plv8-3.1.8.so && \
echo 'trusted = true' >> /usr/local/pgsql/share/extension/plv8.control && \
echo 'trusted = true' >> /usr/local/pgsql/share/extension/plcoffee.control && \
echo 'trusted = true' >> /usr/local/pgsql/share/extension/plls.control
@@ -1024,7 +1027,7 @@ RUN cd compute_tools && mold -run cargo build --locked --profile release-line-de
#
#########################################################################################
FROM debian:bookworm-slim AS compute-tools-image
FROM debian:bullseye-slim AS compute-tools-image
COPY --from=compute-tools /home/nonroot/target/release-line-debug-size-lto/compute_ctl /usr/local/bin/compute_ctl
@@ -1063,7 +1066,7 @@ RUN case "${PG_VERSION}" in "v17") \
#COPY --from=postgis-build /postgis.tar.gz /ext-src/
#COPY --from=postgis-build /sfcgal/* /usr
# COPY --from=plv8-build /plv8.tar.gz /ext-src/
COPY --from=plv8-build /plv8.tar.gz /ext-src/
COPY --from=h3-pg-build /h3-pg.tar.gz /ext-src/
COPY --from=unit-pg-build /postgresql-unit.tar.gz /ext-src/
COPY --from=vector-pg-build /pgvector.tar.gz /ext-src/
@@ -1141,7 +1144,7 @@ ENV PGDATABASE=postgres
# Put it all together into the final image
#
#########################################################################################
FROM debian:bookworm-slim
FROM debian:bullseye-slim
# Add user postgres
RUN mkdir /var/db && useradd -m -d /var/db/postgres postgres && \
echo "postgres:test_console_pass" | chpasswd && \
@@ -1162,7 +1165,7 @@ RUN mkdir /usr/local/download_extensions && chown -R postgres:postgres /usr/loca
# Install:
# libreadline8 for psql
# libicu72, locales for collations (including ICU and plpgsql_check)
# libicu67, locales for collations (including ICU and plpgsql_check)
# liblz4-1 for lz4
# libossp-uuid16 for extension ossp-uuid
# libgeos, libgdal, libsfcgal1, libproj and libprotobuf-c1 for PostGIS
@@ -1173,7 +1176,7 @@ RUN mkdir /usr/local/download_extensions && chown -R postgres:postgres /usr/loca
RUN apt update && \
apt install --no-install-recommends -y \
gdb \
libicu72 \
libicu67 \
liblz4-1 \
libreadline8 \
libboost-iostreams1.74.0 \
@@ -1182,8 +1185,8 @@ RUN apt update && \
libboost-system1.74.0 \
libossp-uuid16 \
libgeos-c1v5 \
libgdal32 \
libproj25 \
libgdal28 \
libproj19 \
libprotobuf-c1 \
libsfcgal1 \
libxml2 \

View File

@@ -11,7 +11,6 @@ testing = []
[dependencies]
anyhow.workspace = true
async-compression.workspace = true
chrono.workspace = true
cfg-if.workspace = true
clap.workspace = true
@@ -24,7 +23,6 @@ num_cpus.workspace = true
opentelemetry.workspace = true
postgres.workspace = true
regex.workspace = true
serde.workspace = true
serde_json.workspace = true
signal-hook.workspace = true
tar.workspace = true
@@ -43,7 +41,6 @@ url.workspace = true
compute_api.workspace = true
utils.workspace = true
workspace_hack.workspace = true
toml_edit.workspace = true
remote_storage = { version = "0.1", path = "../libs/remote_storage/" }
vm_monitor = { version = "0.1", path = "../libs/vm_monitor/" }
zstd = "0.13"

View File

@@ -9,13 +9,10 @@ anyhow.workspace = true
camino.workspace = true
clap.workspace = true
comfy-table.workspace = true
futures.workspace = true
git-version.workspace = true
humantime.workspace = true
nix.workspace = true
once_cell.workspace = true
postgres.workspace = true
hex.workspace = true
humantime-serde.workspace = true
hyper.workspace = true
regex.workspace = true
@@ -23,8 +20,6 @@ reqwest = { workspace = true, features = ["blocking", "json"] }
scopeguard.workspace = true
serde.workspace = true
serde_json.workspace = true
serde_with.workspace = true
tar.workspace = true
thiserror.workspace = true
toml.workspace = true
toml_edit.workspace = true

View File

@@ -151,7 +151,7 @@ where
print!(".");
io::stdout().flush().unwrap();
}
thread::sleep(RETRY_INTERVAL);
tokio::time::sleep(RETRY_INTERVAL).await;
}
Err(e) => {
println!("error starting process {process_name:?}: {e:#}");

View File

@@ -34,12 +34,14 @@ use safekeeper_api::{
DEFAULT_HTTP_LISTEN_PORT as DEFAULT_SAFEKEEPER_HTTP_PORT,
DEFAULT_PG_LISTEN_PORT as DEFAULT_SAFEKEEPER_PG_PORT,
};
use std::borrow::Cow;
use std::collections::{BTreeSet, HashMap};
use std::path::PathBuf;
use std::process::exit;
use std::str::FromStr;
use std::time::Duration;
use storage_broker::DEFAULT_LISTEN_ADDR as DEFAULT_BROKER_ADDR;
use tokio::task::JoinSet;
use url::Host;
use utils::{
auth::{Claims, Scope},
@@ -87,34 +89,35 @@ fn main() -> Result<()> {
// Check for 'neon init' command first.
let subcommand_result = if sub_name == "init" {
handle_init(sub_args).map(Some)
handle_init(sub_args).map(|env| Some(Cow::Owned(env)))
} else {
// all other commands need an existing config
let mut env =
LocalEnv::load_config(&local_env::base_path()).context("Error loading config")?;
let original_env = env.clone();
let env = LocalEnv::load_config(&local_env::base_path()).context("Error loading config")?;
let original_env = env.clone();
let env = Box::leak(Box::new(env));
let rt = tokio::runtime::Builder::new_current_thread()
.enable_all()
.build()
.unwrap();
let subcommand_result = match sub_name {
"tenant" => rt.block_on(handle_tenant(sub_args, &mut env)),
"timeline" => rt.block_on(handle_timeline(sub_args, &mut env)),
"start" => rt.block_on(handle_start_all(&env, get_start_timeout(sub_args))),
"stop" => rt.block_on(handle_stop_all(sub_args, &env)),
"pageserver" => rt.block_on(handle_pageserver(sub_args, &env)),
"storage_controller" => rt.block_on(handle_storage_controller(sub_args, &env)),
"safekeeper" => rt.block_on(handle_safekeeper(sub_args, &env)),
"endpoint" => rt.block_on(handle_endpoint(sub_args, &env)),
"mappings" => handle_mappings(sub_args, &mut env),
"tenant" => rt.block_on(handle_tenant(sub_args, env)),
"timeline" => rt.block_on(handle_timeline(sub_args, env)),
"start" => rt.block_on(handle_start_all(env, get_start_timeout(sub_args))),
"stop" => rt.block_on(handle_stop_all(sub_args, env)),
"pageserver" => rt.block_on(handle_pageserver(sub_args, env)),
"storage_controller" => rt.block_on(handle_storage_controller(sub_args, env)),
"storage_broker" => rt.block_on(handle_storage_broker(sub_args, env)),
"safekeeper" => rt.block_on(handle_safekeeper(sub_args, env)),
"endpoint" => rt.block_on(handle_endpoint(sub_args, env)),
"mappings" => handle_mappings(sub_args, env),
"pg" => bail!("'pg' subcommand has been renamed to 'endpoint'"),
_ => bail!("unexpected subcommand {sub_name}"),
};
if original_env != env {
subcommand_result.map(|()| Some(env))
if &original_env != env {
subcommand_result.map(|()| Some(Cow::Borrowed(env)))
} else {
subcommand_result.map(|()| None)
}
@@ -1245,49 +1248,122 @@ async fn handle_safekeeper(sub_match: &ArgMatches, env: &local_env::LocalEnv) ->
Ok(())
}
async fn handle_storage_broker(sub_match: &ArgMatches, env: &local_env::LocalEnv) -> Result<()> {
let (sub_name, sub_args) = match sub_match.subcommand() {
Some(broker_command_data) => broker_command_data,
None => bail!("no broker subcommand provided"),
};
match sub_name {
"start" => {
if let Err(e) = broker::start_broker_process(env, get_start_timeout(sub_args)).await {
eprintln!("broker start failed: {e}");
exit(1);
}
}
"stop" => {
if let Err(e) = broker::stop_broker_process(env) {
eprintln!("broker stop failed: {e}");
exit(1);
}
}
_ => bail!("Unexpected broker subcommand '{}'", sub_name),
}
Ok(())
}
async fn handle_start_all(
env: &local_env::LocalEnv,
env: &'static local_env::LocalEnv,
retry_timeout: &Duration,
) -> anyhow::Result<()> {
let Err(errors) = handle_start_all_impl(env, *retry_timeout).await else {
neon_start_status_check(env, retry_timeout)
.await
.context("status check after successful startup of all services")?;
return Ok(());
};
eprintln!("startup failed because one or more services could not be started");
for e in errors {
eprintln!("{e}");
let debug_repr = format!("{e:?}");
for line in debug_repr.lines() {
eprintln!(" {line}");
}
}
try_stop_all(env, true).await;
exit(2);
}
/// Returns Ok() if and only if all services could be started successfully.
/// Otherwise, returns the list of errors that occurred during startup.
async fn handle_start_all_impl(
env: &'static local_env::LocalEnv,
retry_timeout: Duration,
) -> Result<(), Vec<anyhow::Error>> {
// Endpoints are not started automatically
broker::start_broker_process(env, retry_timeout).await?;
let mut js = JoinSet::new();
// Only start the storage controller if the pageserver is configured to need it
if env.control_plane_api.is_some() {
let storage_controller = StorageController::from_env(env);
if let Err(e) = storage_controller
.start(NeonStorageControllerStartArgs::with_default_instance_id(
(*retry_timeout).into(),
))
.await
{
eprintln!("storage_controller start failed: {:#}", e);
try_stop_all(env, true).await;
exit(1);
// force infalliblity through closure
#[allow(clippy::redundant_closure_call)]
(|| {
js.spawn(async move {
let retry_timeout = retry_timeout;
broker::start_broker_process(env, &retry_timeout).await
});
// Only start the storage controller if the pageserver is configured to need it
if env.control_plane_api.is_some() {
js.spawn(async move {
let storage_controller = StorageController::from_env(env);
storage_controller
.start(NeonStorageControllerStartArgs::with_default_instance_id(
retry_timeout.into(),
))
.await
.map_err(|e| e.context("start storage_controller"))
});
}
for ps_conf in &env.pageservers {
js.spawn(async move {
let pageserver = PageServerNode::from_env(env, ps_conf);
pageserver
.start(&retry_timeout)
.await
.map_err(|e| e.context(format!("start pageserver {}", ps_conf.id)))
});
}
for node in env.safekeepers.iter() {
js.spawn(async move {
let safekeeper = SafekeeperNode::from_env(env, node);
safekeeper
.start(vec![], &retry_timeout)
.await
.map_err(|e| e.context(format!("start safekeeper {}", safekeeper.id)))
});
}
})();
let mut errors = Vec::new();
while let Some(result) = js.join_next().await {
let result = result.expect("we don't panic or cancel the tasks");
if let Err(e) = result {
errors.push(e);
}
}
for ps_conf in &env.pageservers {
let pageserver = PageServerNode::from_env(env, ps_conf);
if let Err(e) = pageserver.start(retry_timeout).await {
eprintln!("pageserver {} start failed: {:#}", ps_conf.id, e);
try_stop_all(env, true).await;
exit(1);
}
if !errors.is_empty() {
return Err(errors);
}
for node in env.safekeepers.iter() {
let safekeeper = SafekeeperNode::from_env(env, node);
if let Err(e) = safekeeper.start(vec![], retry_timeout).await {
eprintln!("safekeeper {} start failed: {:#}", safekeeper.id, e);
try_stop_all(env, false).await;
exit(1);
}
}
neon_start_status_check(env, retry_timeout).await?;
Ok(())
}
@@ -1672,6 +1748,19 @@ fn cli() -> Command {
.arg(stop_mode_arg.clone())
.arg(instance_id))
)
.subcommand(
Command::new("storage_broker")
.arg_required_else_help(true)
.about("Manage broker")
.subcommand(Command::new("start")
.about("Start broker")
.arg(timeout_arg.clone())
)
.subcommand(Command::new("stop")
.about("Stop broker")
.arg(stop_mode_arg.clone())
)
)
.subcommand(
Command::new("safekeeper")
.arg_required_else_help(true)

View File

@@ -702,7 +702,7 @@ impl Endpoint {
}
}
}
std::thread::sleep(ATTEMPT_INTERVAL);
tokio::time::sleep(ATTEMPT_INTERVAL).await;
}
// disarm the scopeguard, let the child outlive this function (and neon_local invoction)

View File

@@ -17,9 +17,7 @@ use std::time::Duration;
use anyhow::{bail, Context};
use camino::Utf8PathBuf;
use pageserver_api::models::{
self, AuxFilePolicy, LocationConfig, TenantHistorySize, TenantInfo, TimelineInfo,
};
use pageserver_api::models::{self, AuxFilePolicy, TenantInfo, TimelineInfo};
use pageserver_api::shard::TenantShardId;
use pageserver_client::mgmt_api;
use postgres_backend::AuthType;
@@ -324,22 +322,6 @@ impl PageServerNode {
background_process::stop_process(immediate, "pageserver", &self.pid_file())
}
pub async fn page_server_psql_client(
&self,
) -> anyhow::Result<(
tokio_postgres::Client,
tokio_postgres::Connection<tokio_postgres::Socket, tokio_postgres::tls::NoTlsStream>,
)> {
let mut config = self.pg_connection_config.clone();
if self.conf.pg_auth_type == AuthType::NeonJWT {
let token = self
.env
.generate_auth_token(&Claims::new(None, Scope::PageServerApi))?;
config = config.set_password(Some(token));
}
Ok(config.connect_no_tls().await?)
}
pub async fn check_status(&self) -> mgmt_api::Result<()> {
self.http_client.status().await
}
@@ -540,19 +522,6 @@ impl PageServerNode {
Ok(())
}
pub async fn location_config(
&self,
tenant_shard_id: TenantShardId,
config: LocationConfig,
flush_ms: Option<Duration>,
lazy: bool,
) -> anyhow::Result<()> {
Ok(self
.http_client
.location_config(tenant_shard_id, config, flush_ms, lazy)
.await?)
}
pub async fn timeline_list(
&self,
tenant_shard_id: &TenantShardId,
@@ -636,14 +605,4 @@ impl PageServerNode {
Ok(())
}
pub async fn tenant_synthetic_size(
&self,
tenant_shard_id: TenantShardId,
) -> anyhow::Result<TenantHistorySize> {
Ok(self
.http_client
.tenant_synthetic_size(tenant_shard_id)
.await?)
}
}

View File

@@ -4,13 +4,10 @@
/// NOTE: This doesn't implement the full, correct postgresql.conf syntax. Just
/// enough to extract a few settings we need in Neon, assuming you don't do
/// funny stuff like include-directives or funny escaping.
use anyhow::{bail, Context, Result};
use once_cell::sync::Lazy;
use regex::Regex;
use std::collections::HashMap;
use std::fmt;
use std::io::BufRead;
use std::str::FromStr;
/// In-memory representation of a postgresql.conf file
#[derive(Default, Debug)]
@@ -19,84 +16,16 @@ pub struct PostgresConf {
hash: HashMap<String, String>,
}
static CONF_LINE_RE: Lazy<Regex> = Lazy::new(|| Regex::new(r"^((?:\w|\.)+)\s*=\s*(\S+)$").unwrap());
impl PostgresConf {
pub fn new() -> PostgresConf {
PostgresConf::default()
}
/// Read file into memory
pub fn read(read: impl std::io::Read) -> Result<PostgresConf> {
let mut result = Self::new();
for line in std::io::BufReader::new(read).lines() {
let line = line?;
// Store each line in a vector, in original format
result.lines.push(line.clone());
// Also parse each line and insert key=value lines into a hash map.
//
// FIXME: This doesn't match exactly the flex/bison grammar in PostgreSQL.
// But it's close enough for our usage.
let line = line.trim();
if line.starts_with('#') {
// comment, ignore
continue;
} else if let Some(caps) = CONF_LINE_RE.captures(line) {
let name = caps.get(1).unwrap().as_str();
let raw_val = caps.get(2).unwrap().as_str();
if let Ok(val) = deescape_str(raw_val) {
// Note: if there's already an entry in the hash map for
// this key, this will replace it. That's the behavior what
// we want; when PostgreSQL reads the file, each line
// overrides any previous value for the same setting.
result.hash.insert(name.to_string(), val.to_string());
}
}
}
Ok(result)
}
/// Return the current value of 'option'
pub fn get(&self, option: &str) -> Option<&str> {
self.hash.get(option).map(|x| x.as_ref())
}
/// Return the current value of a field, parsed to the right datatype.
///
/// This calls the FromStr::parse() function on the value of the field. If
/// the field does not exist, or parsing fails, returns an error.
///
pub fn parse_field<T>(&self, field_name: &str, context: &str) -> Result<T>
where
T: FromStr,
<T as FromStr>::Err: std::error::Error + Send + Sync + 'static,
{
self.get(field_name)
.with_context(|| format!("could not find '{}' option {}", field_name, context))?
.parse::<T>()
.with_context(|| format!("could not parse '{}' option {}", field_name, context))
}
pub fn parse_field_optional<T>(&self, field_name: &str, context: &str) -> Result<Option<T>>
where
T: FromStr,
<T as FromStr>::Err: std::error::Error + Send + Sync + 'static,
{
if let Some(val) = self.get(field_name) {
let result = val
.parse::<T>()
.with_context(|| format!("could not parse '{}' option {}", field_name, context))?;
Ok(Some(result))
} else {
Ok(None)
}
}
///
/// Note: if you call this multiple times for the same option, the config
/// file will a line for each call. It would be nice to have a function
@@ -154,48 +83,8 @@ fn escape_str(s: &str) -> String {
}
}
/// De-escape a possibly-quoted value.
///
/// See `DeescapeQuotedString` function in PostgreSQL sources for how PostgreSQL
/// does this.
fn deescape_str(s: &str) -> Result<String> {
// If the string has a quote at the beginning and end, strip them out.
if s.len() >= 2 && s.starts_with('\'') && s.ends_with('\'') {
let mut result = String::new();
let mut iter = s[1..(s.len() - 1)].chars().peekable();
while let Some(c) = iter.next() {
let newc = if c == '\\' {
match iter.next() {
Some('b') => '\x08',
Some('f') => '\x0c',
Some('n') => '\n',
Some('r') => '\r',
Some('t') => '\t',
Some('0'..='7') => {
// TODO
bail!("octal escapes not supported");
}
Some(n) => n,
None => break,
}
} else if c == '\'' && iter.peek() == Some(&'\'') {
// doubled quote becomes just one quote
iter.next().unwrap()
} else {
c
};
result.push(newc);
}
Ok(result)
} else {
Ok(s.to_string())
}
}
#[test]
fn test_postgresql_conf_escapes() -> Result<()> {
fn test_postgresql_conf_escapes() -> anyhow::Result<()> {
assert_eq!(escape_str("foo bar"), "'foo bar'");
// these don't need to be quoted
assert_eq!(escape_str("foo"), "foo");
@@ -214,13 +103,5 @@ fn test_postgresql_conf_escapes() -> Result<()> {
assert_eq!(escape_str("fo\\o"), "'fo\\\\o'");
assert_eq!(escape_str("10 cats"), "'10 cats'");
// Test de-escaping
assert_eq!(deescape_str(&escape_str("foo"))?, "foo");
assert_eq!(deescape_str(&escape_str("fo'o\nba\\r"))?, "fo'o\nba\\r");
assert_eq!(deescape_str("'\\b\\f\\n\\r\\t'")?, "\x08\x0c\n\r\t");
// octal-escapes are currently not supported
assert!(deescape_str("'foo\\7\\07\\007'").is_err());
Ok(())
}

View File

@@ -11,14 +11,11 @@ clap.workspace = true
comfy-table.workspace = true
futures.workspace = true
humantime.workspace = true
hyper.workspace = true
pageserver_api.workspace = true
pageserver_client.workspace = true
reqwest.workspace = true
serde.workspace = true
serde_json = { workspace = true, features = ["raw_value"] }
storage_controller_client.workspace = true
thiserror.workspace = true
tokio.workspace = true
tracing.workspace = true
utils.workspace = true

View File

@@ -5,13 +5,13 @@ ARG TAG=latest
FROM $REPOSITORY/${COMPUTE_IMAGE}:$TAG
USER root
#Faker is required for the pg_anon test
RUN apt-get update && \
apt-get install -y curl \
jq \
python3-pip \
netcat-traditional \
faker
netcat
#Faker is required for the pg_anon test
RUN pip3 install Faker
#This is required for the pg_hintplan test
RUN mkdir -p /ext-src/pg_hint_plan-src && chown postgres /ext-src/pg_hint_plan-src

View File

@@ -8,7 +8,6 @@ license.workspace = true
anyhow.workspace = true
chrono.workspace = true
serde.workspace = true
serde_with.workspace = true
serde_json.workspace = true
regex.workspace = true

View File

@@ -5,9 +5,6 @@ edition = "2021"
license = "Apache-2.0"
[dependencies]
anyhow.workspace = true
chrono = { workspace = true, features = ["serde"] }
rand.workspace = true
serde.workspace = true
serde_with.workspace = true
utils.workspace = true

View File

@@ -5,7 +5,7 @@ use chrono::{DateTime, Utc};
use rand::Rng;
use serde::{Deserialize, Serialize};
#[derive(Serialize, serde::Deserialize, Debug, Clone, Copy, Eq, PartialEq, Ord, PartialOrd)]
#[derive(Serialize, Deserialize, Debug, Clone, Copy, Eq, PartialEq, Ord, PartialOrd)]
#[serde(tag = "type")]
pub enum EventType {
#[serde(rename = "absolute")]
@@ -107,7 +107,7 @@ pub const CHUNK_SIZE: usize = 1000;
// Just a wrapper around a slice of events
// to serialize it as `{"events" : [ ] }
#[derive(serde::Serialize, serde::Deserialize)]
#[derive(serde::Serialize, Deserialize)]
pub struct EventChunk<'a, T: Clone> {
pub events: std::borrow::Cow<'a, [T]>,
}

View File

@@ -12,5 +12,4 @@ bytes.workspace = true
utils.workspace = true
parking_lot.workspace = true
hex.workspace = true
scopeguard.workspace = true
smallvec = { workspace = true, features = ["write"] }

View File

@@ -64,6 +64,7 @@ pub struct ConfigToml {
#[serde(with = "humantime_serde")]
pub wal_redo_timeout: Duration,
pub superuser: String,
pub initdb_cache_dir: Option<Utf8PathBuf>,
pub page_cache_size: usize,
pub max_file_descriptors: usize,
pub pg_distrib_dir: Option<Utf8PathBuf>,
@@ -173,40 +174,6 @@ impl Default for EvictionOrder {
}
}
#[derive(
Eq,
PartialEq,
Debug,
Copy,
Clone,
strum_macros::EnumString,
strum_macros::Display,
serde_with::DeserializeFromStr,
serde_with::SerializeDisplay,
)]
#[strum(serialize_all = "kebab-case")]
pub enum GetVectoredImpl {
Sequential,
Vectored,
}
#[derive(
Eq,
PartialEq,
Debug,
Copy,
Clone,
strum_macros::EnumString,
strum_macros::Display,
serde_with::DeserializeFromStr,
serde_with::SerializeDisplay,
)]
#[strum(serialize_all = "kebab-case")]
pub enum GetImpl {
Legacy,
Vectored,
}
#[derive(Copy, Clone, Debug, PartialEq, Eq, serde::Serialize, serde::Deserialize)]
#[serde(transparent)]
pub struct MaxVectoredReadBytes(pub NonZeroUsize);
@@ -338,8 +305,6 @@ pub mod defaults {
pub const DEFAULT_IMAGE_COMPRESSION: ImageCompressionAlgorithm =
ImageCompressionAlgorithm::Zstd { level: Some(1) };
pub const DEFAULT_VALIDATE_VECTORED_GET: bool = false;
pub const DEFAULT_EPHEMERAL_BYTES_PER_MEMORY_KB: usize = 0;
pub const DEFAULT_IO_BUFFER_ALIGNMENT: usize = 512;
@@ -358,6 +323,7 @@ impl Default for ConfigToml {
wal_redo_timeout: (humantime::parse_duration(DEFAULT_WAL_REDO_TIMEOUT)
.expect("cannot parse default wal redo timeout")),
superuser: (DEFAULT_SUPERUSER.to_string()),
initdb_cache_dir: None,
page_cache_size: (DEFAULT_PAGE_CACHE_SIZE),
max_file_descriptors: (DEFAULT_MAX_FILE_DESCRIPTORS),
pg_distrib_dir: None, // Utf8PathBuf::from("./pg_install"), // TODO: formely, this was std::env::current_dir()
@@ -376,7 +342,10 @@ impl Default for ConfigToml {
concurrent_tenant_warmup: (NonZeroUsize::new(DEFAULT_CONCURRENT_TENANT_WARMUP)
.expect("Invalid default constant")),
concurrent_tenant_size_logical_size_queries: NonZeroUsize::new(1).unwrap(),
concurrent_tenant_size_logical_size_queries: NonZeroUsize::new(
DEFAULT_CONCURRENT_TENANT_SIZE_LOGICAL_SIZE_QUERIES,
)
.unwrap(),
metric_collection_interval: (humantime::parse_duration(
DEFAULT_METRIC_COLLECTION_INTERVAL,
)
@@ -467,8 +436,6 @@ pub mod tenant_conf_defaults {
// By default ingest enough WAL for two new L0 layers before checking if new image
// image layers should be created.
pub const DEFAULT_IMAGE_LAYER_CREATION_CHECK_THRESHOLD: u8 = 2;
pub const DEFAULT_INGEST_BATCH_SIZE: u64 = 100;
}
impl Default for TenantConfigToml {

View File

@@ -495,7 +495,7 @@ pub struct CompactionAlgorithmSettings {
pub kind: CompactionAlgorithm,
}
#[derive(Debug, PartialEq, Eq, Clone, serde::Deserialize, serde::Serialize)]
#[derive(Debug, PartialEq, Eq, Clone, Deserialize, Serialize)]
#[serde(tag = "mode", rename_all = "kebab-case", deny_unknown_fields)]
pub enum L0FlushConfig {
#[serde(rename_all = "snake_case")]

View File

@@ -5,10 +5,8 @@ edition.workspace = true
license.workspace = true
[dependencies]
async-trait.workspace = true
anyhow.workspace = true
bytes.workspace = true
futures.workspace = true
rustls.workspace = true
serde.workspace = true
thiserror.workspace = true

View File

@@ -280,16 +280,6 @@ pub struct PostgresBackend<IO> {
pub type PostgresBackendTCP = PostgresBackend<tokio::net::TcpStream>;
pub fn query_from_cstring(query_string: Bytes) -> Vec<u8> {
let mut query_string = query_string.to_vec();
if let Some(ch) = query_string.last() {
if *ch == 0 {
query_string.pop();
}
}
query_string
}
/// Cast a byte slice to a string slice, dropping null terminator if there's one.
fn cstr_to_str(bytes: &[u8]) -> anyhow::Result<&str> {
let without_null = bytes.strip_suffix(&[0]).unwrap_or(bytes);

View File

@@ -5,13 +5,10 @@ edition.workspace = true
license.workspace = true
[dependencies]
rand.workspace = true
regex.workspace = true
bytes.workspace = true
byteorder.workspace = true
anyhow.workspace = true
crc32c.workspace = true
hex.workspace = true
once_cell.workspace = true
log.workspace = true
memoffset.workspace = true

View File

@@ -9,8 +9,8 @@
//! comments on them.
//!
use crate::PageHeaderData;
use crate::BLCKSZ;
use crate::{PageHeaderData, XLogRecord};
//
// From pg_tablespace_d.h
@@ -194,8 +194,6 @@ pub const XLR_RMGR_INFO_MASK: u8 = 0xF0;
pub const XLOG_TBLSPC_CREATE: u8 = 0x00;
pub const XLOG_TBLSPC_DROP: u8 = 0x10;
pub const SIZEOF_XLOGRECORD: u32 = size_of::<XLogRecord>() as u32;
//
// from xlogrecord.h
//
@@ -219,8 +217,6 @@ pub const BKPIMAGE_HAS_HOLE: u8 = 0x01; /* page image has "hole" */
/* From transam.h */
pub const FIRST_NORMAL_TRANSACTION_ID: u32 = 3;
pub const INVALID_TRANSACTION_ID: u32 = 0;
pub const FIRST_BOOTSTRAP_OBJECT_ID: u32 = 12000;
pub const FIRST_NORMAL_OBJECT_ID: u32 = 16384;
/* pg_control.h */
pub const XLOG_CHECKPOINT_SHUTDOWN: u8 = 0x00;

View File

@@ -26,6 +26,7 @@ use bytes::{Buf, Bytes};
use log::*;
use serde::Serialize;
use std::ffi::OsStr;
use std::fs::File;
use std::io::prelude::*;
use std::io::ErrorKind;
@@ -78,19 +79,34 @@ pub fn XLogFileName(tli: TimeLineID, logSegNo: XLogSegNo, wal_segsz_bytes: usize
)
}
pub fn XLogFromFileName(fname: &str, wal_seg_size: usize) -> (XLogSegNo, TimeLineID) {
let tli = u32::from_str_radix(&fname[0..8], 16).unwrap();
let log = u32::from_str_radix(&fname[8..16], 16).unwrap() as XLogSegNo;
let seg = u32::from_str_radix(&fname[16..24], 16).unwrap() as XLogSegNo;
(log * XLogSegmentsPerXLogId(wal_seg_size) + seg, tli)
pub fn XLogFromFileName(
fname: &OsStr,
wal_seg_size: usize,
) -> anyhow::Result<(XLogSegNo, TimeLineID)> {
if let Some(fname_str) = fname.to_str() {
let tli = u32::from_str_radix(&fname_str[0..8], 16)?;
let log = u32::from_str_radix(&fname_str[8..16], 16)? as XLogSegNo;
let seg = u32::from_str_radix(&fname_str[16..24], 16)? as XLogSegNo;
Ok((log * XLogSegmentsPerXLogId(wal_seg_size) + seg, tli))
} else {
anyhow::bail!("non-ut8 filename: {:?}", fname);
}
}
pub fn IsXLogFileName(fname: &str) -> bool {
return fname.len() == XLOG_FNAME_LEN && fname.chars().all(|c| c.is_ascii_hexdigit());
pub fn IsXLogFileName(fname: &OsStr) -> bool {
if let Some(fname) = fname.to_str() {
fname.len() == XLOG_FNAME_LEN && fname.chars().all(|c| c.is_ascii_hexdigit())
} else {
false
}
}
pub fn IsPartialXLogFileName(fname: &str) -> bool {
fname.ends_with(".partial") && IsXLogFileName(&fname[0..fname.len() - 8])
pub fn IsPartialXLogFileName(fname: &OsStr) -> bool {
if let Some(fname) = fname.to_str() {
fname.ends_with(".partial") && IsXLogFileName(OsStr::new(&fname[0..fname.len() - 8]))
} else {
false
}
}
/// If LSN points to the beginning of the page, then shift it to first record,

View File

@@ -9,7 +9,6 @@ anyhow.workspace = true
clap.workspace = true
env_logger.workspace = true
log.workspace = true
once_cell.workspace = true
postgres.workspace = true
postgres_ffi.workspace = true
camino-tempfile.workspace = true

View File

@@ -7,6 +7,7 @@ use postgres_ffi::{WAL_SEGMENT_SIZE, XLOG_BLCKSZ};
use postgres_ffi::{
XLOG_SIZE_OF_XLOG_LONG_PHD, XLOG_SIZE_OF_XLOG_RECORD, XLOG_SIZE_OF_XLOG_SHORT_PHD,
};
use std::ffi::OsStr;
use std::path::{Path, PathBuf};
use std::process::Command;
use std::time::{Duration, Instant};
@@ -26,7 +27,6 @@ macro_rules! xlog_utils_test {
postgres_ffi::for_all_postgres_versions! { xlog_utils_test }
#[derive(Debug, Clone, PartialEq, Eq)]
pub struct Conf {
pub pg_version: u32,
pub pg_distrib_dir: PathBuf,
@@ -136,8 +136,8 @@ impl Conf {
pub fn pg_waldump(
&self,
first_segment_name: &str,
last_segment_name: &str,
first_segment_name: &OsStr,
last_segment_name: &OsStr,
) -> anyhow::Result<std::process::Output> {
let first_segment_file = self.datadir.join(first_segment_name);
let last_segment_file = self.datadir.join(last_segment_name);

View File

@@ -4,6 +4,7 @@ use super::*;
use crate::{error, info};
use regex::Regex;
use std::cmp::min;
use std::ffi::OsStr;
use std::fs::{self, File};
use std::io::Write;
use std::{env, str::FromStr};
@@ -54,7 +55,7 @@ fn test_end_of_wal<C: crate::Crafter>(test_name: &str) {
.wal_dir()
.read_dir()
.unwrap()
.map(|f| f.unwrap().file_name().into_string().unwrap())
.map(|f| f.unwrap().file_name())
.filter(|fname| IsXLogFileName(fname))
.max()
.unwrap();
@@ -70,11 +71,11 @@ fn test_end_of_wal<C: crate::Crafter>(test_name: &str) {
start_lsn
);
for file in fs::read_dir(cfg.wal_dir()).unwrap().flatten() {
let fname = file.file_name().into_string().unwrap();
let fname = file.file_name();
if !IsXLogFileName(&fname) {
continue;
}
let (segno, _) = XLogFromFileName(&fname, WAL_SEGMENT_SIZE);
let (segno, _) = XLogFromFileName(&fname, WAL_SEGMENT_SIZE).unwrap();
let seg_start_lsn = XLogSegNoOffsetToRecPtr(segno, 0, WAL_SEGMENT_SIZE);
if seg_start_lsn > u64::from(*start_lsn) {
continue;
@@ -93,10 +94,10 @@ fn test_end_of_wal<C: crate::Crafter>(test_name: &str) {
}
}
fn find_pg_waldump_end_of_wal(cfg: &crate::Conf, last_segment: &str) -> Lsn {
fn find_pg_waldump_end_of_wal(cfg: &crate::Conf, last_segment: &OsStr) -> Lsn {
// Get the actual end of WAL by pg_waldump
let waldump_output = cfg
.pg_waldump("000000010000000000000001", last_segment)
.pg_waldump(OsStr::new("000000010000000000000001"), last_segment)
.unwrap()
.stderr;
let waldump_output = std::str::from_utf8(&waldump_output).unwrap();
@@ -117,7 +118,7 @@ fn find_pg_waldump_end_of_wal(cfg: &crate::Conf, last_segment: &str) -> Lsn {
fn check_end_of_wal(
cfg: &crate::Conf,
last_segment: &str,
last_segment: &OsStr,
start_lsn: Lsn,
expected_end_of_wal: Lsn,
) {
@@ -132,7 +133,8 @@ fn check_end_of_wal(
// Rename file to partial to actually find last valid lsn, then rename it back.
fs::rename(
cfg.wal_dir().join(last_segment),
cfg.wal_dir().join(format!("{}.partial", last_segment)),
cfg.wal_dir()
.join(format!("{}.partial", last_segment.to_str().unwrap())),
)
.unwrap();
let wal_end = find_end_of_wal(&cfg.wal_dir(), WAL_SEGMENT_SIZE, start_lsn).unwrap();
@@ -142,7 +144,8 @@ fn check_end_of_wal(
);
assert_eq!(wal_end, expected_end_of_wal);
fs::rename(
cfg.wal_dir().join(format!("{}.partial", last_segment)),
cfg.wal_dir()
.join(format!("{}.partial", last_segment.to_str().unwrap())),
cfg.wal_dir().join(last_segment),
)
.unwrap();

View File

@@ -8,10 +8,8 @@ license.workspace = true
bytes.workspace = true
byteorder.workspace = true
itertools.workspace = true
pin-project-lite.workspace = true
postgres-protocol.workspace = true
rand.workspace = true
tokio = { workspace = true, features = ["io-util"] }
tracing.workspace = true
thiserror.workspace = true
serde.workspace = true

View File

@@ -13,14 +13,11 @@ aws-smithy-async.workspace = true
aws-smithy-types.workspace = true
aws-config.workspace = true
aws-sdk-s3.workspace = true
aws-credential-types.workspace = true
bytes.workspace = true
camino = { workspace = true, features = ["serde1"] }
humantime.workspace = true
humantime-serde.workspace = true
hyper = { workspace = true, features = ["stream"] }
futures.workspace = true
rand.workspace = true
serde.workspace = true
serde_json.workspace = true
tokio = { workspace = true, features = ["sync", "fs", "io-util"] }

View File

@@ -127,10 +127,6 @@ impl RemotePath {
&self.0
}
pub fn extension(&self) -> Option<&str> {
self.0.extension()
}
pub fn strip_prefix(&self, p: &RemotePath) -> Result<&Utf8Path, std::path::StripPrefixError> {
self.0.strip_prefix(&p.0)
}

View File

@@ -6,6 +6,5 @@ license.workspace = true
[dependencies]
serde.workspace = true
serde_with.workspace = true
const_format.workspace = true
utils.workspace = true

View File

@@ -9,8 +9,9 @@ hyper.workspace = true
opentelemetry = { workspace = true, features=["rt-tokio"] }
opentelemetry-otlp = { workspace = true, default-features=false, features = ["http-proto", "trace", "http", "reqwest-client"] }
opentelemetry-semantic-conventions.workspace = true
reqwest = { workspace = true, default-features = false, features = ["rustls-tls"] }
tokio = { workspace = true, features = ["rt", "rt-multi-thread"] }
tracing.workspace = true
tracing-opentelemetry.workspace = true
tracing-subscriber.workspace = true
[dev-dependencies]
tracing-subscriber.workspace = true # For examples in docs

View File

@@ -42,7 +42,6 @@ tracing.workspace = true
tracing-error.workspace = true
tracing-subscriber = { workspace = true, features = ["json", "registry"] }
rand.workspace = true
serde_with.workspace = true
strum.workspace = true
strum_macros.workspace = true
url.workspace = true

View File

@@ -82,7 +82,7 @@ impl ApiError {
StatusCode::INTERNAL_SERVER_ERROR,
),
ApiError::InternalServerError(err) => HttpErrorBody::response_from_msg_and_status(
err.to_string(),
format!("{err:#}"), // use alternative formatting so that we give the cause without backtrace
StatusCode::INTERNAL_SERVER_ERROR,
),
}

View File

@@ -21,7 +21,13 @@
//!
//! Another explaination can be found here: <https://brandur.org/rate-limiting>
use std::{sync::Mutex, time::Duration};
use std::{
sync::{
atomic::{AtomicU64, Ordering},
Mutex,
},
time::Duration,
};
use tokio::{sync::Notify, time::Instant};
@@ -128,6 +134,7 @@ impl LeakyBucketState {
pub struct RateLimiter {
pub config: LeakyBucketConfig,
pub sleep_counter: AtomicU64,
pub state: Mutex<LeakyBucketState>,
/// a queue to provide this fair ordering.
pub queue: Notify,
@@ -144,6 +151,7 @@ impl Drop for Requeue<'_> {
impl RateLimiter {
pub fn with_initial_tokens(config: LeakyBucketConfig, initial_tokens: f64) -> Self {
RateLimiter {
sleep_counter: AtomicU64::new(0),
state: Mutex::new(LeakyBucketState::with_initial_tokens(
&config,
initial_tokens,
@@ -163,15 +171,16 @@ impl RateLimiter {
/// returns true if we did throttle
pub async fn acquire(&self, count: usize) -> bool {
let mut throttled = false;
let start = tokio::time::Instant::now();
let start_count = self.sleep_counter.load(Ordering::Acquire);
let mut end_count = start_count;
// wait until we are the first in the queue
let mut notified = std::pin::pin!(self.queue.notified());
if !notified.as_mut().enable() {
throttled = true;
notified.await;
end_count = self.sleep_counter.load(Ordering::Acquire);
}
// notify the next waiter in the queue when we are done.
@@ -184,9 +193,22 @@ impl RateLimiter {
.unwrap()
.add_tokens(&self.config, start, count as f64);
match res {
Ok(()) => return throttled,
Ok(()) => return end_count > start_count,
Err(ready_at) => {
throttled = true;
struct Increment<'a>(&'a AtomicU64);
impl Drop for Increment<'_> {
fn drop(&mut self) {
self.0.fetch_add(1, Ordering::AcqRel);
}
}
// increment the counter after we finish sleeping (or cancel this task).
// this ensures that tasks that have already started the acquire will observe
// the new sleep count when they are allowed to resume on the notify.
let _inc = Increment(&self.sleep_counter);
end_count += 1;
tokio::time::sleep_until(ready_at).await;
}
}

View File

@@ -120,32 +120,6 @@ impl<K: Ord, V> VecMap<K, V> {
Ok((None, delta_size))
}
/// Split the map into two.
///
/// The left map contains everything before `cutoff` (exclusive).
/// Right map contains `cutoff` and everything after (inclusive).
pub fn split_at(&self, cutoff: &K) -> (Self, Self)
where
K: Clone,
V: Clone,
{
let split_idx = self
.data
.binary_search_by_key(&cutoff, extract_key)
.unwrap_or_else(std::convert::identity);
(
VecMap {
data: self.data[..split_idx].to_vec(),
ordering: self.ordering,
},
VecMap {
data: self.data[split_idx..].to_vec(),
ordering: self.ordering,
},
)
}
/// Move items from `other` to the end of `self`, leaving `other` empty.
/// If the `other` ordering is different from `self` ordering
/// `ExtendOrderingError` error will be returned.

View File

@@ -15,13 +15,11 @@ anyhow.workspace = true
axum.workspace = true
clap.workspace = true
futures.workspace = true
inotify.workspace = true
serde.workspace = true
serde_json.workspace = true
sysinfo.workspace = true
tokio = { workspace = true, features = ["rt-multi-thread"] }
tokio-postgres.workspace = true
tokio-stream.workspace = true
tokio-util.workspace = true
tracing.workspace = true
tracing-subscriber.workspace = true

View File

@@ -15,7 +15,6 @@ anyhow.workspace = true
arc-swap.workspace = true
async-compression.workspace = true
async-stream.workspace = true
async-trait.workspace = true
bit_field.workspace = true
byteorder.workspace = true
bytes.workspace = true
@@ -23,12 +22,9 @@ camino.workspace = true
camino-tempfile.workspace = true
chrono = { workspace = true, features = ["serde"] }
clap = { workspace = true, features = ["string"] }
const_format.workspace = true
consumption_metrics.workspace = true
crc32c.workspace = true
crossbeam-utils.workspace = true
either.workspace = true
flate2.workspace = true
fail.workspace = true
futures.workspace = true
git-version.workspace = true
@@ -57,10 +53,6 @@ serde.workspace = true
serde_json = { workspace = true, features = ["raw_value"] }
serde_path_to_error.workspace = true
serde_with.workspace = true
signal-hook.workspace = true
smallvec = { workspace = true, features = ["write"] }
svg_fmt.workspace = true
sync_wrapper.workspace = true
sysinfo.workspace = true
tokio-tar.workspace = true
thiserror.workspace = true
@@ -73,7 +65,6 @@ tokio-stream.workspace = true
tokio-util.workspace = true
toml_edit = { workspace = true, features = [ "serde" ] }
tracing.workspace = true
twox-hash.workspace = true
url.workspace = true
walkdir.workspace = true
metrics.workspace = true

View File

@@ -9,41 +9,19 @@ default = []
[dependencies]
anyhow.workspace = true
async-compression.workspace = true
async-stream.workspace = true
byteorder.workspace = true
bytes.workspace = true
chrono = { workspace = true, features = ["serde"] }
clap = { workspace = true, features = ["string"] }
const_format.workspace = true
consumption_metrics.workspace = true
crossbeam-utils.workspace = true
either.workspace = true
flate2.workspace = true
fail.workspace = true
futures.workspace = true
git-version.workspace = true
hex.workspace = true
humantime.workspace = true
humantime-serde.workspace = true
itertools.workspace = true
once_cell.workspace = true
pageserver_api.workspace = true
pin-project-lite.workspace = true
rand.workspace = true
smallvec = { workspace = true, features = ["write"] }
svg_fmt.workspace = true
sync_wrapper.workspace = true
thiserror.workspace = true
tokio = { workspace = true, features = ["process", "sync", "fs", "rt", "io-util", "time"] }
tokio-io-timeout.workspace = true
tokio-util.workspace = true
tracing.workspace = true
tracing-error.workspace = true
tracing-subscriber.workspace = true
url.workspace = true
walkdir.workspace = true
metrics.workspace = true
utils.workspace = true
workspace_hack.workspace = true

View File

@@ -8,7 +8,6 @@ license.workspace = true
[dependencies]
anyhow.workspace = true
bytes.workspace = true
camino.workspace = true
clap = { workspace = true, features = ["string"] }
git-version.workspace = true
@@ -24,5 +23,4 @@ toml_edit.workspace = true
utils.workspace = true
svg_fmt.workspace = true
workspace_hack.workspace = true
serde.workspace = true
serde_json.workspace = true

View File

@@ -13,7 +13,6 @@ use pageserver_api::{
use remote_storage::{RemotePath, RemoteStorageConfig};
use std::env;
use storage_broker::Uri;
use utils::crashsafe::path_with_suffix_extension;
use utils::logging::SecretString;
use once_cell::sync::OnceCell;
@@ -33,7 +32,7 @@ use crate::tenant::storage_layer::inmemory_layer::IndexEntry;
use crate::tenant::{TENANTS_SEGMENT_NAME, TIMELINES_SEGMENT_NAME};
use crate::virtual_file;
use crate::virtual_file::io_engine;
use crate::{TENANT_HEATMAP_BASENAME, TENANT_LOCATION_CONFIG_NAME, TIMELINE_DELETE_MARK_SUFFIX};
use crate::{TENANT_HEATMAP_BASENAME, TENANT_LOCATION_CONFIG_NAME};
/// Global state of pageserver.
///
@@ -71,6 +70,8 @@ pub struct PageServerConf {
pub superuser: String,
pub initdb_cache_dir: Option<Utf8PathBuf>,
pub page_cache_size: usize,
pub max_file_descriptors: usize,
@@ -257,17 +258,6 @@ impl PageServerConf {
.join(timeline_id.to_string())
}
pub(crate) fn timeline_delete_mark_file_path(
&self,
tenant_shard_id: TenantShardId,
timeline_id: TimelineId,
) -> Utf8PathBuf {
path_with_suffix_extension(
self.timeline_path(&tenant_shard_id, &timeline_id),
TIMELINE_DELETE_MARK_SUFFIX,
)
}
/// Turns storage remote path of a file into its local path.
pub fn local_path(&self, remote_path: &RemotePath) -> Utf8PathBuf {
remote_path.with_base(&self.workdir)
@@ -309,6 +299,7 @@ impl PageServerConf {
wait_lsn_timeout,
wal_redo_timeout,
superuser,
initdb_cache_dir,
page_cache_size,
max_file_descriptors,
pg_distrib_dir,
@@ -356,6 +347,7 @@ impl PageServerConf {
wait_lsn_timeout,
wal_redo_timeout,
superuser,
initdb_cache_dir,
page_cache_size,
max_file_descriptors,
http_auth_type,
@@ -491,11 +483,6 @@ pub struct ConfigurableSemaphore {
}
impl ConfigurableSemaphore {
pub const DEFAULT_INITIAL: NonZeroUsize = match NonZeroUsize::new(1) {
Some(x) => x,
None => panic!("const unwrap is not yet stable"),
};
/// Initializse using a non-zero amount of permits.
///
/// Require a non-zero initial permits, because using permits == 0 is a crude way to disable a
@@ -516,12 +503,6 @@ impl ConfigurableSemaphore {
}
}
impl Default for ConfigurableSemaphore {
fn default() -> Self {
Self::new(Self::DEFAULT_INITIAL)
}
}
impl PartialEq for ConfigurableSemaphore {
fn eq(&self, other: &Self) -> bool {
// the number of permits can be increased at runtime, so we cannot really fulfill the

View File

@@ -178,7 +178,7 @@ async fn collect_metrics(
)
.await;
if let Err(e) = res {
tracing::error!("failed to upload to S3: {e:#}");
tracing::error!("failed to upload to remote storage: {e:#}");
}
}
};

View File

@@ -1177,10 +1177,10 @@ pub(crate) mod virtual_file_io_engine {
}
struct GlobalAndPerTimelineHistogramTimer<'a, 'c> {
global_metric: &'a Histogram,
global_latency_histo: &'a Histogram,
// Optional because not all op types are tracked per-timeline
timeline_metric: Option<&'a Histogram>,
per_timeline_latency_histo: Option<&'a Histogram>,
ctx: &'c RequestContext,
start: std::time::Instant,
@@ -1212,9 +1212,10 @@ impl<'a, 'c> Drop for GlobalAndPerTimelineHistogramTimer<'a, 'c> {
elapsed
}
};
self.global_metric.observe(ex_throttled.as_secs_f64());
if let Some(timeline_metric) = self.timeline_metric {
timeline_metric.observe(ex_throttled.as_secs_f64());
self.global_latency_histo
.observe(ex_throttled.as_secs_f64());
if let Some(per_timeline_getpage_histo) = self.per_timeline_latency_histo {
per_timeline_getpage_histo.observe(ex_throttled.as_secs_f64());
}
}
}
@@ -1240,10 +1241,32 @@ pub enum SmgrQueryType {
#[derive(Debug)]
pub(crate) struct SmgrQueryTimePerTimeline {
global_metrics: [Histogram; SmgrQueryType::COUNT],
per_timeline_getpage: Histogram,
global_started: [IntCounter; SmgrQueryType::COUNT],
global_latency: [Histogram; SmgrQueryType::COUNT],
per_timeline_getpage_started: IntCounter,
per_timeline_getpage_latency: Histogram,
}
static SMGR_QUERY_STARTED_GLOBAL: Lazy<IntCounterVec> = Lazy::new(|| {
register_int_counter_vec!(
// it's a counter, but, name is prepared to extend it to a histogram of queue depth
"pageserver_smgr_query_started_global_count",
"Number of smgr queries started, aggregated by query type.",
&["smgr_query_type"],
)
.expect("failed to define a metric")
});
static SMGR_QUERY_STARTED_PER_TENANT_TIMELINE: Lazy<IntCounterVec> = Lazy::new(|| {
register_int_counter_vec!(
// it's a counter, but, name is prepared to extend it to a histogram of queue depth
"pageserver_smgr_query_started_count",
"Number of smgr queries started, aggregated by query type and tenant/timeline.",
&["smgr_query_type", "tenant_id", "shard_id", "timeline_id"],
)
.expect("failed to define a metric")
});
static SMGR_QUERY_TIME_PER_TENANT_TIMELINE: Lazy<HistogramVec> = Lazy::new(|| {
register_histogram_vec!(
"pageserver_smgr_query_seconds",
@@ -1319,14 +1342,20 @@ impl SmgrQueryTimePerTimeline {
let tenant_id = tenant_shard_id.tenant_id.to_string();
let shard_slug = format!("{}", tenant_shard_id.shard_slug());
let timeline_id = timeline_id.to_string();
let global_metrics = std::array::from_fn(|i| {
let global_started = std::array::from_fn(|i| {
let op = SmgrQueryType::from_repr(i).unwrap();
SMGR_QUERY_STARTED_GLOBAL
.get_metric_with_label_values(&[op.into()])
.unwrap()
});
let global_latency = std::array::from_fn(|i| {
let op = SmgrQueryType::from_repr(i).unwrap();
SMGR_QUERY_TIME_GLOBAL
.get_metric_with_label_values(&[op.into()])
.unwrap()
});
let per_timeline_getpage = SMGR_QUERY_TIME_PER_TENANT_TIMELINE
let per_timeline_getpage_started = SMGR_QUERY_STARTED_PER_TENANT_TIMELINE
.get_metric_with_label_values(&[
SmgrQueryType::GetPageAtLsn.into(),
&tenant_id,
@@ -1334,9 +1363,20 @@ impl SmgrQueryTimePerTimeline {
&timeline_id,
])
.unwrap();
let per_timeline_getpage_latency = SMGR_QUERY_TIME_PER_TENANT_TIMELINE
.get_metric_with_label_values(&[
SmgrQueryType::GetPageAtLsn.into(),
&tenant_id,
&shard_slug,
&timeline_id,
])
.unwrap();
Self {
global_metrics,
per_timeline_getpage,
global_started,
global_latency,
per_timeline_getpage_latency,
per_timeline_getpage_started,
}
}
pub(crate) fn start_timer<'c: 'a, 'a>(
@@ -1344,8 +1384,11 @@ impl SmgrQueryTimePerTimeline {
op: SmgrQueryType,
ctx: &'c RequestContext,
) -> Option<impl Drop + '_> {
let global_metric = &self.global_metrics[op as usize];
let start = Instant::now();
self.global_started[op as usize].inc();
// We subtract time spent throttled from the observed latency.
match ctx.micros_spent_throttled.open() {
Ok(()) => (),
Err(error) => {
@@ -1364,15 +1407,16 @@ impl SmgrQueryTimePerTimeline {
}
}
let timeline_metric = if matches!(op, SmgrQueryType::GetPageAtLsn) {
Some(&self.per_timeline_getpage)
let per_timeline_latency_histo = if matches!(op, SmgrQueryType::GetPageAtLsn) {
self.per_timeline_getpage_started.inc();
Some(&self.per_timeline_getpage_latency)
} else {
None
};
Some(GlobalAndPerTimelineHistogramTimer {
global_metric,
timeline_metric,
global_latency_histo: &self.global_latency[op as usize],
per_timeline_latency_histo,
ctx,
start,
op,
@@ -1423,9 +1467,12 @@ mod smgr_query_time_tests {
let get_counts = || {
let global: u64 = ops
.iter()
.map(|op| metrics.global_metrics[*op as usize].get_sample_count())
.map(|op| metrics.global_latency[*op as usize].get_sample_count())
.sum();
(global, metrics.per_timeline_getpage.get_sample_count())
(
global,
metrics.per_timeline_getpage_latency.get_sample_count(),
)
};
let (pre_global, pre_per_tenant_timeline) = get_counts();
@@ -1777,7 +1824,7 @@ pub(crate) static SECONDARY_MODE: Lazy<SecondaryModeMetrics> = Lazy::new(|| {
.expect("failed to define a metric"),
upload_heatmap_duration: register_histogram!(
"pageserver_secondary_upload_heatmap_duration",
"Time to build and upload a heatmap, including any waiting inside the S3 client"
"Time to build and upload a heatmap, including any waiting inside the remote storage client"
)
.expect("failed to define a metric"),
download_heatmap: register_int_counter!(
@@ -2576,6 +2623,12 @@ impl TimelineMetrics {
let _ = STORAGE_IO_SIZE.remove_label_values(&[op, tenant_id, shard_id, timeline_id]);
}
let _ = SMGR_QUERY_STARTED_PER_TENANT_TIMELINE.remove_label_values(&[
SmgrQueryType::GetPageAtLsn.into(),
tenant_id,
shard_id,
timeline_id,
]);
let _ = SMGR_QUERY_TIME_PER_TENANT_TIMELINE.remove_label_values(&[
SmgrQueryType::GetPageAtLsn.into(),
tenant_id,
@@ -2592,6 +2645,8 @@ pub(crate) fn remove_tenant_metrics(tenant_shard_id: &TenantShardId) {
let _ = TENANT_SYNTHETIC_SIZE_METRIC.remove_label_values(&[&tid]);
}
tenant_throttling::remove_tenant_metrics(tenant_shard_id);
// we leave the BROKEN_TENANTS_SET entry if any
}
@@ -3055,41 +3110,180 @@ pub mod tokio_epoll_uring {
pub(crate) mod tenant_throttling {
use metrics::{register_int_counter_vec, IntCounter};
use once_cell::sync::Lazy;
use utils::shard::TenantShardId;
use crate::tenant::{self, throttle::Metric};
pub(crate) struct TimelineGet {
wait_time: IntCounter,
count: IntCounter,
struct GlobalAndPerTenantIntCounter {
global: IntCounter,
per_tenant: IntCounter,
}
pub(crate) static TIMELINE_GET: Lazy<TimelineGet> = Lazy::new(|| {
static WAIT_USECS: Lazy<metrics::IntCounterVec> = Lazy::new(|| {
register_int_counter_vec!(
"pageserver_tenant_throttling_wait_usecs_sum_global",
"Sum of microseconds that tenants spent waiting for a tenant throttle of a given kind.",
impl GlobalAndPerTenantIntCounter {
#[inline(always)]
pub(crate) fn inc(&self) {
self.inc_by(1)
}
#[inline(always)]
pub(crate) fn inc_by(&self, n: u64) {
self.global.inc_by(n);
self.per_tenant.inc_by(n);
}
}
pub(crate) struct TimelineGet {
count_accounted_start: GlobalAndPerTenantIntCounter,
count_accounted_finish: GlobalAndPerTenantIntCounter,
wait_time: GlobalAndPerTenantIntCounter,
count_throttled: GlobalAndPerTenantIntCounter,
}
static COUNT_ACCOUNTED_START: Lazy<metrics::IntCounterVec> = Lazy::new(|| {
register_int_counter_vec!(
"pageserver_tenant_throttling_count_accounted_start_global",
"Count of tenant throttling starts, by kind of throttle.",
&["kind"]
)
.unwrap()
});
static WAIT_COUNT: Lazy<metrics::IntCounterVec> = Lazy::new(|| {
register_int_counter_vec!(
"pageserver_tenant_throttling_count_global",
"Count of tenant throttlings, by kind of throttle.",
&["kind"]
)
.unwrap()
});
let kind = "timeline_get";
TimelineGet {
wait_time: WAIT_USECS.with_label_values(&[kind]),
count: WAIT_COUNT.with_label_values(&[kind]),
}
.unwrap()
});
static COUNT_ACCOUNTED_START_PER_TENANT: Lazy<metrics::IntCounterVec> = Lazy::new(|| {
register_int_counter_vec!(
"pageserver_tenant_throttling_count_accounted_start",
"Count of tenant throttling starts, by kind of throttle.",
&["kind", "tenant_id", "shard_id"]
)
.unwrap()
});
static COUNT_ACCOUNTED_FINISH: Lazy<metrics::IntCounterVec> = Lazy::new(|| {
register_int_counter_vec!(
"pageserver_tenant_throttling_count_accounted_finish_global",
"Count of tenant throttling finishes, by kind of throttle.",
&["kind"]
)
.unwrap()
});
static COUNT_ACCOUNTED_FINISH_PER_TENANT: Lazy<metrics::IntCounterVec> = Lazy::new(|| {
register_int_counter_vec!(
"pageserver_tenant_throttling_count_accounted_finish",
"Count of tenant throttling finishes, by kind of throttle.",
&["kind", "tenant_id", "shard_id"]
)
.unwrap()
});
static WAIT_USECS: Lazy<metrics::IntCounterVec> = Lazy::new(|| {
register_int_counter_vec!(
"pageserver_tenant_throttling_wait_usecs_sum_global",
"Sum of microseconds that spent waiting throttle by kind of throttle.",
&["kind"]
)
.unwrap()
});
static WAIT_USECS_PER_TENANT: Lazy<metrics::IntCounterVec> = Lazy::new(|| {
register_int_counter_vec!(
"pageserver_tenant_throttling_wait_usecs_sum",
"Sum of microseconds that spent waiting throttle by kind of throttle.",
&["kind", "tenant_id", "shard_id"]
)
.unwrap()
});
impl Metric for &'static TimelineGet {
static WAIT_COUNT: Lazy<metrics::IntCounterVec> = Lazy::new(|| {
register_int_counter_vec!(
"pageserver_tenant_throttling_count_global",
"Count of tenant throttlings, by kind of throttle.",
&["kind"]
)
.unwrap()
});
static WAIT_COUNT_PER_TENANT: Lazy<metrics::IntCounterVec> = Lazy::new(|| {
register_int_counter_vec!(
"pageserver_tenant_throttling_count",
"Count of tenant throttlings, by kind of throttle.",
&["kind", "tenant_id", "shard_id"]
)
.unwrap()
});
const KIND: &str = "timeline_get";
impl TimelineGet {
pub(crate) fn new(tenant_shard_id: &TenantShardId) -> Self {
TimelineGet {
count_accounted_start: {
GlobalAndPerTenantIntCounter {
global: COUNT_ACCOUNTED_START.with_label_values(&[KIND]),
per_tenant: COUNT_ACCOUNTED_START_PER_TENANT.with_label_values(&[
KIND,
&tenant_shard_id.tenant_id.to_string(),
&tenant_shard_id.shard_slug().to_string(),
]),
}
},
count_accounted_finish: {
GlobalAndPerTenantIntCounter {
global: COUNT_ACCOUNTED_FINISH.with_label_values(&[KIND]),
per_tenant: COUNT_ACCOUNTED_FINISH_PER_TENANT.with_label_values(&[
KIND,
&tenant_shard_id.tenant_id.to_string(),
&tenant_shard_id.shard_slug().to_string(),
]),
}
},
wait_time: {
GlobalAndPerTenantIntCounter {
global: WAIT_USECS.with_label_values(&[KIND]),
per_tenant: WAIT_USECS_PER_TENANT.with_label_values(&[
KIND,
&tenant_shard_id.tenant_id.to_string(),
&tenant_shard_id.shard_slug().to_string(),
]),
}
},
count_throttled: {
GlobalAndPerTenantIntCounter {
global: WAIT_COUNT.with_label_values(&[KIND]),
per_tenant: WAIT_COUNT_PER_TENANT.with_label_values(&[
KIND,
&tenant_shard_id.tenant_id.to_string(),
&tenant_shard_id.shard_slug().to_string(),
]),
}
},
}
}
}
pub(crate) fn preinitialize_global_metrics() {
Lazy::force(&COUNT_ACCOUNTED_START);
Lazy::force(&COUNT_ACCOUNTED_FINISH);
Lazy::force(&WAIT_USECS);
Lazy::force(&WAIT_COUNT);
}
pub(crate) fn remove_tenant_metrics(tenant_shard_id: &TenantShardId) {
for m in &[
&COUNT_ACCOUNTED_START_PER_TENANT,
&COUNT_ACCOUNTED_FINISH_PER_TENANT,
&WAIT_USECS_PER_TENANT,
&WAIT_COUNT_PER_TENANT,
] {
let _ = m.remove_label_values(&[
KIND,
&tenant_shard_id.tenant_id.to_string(),
&tenant_shard_id.shard_slug().to_string(),
]);
}
}
impl Metric for TimelineGet {
#[inline(always)]
fn accounting_start(&self) {
self.count_accounted_start.inc();
}
#[inline(always)]
fn accounting_finish(&self) {
self.count_accounted_finish.inc();
}
#[inline(always)]
fn observe_throttling(
&self,
@@ -3097,7 +3291,7 @@ pub(crate) mod tenant_throttling {
) {
let val = u64::try_from(wait_time.as_micros()).unwrap();
self.wait_time.inc_by(val);
self.count.inc();
self.count_throttled.inc();
}
}
}
@@ -3227,11 +3421,14 @@ pub fn preinitialize_metrics() {
}
// countervecs
[&BACKGROUND_LOOP_PERIOD_OVERRUN_COUNT]
.into_iter()
.for_each(|c| {
Lazy::force(c);
});
[
&BACKGROUND_LOOP_PERIOD_OVERRUN_COUNT,
&SMGR_QUERY_STARTED_GLOBAL,
]
.into_iter()
.for_each(|c| {
Lazy::force(c);
});
// gauges
WALRECEIVER_ACTIVE_MANAGERS.get();
@@ -3253,7 +3450,8 @@ pub fn preinitialize_metrics() {
// Custom
Lazy::force(&RECONSTRUCT_TIME);
Lazy::force(&tenant_throttling::TIMELINE_GET);
Lazy::force(&BASEBACKUP_QUERY_TIME);
Lazy::force(&COMPUTE_COMMANDS_COUNTERS);
tenant_throttling::preinitialize_global_metrics();
}

View File

@@ -140,6 +140,7 @@ pub mod metadata;
pub mod remote_timeline_client;
pub mod storage_layer;
pub mod checks;
pub mod config;
pub mod mgr;
pub mod secondary;
@@ -301,7 +302,7 @@ pub struct Tenant {
/// Throttle applied at the top of [`Timeline::get`].
/// All [`Tenant::timelines`] of a given [`Tenant`] instance share the same [`throttle::Throttle`] instance.
pub(crate) timeline_get_throttle:
Arc<throttle::Throttle<&'static crate::metrics::tenant_throttling::TimelineGet>>,
Arc<throttle::Throttle<crate::metrics::tenant_throttling::TimelineGet>>,
/// An ongoing timeline detach concurrency limiter.
///
@@ -1573,6 +1574,9 @@ impl Tenant {
image_layer_desc: Vec<(Lsn, Vec<(pageserver_api::key::Key, bytes::Bytes)>)>,
end_lsn: Lsn,
) -> anyhow::Result<Arc<Timeline>> {
use checks::check_valid_layermap;
use itertools::Itertools;
let tline = self
.create_test_timeline(new_timeline_id, initdb_lsn, pg_version, ctx)
.await?;
@@ -1587,6 +1591,18 @@ impl Tenant {
.force_create_image_layer(lsn, images, Some(initdb_lsn), ctx)
.await?;
}
let layer_names = tline
.layers
.read()
.await
.layer_map()
.unwrap()
.iter_historic_layers()
.map(|layer| layer.layer_name())
.collect_vec();
if let Some(err) = check_valid_layermap(&layer_names) {
bail!("invalid layermap: {err}");
}
Ok(tline)
}
@@ -2815,7 +2831,7 @@ impl Tenant {
gate: Gate::default(),
timeline_get_throttle: Arc::new(throttle::Throttle::new(
Tenant::get_timeline_get_throttle_config(conf, &attached_conf.tenant_conf),
&crate::metrics::tenant_throttling::TIMELINE_GET,
crate::metrics::tenant_throttling::TimelineGet::new(&tenant_shard_id),
)),
tenant_conf: Arc::new(ArcSwap::from_pointee(attached_conf)),
ongoing_timeline_detach: std::sync::Mutex::default(),
@@ -3197,6 +3213,9 @@ impl Tenant {
image_layer_desc: Vec<(Lsn, Vec<(pageserver_api::key::Key, bytes::Bytes)>)>,
end_lsn: Lsn,
) -> anyhow::Result<Arc<Timeline>> {
use checks::check_valid_layermap;
use itertools::Itertools;
let tline = self
.branch_timeline_test(src_timeline, dst_id, ancestor_lsn, ctx)
.await?;
@@ -3217,6 +3236,18 @@ impl Tenant {
.force_create_image_layer(lsn, images, Some(ancestor_lsn), ctx)
.await?;
}
let layer_names = tline
.layers
.read()
.await
.layer_map()
.unwrap()
.iter_historic_layers()
.map(|layer| layer.layer_name())
.collect_vec();
if let Some(err) = check_valid_layermap(&layer_names) {
bail!("invalid layermap: {err}");
}
Ok(tline)
}
@@ -3491,7 +3522,7 @@ impl Tenant {
.context("extract initdb tar")?;
} else {
// Init temporarily repo to get bootstrap data, this creates a directory in the `pgdata_path` path
run_initdb(self.conf, &pgdata_path, pg_version, &self.cancel).await?;
run_initdb_with_cache(self.conf, &pgdata_path, pg_version, &self.cancel).await?;
// Upload the created data dir to S3
if self.tenant_shard_id().is_shard_zero() {
@@ -3837,6 +3868,118 @@ impl Tenant {
}
}
fn cached_initdb_dirname(initial_superuser_name: &str, pg_version: u32) -> String
{
use std::hash::Hash;
use std::hash::Hasher;
use std::collections::hash_map::DefaultHasher;
let mut hasher = DefaultHasher::new();
initial_superuser_name.hash(&mut hasher);
let hash = hasher.finish();
format!("cached_initial_pgdata_{pg_version}_{:016}", hash)
}
fn copy_dir_all(src: impl AsRef<std::path::Path>, dst: impl AsRef<std::path::Path>) -> std::io::Result<()> {
for entry in fs::read_dir(src.as_ref())? {
let entry = entry?;
let subsrc = entry.path();
let subdst = dst.as_ref().join(&entry.file_name());
if entry.file_type()?.is_dir() {
std::fs::create_dir(&subdst)?;
copy_dir_all(&subsrc, &subdst)?;
} else {
std::fs::copy(&subsrc, &subdst)?;
}
}
Ok(())
}
fn restore_cached_initdb_dir(
cached_path: &Utf8Path,
target_path: &Utf8Path,
) -> anyhow::Result<bool> {
if !cached_path.exists() {
info!("cached initdb dir \"{cached_path}\" does not exist yet");
return Ok(false);
}
std::fs::create_dir(target_path)?;
copy_dir_all(cached_path, target_path)?;
info!("restored initdb result from cache dir \"{cached_path}\"");
Ok(true)
}
fn save_cached_initdb_dir(
src_path: &Utf8Path,
cache_path: &Utf8Path,
) -> anyhow::Result<()> {
match std::fs::create_dir(cache_path) {
Ok(()) => {
info!("saving initdb result to cache dir \"{cache_path}\"");
},
Err(err) if err.kind() == std::io::ErrorKind::AlreadyExists => {
info!("cache initdb dir \"{cache_path}\" already exists, not saving");
return Ok(())
},
Err(err) => { return Err(anyhow::Error::from(err))},
};
let cache_dir_guard = scopeguard::guard(cache_path, |cp| {
if let Err(err) = std::fs::remove_dir_all(&cp) {
error!("could not remove cached initdb directory {cp}: {err}");
}
});
let cache_parent_path = cache_path.parent().ok_or(anyhow::Error::msg("no cache parent path"))?;
let tmp_dirpath = camino_tempfile::tempdir_in(cache_parent_path)?;
copy_dir_all(src_path, &tmp_dirpath)?;
std::fs::rename(tmp_dirpath, &*cache_dir_guard)?;
// disarm the guard
scopeguard::ScopeGuard::into_inner(cache_dir_guard);
Ok(())
}
async fn run_initdb_with_cache(
conf: &'static PageServerConf,
initdb_target_dir: &Utf8Path,
pg_version: u32,
cancel: &CancellationToken,
) -> Result<(), InitdbError> {
let cache_dir = conf.initdb_cache_dir.as_ref().map(|initdb_cache_dir| {
initdb_cache_dir.join(cached_initdb_dirname(&conf.superuser, pg_version))
});
if let Some(cache_dir) = &cache_dir {
match restore_cached_initdb_dir(&cache_dir, initdb_target_dir) {
Ok(true) => return Ok(()),
Ok(false) => {},
Err(err) => {
warn!("Error restoring from cached initdb directory \"{cache_dir}\": {err}");
if initdb_target_dir.exists() {
if let Err(err) = std::fs::remove_dir_all(&initdb_target_dir) {
error!("could not remove temporary initdb target directory {initdb_target_dir}: {err}");
}
}
},
}
}
run_initdb(conf, initdb_target_dir, pg_version, cancel).await?;
if let Some(cache_dir) = &cache_dir {
if let Err(err) = save_cached_initdb_dir(initdb_target_dir, &cache_dir) {
warn!("error saving initdb result to cache directory \"{cache_dir}\": {err}");
}
}
Ok(())
}
/// Create the cluster temporarily in 'initdbpath' directory inside the repository
/// to get bootstrap data for timeline initialization.
async fn run_initdb(
@@ -4164,9 +4307,18 @@ pub(crate) mod harness {
let records_neon = records.iter().all(|r| apply_neon::can_apply_in_neon(&r.1));
if records_neon {
// For Neon wal records, we can decode without spawning postgres, so do so.
let base_img = base_img.expect("Neon WAL redo requires base image").1;
let mut page = BytesMut::new();
page.extend_from_slice(&base_img);
let mut page = match (base_img, records.first()) {
(Some((_lsn, img)), _) => {
let mut page = BytesMut::new();
page.extend_from_slice(&img);
page
}
(_, Some((_lsn, rec))) if rec.will_init() => BytesMut::new(),
_ => {
panic!("Neon WAL redo requires base image or will init record");
}
};
for (record_lsn, record) in records {
apply_neon::apply_in_neon(&record, record_lsn, key, &mut page)?;
}
@@ -8470,4 +8622,135 @@ mod tests {
Ok(())
}
// Regression test for https://github.com/neondatabase/neon/issues/9012
// Create an image arrangement where we have to read at different LSN ranges
// from a delta layer. This is achieved by overlapping an image layer on top of
// a delta layer. Like so:
//
// A B
// +----------------+ -> delta_layer
// | | ^ lsn
// | =========|-> nested_image_layer |
// | C | |
// +----------------+ |
// ======== -> baseline_image_layer +-------> key
//
//
// When querying the key range [A, B) we need to read at different LSN ranges
// for [A, C) and [C, B). This test checks that the described edge case is handled correctly.
#[tokio::test]
async fn test_vectored_read_with_nested_image_layer() -> anyhow::Result<()> {
let harness = TenantHarness::create("test_vectored_read_with_nested_image_layer").await?;
let (tenant, ctx) = harness.load().await;
let will_init_keys = [2, 6];
fn get_key(id: u32) -> Key {
let mut key = Key::from_hex("110000000033333333444444445500000000").unwrap();
key.field6 = id;
key
}
let mut expected_key_values = HashMap::new();
let baseline_image_layer_lsn = Lsn(0x10);
let mut baseline_img_layer = Vec::new();
for i in 0..5 {
let key = get_key(i);
let value = format!("value {i}@{baseline_image_layer_lsn}");
let removed = expected_key_values.insert(key, value.clone());
assert!(removed.is_none());
baseline_img_layer.push((key, Bytes::from(value)));
}
let nested_image_layer_lsn = Lsn(0x50);
let mut nested_img_layer = Vec::new();
for i in 5..10 {
let key = get_key(i);
let value = format!("value {i}@{nested_image_layer_lsn}");
let removed = expected_key_values.insert(key, value.clone());
assert!(removed.is_none());
nested_img_layer.push((key, Bytes::from(value)));
}
let mut delta_layer_spec = Vec::default();
let delta_layer_start_lsn = Lsn(0x20);
let mut delta_layer_end_lsn = delta_layer_start_lsn;
for i in 0..10 {
let key = get_key(i);
let key_in_nested = nested_img_layer
.iter()
.any(|(key_with_img, _)| *key_with_img == key);
let lsn = {
if key_in_nested {
Lsn(nested_image_layer_lsn.0 + 0x10)
} else {
delta_layer_start_lsn
}
};
let will_init = will_init_keys.contains(&i);
if will_init {
delta_layer_spec.push((key, lsn, Value::WalRecord(NeonWalRecord::wal_init())));
expected_key_values.insert(key, "".to_string());
} else {
let delta = format!("@{lsn}");
delta_layer_spec.push((
key,
lsn,
Value::WalRecord(NeonWalRecord::wal_append(&delta)),
));
expected_key_values
.get_mut(&key)
.expect("An image exists for each key")
.push_str(delta.as_str());
}
delta_layer_end_lsn = std::cmp::max(delta_layer_start_lsn, lsn);
}
delta_layer_end_lsn = Lsn(delta_layer_end_lsn.0 + 1);
assert!(
nested_image_layer_lsn > delta_layer_start_lsn
&& nested_image_layer_lsn < delta_layer_end_lsn
);
let tline = tenant
.create_test_timeline_with_layers(
TIMELINE_ID,
baseline_image_layer_lsn,
DEFAULT_PG_VERSION,
&ctx,
vec![DeltaLayerTestDesc::new_with_inferred_key_range(
delta_layer_start_lsn..delta_layer_end_lsn,
delta_layer_spec,
)], // delta layers
vec![
(baseline_image_layer_lsn, baseline_img_layer),
(nested_image_layer_lsn, nested_img_layer),
], // image layers
delta_layer_end_lsn,
)
.await?;
let keyspace = KeySpace::single(get_key(0)..get_key(10));
let results = tline
.get_vectored(keyspace, delta_layer_end_lsn, &ctx)
.await
.expect("No vectored errors");
for (key, res) in results {
let value = res.expect("No key errors");
let expected_value = expected_key_values.remove(&key).expect("No unknown keys");
assert_eq!(value, Bytes::from(expected_value));
}
Ok(())
}
}

View File

@@ -0,0 +1,55 @@
use std::collections::BTreeSet;
use itertools::Itertools;
use super::storage_layer::LayerName;
/// Checks whether a layer map is valid (i.e., is a valid result of the current compaction algorithm if nothing goes wrong).
/// The function checks if we can split the LSN range of a delta layer only at the LSNs of the delta layers. For example,
///
/// ```plain
/// | | | |
/// | 1 | | 2 | | 3 |
/// | | | | | |
/// ```
///
/// This is not a valid layer map because the LSN range of layer 1 intersects with the LSN range of layer 2. 1 and 2 should have
/// the same LSN range.
///
/// The exception is that when layer 2 only contains a single key, it could be split over the LSN range. For example,
///
/// ```plain
/// | | | 2 | | |
/// | 1 | |-------| | 3 |
/// | | | 4 | | |
///
/// If layer 2 and 4 contain the same single key, this is also a valid layer map.
pub fn check_valid_layermap(metadata: &[LayerName]) -> Option<String> {
let mut lsn_split_point = BTreeSet::new(); // TODO: use a better data structure (range tree / range set?)
let mut all_delta_layers = Vec::new();
for name in metadata {
if let LayerName::Delta(layer) = name {
if layer.key_range.start.next() != layer.key_range.end {
all_delta_layers.push(layer.clone());
}
}
}
for layer in &all_delta_layers {
let lsn_range = &layer.lsn_range;
lsn_split_point.insert(lsn_range.start);
lsn_split_point.insert(lsn_range.end);
}
for layer in &all_delta_layers {
let lsn_range = layer.lsn_range.clone();
let intersects = lsn_split_point.range(lsn_range).collect_vec();
if intersects.len() > 1 {
let err = format!(
"layer violates the layer map LSN split assumption: layer {} intersects with LSN [{}]",
layer,
intersects.into_iter().map(|lsn| lsn.to_string()).join(", ")
);
return Some(err);
}
}
None
}

View File

@@ -1,11 +1,29 @@
use std::collections::HashMap;
use utils::id::TimelineId;
use std::{collections::HashMap, time::Duration};
use super::remote_timeline_client::index::GcBlockingReason;
use tokio::time::Instant;
use utils::id::TimelineId;
type Storage = HashMap<TimelineId, enumset::EnumSet<GcBlockingReason>>;
type TimelinesBlocked = HashMap<TimelineId, enumset::EnumSet<GcBlockingReason>>;
#[derive(Default)]
struct Storage {
timelines_blocked: TimelinesBlocked,
/// The deadline before which we are blocked from GC so that
/// leases have a chance to be renewed.
lsn_lease_deadline: Option<Instant>,
}
impl Storage {
fn is_blocked_by_lsn_lease_deadline(&self) -> bool {
self.lsn_lease_deadline
.map(|d| Instant::now() < d)
.unwrap_or(false)
}
}
/// GcBlock provides persistent (per-timeline) gc blocking and facilitates transient time based gc
/// blocking.
#[derive(Default)]
pub(crate) struct GcBlock {
/// The timelines which have current reasons to block gc.
@@ -13,6 +31,12 @@ pub(crate) struct GcBlock {
/// LOCK ORDER: this is held locked while scheduling the next index_part update. This is done
/// to keep the this field up to date with RemoteTimelineClient `upload_queue.dirty`.
reasons: std::sync::Mutex<Storage>,
/// GC background task or manually run `Tenant::gc_iteration` holds a lock on this.
///
/// Do not add any more features taking and forbidding taking this lock. It should be
/// `tokio::sync::Notify`, but that is rarely used. On the other side, [`GcBlock::insert`]
/// synchronizes with gc attempts by locking and unlocking this mutex.
blocking: tokio::sync::Mutex<()>,
}
@@ -42,6 +66,20 @@ impl GcBlock {
}
}
/// Sets a deadline before which we cannot proceed to GC due to lsn lease.
///
/// We do this as the leases mapping are not persisted to disk. By delaying GC by lease
/// length, we guarantee that all the leases we granted before will have a chance to renew
/// when we run GC for the first time after restart / transition from AttachedMulti to AttachedSingle.
pub(super) fn set_lsn_lease_deadline(&self, lsn_lease_length: Duration) {
let deadline = Instant::now() + lsn_lease_length;
let mut g = self.reasons.lock().unwrap();
g.lsn_lease_deadline = Some(deadline);
}
/// Describe the current gc blocking reasons.
///
/// TODO: make this json serializable.
pub(crate) fn summary(&self) -> Option<BlockingReasons> {
let g = self.reasons.lock().unwrap();
@@ -64,7 +102,7 @@ impl GcBlock {
) -> anyhow::Result<bool> {
let (added, uploaded) = {
let mut g = self.reasons.lock().unwrap();
let set = g.entry(timeline.timeline_id).or_default();
let set = g.timelines_blocked.entry(timeline.timeline_id).or_default();
let added = set.insert(reason);
// LOCK ORDER: intentionally hold the lock, see self.reasons.
@@ -95,7 +133,7 @@ impl GcBlock {
let (remaining_blocks, uploaded) = {
let mut g = self.reasons.lock().unwrap();
match g.entry(timeline.timeline_id) {
match g.timelines_blocked.entry(timeline.timeline_id) {
Entry::Occupied(mut oe) => {
let set = oe.get_mut();
set.remove(reason);
@@ -109,7 +147,7 @@ impl GcBlock {
}
}
let remaining_blocks = g.len();
let remaining_blocks = g.timelines_blocked.len();
// LOCK ORDER: intentionally hold the lock while scheduling; see self.reasons
let uploaded = timeline
@@ -134,11 +172,11 @@ impl GcBlock {
pub(crate) fn before_delete(&self, timeline: &super::Timeline) {
let unblocked = {
let mut g = self.reasons.lock().unwrap();
if g.is_empty() {
if g.timelines_blocked.is_empty() {
return;
}
g.remove(&timeline.timeline_id);
g.timelines_blocked.remove(&timeline.timeline_id);
BlockingReasons::clean_and_summarize(g).is_none()
};
@@ -149,10 +187,11 @@ impl GcBlock {
}
/// Initialize with the non-deleted timelines of this tenant.
pub(crate) fn set_scanned(&self, scanned: Storage) {
pub(crate) fn set_scanned(&self, scanned: TimelinesBlocked) {
let mut g = self.reasons.lock().unwrap();
assert!(g.is_empty());
g.extend(scanned.into_iter().filter(|(_, v)| !v.is_empty()));
assert!(g.timelines_blocked.is_empty());
g.timelines_blocked
.extend(scanned.into_iter().filter(|(_, v)| !v.is_empty()));
if let Some(reasons) = BlockingReasons::clean_and_summarize(g) {
tracing::info!(summary=?reasons, "initialized with gc blocked");
@@ -166,6 +205,7 @@ pub(super) struct Guard<'a> {
#[derive(Debug)]
pub(crate) struct BlockingReasons {
tenant_blocked_by_lsn_lease_deadline: bool,
timelines: usize,
reasons: enumset::EnumSet<GcBlockingReason>,
}
@@ -174,8 +214,8 @@ impl std::fmt::Display for BlockingReasons {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
write!(
f,
"{} timelines block for {:?}",
self.timelines, self.reasons
"tenant_blocked_by_lsn_lease_deadline: {}, {} timelines block for {:?}",
self.tenant_blocked_by_lsn_lease_deadline, self.timelines, self.reasons
)
}
}
@@ -183,13 +223,15 @@ impl std::fmt::Display for BlockingReasons {
impl BlockingReasons {
fn clean_and_summarize(mut g: std::sync::MutexGuard<'_, Storage>) -> Option<Self> {
let mut reasons = enumset::EnumSet::empty();
g.retain(|_key, value| {
g.timelines_blocked.retain(|_key, value| {
reasons = reasons.union(*value);
!value.is_empty()
});
if !g.is_empty() {
let blocked_by_lsn_lease_deadline = g.is_blocked_by_lsn_lease_deadline();
if !g.timelines_blocked.is_empty() || blocked_by_lsn_lease_deadline {
Some(BlockingReasons {
timelines: g.len(),
tenant_blocked_by_lsn_lease_deadline: blocked_by_lsn_lease_deadline,
timelines: g.timelines_blocked.len(),
reasons,
})
} else {
@@ -198,14 +240,17 @@ impl BlockingReasons {
}
fn summarize(g: &std::sync::MutexGuard<'_, Storage>) -> Option<Self> {
if g.is_empty() {
let blocked_by_lsn_lease_deadline = g.is_blocked_by_lsn_lease_deadline();
if g.timelines_blocked.is_empty() && !blocked_by_lsn_lease_deadline {
None
} else {
let reasons = g
.timelines_blocked
.values()
.fold(enumset::EnumSet::empty(), |acc, next| acc.union(*next));
Some(BlockingReasons {
timelines: g.len(),
tenant_blocked_by_lsn_lease_deadline: blocked_by_lsn_lease_deadline,
timelines: g.timelines_blocked.len(),
reasons,
})
}

View File

@@ -949,6 +949,12 @@ impl TenantManager {
(LocationMode::Attached(attach_conf), Some(TenantSlot::Attached(tenant))) => {
match attach_conf.generation.cmp(&tenant.generation) {
Ordering::Equal => {
if attach_conf.attach_mode == AttachmentMode::Single {
tenant
.gc_block
.set_lsn_lease_deadline(tenant.get_lsn_lease_length());
}
// A transition from Attached to Attached in the same generation, we may
// take our fast path and just provide the updated configuration
// to the tenant.

View File

@@ -276,6 +276,16 @@ pub(crate) enum LayerId {
InMemoryLayerId(InMemoryLayerFileId),
}
/// Uniquely identify a layer visit by the layer
/// and LSN floor (or start LSN) of the reads.
/// The layer itself is not enough since we may
/// have different LSN lower bounds for delta layer reads.
#[derive(Debug, PartialEq, Eq, Clone, Hash)]
struct LayerToVisitId {
layer_id: LayerId,
lsn_floor: Lsn,
}
/// Layer wrapper for the read path. Note that it is valid
/// to use these layers even after external operations have
/// been performed on them (compaction, freeze, etc.).
@@ -287,9 +297,9 @@ pub(crate) enum ReadableLayer {
/// A partial description of a read to be done.
#[derive(Debug, Clone)]
struct ReadDesc {
struct LayerVisit {
/// An id used to resolve the readable layer within the fringe
layer_id: LayerId,
layer_to_visit_id: LayerToVisitId,
/// Lsn range for the read, used for selecting the next read
lsn_range: Range<Lsn>,
}
@@ -303,12 +313,12 @@ struct ReadDesc {
/// a two layer indexing scheme.
#[derive(Debug)]
pub(crate) struct LayerFringe {
planned_reads_by_lsn: BinaryHeap<ReadDesc>,
layers: HashMap<LayerId, LayerKeyspace>,
planned_visits_by_lsn: BinaryHeap<LayerVisit>,
visit_reads: HashMap<LayerToVisitId, LayerVisitReads>,
}
#[derive(Debug)]
struct LayerKeyspace {
struct LayerVisitReads {
layer: ReadableLayer,
target_keyspace: KeySpaceRandomAccum,
}
@@ -316,23 +326,23 @@ struct LayerKeyspace {
impl LayerFringe {
pub(crate) fn new() -> Self {
LayerFringe {
planned_reads_by_lsn: BinaryHeap::new(),
layers: HashMap::new(),
planned_visits_by_lsn: BinaryHeap::new(),
visit_reads: HashMap::new(),
}
}
pub(crate) fn next_layer(&mut self) -> Option<(ReadableLayer, KeySpace, Range<Lsn>)> {
let read_desc = match self.planned_reads_by_lsn.pop() {
let read_desc = match self.planned_visits_by_lsn.pop() {
Some(desc) => desc,
None => return None,
};
let removed = self.layers.remove_entry(&read_desc.layer_id);
let removed = self.visit_reads.remove_entry(&read_desc.layer_to_visit_id);
match removed {
Some((
_,
LayerKeyspace {
LayerVisitReads {
layer,
mut target_keyspace,
},
@@ -351,20 +361,24 @@ impl LayerFringe {
keyspace: KeySpace,
lsn_range: Range<Lsn>,
) {
let layer_id = layer.id();
let entry = self.layers.entry(layer_id.clone());
let layer_to_visit_id = LayerToVisitId {
layer_id: layer.id(),
lsn_floor: lsn_range.start,
};
let entry = self.visit_reads.entry(layer_to_visit_id.clone());
match entry {
Entry::Occupied(mut entry) => {
entry.get_mut().target_keyspace.add_keyspace(keyspace);
}
Entry::Vacant(entry) => {
self.planned_reads_by_lsn.push(ReadDesc {
self.planned_visits_by_lsn.push(LayerVisit {
lsn_range,
layer_id: layer_id.clone(),
layer_to_visit_id: layer_to_visit_id.clone(),
});
let mut accum = KeySpaceRandomAccum::new();
accum.add_keyspace(keyspace);
entry.insert(LayerKeyspace {
entry.insert(LayerVisitReads {
layer,
target_keyspace: accum,
});
@@ -379,7 +393,7 @@ impl Default for LayerFringe {
}
}
impl Ord for ReadDesc {
impl Ord for LayerVisit {
fn cmp(&self, other: &Self) -> Ordering {
let ord = self.lsn_range.end.cmp(&other.lsn_range.end);
if ord == std::cmp::Ordering::Equal {
@@ -390,19 +404,19 @@ impl Ord for ReadDesc {
}
}
impl PartialOrd for ReadDesc {
impl PartialOrd for LayerVisit {
fn partial_cmp(&self, other: &Self) -> Option<Ordering> {
Some(self.cmp(other))
}
}
impl PartialEq for ReadDesc {
impl PartialEq for LayerVisit {
fn eq(&self, other: &Self) -> bool {
self.lsn_range == other.lsn_range
}
}
impl Eq for ReadDesc {}
impl Eq for LayerVisit {}
impl ReadableLayer {
pub(crate) fn id(&self) -> LayerId {

View File

@@ -38,7 +38,7 @@ use crate::tenant::timeline::GetVectoredError;
use crate::tenant::vectored_blob_io::{
BlobFlag, StreamingVectoredReadPlanner, VectoredBlobReader, VectoredRead, VectoredReadPlanner,
};
use crate::tenant::{PageReconstructError, Timeline};
use crate::tenant::PageReconstructError;
use crate::virtual_file::owned_buffers_io::io_buf_ext::IoBufExt;
use crate::virtual_file::{self, VirtualFile};
use crate::{IMAGE_FILE_MAGIC, STORAGE_FORMAT_VERSION, TEMP_FILE_SUFFIX};
@@ -58,7 +58,6 @@ use std::io::SeekFrom;
use std::ops::Range;
use std::os::unix::prelude::FileExt;
use std::str::FromStr;
use std::sync::Arc;
use tokio::sync::OnceCell;
use tokio_stream::StreamExt;
use tracing::*;
@@ -70,9 +69,7 @@ use utils::{
};
use super::layer_name::ImageLayerName;
use super::{
AsLayerDesc, Layer, LayerName, PersistentLayerDesc, ResidentLayer, ValuesReconstructState,
};
use super::{AsLayerDesc, LayerName, PersistentLayerDesc, ValuesReconstructState};
///
/// Header stored in the beginning of the file
@@ -800,10 +797,9 @@ impl ImageLayerWriterInner {
///
async fn finish(
self,
timeline: &Arc<Timeline>,
ctx: &RequestContext,
end_key: Option<Key>,
) -> anyhow::Result<ResidentLayer> {
) -> anyhow::Result<(PersistentLayerDesc, Utf8PathBuf)> {
let index_start_blk =
((self.blob_writer.size() + PAGE_SZ as u64 - 1) / PAGE_SZ as u64) as u32;
@@ -879,12 +875,9 @@ impl ImageLayerWriterInner {
// fsync the file
file.sync_all().await?;
// FIXME: why not carry the virtualfile here, it supports renaming?
let layer = Layer::finish_creating(self.conf, timeline, desc, &self.path)?;
trace!("created image layer {}", self.path);
info!("created image layer {}", layer.local_path());
Ok(layer)
Ok((desc, self.path))
}
}
@@ -963,24 +956,18 @@ impl ImageLayerWriter {
///
pub(crate) async fn finish(
mut self,
timeline: &Arc<Timeline>,
ctx: &RequestContext,
) -> anyhow::Result<super::ResidentLayer> {
self.inner.take().unwrap().finish(timeline, ctx, None).await
) -> anyhow::Result<(PersistentLayerDesc, Utf8PathBuf)> {
self.inner.take().unwrap().finish(ctx, None).await
}
/// Finish writing the image layer with an end key, used in [`super::split_writer::SplitImageLayerWriter`]. The end key determines the end of the image layer's covered range and is exclusive.
pub(super) async fn finish_with_end_key(
mut self,
timeline: &Arc<Timeline>,
end_key: Key,
ctx: &RequestContext,
) -> anyhow::Result<super::ResidentLayer> {
self.inner
.take()
.unwrap()
.finish(timeline, ctx, Some(end_key))
.await
) -> anyhow::Result<(PersistentLayerDesc, Utf8PathBuf)> {
self.inner.take().unwrap().finish(ctx, Some(end_key)).await
}
}
@@ -1084,7 +1071,7 @@ mod test {
tenant::{
config::TenantConf,
harness::{TenantHarness, TIMELINE_ID},
storage_layer::ResidentLayer,
storage_layer::{Layer, ResidentLayer},
vectored_blob_io::StreamingVectoredReadPlanner,
Tenant, Timeline,
},
@@ -1155,7 +1142,8 @@ mod test {
key = key.next();
}
writer.finish(&timeline, &ctx).await.unwrap()
let (desc, path) = writer.finish(&ctx).await.unwrap();
Layer::finish_creating(tenant.conf, &timeline, desc, &path).unwrap()
};
let original_size = resident.metadata().file_size;
@@ -1217,7 +1205,9 @@ mod test {
.await
.unwrap();
let replacement = if wrote_keys > 0 {
Some(filtered_writer.finish(&timeline, &ctx).await.unwrap())
let (desc, path) = filtered_writer.finish(&ctx).await.unwrap();
let resident = Layer::finish_creating(tenant.conf, &timeline, desc, &path).unwrap();
Some(resident)
} else {
None
};
@@ -1290,7 +1280,8 @@ mod test {
for (key, img) in images {
writer.put_image(key, img, ctx).await?;
}
let img_layer = writer.finish(tline, ctx).await?;
let (desc, path) = writer.finish(ctx).await?;
let img_layer = Layer::finish_creating(tenant.conf, tline, desc, &path)?;
Ok::<_, anyhow::Error>(img_layer)
}

View File

@@ -439,11 +439,30 @@ impl Layer {
fn record_access(&self, ctx: &RequestContext) {
if self.0.access_stats.record_access(ctx) {
// Visibility was modified to Visible
tracing::info!(
"Layer {} became visible as a result of access",
self.0.desc.key()
);
// Visibility was modified to Visible: maybe log about this
match ctx.task_kind() {
TaskKind::CalculateSyntheticSize
| TaskKind::GarbageCollector
| TaskKind::MgmtRequest => {
// This situation is expected in code paths do binary searches of the LSN space to resolve
// an LSN to a timestamp, which happens during GC, during GC cutoff calculations in synthetic size,
// and on-demand for certain HTTP API requests.
}
_ => {
// In all other contexts, it is unusual to do I/O involving layers which are not visible at
// some branch tip, so we log the fact that we are accessing something that the visibility
// calculation thought should not be visible.
//
// This case is legal in brief time windows: for example an in-flight getpage request can hold on to a layer object
// which was covered by a concurrent compaction.
tracing::info!(
"Layer {} became visible as a result of access",
self.0.desc.key()
);
}
}
// Update the timeline's visible bytes count
if let Some(tl) = self.0.timeline.upgrade() {
tl.metrics
.visible_physical_size_gauge

View File

@@ -1025,6 +1025,15 @@ fn access_stats() {
assert_eq!(access_stats.latest_activity(), lowres_time(atime));
access_stats.set_visibility(LayerVisibilityHint::Visible);
assert_eq!(access_stats.latest_activity(), lowres_time(atime));
// Recording access implicitly makes layer visible, if it wasn't already
let atime = UNIX_EPOCH + Duration::from_secs(2200000000);
access_stats.set_visibility(LayerVisibilityHint::Covered);
assert_eq!(access_stats.visibility(), LayerVisibilityHint::Covered);
assert!(access_stats.record_access_at(atime));
access_stats.set_visibility(LayerVisibilityHint::Visible);
assert!(!access_stats.record_access_at(atime));
access_stats.set_visibility(LayerVisibilityHint::Visible);
}
#[test]

View File

@@ -121,11 +121,11 @@ impl SplitImageLayerWriter {
self.generated_layers
.push(SplitWriterResult::Discarded(layer_key));
} else {
self.generated_layers.push(SplitWriterResult::Produced(
prev_image_writer
.finish_with_end_key(tline, key, ctx)
.await?,
));
let (desc, path) = prev_image_writer.finish_with_end_key(key, ctx).await?;
let layer = Layer::finish_creating(self.conf, tline, desc, &path)?;
self.generated_layers
.push(SplitWriterResult::Produced(layer));
}
}
self.inner.put_image(key, img, ctx).await
@@ -170,9 +170,9 @@ impl SplitImageLayerWriter {
if discard(&layer_key).await {
generated_layers.push(SplitWriterResult::Discarded(layer_key));
} else {
generated_layers.push(SplitWriterResult::Produced(
inner.finish_with_end_key(tline, end_key, ctx).await?,
));
let (desc, path) = inner.finish_with_end_key(end_key, ctx).await?;
let layer = Layer::finish_creating(self.conf, tline, desc, &path)?;
generated_layers.push(SplitWriterResult::Produced(layer));
}
Ok(generated_layers)
}

View File

@@ -163,8 +163,6 @@ async fn compaction_loop(tenant: Arc<Tenant>, cancel: CancellationToken) {
// How many errors we have seen consequtively
let mut error_run_count = 0;
let mut last_throttle_flag_reset_at = Instant::now();
TENANT_TASK_EVENTS.with_label_values(&["start"]).inc();
async {
let ctx = RequestContext::todo_child(TaskKind::Compaction, DownloadBehavior::Download);
@@ -191,8 +189,6 @@ async fn compaction_loop(tenant: Arc<Tenant>, cancel: CancellationToken) {
}
}
let sleep_duration;
if period == Duration::ZERO {
#[cfg(not(feature = "testing"))]
@@ -207,12 +203,18 @@ async fn compaction_loop(tenant: Arc<Tenant>, cancel: CancellationToken) {
};
// Run compaction
let IterationResult { output, elapsed } = iteration.run(tenant.compaction_iteration(&cancel, &ctx)).await;
let IterationResult { output, elapsed } = iteration
.run(tenant.compaction_iteration(&cancel, &ctx))
.await;
match output {
Ok(has_pending_task) => {
error_run_count = 0;
// schedule the next compaction immediately in case there is a pending compaction task
sleep_duration = if has_pending_task { Duration::ZERO } else { period };
sleep_duration = if has_pending_task {
Duration::ZERO
} else {
period
};
}
Err(e) => {
let wait_duration = backoff::exponential_backoff_duration_seconds(
@@ -233,38 +235,20 @@ async fn compaction_loop(tenant: Arc<Tenant>, cancel: CancellationToken) {
}
// the duration is recorded by performance tests by enabling debug in this function
tracing::debug!(elapsed_ms=elapsed.as_millis(), "compaction iteration complete");
tracing::debug!(
elapsed_ms = elapsed.as_millis(),
"compaction iteration complete"
);
};
// Perhaps we did no work and the walredo process has been idle for some time:
// give it a chance to shut down to avoid leaving walredo process running indefinitely.
// TODO: move this to a separate task (housekeeping loop) that isn't affected by the back-off,
// so we get some upper bound guarantee on when walredo quiesce / this throttling reporting here happens.
if let Some(walredo_mgr) = &tenant.walredo_mgr {
walredo_mgr.maybe_quiesce(period * 10);
}
// TODO: move this (and walredo quiesce) to a separate task that isn't affected by the back-off,
// so we get some upper bound guarantee on when walredo quiesce / this throttling reporting here happens.
info_span!(parent: None, "timeline_get_throttle", tenant_id=%tenant.tenant_shard_id, shard_id=%tenant.tenant_shard_id.shard_slug()).in_scope(|| {
let now = Instant::now();
let prev = std::mem::replace(&mut last_throttle_flag_reset_at, now);
let Stats { count_accounted, count_throttled, sum_throttled_usecs } = tenant.timeline_get_throttle.reset_stats();
if count_throttled == 0 {
return;
}
let allowed_rps = tenant.timeline_get_throttle.steady_rps();
let delta = now - prev;
info!(
n_seconds=%format_args!("{:.3}",
delta.as_secs_f64()),
count_accounted,
count_throttled,
sum_throttled_usecs,
allowed_rps=%format_args!("{allowed_rps:.0}"),
"shard was throttled in the last n_seconds"
);
});
// Sleep
if tokio::time::timeout(sleep_duration, cancel.cancelled())
.await
@@ -346,6 +330,7 @@ async fn gc_loop(tenant: Arc<Tenant>, cancel: CancellationToken) {
RequestContext::todo_child(TaskKind::GarbageCollector, DownloadBehavior::Download);
let mut first = true;
tenant.gc_block.set_lsn_lease_deadline(tenant.get_lsn_lease_length());
loop {
tokio::select! {
_ = cancel.cancelled() => {
@@ -363,7 +348,6 @@ async fn gc_loop(tenant: Arc<Tenant>, cancel: CancellationToken) {
first = false;
let delays = async {
delay_by_lease_length(tenant.get_lsn_lease_length(), &cancel).await?;
random_init_delay(period, &cancel).await?;
Ok::<_, Cancelled>(())
};
@@ -437,6 +421,7 @@ async fn gc_loop(tenant: Arc<Tenant>, cancel: CancellationToken) {
async fn ingest_housekeeping_loop(tenant: Arc<Tenant>, cancel: CancellationToken) {
TENANT_TASK_EVENTS.with_label_values(&["start"]).inc();
async {
let mut last_throttle_flag_reset_at = Instant::now();
loop {
tokio::select! {
_ = cancel.cancelled() => {
@@ -483,6 +468,29 @@ async fn ingest_housekeeping_loop(tenant: Arc<Tenant>, cancel: CancellationToken
kind: BackgroundLoopKind::IngestHouseKeeping,
};
iteration.run(tenant.ingest_housekeeping()).await;
// TODO: rename the background loop kind to something more generic, like, tenant housekeeping.
// Or just spawn another background loop for this throttle, it's not like it's super costly.
info_span!(parent: None, "timeline_get_throttle", tenant_id=%tenant.tenant_shard_id, shard_id=%tenant.tenant_shard_id.shard_slug()).in_scope(|| {
let now = Instant::now();
let prev = std::mem::replace(&mut last_throttle_flag_reset_at, now);
let Stats { count_accounted_start, count_accounted_finish, count_throttled, sum_throttled_usecs} = tenant.timeline_get_throttle.reset_stats();
if count_throttled == 0 {
return;
}
let allowed_rps = tenant.timeline_get_throttle.steady_rps();
let delta = now - prev;
info!(
n_seconds=%format_args!("{:.3}",
delta.as_secs_f64()),
count_accounted = count_accounted_finish, // don't break existing log scraping
count_throttled,
sum_throttled_usecs,
count_accounted_start, // log after pre-existing fields to not break existing log scraping
allowed_rps=%format_args!("{allowed_rps:.0}"),
"shard was throttled in the last n_seconds"
);
});
}
}
.await;
@@ -538,28 +546,12 @@ pub(crate) async fn random_init_delay(
let mut rng = rand::thread_rng();
rng.gen_range(Duration::ZERO..=period)
};
match tokio::time::timeout(d, cancel.cancelled()).await {
Ok(_) => Err(Cancelled),
Err(_) => Ok(()),
}
}
/// Delays GC by defaul lease length at restart.
///
/// We do this as the leases mapping are not persisted to disk. By delaying GC by default
/// length, we gurantees that all the leases we granted before the restart will expire
/// when we run GC for the first time after the restart.
pub(crate) async fn delay_by_lease_length(
length: Duration,
cancel: &CancellationToken,
) -> Result<(), Cancelled> {
match tokio::time::timeout(length, cancel.cancelled()).await {
Ok(_) => Err(Cancelled),
Err(_) => Ok(()),
}
}
struct Iteration {
started_at: Instant,
period: Duration,

View File

@@ -24,8 +24,10 @@ use crate::{context::RequestContext, task_mgr::TaskKind};
pub struct Throttle<M: Metric> {
inner: ArcSwap<Inner>,
metric: M,
/// will be turned into [`Stats::count_accounted`]
count_accounted: AtomicU64,
/// will be turned into [`Stats::count_accounted_start`]
count_accounted_start: AtomicU64,
/// will be turned into [`Stats::count_accounted_finish`]
count_accounted_finish: AtomicU64,
/// will be turned into [`Stats::count_throttled`]
count_throttled: AtomicU64,
/// will be turned into [`Stats::sum_throttled_usecs`]
@@ -43,17 +45,21 @@ pub struct Observation {
pub wait_time: Duration,
}
pub trait Metric {
fn accounting_start(&self);
fn accounting_finish(&self);
fn observe_throttling(&self, observation: &Observation);
}
/// See [`Throttle::reset_stats`].
pub struct Stats {
// Number of requests that were subject to throttling, i.e., requests of the configured [`Config::task_kinds`].
pub count_accounted: u64,
// Subset of the `accounted` requests that were actually throttled.
// Note that the numbers are stored as two independent atomics, so, there might be a slight drift.
/// Number of requests that started [`Throttle::throttle`] calls.
pub count_accounted_start: u64,
/// Number of requests that finished [`Throttle::throttle`] calls.
pub count_accounted_finish: u64,
/// Subset of the `accounted` requests that were actually throttled.
/// Note that the numbers are stored as two independent atomics, so, there might be a slight drift.
pub count_throttled: u64,
// Sum of microseconds that throttled requests spent waiting for throttling.
/// Sum of microseconds that throttled requests spent waiting for throttling.
pub sum_throttled_usecs: u64,
}
@@ -65,7 +71,8 @@ where
Self {
inner: ArcSwap::new(Arc::new(Self::new_inner(config))),
metric,
count_accounted: AtomicU64::new(0),
count_accounted_start: AtomicU64::new(0),
count_accounted_finish: AtomicU64::new(0),
count_throttled: AtomicU64::new(0),
sum_throttled_usecs: AtomicU64::new(0),
}
@@ -117,11 +124,13 @@ where
/// This method allows retrieving & resetting that flag.
/// Useful for periodic reporting.
pub fn reset_stats(&self) -> Stats {
let count_accounted = self.count_accounted.swap(0, Ordering::Relaxed);
let count_accounted_start = self.count_accounted_start.swap(0, Ordering::Relaxed);
let count_accounted_finish = self.count_accounted_finish.swap(0, Ordering::Relaxed);
let count_throttled = self.count_throttled.swap(0, Ordering::Relaxed);
let sum_throttled_usecs = self.sum_throttled_usecs.swap(0, Ordering::Relaxed);
Stats {
count_accounted,
count_accounted_start,
count_accounted_finish,
count_throttled,
sum_throttled_usecs,
}
@@ -139,9 +148,12 @@ where
};
let start = std::time::Instant::now();
self.metric.accounting_start();
self.count_accounted_start.fetch_add(1, Ordering::Relaxed);
let did_throttle = inner.rate_limiter.acquire(key_count).await;
self.count_accounted_finish.fetch_add(1, Ordering::Relaxed);
self.metric.accounting_finish();
self.count_accounted.fetch_add(1, Ordering::Relaxed);
if did_throttle {
self.count_throttled.fetch_add(1, Ordering::Relaxed);
let now = Instant::now();

View File

@@ -196,9 +196,8 @@ fn drop_wlock<T>(rlock: tokio::sync::RwLockWriteGuard<'_, T>) {
/// The outward-facing resources required to build a Timeline
pub struct TimelineResources {
pub remote_client: RemoteTimelineClient,
pub timeline_get_throttle: Arc<
crate::tenant::throttle::Throttle<&'static crate::metrics::tenant_throttling::TimelineGet>,
>,
pub timeline_get_throttle:
Arc<crate::tenant::throttle::Throttle<crate::metrics::tenant_throttling::TimelineGet>>,
pub l0_flush_global_state: l0_flush::L0FlushGlobalState,
}
@@ -406,9 +405,8 @@ pub struct Timeline {
gc_lock: tokio::sync::Mutex<()>,
/// Cloned from [`super::Tenant::timeline_get_throttle`] on construction.
timeline_get_throttle: Arc<
crate::tenant::throttle::Throttle<&'static crate::metrics::tenant_throttling::TimelineGet>,
>,
timeline_get_throttle:
Arc<crate::tenant::throttle::Throttle<crate::metrics::tenant_throttling::TimelineGet>>,
/// Keep aux directory cache to avoid it's reconstruction on each update
pub(crate) aux_files: tokio::sync::Mutex<AuxFilesState>,
@@ -4013,7 +4011,9 @@ impl Timeline {
if wrote_keys {
// Normal path: we have written some data into the new image layer for this
// partition, so flush it to disk.
let image_layer = image_layer_writer.finish(self, ctx).await?;
let (desc, path) = image_layer_writer.finish(ctx).await?;
let image_layer = Layer::finish_creating(self.conf, self, desc, &path)?;
info!("created image layer for rel {}", image_layer.local_path());
Ok(ImageLayerCreationOutcome {
image: Some(image_layer),
next_start_key: img_range.end,
@@ -4101,7 +4101,12 @@ impl Timeline {
if wrote_any_image {
// Normal path: we have written some data into the new image layer for this
// partition, so flush it to disk.
let image_layer = image_layer_writer.finish(self, ctx).await?;
let (desc, path) = image_layer_writer.finish(ctx).await?;
let image_layer = Layer::finish_creating(self.conf, self, desc, &path)?;
info!(
"created image layer for metadata {}",
image_layer.local_path()
);
Ok(ImageLayerCreationOutcome {
image: Some(image_layer),
next_start_key: img_range.end,
@@ -4309,7 +4314,9 @@ impl Timeline {
timer.stop_and_record();
// Creating image layers may have caused some previously visible layers to be covered
self.update_layer_visibility().await?;
if !image_layers.is_empty() {
self.update_layer_visibility().await?;
}
Ok(image_layers)
}
@@ -5371,7 +5378,8 @@ impl Timeline {
/// Force create an image layer and place it into the layer map.
///
/// DO NOT use this function directly. Use [`Tenant::branch_timeline_test_with_layers`]
/// or [`Tenant::create_test_timeline_with_layers`] to ensure all these layers are placed into the layer map in one run.
/// or [`Tenant::create_test_timeline_with_layers`] to ensure all these layers are
/// placed into the layer map in one run AND be validated.
#[cfg(test)]
pub(super) async fn force_create_image_layer(
self: &Arc<Timeline>,
@@ -5403,8 +5411,9 @@ impl Timeline {
for (key, img) in images {
image_layer_writer.put_image(key, img, ctx).await?;
}
let image_layer = image_layer_writer.finish(self, ctx).await?;
let (desc, path) = image_layer_writer.finish(ctx).await?;
let image_layer = Layer::finish_creating(self.conf, self, desc, &path)?;
info!("force created image layer {}", image_layer.local_path());
{
let mut guard = self.layers.write().await;
guard.open_mut().unwrap().force_insert_layer(image_layer);
@@ -5416,7 +5425,8 @@ impl Timeline {
/// Force create a delta layer and place it into the layer map.
///
/// DO NOT use this function directly. Use [`Tenant::branch_timeline_test_with_layers`]
/// or [`Tenant::create_test_timeline_with_layers`] to ensure all these layers are placed into the layer map in one run.
/// or [`Tenant::create_test_timeline_with_layers`] to ensure all these layers are
/// placed into the layer map in one run AND be validated.
#[cfg(test)]
pub(super) async fn force_create_delta_layer(
self: &Arc<Timeline>,
@@ -5442,33 +5452,6 @@ impl Timeline {
if let Some(check_start_lsn) = check_start_lsn {
assert!(deltas.lsn_range.start >= check_start_lsn);
}
// check if the delta layer does not violate the LSN invariant, the legacy compaction should always produce a batch of
// layers of the same start/end LSN, and so should the force inserted layer
{
/// Checks if a overlaps with b, assume a/b = [start, end).
pub fn overlaps_with<T: Ord>(a: &Range<T>, b: &Range<T>) -> bool {
!(a.end <= b.start || b.end <= a.start)
}
if deltas.key_range.start.next() != deltas.key_range.end {
let guard = self.layers.read().await;
let mut invalid_layers =
guard.layer_map()?.iter_historic_layers().filter(|layer| {
layer.is_delta()
&& overlaps_with(&layer.lsn_range, &deltas.lsn_range)
&& layer.lsn_range != deltas.lsn_range
// skip single-key layer files
&& layer.key_range.start.next() != layer.key_range.end
});
if let Some(layer) = invalid_layers.next() {
// If a delta layer overlaps with another delta layer AND their LSN range is not the same, panic
panic!(
"inserted layer violates delta layer LSN invariant: current_lsn_range={}..{}, conflict_lsn_range={}..{}",
deltas.lsn_range.start, deltas.lsn_range.end, layer.lsn_range.start, layer.lsn_range.end
);
}
}
}
let mut delta_layer_writer = DeltaLayerWriter::new(
self.conf,
self.timeline_id,
@@ -5483,7 +5466,7 @@ impl Timeline {
}
let (desc, path) = delta_layer_writer.finish(deltas.key_range.end, ctx).await?;
let delta_layer = Layer::finish_creating(self.conf, self, desc, &path)?;
info!("force created delta layer {}", delta_layer.local_path());
{
let mut guard = self.layers.write().await;
guard.open_mut().unwrap().force_insert_layer(delta_layer);

View File

@@ -29,6 +29,7 @@ use utils::id::TimelineId;
use crate::context::{AccessStatsBehavior, RequestContext, RequestContextBuilder};
use crate::page_cache;
use crate::tenant::checks::check_valid_layermap;
use crate::tenant::remote_timeline_client::WaitCompletionError;
use crate::tenant::storage_layer::merge_iterator::MergeIterator;
use crate::tenant::storage_layer::split_writer::{
@@ -563,10 +564,12 @@ impl Timeline {
.await?;
if keys_written > 0 {
let new_layer = image_layer_writer
.finish(self, ctx)
let (desc, path) = image_layer_writer
.finish(ctx)
.await
.map_err(CompactionError::Other)?;
let new_layer = Layer::finish_creating(self.conf, self, desc, &path)
.map_err(CompactionError::Other)?;
tracing::info!(layer=%new_layer, "Rewrote layer, {} -> {} bytes",
layer.metadata().file_size,
new_layer.metadata().file_size);
@@ -1786,20 +1789,12 @@ impl Timeline {
stat.visit_image_layer(desc.file_size());
}
}
for layer in &layer_selection {
let desc = layer.layer_desc();
let key_range = &desc.key_range;
if desc.is_delta() && key_range.start.next() != key_range.end {
let lsn_range = desc.lsn_range.clone();
let intersects = lsn_split_point.range(lsn_range).collect_vec();
if intersects.len() > 1 {
bail!(
"cannot run gc-compaction because it violates the layer map LSN split assumption: layer {} intersects with LSN [{}]",
desc.key(),
intersects.into_iter().map(|lsn| lsn.to_string()).join(", ")
);
}
}
let layer_names: Vec<crate::tenant::storage_layer::LayerName> = layer_selection
.iter()
.map(|layer| layer.layer_desc().layer_name())
.collect_vec();
if let Some(err) = check_valid_layermap(&layer_names) {
bail!("cannot run gc-compaction because {}", err);
}
// The maximum LSN we are processing in this compaction loop
let end_lsn = layer_selection

View File

@@ -135,25 +135,6 @@ async fn delete_remote_layers_and_index(timeline: &Timeline) -> anyhow::Result<(
.context("delete_all")
}
// This function removs remaining traces of a timeline on disk.
// Namely: metadata file, timeline directory, delete mark.
// Note: io::ErrorKind::NotFound are ignored for metadata and timeline dir.
// delete mark should be present because it is the last step during deletion.
// (nothing can fail after its deletion)
async fn cleanup_remaining_timeline_fs_traces(
conf: &PageServerConf,
tenant_shard_id: TenantShardId,
timeline_id: TimelineId,
) -> anyhow::Result<()> {
// Remove delete mark
// TODO: once we are confident that no more exist in the field, remove this
// line. It cleans up a legacy marker file that might in rare cases be present.
tokio::fs::remove_file(conf.timeline_delete_mark_file_path(tenant_shard_id, timeline_id))
.await
.or_else(fs_ext::ignore_not_found)
.context("remove delete mark")
}
/// It is important that this gets called when DeletionGuard is being held.
/// For more context see comments in [`DeleteTimelineFlow::prepare`]
async fn remove_timeline_from_tenant(
@@ -194,12 +175,10 @@ async fn remove_timeline_from_tenant(
/// 7. Delete mark file
///
/// It is resumable from any step in case a crash/restart occurs.
/// There are three entrypoints to the process:
/// There are two entrypoints to the process:
/// 1. [`DeleteTimelineFlow::run`] this is the main one called by a management api handler.
/// 2. [`DeleteTimelineFlow::resume_deletion`] is called during restarts when local metadata is still present
/// and we possibly neeed to continue deletion of remote files.
/// 3. [`DeleteTimelineFlow::cleanup_remaining_timeline_fs_traces`] is used when we deleted remote
/// index but still have local metadata, timeline directory and delete mark.
///
/// Note the only other place that messes around timeline delete mark is the logic that scans directory with timelines during tenant load.
#[derive(Default)]
@@ -311,18 +290,6 @@ impl DeleteTimelineFlow {
Ok(())
}
#[instrument(skip_all, fields(%timeline_id))]
pub async fn cleanup_remaining_timeline_fs_traces(
tenant: &Tenant,
timeline_id: TimelineId,
) -> anyhow::Result<()> {
let r =
cleanup_remaining_timeline_fs_traces(tenant.conf, tenant.tenant_shard_id, timeline_id)
.await;
info!("Done");
r
}
fn prepare(
tenant: &Tenant,
timeline_id: TimelineId,

View File

@@ -35,6 +35,7 @@ use anyhow::Context;
use bytes::{Bytes, BytesMut};
use pageserver_api::models::{WalRedoManagerProcessStatus, WalRedoManagerStatus};
use pageserver_api::shard::TenantShardId;
use std::future::Future;
use std::sync::Arc;
use std::time::Duration;
use std::time::Instant;
@@ -296,6 +297,97 @@ impl PostgresRedoManager {
}
}
async fn do_with_walredo_process<
F: FnOnce(Arc<Process>) -> Fut,
Fut: Future<Output = Result<O, Error>>,
O,
>(
&self,
pg_version: u32,
closure: F,
) -> Result<O, Error> {
let proc: Arc<Process> = match self.redo_process.get_or_init_detached().await {
Ok(guard) => match &*guard {
ProcessOnceCell::Spawned(proc) => Arc::clone(proc),
ProcessOnceCell::ManagerShutDown => {
return Err(Error::Cancelled);
}
},
Err(permit) => {
let start = Instant::now();
// acquire guard before spawning process, so that we don't spawn new processes
// if the gate is already closed.
let _launched_processes_guard = match self.launched_processes.enter() {
Ok(guard) => guard,
Err(GateError::GateClosed) => unreachable!(
"shutdown sets the once cell to `ManagerShutDown` state before closing the gate"
),
};
let proc = Arc::new(Process {
process: process::WalRedoProcess::launch(
self.conf,
self.tenant_shard_id,
pg_version,
)
.context("launch walredo process")?,
_launched_processes_guard,
});
let duration = start.elapsed();
WAL_REDO_PROCESS_LAUNCH_DURATION_HISTOGRAM.observe(duration.as_secs_f64());
info!(
elapsed_ms = duration.as_millis(),
pid = proc.id(),
"launched walredo process"
);
self.redo_process
.set(ProcessOnceCell::Spawned(Arc::clone(&proc)), permit);
proc
}
};
// async closures are unstable, would support &Process
let result = closure(proc.clone()).await;
if result.is_err() {
// Avoid concurrent callers hitting the same issue by taking `proc` out of the rotation.
// Note that there may be other tasks concurrent with us that also hold `proc`.
// We have to deal with that here.
// Also read the doc comment on field `self.redo_process`.
//
// NB: there may still be other concurrent threads using `proc`.
// The last one will send SIGKILL when the underlying Arc reaches refcount 0.
//
// NB: the drop impl blocks the dropping thread with a wait() system call for
// the child process. In some ways the blocking is actually good: if we
// deferred the waiting into the background / to tokio if we used `tokio::process`,
// it could happen that if walredo always fails immediately, we spawn processes faster
// than we can SIGKILL & `wait` for them to exit. By doing it the way we do here,
// we limit this risk of run-away to at most $num_runtimes * $num_executor_threads.
// This probably needs revisiting at some later point.
match self.redo_process.get() {
None => (),
Some(guard) => {
match &*guard {
ProcessOnceCell::ManagerShutDown => {}
ProcessOnceCell::Spawned(guard_proc) => {
if Arc::ptr_eq(&proc, guard_proc) {
// We're the first to observe an error from `proc`, it's our job to take it out of rotation.
guard.take_and_deinit();
} else {
// Another task already spawned another redo process (further up in this method)
// and put it into `redo_process`. Do nothing, our view of the world is behind.
}
}
}
}
}
// The last task that does this `drop()` of `proc` will do a blocking `wait()` syscall.
drop(proc);
}
result
}
///
/// Process one request for WAL redo using wal-redo postgres
///
@@ -319,130 +411,63 @@ impl PostgresRedoManager {
const MAX_RETRY_ATTEMPTS: u32 = 1;
let mut n_attempts = 0u32;
loop {
let proc: Arc<Process> = match self.redo_process.get_or_init_detached().await {
Ok(guard) => match &*guard {
ProcessOnceCell::Spawned(proc) => Arc::clone(proc),
ProcessOnceCell::ManagerShutDown => {
return Err(Error::Cancelled);
}
},
Err(permit) => {
let start = Instant::now();
// acquire guard before spawning process, so that we don't spawn new processes
// if the gate is already closed.
let _launched_processes_guard = match self.launched_processes.enter() {
Ok(guard) => guard,
Err(GateError::GateClosed) => unreachable!(
"shutdown sets the once cell to `ManagerShutDown` state before closing the gate"
),
};
let proc = Arc::new(Process {
process: process::WalRedoProcess::launch(
self.conf,
self.tenant_shard_id,
pg_version,
)
.context("launch walredo process")?,
_launched_processes_guard,
});
let duration = start.elapsed();
WAL_REDO_PROCESS_LAUNCH_DURATION_HISTOGRAM.observe(duration.as_secs_f64());
info!(
duration_ms = duration.as_millis(),
pid = proc.id(),
"launched walredo process"
);
self.redo_process
.set(ProcessOnceCell::Spawned(Arc::clone(&proc)), permit);
proc
}
};
let base_img = &base_img;
let closure = |proc: Arc<Process>| async move {
let started_at = std::time::Instant::now();
let started_at = std::time::Instant::now();
// Relational WAL records are applied using wal-redo-postgres
let result = proc
.apply_wal_records(rel, blknum, base_img, records, wal_redo_timeout)
.await
.context("apply_wal_records");
// Relational WAL records are applied using wal-redo-postgres
let result = proc
.apply_wal_records(rel, blknum, &base_img, records, wal_redo_timeout)
.await
.context("apply_wal_records");
let duration = started_at.elapsed();
let duration = started_at.elapsed();
let len = records.len();
let nbytes = records.iter().fold(0, |acumulator, record| {
acumulator
+ match &record.1 {
NeonWalRecord::Postgres { rec, .. } => rec.len(),
_ => unreachable!("Only PostgreSQL records are accepted in this batch"),
}
});
WAL_REDO_TIME.observe(duration.as_secs_f64());
WAL_REDO_RECORDS_HISTOGRAM.observe(len as f64);
WAL_REDO_BYTES_HISTOGRAM.observe(nbytes as f64);
debug!(
"postgres applied {} WAL records ({} bytes) in {} us to reconstruct page image at LSN {}",
len,
nbytes,
duration.as_micros(),
lsn
);
// If something went wrong, don't try to reuse the process. Kill it, and
// next request will launch a new one.
if let Err(e) = result.as_ref() {
error!(
"error applying {} WAL records {}..{} ({} bytes) to key {key}, from base image with LSN {} to reconstruct page image at LSN {} n_attempts={}: {:?}",
records.len(),
records.first().map(|p| p.0).unwrap_or(Lsn(0)),
records.last().map(|p| p.0).unwrap_or(Lsn(0)),
nbytes,
base_img_lsn,
lsn,
n_attempts,
e,
);
// Avoid concurrent callers hitting the same issue by taking `proc` out of the rotation.
// Note that there may be other tasks concurrent with us that also hold `proc`.
// We have to deal with that here.
// Also read the doc comment on field `self.redo_process`.
//
// NB: there may still be other concurrent threads using `proc`.
// The last one will send SIGKILL when the underlying Arc reaches refcount 0.
//
// NB: the drop impl blocks the dropping thread with a wait() system call for
// the child process. In some ways the blocking is actually good: if we
// deferred the waiting into the background / to tokio if we used `tokio::process`,
// it could happen that if walredo always fails immediately, we spawn processes faster
// than we can SIGKILL & `wait` for them to exit. By doing it the way we do here,
// we limit this risk of run-away to at most $num_runtimes * $num_executor_threads.
// This probably needs revisiting at some later point.
match self.redo_process.get() {
None => (),
Some(guard) => {
match &*guard {
ProcessOnceCell::ManagerShutDown => {}
ProcessOnceCell::Spawned(guard_proc) => {
if Arc::ptr_eq(&proc, guard_proc) {
// We're the first to observe an error from `proc`, it's our job to take it out of rotation.
guard.take_and_deinit();
} else {
// Another task already spawned another redo process (further up in this method)
// and put it into `redo_process`. Do nothing, our view of the world is behind.
}
}
let len = records.len();
let nbytes = records.iter().fold(0, |acumulator, record| {
acumulator
+ match &record.1 {
NeonWalRecord::Postgres { rec, .. } => rec.len(),
_ => unreachable!("Only PostgreSQL records are accepted in this batch"),
}
}
});
WAL_REDO_TIME.observe(duration.as_secs_f64());
WAL_REDO_RECORDS_HISTOGRAM.observe(len as f64);
WAL_REDO_BYTES_HISTOGRAM.observe(nbytes as f64);
debug!(
"postgres applied {} WAL records ({} bytes) in {} us to reconstruct page image at LSN {}",
len,
nbytes,
duration.as_micros(),
lsn
);
if let Err(e) = result.as_ref() {
error!(
"error applying {} WAL records {}..{} ({} bytes) to key {key}, from base image with LSN {} to reconstruct page image at LSN {} n_attempts={}: {:?}",
records.len(),
records.first().map(|p| p.0).unwrap_or(Lsn(0)),
records.last().map(|p| p.0).unwrap_or(Lsn(0)),
nbytes,
base_img_lsn,
lsn,
n_attempts,
e,
);
}
// The last task that does this `drop()` of `proc` will do a blocking `wait()` syscall.
drop(proc);
} else if n_attempts != 0 {
result.map_err(Error::Other)
};
let result = self.do_with_walredo_process(pg_version, closure).await;
if result.is_ok() && n_attempts != 0 {
info!(n_attempts, "retried walredo succeeded");
}
n_attempts += 1;
if n_attempts > MAX_RETRY_ATTEMPTS || result.is_ok() {
return result.map_err(Error::Other);
return result;
}
}
}

View File

@@ -18,7 +18,6 @@ atomic-take.workspace = true
aws-config.workspace = true
aws-sdk-iam.workspace = true
aws-sigv4.workspace = true
aws-types.workspace = true
base64.workspace = true
bstr.workspace = true
bytes = { workspace = true, features = ["serde"] }
@@ -26,7 +25,6 @@ camino.workspace = true
chrono.workspace = true
clap.workspace = true
consumption_metrics.workspace = true
crossbeam-deque.workspace = true
dashmap.workspace = true
env_logger.workspace = true
framed-websockets.workspace = true
@@ -48,11 +46,9 @@ indexmap.workspace = true
ipnet.workspace = true
itertools.workspace = true
lasso = { workspace = true, features = ["multi-threaded"] }
md5.workspace = true
measured = { workspace = true, features = ["lasso"] }
metrics.workspace = true
once_cell.workspace = true
opentelemetry.workspace = true
parking_lot.workspace = true
parquet.workspace = true
parquet_derive.workspace = true
@@ -67,7 +63,6 @@ reqwest.workspace = true
reqwest-middleware = { workspace = true, features = ["json"] }
reqwest-retry.workspace = true
reqwest-tracing.workspace = true
routerify.workspace = true
rustc-hash.workspace = true
rustls-pemfile.workspace = true
rustls.workspace = true
@@ -79,7 +74,6 @@ smol_str.workspace = true
smallvec.workspace = true
socket2.workspace = true
subtle.workspace = true
task-local-extensions.workspace = true
thiserror.workspace = true
tikv-jemallocator.workspace = true
tikv-jemalloc-ctl = { workspace = true, features = ["use_std"] }
@@ -88,7 +82,6 @@ tokio-postgres-rustls.workspace = true
tokio-rustls.workspace = true
tokio-util.workspace = true
tokio = { workspace = true, features = ["signal"] }
tower-service.workspace = true
tracing-opentelemetry.workspace = true
tracing-subscriber.workspace = true
tracing-utils.workspace = true

View File

@@ -163,6 +163,7 @@ impl ComputeUserInfo {
}
pub(crate) enum ComputeCredentialKeys {
#[cfg(any(test, feature = "testing"))]
Password(Vec<u8>),
AuthKeys(AuthKeys),
None,
@@ -293,16 +294,10 @@ async fn auth_quirks(
// We now expect to see a very specific payload in the place of password.
let (info, unauthenticated_password) = match user_info.try_into() {
Err(info) => {
let res = hacks::password_hack_no_authentication(ctx, info, client).await?;
ctx.set_endpoint_id(res.info.endpoint.clone());
let password = match res.keys {
ComputeCredentialKeys::Password(p) => p,
ComputeCredentialKeys::AuthKeys(_) | ComputeCredentialKeys::None => {
unreachable!("password hack should return a password")
}
};
(res.info, Some(password))
let (info, password) =
hacks::password_hack_no_authentication(ctx, info, client).await?;
ctx.set_endpoint_id(info.endpoint.clone());
(info, Some(password))
}
Ok(info) => (info, None),
};

View File

@@ -1,6 +1,4 @@
use super::{
ComputeCredentialKeys, ComputeCredentials, ComputeUserInfo, ComputeUserInfoNoEndpoint,
};
use super::{ComputeCredentials, ComputeUserInfo, ComputeUserInfoNoEndpoint};
use crate::{
auth::{self, AuthFlow},
config::AuthenticationConfig,
@@ -63,7 +61,7 @@ pub(crate) async fn password_hack_no_authentication(
ctx: &RequestMonitoring,
info: ComputeUserInfoNoEndpoint,
client: &mut stream::PqStream<Stream<impl AsyncRead + AsyncWrite + Unpin>>,
) -> auth::Result<ComputeCredentials> {
) -> auth::Result<(ComputeUserInfo, Vec<u8>)> {
warn!("project not specified, resorting to the password hack auth flow");
ctx.set_auth_method(crate::context::AuthMethod::Cleartext);
@@ -79,12 +77,12 @@ pub(crate) async fn password_hack_no_authentication(
info!(project = &*payload.endpoint, "received missing parameter");
// Report tentative success; compute node will check the password anyway.
Ok(ComputeCredentials {
info: ComputeUserInfo {
Ok((
ComputeUserInfo {
user: info.user,
options: info.options,
endpoint: payload.endpoint,
},
keys: ComputeCredentialKeys::Password(payload.password),
})
payload.password,
))
}

View File

@@ -25,6 +25,8 @@ const MAX_JWK_BODY_SIZE: usize = 64 * 1024;
pub(crate) trait FetchAuthRules: Clone + Send + Sync + 'static {
fn fetch_auth_rules(
&self,
ctx: &RequestMonitoring,
endpoint: EndpointId,
role_name: RoleName,
) -> impl Future<Output = anyhow::Result<Vec<AuthRule>>> + Send;
}
@@ -101,7 +103,9 @@ impl JwkCacheEntryLock {
async fn renew_jwks<F: FetchAuthRules>(
&self,
_permit: JwkRenewalPermit<'_>,
ctx: &RequestMonitoring,
client: &reqwest::Client,
endpoint: EndpointId,
role_name: RoleName,
auth_rules: &F,
) -> anyhow::Result<Arc<JwkCacheEntry>> {
@@ -115,7 +119,9 @@ impl JwkCacheEntryLock {
}
}
let rules = auth_rules.fetch_auth_rules(role_name).await?;
let rules = auth_rules
.fetch_auth_rules(ctx, endpoint, role_name)
.await?;
let mut key_sets =
ahash::HashMap::with_capacity_and_hasher(rules.len(), ahash::RandomState::new());
// TODO(conrad): run concurrently
@@ -166,6 +172,7 @@ impl JwkCacheEntryLock {
self: &Arc<Self>,
ctx: &RequestMonitoring,
client: &reqwest::Client,
endpoint: EndpointId,
role_name: RoleName,
fetch: &F,
) -> Result<Arc<JwkCacheEntry>, anyhow::Error> {
@@ -176,7 +183,9 @@ impl JwkCacheEntryLock {
let Some(cached) = guard else {
let _paused = ctx.latency_timer_pause(crate::metrics::Waiting::Compute);
let permit = self.acquire_permit().await;
return self.renew_jwks(permit, client, role_name, fetch).await;
return self
.renew_jwks(permit, ctx, client, endpoint, role_name, fetch)
.await;
};
let last_update = now.duration_since(cached.last_retrieved);
@@ -187,7 +196,9 @@ impl JwkCacheEntryLock {
let permit = self.acquire_permit().await;
// it's been too long since we checked the keys. wait for them to update.
return self.renew_jwks(permit, client, role_name, fetch).await;
return self
.renew_jwks(permit, ctx, client, endpoint, role_name, fetch)
.await;
}
// every 5 minutes we should spawn a job to eagerly update the token.
@@ -198,8 +209,12 @@ impl JwkCacheEntryLock {
let entry = self.clone();
let client = client.clone();
let fetch = fetch.clone();
let ctx = ctx.clone();
tokio::spawn(async move {
if let Err(e) = entry.renew_jwks(permit, &client, role_name, &fetch).await {
if let Err(e) = entry
.renew_jwks(permit, &ctx, &client, endpoint, role_name, &fetch)
.await
{
tracing::warn!(error=?e, "could not fetch JWKs in background job");
}
});
@@ -216,6 +231,7 @@ impl JwkCacheEntryLock {
ctx: &RequestMonitoring,
jwt: &str,
client: &reqwest::Client,
endpoint: EndpointId,
role_name: RoleName,
fetch: &F,
) -> Result<(), anyhow::Error> {
@@ -242,7 +258,7 @@ impl JwkCacheEntryLock {
let kid = header.key_id.context("missing key id")?;
let mut guard = self
.get_or_update_jwk_cache(ctx, client, role_name.clone(), fetch)
.get_or_update_jwk_cache(ctx, client, endpoint.clone(), role_name.clone(), fetch)
.await?;
// get the key from the JWKs if possible. If not, wait for the keys to update.
@@ -254,7 +270,14 @@ impl JwkCacheEntryLock {
let permit = self.acquire_permit().await;
guard = self
.renew_jwks(permit, client, role_name.clone(), fetch)
.renew_jwks(
permit,
ctx,
client,
endpoint.clone(),
role_name.clone(),
fetch,
)
.await?;
}
_ => {
@@ -318,7 +341,7 @@ impl JwkCache {
jwt: &str,
) -> Result<(), anyhow::Error> {
// try with just a read lock first
let key = (endpoint, role_name.clone());
let key = (endpoint.clone(), role_name.clone());
let entry = self.map.get(&key).as_deref().map(Arc::clone);
let entry = entry.unwrap_or_else(|| {
// acquire a write lock after to insert.
@@ -327,7 +350,7 @@ impl JwkCache {
});
entry
.check_jwt(ctx, jwt, &self.client, role_name, fetch)
.check_jwt(ctx, jwt, &self.client, endpoint, role_name, fetch)
.await
}
}
@@ -688,6 +711,8 @@ X0n5X2/pBLJzxZc62ccvZYVnctBiFs6HbSnxpuMQCfkt/BcR/ttIepBQQIW86wHL
impl FetchAuthRules for Fetch {
async fn fetch_auth_rules(
&self,
_ctx: &RequestMonitoring,
_endpoint: EndpointId,
_role_name: RoleName,
) -> anyhow::Result<Vec<AuthRule>> {
Ok(vec![
@@ -706,6 +731,7 @@ X0n5X2/pBLJzxZc62ccvZYVnctBiFs6HbSnxpuMQCfkt/BcR/ttIepBQQIW86wHL
}
let role_name = RoleName::from("user");
let endpoint = EndpointId::from("ep");
let jwk_cache = Arc::new(JwkCacheEntryLock::default());
@@ -715,6 +741,7 @@ X0n5X2/pBLJzxZc62ccvZYVnctBiFs6HbSnxpuMQCfkt/BcR/ttIepBQQIW86wHL
&RequestMonitoring::test(),
&token,
&client,
endpoint.clone(),
role_name.clone(),
&Fetch(addr),
)

View File

@@ -9,8 +9,9 @@ use crate::{
messages::{ColdStartInfo, EndpointJwksResponse, MetricsAuxInfo},
NodeInfo,
},
context::RequestMonitoring,
intern::{BranchIdInt, BranchIdTag, EndpointIdTag, InternId, ProjectIdInt, ProjectIdTag},
RoleName,
EndpointId, RoleName,
};
use super::jwt::{AuthRule, FetchAuthRules, JwkCache};
@@ -57,7 +58,12 @@ pub struct JwksRoleSettings {
}
impl FetchAuthRules for StaticAuthRules {
async fn fetch_auth_rules(&self, role_name: RoleName) -> anyhow::Result<Vec<AuthRule>> {
async fn fetch_auth_rules(
&self,
_ctx: &RequestMonitoring,
_endpoint: EndpointId,
role_name: RoleName,
) -> anyhow::Result<Vec<AuthRule>> {
let mappings = JWKS_ROLE_MAP.load();
let role_mappings = mappings
.as_deref()

View File

@@ -92,6 +92,12 @@ struct SqlOverHttpArgs {
#[clap(long, default_value_t = 16)]
sql_over_http_cancel_set_shards: usize,
#[clap(long, default_value_t = 10 * 1024 * 1024)] // 10 MiB
sql_over_http_max_request_size_bytes: u64,
#[clap(long, default_value_t = 10 * 1024 * 1024)] // 10 MiB
sql_over_http_max_response_size_bytes: usize,
}
#[tokio::main]
@@ -208,6 +214,8 @@ fn build_config(args: &LocalProxyCliArgs) -> anyhow::Result<&'static ProxyConfig
},
cancel_set: CancelSet::new(args.sql_over_http.sql_over_http_cancel_set_shards),
client_conn_threshold: args.sql_over_http.sql_over_http_client_conn_threshold,
max_request_size_bytes: args.sql_over_http.sql_over_http_max_request_size_bytes,
max_response_size_bytes: args.sql_over_http.sql_over_http_max_response_size_bytes,
};
Ok(Box::leak(Box::new(ProxyConfig {

View File

@@ -62,12 +62,13 @@ static GLOBAL: tikv_jemallocator::Jemalloc = tikv_jemallocator::Jemalloc;
#[derive(Clone, Debug, ValueEnum)]
enum AuthBackendType {
Console,
#[cfg(feature = "testing")]
Postgres,
// clap only shows the name, not the alias, in usage text.
// TODO: swap name/alias and deprecate "link"
#[value(name("link"), alias("web"))]
Web,
#[cfg(feature = "testing")]
Postgres,
}
/// Neon proxy/router
@@ -268,6 +269,12 @@ struct SqlOverHttpArgs {
#[clap(long, default_value_t = 64)]
sql_over_http_cancel_set_shards: usize,
#[clap(long, default_value_t = 10 * 1024 * 1024)] // 10 MiB
sql_over_http_max_request_size_bytes: u64,
#[clap(long, default_value_t = 10 * 1024 * 1024)] // 10 MiB
sql_over_http_max_response_size_bytes: usize,
}
#[tokio::main]
@@ -633,17 +640,19 @@ fn build_config(args: &ProxyCliArgs) -> anyhow::Result<&'static ProxyConfig> {
let api = console::provider::ConsoleBackend::Console(api);
auth::Backend::Console(MaybeOwned::Owned(api), ())
}
#[cfg(feature = "testing")]
AuthBackendType::Postgres => {
let url = args.auth_endpoint.parse()?;
let api = console::provider::mock::Api::new(url);
let api = console::provider::ConsoleBackend::Postgres(api);
auth::Backend::Console(MaybeOwned::Owned(api), ())
}
AuthBackendType::Web => {
let url = args.uri.parse()?;
auth::Backend::Web(MaybeOwned::Owned(url), ())
}
#[cfg(feature = "testing")]
AuthBackendType::Postgres => {
let url = args.auth_endpoint.parse()?;
let api = console::provider::mock::Api::new(url, !args.is_private_access_proxy);
let api = console::provider::ConsoleBackend::Postgres(api);
auth::Backend::Console(MaybeOwned::Owned(api), ())
}
};
let config::ConcurrencyLockOptions {
@@ -679,6 +688,8 @@ fn build_config(args: &ProxyCliArgs) -> anyhow::Result<&'static ProxyConfig> {
},
cancel_set: CancelSet::new(args.sql_over_http.sql_over_http_cancel_set_shards),
client_conn_threshold: args.sql_over_http.sql_over_http_client_conn_threshold,
max_request_size_bytes: args.sql_over_http.sql_over_http_max_request_size_bytes,
max_response_size_bytes: args.sql_over_http.sql_over_http_max_response_size_bytes,
};
let authentication_config = AuthenticationConfig {
thread_pool,

View File

@@ -56,6 +56,8 @@ pub struct HttpConfig {
pub pool_options: GlobalConnPoolOptions,
pub cancel_set: CancelSet,
pub client_conn_threshold: u64,
pub max_request_size_bytes: u64,
pub max_response_size_bytes: usize,
}
pub struct AuthenticationConfig {

View File

@@ -303,6 +303,7 @@ impl NodeInfo {
pub(crate) fn set_keys(&mut self, keys: &ComputeCredentialKeys) {
match keys {
#[cfg(any(test, feature = "testing"))]
ComputeCredentialKeys::Password(password) => self.config.password(password),
ComputeCredentialKeys::AuthKeys(auth_keys) => self.config.auth_keys(*auth_keys),
ComputeCredentialKeys::None => &mut self.config,

View File

@@ -41,11 +41,15 @@ impl From<tokio_postgres::Error> for ApiError {
#[derive(Clone)]
pub struct Api {
endpoint: ApiUrl,
ip_allowlist_check_enabled: bool,
}
impl Api {
pub fn new(endpoint: ApiUrl) -> Self {
Self { endpoint }
pub fn new(endpoint: ApiUrl, ip_allowlist_check_enabled: bool) -> Self {
Self {
endpoint,
ip_allowlist_check_enabled,
}
}
pub(crate) fn url(&self) -> &str {
@@ -64,6 +68,7 @@ impl Api {
tokio_postgres::connect(self.endpoint.as_str(), tokio_postgres::NoTls).await?;
tokio::spawn(connection);
let secret = if let Some(entry) = get_execute_postgres_query(
&client,
"select rolpassword from pg_catalog.pg_authid where rolname = $1",
@@ -79,21 +84,26 @@ impl Api {
warn!("user '{}' does not exist", user_info.user);
None
};
let allowed_ips = match get_execute_postgres_query(
&client,
"select allowed_ips from neon_control_plane.endpoints where endpoint_id = $1",
&[&user_info.endpoint.as_str()],
"allowed_ips",
)
.await?
{
Some(s) => {
info!("got allowed_ips: {s}");
s.split(',')
.map(|s| IpPattern::from_str(s).unwrap())
.collect()
let allowed_ips = if self.ip_allowlist_check_enabled {
match get_execute_postgres_query(
&client,
"select allowed_ips from neon_control_plane.endpoints where endpoint_id = $1",
&[&user_info.endpoint.as_str()],
"allowed_ips",
)
.await?
{
Some(s) => {
info!("got allowed_ips: {s}");
s.split(',')
.map(|s| IpPattern::from_str(s).unwrap())
.collect()
}
None => vec![],
}
None => vec![],
} else {
vec![]
};
Ok((secret, allowed_ips))

View File

@@ -79,6 +79,40 @@ pub(crate) enum AuthMethod {
Cleartext,
}
impl Clone for RequestMonitoring {
fn clone(&self) -> Self {
let inner = self.0.try_lock().expect("should not deadlock");
let new = RequestMonitoringInner {
peer_addr: inner.peer_addr,
session_id: inner.session_id,
protocol: inner.protocol,
first_packet: inner.first_packet,
region: inner.region,
span: info_span!("background_task"),
project: inner.project,
branch: inner.branch,
endpoint_id: inner.endpoint_id.clone(),
dbname: inner.dbname.clone(),
user: inner.user.clone(),
application: inner.application.clone(),
error_kind: inner.error_kind,
auth_method: inner.auth_method.clone(),
success: inner.success,
rejected: inner.rejected,
cold_start_info: inner.cold_start_info,
pg_options: inner.pg_options.clone(),
sender: None,
disconnect_sender: None,
latency_timer: LatencyTimer::noop(inner.protocol),
disconnect_timestamp: inner.disconnect_timestamp,
};
Self(TryLock::new(new))
}
}
impl RequestMonitoring {
pub fn new(
session_id: Uuid,

View File

@@ -397,6 +397,8 @@ pub struct LatencyTimer {
protocol: Protocol,
cold_start_info: ColdStartInfo,
outcome: ConnectOutcome,
skip_reporting: bool,
}
impl LatencyTimer {
@@ -409,6 +411,20 @@ impl LatencyTimer {
cold_start_info: ColdStartInfo::Unknown,
// assume failed unless otherwise specified
outcome: ConnectOutcome::Failed,
skip_reporting: false,
}
}
pub(crate) fn noop(protocol: Protocol) -> Self {
Self {
start: time::Instant::now(),
stop: None,
accumulated: Accumulated::default(),
protocol,
cold_start_info: ColdStartInfo::Unknown,
// assume failed unless otherwise specified
outcome: ConnectOutcome::Failed,
skip_reporting: true,
}
}
@@ -443,6 +459,10 @@ pub enum ConnectOutcome {
impl Drop for LatencyTimer {
fn drop(&mut self) {
if self.skip_reporting {
return;
}
let duration = self
.stop
.unwrap_or_else(time::Instant::now)

View File

@@ -27,7 +27,7 @@ use crate::{
Host,
};
use super::conn_pool::{poll_client, AuthData, Client, ConnInfo, GlobalConnPool};
use super::conn_pool::{poll_client, Client, ConnInfo, GlobalConnPool};
pub(crate) struct PoolingBackend {
pub(crate) pool: Arc<GlobalConnPool<tokio_postgres::Client>>,
@@ -274,13 +274,6 @@ impl ConnectMechanism for TokioMechanism {
.dbname(&self.conn_info.dbname)
.connect_timeout(timeout);
match &self.conn_info.auth {
AuthData::Jwt(_) => {}
AuthData::Password(pw) => {
config.password(pw);
}
}
let pause = ctx.latency_timer_pause(crate::metrics::Waiting::Compute);
let res = config.connect(tokio_postgres::NoTls).await;
drop(pause);

View File

@@ -29,11 +29,16 @@ use tracing::{info, info_span, Instrument};
use super::backend::HttpConnError;
#[derive(Debug, Clone)]
pub(crate) struct ConnInfoWithAuth {
pub(crate) conn_info: ConnInfo,
pub(crate) auth: AuthData,
}
#[derive(Debug, Clone)]
pub(crate) struct ConnInfo {
pub(crate) user_info: ComputeUserInfo,
pub(crate) dbname: DbName,
pub(crate) auth: AuthData,
}
#[derive(Debug, Clone)]
@@ -776,6 +781,8 @@ mod tests {
},
cancel_set: CancelSet::new(0),
client_conn_threshold: u64::MAX,
max_request_size_bytes: u64::MAX,
max_response_size_bytes: usize::MAX,
}));
let pool = GlobalConnPool::new(config);
let conn_info = ConnInfo {
@@ -785,7 +792,6 @@ mod tests {
options: NeonOptions::default(),
},
dbname: "dbname".into(),
auth: AuthData::Password("password".as_bytes().into()),
};
let ep_pool = Arc::downgrade(
&pool.get_or_create_endpoint_pool(&conn_info.endpoint_cache_key().unwrap()),
@@ -843,7 +849,6 @@ mod tests {
options: NeonOptions::default(),
},
dbname: "dbname".into(),
auth: AuthData::Password("password".as_bytes().into()),
};
let ep_pool = Arc::downgrade(
&pool.get_or_create_endpoint_pool(&conn_info.endpoint_cache_key().unwrap()),

View File

@@ -60,6 +60,7 @@ use super::backend::PoolingBackend;
use super::conn_pool::AuthData;
use super::conn_pool::Client;
use super::conn_pool::ConnInfo;
use super::conn_pool::ConnInfoWithAuth;
use super::http_util::json_response;
use super::json::json_to_pg_text;
use super::json::pg_text_row_to_json;
@@ -87,9 +88,6 @@ enum Payload {
Batch(BatchQueryData),
}
const MAX_RESPONSE_SIZE: usize = 10 * 1024 * 1024; // 10 MiB
const MAX_REQUEST_SIZE: u64 = 10 * 1024 * 1024; // 10 MiB
static CONN_STRING: HeaderName = HeaderName::from_static("neon-connection-string");
static RAW_TEXT_OUTPUT: HeaderName = HeaderName::from_static("neon-raw-text-output");
static ARRAY_MODE: HeaderName = HeaderName::from_static("neon-array-mode");
@@ -151,7 +149,7 @@ fn get_conn_info(
ctx: &RequestMonitoring,
headers: &HeaderMap,
tls: Option<&TlsConfig>,
) -> Result<ConnInfo, ConnInfoError> {
) -> Result<ConnInfoWithAuth, ConnInfoError> {
// HTTP only uses cleartext (for now and likely always)
ctx.set_auth_method(crate::context::AuthMethod::Cleartext);
@@ -238,11 +236,8 @@ fn get_conn_info(
options: options.unwrap_or_default(),
};
Ok(ConnInfo {
user_info,
dbname,
auth,
})
let conn_info = ConnInfo { user_info, dbname };
Ok(ConnInfoWithAuth { conn_info, auth })
}
// TODO: return different http error codes
@@ -366,10 +361,10 @@ pub(crate) enum SqlOverHttpError {
ConnectCompute(#[from] HttpConnError),
#[error("{0}")]
ConnInfo(#[from] ConnInfoError),
#[error("request is too large (max is {MAX_REQUEST_SIZE} bytes)")]
RequestTooLarge,
#[error("response is too large (max is {MAX_RESPONSE_SIZE} bytes)")]
ResponseTooLarge,
#[error("request is too large (max is {0} bytes)")]
RequestTooLarge(u64),
#[error("response is too large (max is {0} bytes)")]
ResponseTooLarge(usize),
#[error("invalid isolation level")]
InvalidIsolationLevel,
#[error("{0}")]
@@ -386,8 +381,8 @@ impl ReportableError for SqlOverHttpError {
SqlOverHttpError::ReadPayload(e) => e.get_error_kind(),
SqlOverHttpError::ConnectCompute(e) => e.get_error_kind(),
SqlOverHttpError::ConnInfo(e) => e.get_error_kind(),
SqlOverHttpError::RequestTooLarge => ErrorKind::User,
SqlOverHttpError::ResponseTooLarge => ErrorKind::User,
SqlOverHttpError::RequestTooLarge(_) => ErrorKind::User,
SqlOverHttpError::ResponseTooLarge(_) => ErrorKind::User,
SqlOverHttpError::InvalidIsolationLevel => ErrorKind::User,
SqlOverHttpError::Postgres(p) => p.get_error_kind(),
SqlOverHttpError::JsonConversion(_) => ErrorKind::Postgres,
@@ -402,8 +397,8 @@ impl UserFacingError for SqlOverHttpError {
SqlOverHttpError::ReadPayload(p) => p.to_string(),
SqlOverHttpError::ConnectCompute(c) => c.to_string_client(),
SqlOverHttpError::ConnInfo(c) => c.to_string_client(),
SqlOverHttpError::RequestTooLarge => self.to_string(),
SqlOverHttpError::ResponseTooLarge => self.to_string(),
SqlOverHttpError::RequestTooLarge(_) => self.to_string(),
SqlOverHttpError::ResponseTooLarge(_) => self.to_string(),
SqlOverHttpError::InvalidIsolationLevel => self.to_string(),
SqlOverHttpError::Postgres(p) => p.to_string(),
SqlOverHttpError::JsonConversion(_) => "could not parse postgres response".to_string(),
@@ -526,7 +521,10 @@ async fn handle_inner(
// TLS config should be there.
let conn_info = get_conn_info(ctx, headers, config.tls_config.as_ref())?;
info!(user = conn_info.user_info.user.as_str(), "credentials");
info!(
user = conn_info.conn_info.user_info.user.as_str(),
"credentials"
);
// Allow connection pooling only if explicitly requested
// or if we have decided that http pool is no longer opt-in
@@ -537,7 +535,7 @@ async fn handle_inner(
let request_content_length = match request.body().size_hint().upper() {
Some(v) => v,
None => MAX_REQUEST_SIZE + 1,
None => config.http_config.max_request_size_bytes + 1,
};
info!(request_content_length, "request size in bytes");
Metrics::get()
@@ -547,8 +545,10 @@ async fn handle_inner(
// we don't have a streaming request support yet so this is to prevent OOM
// from a malicious user sending an extremely large request body
if request_content_length > MAX_REQUEST_SIZE {
return Err(SqlOverHttpError::RequestTooLarge);
if request_content_length > config.http_config.max_request_size_bytes {
return Err(SqlOverHttpError::RequestTooLarge(
config.http_config.max_request_size_bytes,
));
}
let fetch_and_process_request = Box::pin(
@@ -569,20 +569,20 @@ async fn handle_inner(
.authenticate_with_password(
ctx,
&config.authentication_config,
&conn_info.user_info,
&conn_info.conn_info.user_info,
pw,
)
.await?
}
AuthData::Jwt(jwt) => {
backend
.authenticate_with_jwt(ctx, &conn_info.user_info, jwt)
.authenticate_with_jwt(ctx, &conn_info.conn_info.user_info, jwt)
.await?
}
};
let client = backend
.connect_to_compute(ctx, conn_info, keys, !allow_pool)
.connect_to_compute(ctx, conn_info.conn_info, keys, !allow_pool)
.await?;
// not strictly necessary to mark success here,
// but it's just insurance for if we forget it somewhere else
@@ -612,7 +612,10 @@ async fn handle_inner(
// Now execute the query and return the result.
let json_output = match payload {
Payload::Single(stmt) => stmt.process(cancel, &mut client, parsed_headers).await?,
Payload::Single(stmt) => {
stmt.process(config, cancel, &mut client, parsed_headers)
.await?
}
Payload::Batch(statements) => {
if parsed_headers.txn_read_only {
response = response.header(TXN_READ_ONLY.clone(), &HEADER_VALUE_TRUE);
@@ -628,7 +631,7 @@ async fn handle_inner(
}
statements
.process(cancel, &mut client, parsed_headers)
.process(config, cancel, &mut client, parsed_headers)
.await?
}
};
@@ -656,6 +659,7 @@ async fn handle_inner(
impl QueryData {
async fn process(
self,
config: &'static ProxyConfig,
cancel: CancellationToken,
client: &mut Client<tokio_postgres::Client>,
parsed_headers: HttpHeaders,
@@ -664,7 +668,7 @@ impl QueryData {
let cancel_token = inner.cancel_token();
let res = match select(
pin!(query_to_json(&*inner, self, &mut 0, parsed_headers)),
pin!(query_to_json(config, &*inner, self, &mut 0, parsed_headers)),
pin!(cancel.cancelled()),
)
.await
@@ -727,6 +731,7 @@ impl QueryData {
impl BatchQueryData {
async fn process(
self,
config: &'static ProxyConfig,
cancel: CancellationToken,
client: &mut Client<tokio_postgres::Client>,
parsed_headers: HttpHeaders,
@@ -751,44 +756,52 @@ impl BatchQueryData {
discard.discard();
})?;
let json_output =
match query_batch(cancel.child_token(), &transaction, self, parsed_headers).await {
Ok(json_output) => {
info!("commit");
let status = transaction.commit().await.inspect_err(|_| {
// if we cannot commit - for now don't return connection to pool
// TODO: get a query status from the error
discard.discard();
})?;
discard.check_idle(status);
json_output
}
Err(SqlOverHttpError::Cancelled(_)) => {
if let Err(err) = cancel_token.cancel_query(NoTls).await {
tracing::error!(?err, "could not cancel query");
}
// TODO: after cancelling, wait to see if we can get a status. maybe the connection is still safe.
let json_output = match query_batch(
config,
cancel.child_token(),
&transaction,
self,
parsed_headers,
)
.await
{
Ok(json_output) => {
info!("commit");
let status = transaction.commit().await.inspect_err(|_| {
// if we cannot commit - for now don't return connection to pool
// TODO: get a query status from the error
discard.discard();
})?;
discard.check_idle(status);
json_output
}
Err(SqlOverHttpError::Cancelled(_)) => {
if let Err(err) = cancel_token.cancel_query(NoTls).await {
tracing::error!(?err, "could not cancel query");
}
// TODO: after cancelling, wait to see if we can get a status. maybe the connection is still safe.
discard.discard();
return Err(SqlOverHttpError::Cancelled(SqlOverHttpCancel::Postgres));
}
Err(err) => {
info!("rollback");
let status = transaction.rollback().await.inspect_err(|_| {
// if we cannot rollback - for now don't return connection to pool
// TODO: get a query status from the error
discard.discard();
})?;
discard.check_idle(status);
return Err(err);
}
};
return Err(SqlOverHttpError::Cancelled(SqlOverHttpCancel::Postgres));
}
Err(err) => {
info!("rollback");
let status = transaction.rollback().await.inspect_err(|_| {
// if we cannot rollback - for now don't return connection to pool
// TODO: get a query status from the error
discard.discard();
})?;
discard.check_idle(status);
return Err(err);
}
};
Ok(json_output)
}
}
async fn query_batch(
config: &'static ProxyConfig,
cancel: CancellationToken,
transaction: &Transaction<'_>,
queries: BatchQueryData,
@@ -798,6 +811,7 @@ async fn query_batch(
let mut current_size = 0;
for stmt in queries.queries {
let query = pin!(query_to_json(
config,
transaction,
stmt,
&mut current_size,
@@ -826,6 +840,7 @@ async fn query_batch(
}
async fn query_to_json<T: GenericClient>(
config: &'static ProxyConfig,
client: &T,
data: QueryData,
current_size: &mut usize,
@@ -846,8 +861,10 @@ async fn query_to_json<T: GenericClient>(
rows.push(row);
// we don't have a streaming response support yet so this is to prevent OOM
// from a malicious query (eg a cross join)
if *current_size > MAX_RESPONSE_SIZE {
return Err(SqlOverHttpError::ResponseTooLarge);
if *current_size > config.http_config.max_response_size_bytes {
return Err(SqlOverHttpError::ResponseTooLarge(
config.http_config.max_response_size_bytes,
));
}
}

View File

@@ -13,14 +13,12 @@ testing = ["fail/failpoints"]
[dependencies]
async-stream.workspace = true
anyhow.workspace = true
async-trait.workspace = true
byteorder.workspace = true
bytes.workspace = true
camino.workspace = true
camino-tempfile.workspace = true
chrono.workspace = true
clap = { workspace = true, features = ["derive"] }
const_format.workspace = true
crc32c.workspace = true
fail.workspace = true
git-version.workspace = true
@@ -38,8 +36,6 @@ scopeguard.workspace = true
reqwest = { workspace = true, features = ["json"] }
serde.workspace = true
serde_json.workspace = true
serde_with.workspace = true
signal-hook.workspace = true
strum.workspace = true
strum_macros.workspace = true
thiserror.workspace = true
@@ -48,7 +44,6 @@ tokio-util = { workspace = true }
tokio-io-timeout.workspace = true
tokio-postgres.workspace = true
tokio-tar.workspace = true
toml_edit.workspace = true
tracing.workspace = true
url.workspace = true
metrics.workspace = true

View File

@@ -17,6 +17,7 @@ use postgres_ffi::MAX_SEND_SIZE;
use serde::Deserialize;
use serde::Serialize;
use postgres_ffi::v14::xlog_utils::{IsPartialXLogFileName, IsXLogFileName};
use sha2::{Digest, Sha256};
use utils::id::NodeId;
use utils::id::TenantTimelineId;
@@ -51,6 +52,9 @@ pub struct Args {
/// Dump full term history. True by default.
pub dump_term_history: bool,
/// Dump last modified time of WAL segments. Uses value of `dump_all` by default.
pub dump_wal_last_modified: bool,
/// Filter timelines by tenant_id.
pub tenant_id: Option<TenantId>,
@@ -128,12 +132,19 @@ async fn build_from_tli_dump(
None
};
let wal_last_modified = if args.dump_wal_last_modified {
get_wal_last_modified(timeline_dir).ok().flatten()
} else {
None
};
Timeline {
tenant_id: timeline.ttid.tenant_id,
timeline_id: timeline.ttid.timeline_id,
control_file,
memory,
disk_content,
wal_last_modified,
}
}
@@ -156,6 +167,7 @@ pub struct Timeline {
pub control_file: Option<TimelinePersistentState>,
pub memory: Option<Memory>,
pub disk_content: Option<DiskContent>,
pub wal_last_modified: Option<DateTime<Utc>>,
}
#[derive(Debug, Serialize, Deserialize)]
@@ -302,6 +314,27 @@ fn build_file_info(entry: DirEntry) -> Result<FileInfo> {
})
}
/// Get highest modified time of WAL segments in the directory.
fn get_wal_last_modified(path: &Utf8Path) -> Result<Option<DateTime<Utc>>> {
let mut res = None;
for entry in fs::read_dir(path)? {
if entry.is_err() {
continue;
}
let entry = entry?;
/* Ignore files that are not XLOG segments */
let fname = entry.file_name();
if !IsXLogFileName(&fname) && !IsPartialXLogFileName(&fname) {
continue;
}
let metadata = entry.metadata()?;
let modified: DateTime<Utc> = DateTime::from(metadata.modified()?);
res = std::cmp::max(res, Some(modified));
}
Ok(res)
}
/// Converts SafeKeeperConf to Config, filtering out the fields that are not
/// supposed to be exposed.
fn build_config(config: SafeKeeperConf) -> Config {

View File

@@ -1,7 +1,11 @@
openapi: "3.0.2"
info:
title: Safekeeper control API
description: Neon Safekeeper API
version: "1.0"
license:
name: "Apache"
url: https://github.com/neondatabase/neon/blob/main/LICENSE
servers:
@@ -386,6 +390,12 @@ components:
msg:
type: string
NotFoundError:
type: object
properties:
msg:
type: string
responses:
#

View File

@@ -481,6 +481,7 @@ async fn dump_debug_handler(mut request: Request<Body>) -> Result<Response<Body>
let mut dump_memory: Option<bool> = None;
let mut dump_disk_content: Option<bool> = None;
let mut dump_term_history: Option<bool> = None;
let mut dump_wal_last_modified: Option<bool> = None;
let mut tenant_id: Option<TenantId> = None;
let mut timeline_id: Option<TimelineId> = None;
@@ -494,6 +495,7 @@ async fn dump_debug_handler(mut request: Request<Body>) -> Result<Response<Body>
"dump_memory" => dump_memory = Some(parse_kv_str(&k, &v)?),
"dump_disk_content" => dump_disk_content = Some(parse_kv_str(&k, &v)?),
"dump_term_history" => dump_term_history = Some(parse_kv_str(&k, &v)?),
"dump_wal_last_modified" => dump_wal_last_modified = Some(parse_kv_str(&k, &v)?),
"tenant_id" => tenant_id = Some(parse_kv_str(&k, &v)?),
"timeline_id" => timeline_id = Some(parse_kv_str(&k, &v)?),
_ => Err(ApiError::BadRequest(anyhow::anyhow!(
@@ -508,6 +510,7 @@ async fn dump_debug_handler(mut request: Request<Body>) -> Result<Response<Body>
let dump_memory = dump_memory.unwrap_or(dump_all);
let dump_disk_content = dump_disk_content.unwrap_or(dump_all);
let dump_term_history = dump_term_history.unwrap_or(true);
let dump_wal_last_modified = dump_wal_last_modified.unwrap_or(dump_all);
let args = debug_dump::Args {
dump_all,
@@ -515,6 +518,7 @@ async fn dump_debug_handler(mut request: Request<Body>) -> Result<Response<Body>
dump_memory,
dump_disk_content,
dump_term_history,
dump_wal_last_modified,
tenant_id,
timeline_id,
};

View File

@@ -278,7 +278,7 @@ impl WalResidentTimeline {
}
/// pull_timeline request body.
#[derive(Debug, Serialize, Deserialize)]
#[derive(Debug, Deserialize)]
pub struct Request {
pub tenant_id: TenantId,
pub timeline_id: TimelineId,
@@ -293,7 +293,7 @@ pub struct Response {
}
/// Response for debug dump request.
#[derive(Debug, Serialize, Deserialize)]
#[derive(Debug, Deserialize)]
pub struct DebugDumpResponse {
pub start_time: DateTime<Utc>,
pub finish_time: DateTime<Utc>,

View File

@@ -539,20 +539,17 @@ async fn remove_segments_from_disk(
while let Some(entry) = entries.next_entry().await? {
let entry_path = entry.path();
let fname = entry_path.file_name().unwrap();
if let Some(fname_str) = fname.to_str() {
/* Ignore files that are not XLOG segments */
if !IsXLogFileName(fname_str) && !IsPartialXLogFileName(fname_str) {
continue;
}
let (segno, _) = XLogFromFileName(fname_str, wal_seg_size);
if remove_predicate(segno) {
remove_file(entry_path).await?;
n_removed += 1;
min_removed = min(min_removed, segno);
max_removed = max(max_removed, segno);
REMOVED_WAL_SEGMENTS.inc();
}
/* Ignore files that are not XLOG segments */
if !IsXLogFileName(fname) && !IsPartialXLogFileName(fname) {
continue;
}
let (segno, _) = XLogFromFileName(fname, wal_seg_size)?;
if remove_predicate(segno) {
remove_file(entry_path).await?;
n_removed += 1;
min_removed = min(min_removed, segno);
max_removed = max(max_removed, segno);
REMOVED_WAL_SEGMENTS.inc();
}
}

View File

@@ -10,7 +10,6 @@ bench = []
[dependencies]
anyhow.workspace = true
async-stream.workspace = true
bytes.workspace = true
clap = { workspace = true, features = ["derive"] }
const_format.workspace = true
futures.workspace = true
@@ -24,7 +23,6 @@ parking_lot.workspace = true
prost.workspace = true
tonic.workspace = true
tokio = { workspace = true, features = ["rt-multi-thread"] }
tokio-stream.workspace = true
tracing.workspace = true
metrics.workspace = true
utils.workspace = true

View File

@@ -15,9 +15,7 @@ testing = []
[dependencies]
anyhow.workspace = true
aws-config.workspace = true
bytes.workspace = true
camino.workspace = true
chrono.workspace = true
clap.workspace = true
fail.workspace = true

View File

@@ -5,18 +5,7 @@ edition.workspace = true
license.workspace = true
[dependencies]
pageserver_api.workspace = true
pageserver_client.workspace = true
thiserror.workspace = true
reqwest.workspace = true
utils.workspace = true
serde.workspace = true
workspace_hack = { version = "0.1", path = "../../workspace_hack" }
tokio-postgres.workspace = true
tokio-stream.workspace = true
tokio.workspace = true
futures.workspace = true
tokio-util.workspace = true
anyhow.workspace = true
postgres.workspace = true
bytes.workspace = true

View File

@@ -3,6 +3,7 @@ use std::{
borrow::Cow,
cmp::Ordering,
collections::{BTreeMap, HashMap, HashSet},
error::Error,
ops::Deref,
path::PathBuf,
str::FromStr,
@@ -218,9 +219,16 @@ fn passthrough_api_error(node: &Node, e: mgmt_api::Error) -> ApiError {
format!("{node} error receiving error body: {str}").into(),
)
}
mgmt_api::Error::ReceiveBody(str) => {
// Presume errors receiving body are connectivity/availability issues
ApiError::ResourceUnavailable(format!("{node} error receiving body: {str}").into())
mgmt_api::Error::ReceiveBody(err) if err.is_decode() => {
// Return 500 for decoding errors.
ApiError::InternalServerError(anyhow::Error::from(err).context("error decoding body"))
}
mgmt_api::Error::ReceiveBody(err) => {
// Presume errors receiving body are connectivity/availability issues except for decoding errors
let src_str = err.source().map(|e| e.to_string()).unwrap_or_default();
ApiError::ResourceUnavailable(
format!("{node} error receiving error body: {err} {}", src_str).into(),
)
}
mgmt_api::Error::ApiError(StatusCode::NOT_FOUND, msg) => {
ApiError::NotFound(anyhow::anyhow!(format!("{node}: {msg}")).into())

View File

@@ -6,21 +6,13 @@ license.workspace = true
[dependencies]
aws-sdk-s3.workspace = true
aws-smithy-async.workspace = true
either.workspace = true
tokio-rustls.workspace = true
anyhow.workspace = true
git-version.workspace = true
hex.workspace = true
humantime.workspace = true
thiserror.workspace = true
rand.workspace = true
bytes.workspace = true
bincode.workspace = true
crc32c.workspace = true
serde.workspace = true
serde_json.workspace = true
serde_with.workspace = true
workspace_hack.workspace = true
utils.workspace = true
async-stream.workspace = true

View File

@@ -1,7 +1,8 @@
use std::collections::{BTreeSet, HashMap, HashSet};
use std::collections::{HashMap, HashSet};
use anyhow::Context;
use itertools::Itertools;
use pageserver::tenant::checks::check_valid_layermap;
use pageserver::tenant::layer_map::LayerMap;
use pageserver::tenant::remote_timeline_client::index::LayerFileMetadata;
use pageserver_api::shard::ShardIndex;
@@ -48,56 +49,6 @@ impl TimelineAnalysis {
}
}
/// Checks whether a layer map is valid (i.e., is a valid result of the current compaction algorithm if nothing goes wrong).
/// The function checks if we can split the LSN range of a delta layer only at the LSNs of the delta layers. For example,
///
/// ```plain
/// | | | |
/// | 1 | | 2 | | 3 |
/// | | | | | |
/// ```
///
/// This is not a valid layer map because the LSN range of layer 1 intersects with the LSN range of layer 2. 1 and 2 should have
/// the same LSN range.
///
/// The exception is that when layer 2 only contains a single key, it could be split over the LSN range. For example,
///
/// ```plain
/// | | | 2 | | |
/// | 1 | |-------| | 3 |
/// | | | 4 | | |
///
/// If layer 2 and 4 contain the same single key, this is also a valid layer map.
fn check_valid_layermap(metadata: &HashMap<LayerName, LayerFileMetadata>) -> Option<String> {
let mut lsn_split_point = BTreeSet::new(); // TODO: use a better data structure (range tree / range set?)
let mut all_delta_layers = Vec::new();
for (name, _) in metadata.iter() {
if let LayerName::Delta(layer) = name {
if layer.key_range.start.next() != layer.key_range.end {
all_delta_layers.push(layer.clone());
}
}
}
for layer in &all_delta_layers {
let lsn_range = &layer.lsn_range;
lsn_split_point.insert(lsn_range.start);
lsn_split_point.insert(lsn_range.end);
}
for layer in &all_delta_layers {
let lsn_range = layer.lsn_range.clone();
let intersects = lsn_split_point.range(lsn_range).collect_vec();
if intersects.len() > 1 {
let err = format!(
"layer violates the layer map LSN split assumption: layer {} intersects with LSN [{}]",
layer,
intersects.into_iter().map(|lsn| lsn.to_string()).join(", ")
);
return Some(err);
}
}
None
}
pub(crate) async fn branch_cleanup_and_check_errors(
remote_client: &GenericRemoteStorage,
id: &TenantShardTimelineId,
@@ -177,7 +128,8 @@ pub(crate) async fn branch_cleanup_and_check_errors(
}
}
if let Some(err) = check_valid_layermap(&index_part.layer_metadata) {
let layer_names = index_part.layer_metadata.keys().cloned().collect_vec();
if let Some(err) = check_valid_layermap(&layer_names) {
result.errors.push(format!(
"index_part.json contains invalid layer map structure: {err}"
));

View File

@@ -121,8 +121,6 @@ enum Command {
async fn main() -> anyhow::Result<()> {
let cli = Cli::parse();
tracing::info!("version: {}, build_tag {}", GIT_VERSION, BUILD_TAG);
let bucket_config = BucketConfig::from_env()?;
let command_log_name = match &cli.command {
@@ -142,6 +140,8 @@ async fn main() -> anyhow::Result<()> {
chrono::Utc::now().format("%Y_%m_%d__%H_%M_%S")
));
tracing::info!("version: {}, build_tag {}", GIT_VERSION, BUILD_TAG);
let controller_client = cli.controller_api.map(|controller_api| {
ControllerClientConfig {
controller_api,

View File

@@ -1,63 +0,0 @@
import subprocess
import time
from dataclasses import dataclass
from pathlib import Path
from typing import Any, Optional
from fixtures.log_helper import log
@dataclass
class NeonBroker:
"""An object managing storage_broker instance"""
logfile: Path
port: int
neon_binpath: Path
handle: Optional[subprocess.Popen[Any]] = None # handle of running daemon
def listen_addr(self):
return f"127.0.0.1:{self.port}"
def client_url(self):
return f"http://{self.listen_addr()}"
def check_status(self):
return True # TODO
def try_start(self):
if self.handle is not None:
log.debug(f"storage_broker is already running on port {self.port}")
return
listen_addr = self.listen_addr()
log.info(f'starting storage_broker to listen incoming connections at "{listen_addr}"')
with open(self.logfile, "wb") as logfile:
args = [
str(self.neon_binpath / "storage_broker"),
f"--listen-addr={listen_addr}",
]
self.handle = subprocess.Popen(args, stdout=logfile, stderr=logfile)
# wait for start
started_at = time.time()
while True:
try:
self.check_status()
except Exception as e:
elapsed = time.time() - started_at
if elapsed > 5:
raise RuntimeError(
f"timed out waiting {elapsed:.0f}s for storage_broker start: {e}"
) from e
time.sleep(0.5)
else:
break # success
def stop(self, immediate: bool = False):
if self.handle is not None:
if immediate:
self.handle.kill()
else:
self.handle.terminate()
self.handle.wait()

View File

@@ -102,6 +102,11 @@ def histogram(prefix_without_trailing_underscore: str) -> List[str]:
return [f"{prefix_without_trailing_underscore}_{x}" for x in ["bucket", "count", "sum"]]
def counter(name: str) -> str:
# the prometheus_client package appends _total to all counters client-side
return f"{name}_total"
PAGESERVER_PER_TENANT_REMOTE_TIMELINE_CLIENT_METRICS: Tuple[str, ...] = (
"pageserver_remote_timeline_client_calls_started_total",
"pageserver_remote_timeline_client_calls_finished_total",
@@ -132,9 +137,14 @@ PAGESERVER_GLOBAL_METRICS: Tuple[str, ...] = (
*histogram("pageserver_wait_lsn_seconds"),
*histogram("pageserver_remote_operation_seconds"),
*histogram("pageserver_io_operations_seconds"),
"pageserver_smgr_query_started_global_count_total",
"pageserver_tenant_states_count",
"pageserver_circuit_breaker_broken_total",
"pageserver_circuit_breaker_unbroken_total",
counter("pageserver_tenant_throttling_count_accounted_start_global"),
counter("pageserver_tenant_throttling_count_accounted_finish_global"),
counter("pageserver_tenant_throttling_wait_usecs_sum_global"),
counter("pageserver_tenant_throttling_count_global"),
)
PAGESERVER_PER_TENANT_METRICS: Tuple[str, ...] = (
@@ -146,6 +156,7 @@ PAGESERVER_PER_TENANT_METRICS: Tuple[str, ...] = (
"pageserver_smgr_query_seconds_bucket",
"pageserver_smgr_query_seconds_count",
"pageserver_smgr_query_seconds_sum",
"pageserver_smgr_query_started_count_total",
"pageserver_archive_size",
"pageserver_pitr_history_size",
"pageserver_layer_bytes",
@@ -157,6 +168,10 @@ PAGESERVER_PER_TENANT_METRICS: Tuple[str, ...] = (
"pageserver_evictions_with_low_residence_duration_total",
"pageserver_aux_file_estimated_size",
"pageserver_valid_lsn_lease_count",
counter("pageserver_tenant_throttling_count_accounted_start"),
counter("pageserver_tenant_throttling_count_accounted_finish"),
counter("pageserver_tenant_throttling_wait_usecs_sum"),
counter("pageserver_tenant_throttling_count"),
*PAGESERVER_PER_TENANT_REMOTE_TIMELINE_CLIENT_METRICS,
# "pageserver_directory_entries_count", -- only used if above a certain threshold
# "pageserver_broken_tenants_count" -- used only for broken

View File

@@ -60,7 +60,6 @@ from urllib3.util.retry import Retry
from fixtures import overlayfs
from fixtures.auth_tokens import AuthKeys, TokenScope
from fixtures.broker import NeonBroker
from fixtures.common_types import Lsn, NodeId, TenantId, TenantShardId, TimelineId
from fixtures.endpoint.http import EndpointHttpClient
from fixtures.log_helper import log
@@ -182,6 +181,17 @@ def top_output_dir(base_dir: Path) -> Iterator[Path]:
log.info(f"top_output_dir is {output_dir}")
yield output_dir
@pytest.fixture(scope="session", autouse=True)
def shared_initdb_cache_dir(top_output_dir: Path) -> Iterator[Path]:
log.info("Creating shared initdb cache directory")
cache_dir = top_output_dir / "shared_initdb_cache"
shutil.rmtree(cache_dir, ignore_errors=True)
cache_dir.mkdir(exist_ok=True)
yield cache_dir
@pytest.fixture(scope="session")
def neon_api_key() -> str:
@@ -230,21 +240,6 @@ def port_distributor(worker_base_port: int, worker_port_num: int) -> PortDistrib
return PortDistributor(base_port=worker_base_port, port_number=worker_port_num)
@pytest.fixture(scope="function")
def default_broker(
port_distributor: PortDistributor,
test_output_dir: Path,
neon_binpath: Path,
) -> Iterator[NeonBroker]:
# multiple pytest sessions could get launched in parallel, get them different ports/datadirs
client_port = port_distributor.get_port()
broker_logfile = test_output_dir / "repo" / "storage_broker.log"
broker = NeonBroker(logfile=broker_logfile, port=client_port, neon_binpath=neon_binpath)
yield broker
broker.stop()
@pytest.fixture(scope="session")
def run_id() -> Iterator[uuid.UUID]:
yield uuid.uuid4()
@@ -387,7 +382,6 @@ class NeonEnvBuilder:
self,
repo_dir: Path,
port_distributor: PortDistributor,
broker: NeonBroker,
run_id: uuid.UUID,
mock_s3_server: MockS3Server,
neon_binpath: Path,
@@ -418,6 +412,7 @@ class NeonEnvBuilder:
safekeeper_extra_opts: Optional[list[str]] = None,
storage_controller_port_override: Optional[int] = None,
pageserver_io_buffer_alignment: Optional[int] = None,
shared_initdb_cache_dir: Optional[Path] = None,
):
self.repo_dir = repo_dir
self.rust_log_override = rust_log_override
@@ -428,7 +423,6 @@ class NeonEnvBuilder:
# Safekeepers remote storage
self.safekeepers_remote_storage: Optional[RemoteStorage] = None
self.broker = broker
self.run_id = run_id
self.mock_s3_server: MockS3Server = mock_s3_server
self.pageserver_config_override = pageserver_config_override
@@ -450,6 +444,7 @@ class NeonEnvBuilder:
self.enable_scrub_on_exit = True
self.test_output_dir = test_output_dir
self.test_overlay_dir = test_overlay_dir
self.shared_initdb_cache_dir = shared_initdb_cache_dir
self.overlay_mounts_created_by_us: List[Tuple[str, Path]] = []
self.config_init_force: Optional[str] = None
self.top_output_dir = top_output_dir
@@ -940,6 +935,8 @@ class NeonEnvBuilder:
self.env.storage_controller.assert_no_errors()
self.env.broker.assert_no_errors()
try:
self.overlay_cleanup_teardown()
except Exception as e:
@@ -985,6 +982,7 @@ class NeonEnv:
def __init__(self, config: NeonEnvBuilder):
self.repo_dir = config.repo_dir
self.shared_initdb_cache_dir = config.shared_initdb_cache_dir
self.rust_log_override = config.rust_log_override
self.port_distributor = config.port_distributor
self.s3_mock_server = config.mock_s3_server
@@ -993,7 +991,7 @@ class NeonEnv:
self.endpoints = EndpointFactory(self)
self.safekeepers: List[Safekeeper] = []
self.pageservers: List[NeonPageserver] = []
self.broker = config.broker
self.broker = NeonBroker(self)
self.pageserver_remote_storage = config.pageserver_remote_storage
self.safekeepers_remote_storage = config.safekeepers_remote_storage
self.pg_version = config.pg_version
@@ -1087,6 +1085,10 @@ class NeonEnv:
# Default which can be overriden with `NeonEnvBuilder.pageserver_config_override`
"availability_zone": "us-east-2a",
}
if self.shared_initdb_cache_dir is not None:
ps_cfg["initdb_cache_dir"] = str(self.shared_initdb_cache_dir)
if self.pageserver_virtual_file_io_engine is not None:
ps_cfg["virtual_file_io_engine"] = self.pageserver_virtual_file_io_engine
if config.pageserver_default_tenant_config_compaction_algorithm is not None:
@@ -1168,7 +1170,7 @@ class NeonEnv:
max_workers=2 + len(self.pageservers) + len(self.safekeepers)
) as executor:
futs.append(
executor.submit(lambda: self.broker.try_start() or None)
executor.submit(lambda: self.broker.start() or None)
) # The `or None` is for the linter
for pageserver in self.pageservers:
@@ -1225,7 +1227,7 @@ class NeonEnv:
pageserver.stop(immediate=immediate)
except RuntimeError:
stop_later.append(pageserver)
self.broker.stop(immediate=immediate)
self.broker.stop()
# TODO: for nice logging we need python 3.11 ExceptionGroup
for ps in stop_later:
@@ -1339,7 +1341,6 @@ def neon_simple_env(
pytestconfig: Config,
port_distributor: PortDistributor,
mock_s3_server: MockS3Server,
default_broker: NeonBroker,
run_id: uuid.UUID,
top_output_dir: Path,
test_output_dir: Path,
@@ -1350,6 +1351,7 @@ def neon_simple_env(
pageserver_aux_file_policy: Optional[AuxFileStore],
pageserver_default_tenant_config_compaction_algorithm: Optional[Dict[str, Any]],
pageserver_io_buffer_alignment: Optional[int],
shared_initdb_cache_dir: Optional[Path],
) -> Iterator[NeonEnv]:
"""
Simple Neon environment, with no authentication and no safekeepers.
@@ -1364,7 +1366,6 @@ def neon_simple_env(
top_output_dir=top_output_dir,
repo_dir=repo_dir,
port_distributor=port_distributor,
broker=default_broker,
mock_s3_server=mock_s3_server,
neon_binpath=neon_binpath,
pg_distrib_dir=pg_distrib_dir,
@@ -1377,6 +1378,7 @@ def neon_simple_env(
pageserver_aux_file_policy=pageserver_aux_file_policy,
pageserver_default_tenant_config_compaction_algorithm=pageserver_default_tenant_config_compaction_algorithm,
pageserver_io_buffer_alignment=pageserver_io_buffer_alignment,
shared_initdb_cache_dir=shared_initdb_cache_dir
) as builder:
env = builder.init_start()
@@ -1392,7 +1394,6 @@ def neon_env_builder(
neon_binpath: Path,
pg_distrib_dir: Path,
pg_version: PgVersion,
default_broker: NeonBroker,
run_id: uuid.UUID,
request: FixtureRequest,
test_overlay_dir: Path,
@@ -1402,6 +1403,7 @@ def neon_env_builder(
pageserver_aux_file_policy: Optional[AuxFileStore],
record_property: Callable[[str, object], None],
pageserver_io_buffer_alignment: Optional[int],
shared_initdb_cache_dir: Optional[Path],
) -> Iterator[NeonEnvBuilder]:
"""
Fixture to create a Neon environment for test.
@@ -1428,7 +1430,6 @@ def neon_env_builder(
neon_binpath=neon_binpath,
pg_distrib_dir=pg_distrib_dir,
pg_version=pg_version,
broker=default_broker,
run_id=run_id,
preserve_database_files=cast(bool, pytestconfig.getoption("--preserve-database-files")),
pageserver_virtual_file_io_engine=pageserver_virtual_file_io_engine,
@@ -1438,6 +1439,7 @@ def neon_env_builder(
pageserver_aux_file_policy=pageserver_aux_file_policy,
pageserver_default_tenant_config_compaction_algorithm=pageserver_default_tenant_config_compaction_algorithm,
pageserver_io_buffer_alignment=pageserver_io_buffer_alignment,
shared_initdb_cache_dir=shared_initdb_cache_dir
) as builder:
yield builder
# Propogate `preserve_database_files` to make it possible to use in other fixtures,
@@ -1850,6 +1852,18 @@ class NeonCli(AbstractNeonCli):
args.extend(["-m", "immediate"])
return self.raw_cli(args)
def broker_start(
self, timeout_in_seconds: Optional[int] = None
) -> "subprocess.CompletedProcess[str]":
cmd = ["storage_broker", "start"]
if timeout_in_seconds is not None:
cmd.append(f"--start-timeout={timeout_in_seconds}s")
return self.raw_cli(cmd)
def broker_stop(self) -> "subprocess.CompletedProcess[str]":
cmd = ["storage_broker", "stop"]
return self.raw_cli(cmd)
def endpoint_create(
self,
branch_name: str,
@@ -3871,9 +3885,6 @@ def static_proxy(
dbname = vanilla_pg.default_options["dbname"]
auth_endpoint = f"postgres://proxy:password@{host}:{port}/{dbname}"
# require password for 'http_auth' user
vanilla_pg.edit_hba([f"host {dbname} http_auth {host} password"])
# For simplicity, we use the same user for both `--auth-endpoint` and `safe_psql`
vanilla_pg.start()
vanilla_pg.safe_psql("create user proxy with login superuser password 'password'")
@@ -4094,7 +4105,7 @@ class Endpoint(PgProtocol, LogUtils):
json.dump(dict(data_dict, **kwargs), file, indent=4)
# Please note: Migrations only run if pg_skip_catalog_updates is false
def wait_for_migrations(self, num_migrations: int = 10):
def wait_for_migrations(self, num_migrations: int = 11):
with self.cursor() as cur:
def check_migrations_done():
@@ -4573,6 +4584,40 @@ class Safekeeper(LogUtils):
wait_until(20, 0.5, paused)
class NeonBroker(LogUtils):
"""An object managing storage_broker instance"""
def __init__(self, env: NeonEnv):
super().__init__(logfile=env.repo_dir / "storage_broker.log")
self.env = env
self.port: int = self.env.port_distributor.get_port()
self.running = False
def start(
self,
timeout_in_seconds: Optional[int] = None,
):
assert not self.running
self.env.neon_cli.broker_start(timeout_in_seconds)
self.running = True
return self
def stop(self):
if self.running:
self.env.neon_cli.broker_stop()
self.running = False
return self
def listen_addr(self):
return f"127.0.0.1:{self.port}"
def client_url(self):
return f"http://{self.listen_addr()}"
def assert_no_errors(self):
assert_no_errors(self.logfile, "storage_controller", [])
# TODO: Replace with `StrEnum` when we upgrade to python 3.11
class NodeKind(str, Enum):
PAGESERVER = "pageserver"

View File

@@ -39,7 +39,7 @@ def single_timeline(
log.info("detach template tenant form pageserver")
env.pageserver.tenant_detach(template_tenant)
log.info(f"duplicating template tenant {ncopies} times in S3")
log.info(f"duplicating template tenant {ncopies} times in remote storage")
tenants = fixtures.pageserver.remote_storage.duplicate_tenant(env, template_tenant, ncopies)
# In theory we could just attach all the tenants, force on-demand downloads via mgmt API, and be done.

View File

@@ -142,6 +142,7 @@ def test_branch_creation_before_gc(neon_simple_env: NeonEnv):
"image_creation_threshold": "1",
# set PITR interval to be small, so we can do GC
"pitr_interval": "0 s",
"lsn_lease_length": "0s",
}
)

View File

@@ -11,7 +11,9 @@ from fixtures.utils import print_gc_result, query_scalar
#
def test_branch_behind(neon_env_builder: NeonEnvBuilder):
# Disable pitr, because here we want to test branch creation after GC
env = neon_env_builder.init_start(initial_tenant_conf={"pitr_interval": "0 sec"})
env = neon_env_builder.init_start(
initial_tenant_conf={"pitr_interval": "0 sec", "lsn_lease_length": "0s"}
)
error_regexes = [
".*invalid branch start lsn.*",

View File

@@ -419,7 +419,7 @@ def test_duplicate_creation(neon_env_builder: NeonEnvBuilder):
def test_branching_while_stuck_find_gc_cutoffs(neon_env_builder: NeonEnvBuilder):
env = neon_env_builder.init_start()
env = neon_env_builder.init_start(initial_tenant_conf={"lsn_lease_length": "0s"})
client = env.pageserver.http_client()

View File

@@ -240,6 +240,7 @@ def test_uploads_and_deletions(
"image_creation_threshold": "1",
"image_layer_creation_check_threshold": "0",
"compaction_algorithm": json.dumps({"kind": compaction_algorithm.value}),
"lsn_lease_length": "0s",
}
env = neon_env_builder.init_start(initial_tenant_conf=tenant_conf)

View File

@@ -222,7 +222,7 @@ def pgbench_accounts_initialized(ep):
# Without hs feedback enabled we'd see 'User query might have needed to see row
# versions that must be removed.' errors.
def test_hot_standby_feedback(neon_env_builder: NeonEnvBuilder, pg_bin: PgBin):
env = neon_env_builder.init_start()
env = neon_env_builder.init_start(initial_tenant_conf={"lsn_lease_length": "0s"})
agressive_vacuum_conf = [
"log_autovacuum_min_duration = 0",
"autovacuum_naptime = 10s",

View File

@@ -173,6 +173,7 @@ def test_gc_of_remote_layers(neon_env_builder: NeonEnvBuilder):
# "image_creation_threshold": set at runtime
"compaction_target_size": f"{128 * (1024**2)}", # make it so that we only have 1 partition => image coverage for delta layers => enables gc of delta layers
"image_layer_creation_check_threshold": "0", # always check if a new image layer can be created
"lsn_lease_length": "0s",
}
def tenant_update_config(changes):

View File

@@ -134,6 +134,7 @@ def test_cli_start_stop(neon_env_builder: NeonEnvBuilder):
env.neon_cli.pageserver_stop(env.pageserver.id)
env.neon_cli.safekeeper_stop()
env.neon_cli.storage_controller_stop(False)
env.neon_cli.broker_stop()
# Keep NeonEnv state up to date, it usually owns starting/stopping services
env.pageserver.running = False
@@ -176,6 +177,7 @@ def test_cli_start_stop_multi(neon_env_builder: NeonEnvBuilder):
# Stop this to get out of the way of the following `start`
env.neon_cli.storage_controller_stop(False)
env.neon_cli.broker_stop()
# Default start
res = env.neon_cli.raw_cli(["start"])

Some files were not shown because too many files have changed in this diff Show More