Compare commits

..

14 Commits

Author SHA1 Message Date
Alex Chi Z
3fb6e258dc feat(pageserver): use vectored_get in collect_keyspace
Signed-off-by: Alex Chi Z <chi@neon.tech>
2025-01-28 20:31:15 +01:00
Vlad Lazar
c54cd9e76a storcon: signal LSN wait to pageserver during live migration (#10452)
## Problem

We've seen the ingest connection manager get stuck shortly after a
migration.

## Summary of changes

A speculative mitigation is to use the same mechanism as get page
requests for kicking LSN ingest. The connection manager monitors
LSN waits and queries the broker if no updates are received for the
timeline.

Closes https://github.com/neondatabase/neon/issues/10351
2025-01-28 17:33:07 +00:00
Erik Grinaker
1010b8add4 pageserver: add l0_flush_wait_upload setting (#10534)
## Problem

We need a setting to disable the flush upload wait, to test L0 flush
backpressure in staging.

## Summary of changes

Add `l0_flush_wait_upload` setting.
2025-01-28 17:21:05 +00:00
Folke Behrens
ae4b2af299 fix(proxy): Use correct identifier for usage metrics upload (#10538)
## Problem

The request data and usage metrics S3 requests use the same identifier
shown in logs, causing confusion about what type of upload failed.

## Summary of changes

Use the correct identifier for usage metrics uploads.

neondatabase/cloud#23084
2025-01-28 17:08:17 +00:00
Tristan Partin
15fecb8474 Update axum to 0.8.1 (#10332)
Only a few things that needed updating:

- async_trait was removed
- Message::Text takes a Utf8Bytes object instead of a String

Signed-off-by: Tristan Partin <tristan@neon.tech>
Co-authored-by: Conrad Ludgate <connor@neon.tech>
2025-01-28 15:32:59 +00:00
Erik Grinaker
47677ba578 pageserver: disable L0 backpressure by default (#10535)
## Problem

We'll need further improvements to compaction before enabling L0 flush
backpressure by default. See:
https://neondb.slack.com/archives/C033RQ5SPDH/p1738066068960519?thread_ts=1737818888.474179&cid=C033RQ5SPDH.

Touches #5415.

## Summary of changes

Disable `l0_flush_delay_threshold` by default.
2025-01-28 14:51:30 +00:00
Arpad Müller
83b6bfa229 Re-download layer if its local and on-disk metadata diverge (#10529)
In #10308, we noticed many warnings about the local layer having
different sizes on-disk compared to the metadata.

However, the layer downloader would never redownload layer files if the
sizes or generation numbers change. This is obviously a bug, which we
aim to fix with this PR.

This change also moves the code deciding what to do about a layer to a
dedicated function: before we handled the "routing" via control flow,
but now it's become too complicated and it is nicer to have the
different verdicts for a layer spelled out in a list/match.
2025-01-28 13:39:53 +00:00
Erik Grinaker
ed942b05f7 Revert "pageserver: revert flush backpressure" (#10402)" (#10533)
This reverts commit 9e55d79803.

We'll still need this until we can tune L0 flush backpressure and
compaction. I'll add a setting to disable this separately.
2025-01-28 13:33:58 +00:00
Vlad Lazar
62a717a2ca pageserver: use PS node id for SK appname (#10522)
## Problem

This one is fairly embarrassing. Safekeeper node id was used in the
pageserver application name
when connecting to safekeepers.

## Summary of changes

Use the right node id.

Closes https://github.com/neondatabase/neon/issues/10461
2025-01-28 13:11:51 +00:00
Peter Bendel
c8fbbb9b65 Test ingest_benchmark with different stripe size and also PostgreSQL version 17 (#10510)
We want to verify if pageserver stripe size has an impact on ingest
performance.
We want to verify if ingest performance has improved or regressed with
postgres version 17.

## Summary of changes

- Allow to create new project with different postgres versions
- allow to pre-shard new project with different stripe sizes instead of
relying on storage manager to shard_split the project once a threshold
is exceeded

Replaces https://github.com/neondatabase/neon/pull/10509

Test run https://github.com/neondatabase/neon/actions/runs/12986410381
2025-01-27 21:06:05 +00:00
John Spray
d73f4a6470 pageserver: retry wrapper on manifest upload (#10524)
## Problem

On remote storage errors (e.g. I/O timeout) uploading tenant manifest,
all of compaction could fail. This is a problem IRL because we shouldn't
abort compaction on a single IO error, and in tests because it generates
spurious failures.

Related:
https://github.com/orgs/neondatabase/projects/51/views/2?sliceBy%5Bvalue%5D=jcsp&pane=issue&itemId=93692919&issue=neondatabase%7Cneon%7C10389

## Summary of changes

- Use `backoff::retry` when uploading tenant manifest
2025-01-27 21:02:25 +00:00
Heikki Linnakangas
5477d7db93 fast_import: fixes for Postgres v17 (#10414)
Now that the tests are run on v17, they're also run in debug mode, which
is slow. Increase statement_timeout in the test to work around that.
2025-01-27 19:47:49 +00:00
Arpad Müller
eb9832d846 Remove PQ_LIB_DIR env var (#10526)
We now don't need libpq any more for the build of the storage
controller, as we use `diesel-async` since #10280. Therefore, we remove
the env var that gave cargo/rustc the location for libpq.

Follow-up of #10280
2025-01-27 19:38:18 +00:00
Christian Schwarz
3d36dfe533 fix: noisy broker subscription failed error during storage broker deploys (#10521)
During broker deploys, pageservers log this noisy WARN en masse.

I can trivially reproduce the WARN message in neon_local by SIGKILLing
broker during e.g. `pgbench -i`.

I don't understand why tonic is not detecting the error as
`Code::Unavailable`.

Until we find time to understand that / fix upstream, this PR adds the
error message to the existing list of known error messages that get
demoted to INFO level.

Refs:
-  refs https://github.com/neondatabase/neon/issues/9562
2025-01-27 19:19:55 +00:00
45 changed files with 764 additions and 278 deletions

View File

@@ -17,6 +17,31 @@ inputs:
compute_units:
description: '[Min, Max] compute units'
default: '[1, 1]'
# settings below only needed if you want the project to be sharded from the beginning
shard_split_project:
description: 'by default new projects are not shard-split, specify true to shard-split'
required: false
default: 'false'
admin_api_key:
description: 'Admin API Key needed for shard-splitting. Must be specified if shard_split_project is true'
required: false
shard_count:
description: 'Number of shards to split the project into, only applies if shard_split_project is true'
required: false
default: '8'
stripe_size:
description: 'Stripe size, optional, in 8kiB pages. e.g. set 2048 for 16MB stripes. Default is 128 MiB, only applies if shard_split_project is true'
required: false
default: '32768'
psql_path:
description: 'Path to psql binary - it is caller responsibility to provision the psql binary'
required: false
default: '/tmp/neon/pg_install/v16/bin/psql'
libpq_lib_path:
description: 'Path to directory containing libpq library - it is caller responsibility to provision the libpq library'
required: false
default: '/tmp/neon/pg_install/v16/lib'
outputs:
dsn:
@@ -63,6 +88,23 @@ runs:
echo "project_id=${project_id}" >> $GITHUB_OUTPUT
echo "Project ${project_id} has been created"
if [ "${SHARD_SPLIT_PROJECT}" = "true" ]; then
# determine tenant ID
TENANT_ID=`${PSQL} ${dsn} -t -A -c "SHOW neon.tenant_id"`
echo "Splitting project ${project_id} with tenant_id ${TENANT_ID} into $((SHARD_COUNT)) shards with stripe size $((STRIPE_SIZE))"
echo "Sending PUT request to https://${API_HOST}/regions/${REGION_ID}/api/v1/admin/storage/proxy/control/v1/tenant/${TENANT_ID}/shard_split"
echo "with body {\"new_shard_count\": $((SHARD_COUNT)), \"new_stripe_size\": $((STRIPE_SIZE))}"
# we need an ADMIN API KEY to invoke storage controller API for shard splitting (bash -u above checks that the variable is set)
curl -X PUT \
"https://${API_HOST}/regions/${REGION_ID}/api/v1/admin/storage/proxy/control/v1/tenant/${TENANT_ID}/shard_split" \
-H "Accept: application/json" -H "Content-Type: application/json" -H "Authorization: Bearer ${ADMIN_API_KEY}" \
-d "{\"new_shard_count\": $SHARD_COUNT, \"new_stripe_size\": $STRIPE_SIZE}"
fi
env:
API_HOST: ${{ inputs.api_host }}
API_KEY: ${{ inputs.api_key }}
@@ -70,3 +112,9 @@ runs:
POSTGRES_VERSION: ${{ inputs.postgres_version }}
MIN_CU: ${{ fromJSON(inputs.compute_units)[0] }}
MAX_CU: ${{ fromJSON(inputs.compute_units)[1] }}
SHARD_SPLIT_PROJECT: ${{ inputs.shard_split_project }}
ADMIN_API_KEY: ${{ inputs.admin_api_key }}
SHARD_COUNT: ${{ inputs.shard_count }}
STRIPE_SIZE: ${{ inputs.stripe_size }}
PSQL: ${{ inputs.psql_path }}
LD_LIBRARY_PATH: ${{ inputs.libpq_lib_path }}

View File

@@ -158,8 +158,6 @@ jobs:
- name: Run cargo build
run: |
PQ_LIB_DIR=$(pwd)/pg_install/v16/lib
export PQ_LIB_DIR
${cov_prefix} mold -run cargo build $CARGO_FLAGS $CARGO_FEATURES --bins --tests
# Do install *before* running rust tests because they might recompile the
@@ -217,8 +215,6 @@ jobs:
env:
NEXTEST_RETRIES: 3
run: |
PQ_LIB_DIR=$(pwd)/pg_install/v16/lib
export PQ_LIB_DIR
LD_LIBRARY_PATH=$(pwd)/pg_install/v17/lib
export LD_LIBRARY_PATH

View File

@@ -235,7 +235,7 @@ jobs:
echo 'CPPFLAGS=-I/usr/local/opt/openssl@3/include' >> $GITHUB_ENV
- name: Run cargo build (only for v17)
run: PQ_LIB_DIR=$(pwd)/pg_install/v17/lib cargo build --all --release -j$(sysctl -n hw.ncpu)
run: cargo build --all --release -j$(sysctl -n hw.ncpu)
- name: Check that no warnings are produced (only for v17)
run: ./run_clippy.sh

View File

@@ -28,7 +28,24 @@ jobs:
strategy:
fail-fast: false # allow other variants to continue even if one fails
matrix:
target_project: [new_empty_project, large_existing_project]
include:
- target_project: new_empty_project_stripe_size_2048
stripe_size: 2048 # 16 MiB
postgres_version: 16
- target_project: new_empty_project_stripe_size_32768
stripe_size: 32768 # 256 MiB # note that this is different from null because using null will shard_split the project only if it reaches the threshold
# while here it is sharded from the beginning with a shard size of 256 MiB
postgres_version: 16
- target_project: new_empty_project
stripe_size: null # run with neon defaults which will shard split only when reaching the threshold
postgres_version: 16
- target_project: new_empty_project
stripe_size: null # run with neon defaults which will shard split only when reaching the threshold
postgres_version: 17
- target_project: large_existing_project
stripe_size: null # cannot re-shared or choose different stripe size for existing, already sharded project
postgres_version: 16
max-parallel: 1 # we want to run each stripe size sequentially to be able to compare the results
permissions:
contents: write
statuses: write
@@ -67,17 +84,21 @@ jobs:
aws-oicd-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
- name: Create Neon Project
if: ${{ matrix.target_project == 'new_empty_project' }}
if: ${{ startsWith(matrix.target_project, 'new_empty_project') }}
id: create-neon-project-ingest-target
uses: ./.github/actions/neon-project-create
with:
region_id: aws-us-east-2
postgres_version: 16
postgres_version: ${{ matrix.postgres_version }}
compute_units: '[7, 7]' # we want to test large compute here to avoid compute-side bottleneck
api_key: ${{ secrets.NEON_STAGING_API_KEY }}
shard_split_project: ${{ matrix.stripe_size != null && 'true' || 'false' }}
admin_api_key: ${{ secrets.NEON_STAGING_ADMIN_API_KEY }}
shard_count: 8
stripe_size: ${{ matrix.stripe_size }}
- name: Initialize Neon project
if: ${{ matrix.target_project == 'new_empty_project' }}
if: ${{ startsWith(matrix.target_project, 'new_empty_project') }}
env:
BENCHMARK_INGEST_TARGET_CONNSTR: ${{ steps.create-neon-project-ingest-target.outputs.dsn }}
NEW_PROJECT_ID: ${{ steps.create-neon-project-ingest-target.outputs.project_id }}
@@ -130,7 +151,7 @@ jobs:
test_selection: performance/test_perf_ingest_using_pgcopydb.py
run_in_parallel: false
extra_params: -s -m remote_cluster --timeout 86400 -k test_ingest_performance_using_pgcopydb
pg_version: v16
pg_version: v${{ matrix.postgres_version }}
save_perf_report: true
aws-oicd-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
env:
@@ -146,7 +167,7 @@ jobs:
${PSQL} "${BENCHMARK_INGEST_TARGET_CONNSTR}" -c "\dt+"
- name: Delete Neon Project
if: ${{ always() && matrix.target_project == 'new_empty_project' }}
if: ${{ always() && startsWith(matrix.target_project, 'new_empty_project') }}
uses: ./.github/actions/neon-project-delete
with:
project_id: ${{ steps.create-neon-project-ingest-target.outputs.project_id }}

View File

@@ -114,7 +114,7 @@ jobs:
run: make walproposer-lib -j$(nproc)
- name: Produce the build stats
run: PQ_LIB_DIR=$(pwd)/pg_install/v17/lib cargo build --all --release --timings -j$(nproc)
run: cargo build --all --release --timings -j$(nproc)
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v4

125
Cargo.lock generated
View File

@@ -179,7 +179,7 @@ dependencies = [
"nom",
"num-traits",
"rusticata-macros",
"thiserror",
"thiserror 1.0.69",
"time",
]
@@ -718,14 +718,14 @@ dependencies = [
[[package]]
name = "axum"
version = "0.7.9"
version = "0.8.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "edca88bc138befd0323b20752846e6587272d3b03b0343c8ea28a6f819e6e71f"
checksum = "6d6fd624c75e18b3b4c6b9caf42b1afe24437daaee904069137d8bab077be8b8"
dependencies = [
"async-trait",
"axum-core",
"base64 0.22.1",
"bytes",
"form_urlencoded",
"futures-util",
"http 1.1.0",
"http-body 1.0.0",
@@ -733,7 +733,7 @@ dependencies = [
"hyper 1.4.1",
"hyper-util",
"itoa",
"matchit 0.7.0",
"matchit",
"memchr",
"mime",
"percent-encoding",
@@ -746,7 +746,7 @@ dependencies = [
"sha1",
"sync_wrapper 1.0.1",
"tokio",
"tokio-tungstenite 0.24.0",
"tokio-tungstenite 0.26.1",
"tower 0.5.2",
"tower-layer",
"tower-service",
@@ -755,11 +755,10 @@ dependencies = [
[[package]]
name = "axum-core"
version = "0.4.5"
version = "0.5.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "09f2bd6146b97ae3359fa0cc6d6b376d9539582c7b4220f041a33ec24c226199"
checksum = "df1362f362fd16024ae199c1970ce98f9661bf5ef94b9808fee734bc3698b733"
dependencies = [
"async-trait",
"bytes",
"futures-util",
"http 1.1.0",
@@ -1130,7 +1129,7 @@ dependencies = [
"log",
"nix 0.25.1",
"regex",
"thiserror",
"thiserror 1.0.69",
]
[[package]]
@@ -1311,7 +1310,7 @@ dependencies = [
"serde_with",
"signal-hook",
"tar",
"thiserror",
"thiserror 1.0.69",
"tokio",
"tokio-postgres 0.7.7",
"tokio-stream",
@@ -1420,7 +1419,7 @@ dependencies = [
"serde",
"serde_json",
"storage_broker",
"thiserror",
"thiserror 1.0.69",
"tokio",
"tokio-postgres 0.7.7",
"tokio-util",
@@ -2264,7 +2263,7 @@ dependencies = [
"pin-project",
"rand 0.8.5",
"sha1",
"thiserror",
"thiserror 1.0.69",
"tokio",
"tokio-util",
]
@@ -3390,12 +3389,6 @@ dependencies = [
"regex-automata 0.1.10",
]
[[package]]
name = "matchit"
version = "0.7.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "b87248edafb776e59e6ee64a79086f65890d3510f2c656c000bf2a7e8a0aea40"
[[package]]
name = "matchit"
version = "0.8.4"
@@ -3786,7 +3779,7 @@ dependencies = [
"serde_json",
"serde_path_to_error",
"sha2",
"thiserror",
"thiserror 1.0.69",
"url",
]
@@ -3836,7 +3829,7 @@ dependencies = [
"futures-sink",
"js-sys",
"pin-project-lite",
"thiserror",
"thiserror 1.0.69",
"tracing",
]
@@ -3868,7 +3861,7 @@ dependencies = [
"opentelemetry_sdk",
"prost",
"reqwest",
"thiserror",
"thiserror 1.0.69",
]
[[package]]
@@ -3904,7 +3897,7 @@ dependencies = [
"percent-encoding",
"rand 0.8.5",
"serde_json",
"thiserror",
"thiserror 1.0.69",
"tokio",
"tokio-stream",
"tracing",
@@ -4018,7 +4011,7 @@ dependencies = [
"remote_storage",
"serde_json",
"svg_fmt",
"thiserror",
"thiserror 1.0.69",
"tokio",
"tokio-util",
"utils",
@@ -4094,7 +4087,7 @@ dependencies = [
"strum_macros",
"sysinfo",
"tenant_size_model",
"thiserror",
"thiserror 1.0.69",
"tikv-jemallocator",
"tokio",
"tokio-epoll-uring",
@@ -4140,7 +4133,7 @@ dependencies = [
"storage_broker",
"strum",
"strum_macros",
"thiserror",
"thiserror 1.0.69",
"utils",
]
@@ -4155,7 +4148,7 @@ dependencies = [
"postgres",
"reqwest",
"serde",
"thiserror",
"thiserror 1.0.69",
"tokio",
"tokio-postgres 0.7.7",
"tokio-stream",
@@ -4559,7 +4552,7 @@ dependencies = [
"rustls 0.23.18",
"rustls-pemfile 2.1.1",
"serde",
"thiserror",
"thiserror 1.0.69",
"tokio",
"tokio-postgres 0.7.7",
"tokio-postgres-rustls",
@@ -4597,7 +4590,7 @@ dependencies = [
"pprof",
"regex",
"serde",
"thiserror",
"thiserror 1.0.69",
"tracing",
"utils",
]
@@ -4608,7 +4601,7 @@ version = "0.1.0"
dependencies = [
"anyhow",
"camino",
"thiserror",
"thiserror 1.0.69",
"tokio",
"workspace_hack",
]
@@ -4641,7 +4634,7 @@ dependencies = [
"smallvec",
"symbolic-demangle",
"tempfile",
"thiserror",
"thiserror 1.0.69",
]
[[package]]
@@ -4673,7 +4666,7 @@ dependencies = [
"postgres-protocol 0.6.4",
"rand 0.8.5",
"serde",
"thiserror",
"thiserror 1.0.69",
"tokio",
]
@@ -4744,7 +4737,7 @@ dependencies = [
"memchr",
"parking_lot 0.12.1",
"procfs",
"thiserror",
"thiserror 1.0.69",
]
[[package]]
@@ -4914,7 +4907,7 @@ dependencies = [
"strum",
"strum_macros",
"subtle",
"thiserror",
"thiserror 1.0.69",
"tikv-jemalloc-ctl",
"tikv-jemallocator",
"tokio",
@@ -5311,7 +5304,7 @@ dependencies = [
"http 1.1.0",
"reqwest",
"serde",
"thiserror",
"thiserror 1.0.69",
"tower-service",
]
@@ -5331,7 +5324,7 @@ dependencies = [
"reqwest",
"reqwest-middleware",
"retry-policies",
"thiserror",
"thiserror 1.0.69",
"tokio",
"tracing",
"wasm-timer",
@@ -5347,7 +5340,7 @@ dependencies = [
"async-trait",
"getrandom 0.2.11",
"http 1.1.0",
"matchit 0.8.4",
"matchit",
"opentelemetry",
"reqwest",
"reqwest-middleware",
@@ -5726,7 +5719,7 @@ dependencies = [
"storage_broker",
"strum",
"strum_macros",
"thiserror",
"thiserror 1.0.69",
"tikv-jemallocator",
"tokio",
"tokio-io-timeout",
@@ -5765,7 +5758,7 @@ dependencies = [
"reqwest",
"safekeeper_api",
"serde",
"thiserror",
"thiserror 1.0.69",
"utils",
"workspace_hack",
]
@@ -5974,7 +5967,7 @@ dependencies = [
"rand 0.8.5",
"serde",
"serde_json",
"thiserror",
"thiserror 1.0.69",
"time",
"url",
"uuid",
@@ -6046,7 +6039,7 @@ checksum = "c7715380eec75f029a4ef7de39a9200e0a63823176b759d055b613f5a87df6a6"
dependencies = [
"percent-encoding",
"serde",
"thiserror",
"thiserror 1.0.69",
]
[[package]]
@@ -6208,7 +6201,7 @@ checksum = "adc4e5204eb1910f40f9cfa375f6f05b68c3abac4b6fd879c8ff5e7ae8a0a085"
dependencies = [
"num-bigint",
"num-traits",
"thiserror",
"thiserror 1.0.69",
"time",
]
@@ -6353,7 +6346,7 @@ dependencies = [
"serde_json",
"strum",
"strum_macros",
"thiserror",
"thiserror 1.0.69",
"tokio",
"tokio-util",
"tracing",
@@ -6645,7 +6638,16 @@ version = "1.0.69"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "b6aaf5339b578ea85b50e080feb250a3e8ae8cfcdff9a461c9ec2904bc923f52"
dependencies = [
"thiserror-impl",
"thiserror-impl 1.0.69",
]
[[package]]
name = "thiserror"
version = "2.0.11"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "d452f284b73e6d76dd36758a0c8684b1d5be31f92b89d07fd5822175732206fc"
dependencies = [
"thiserror-impl 2.0.11",
]
[[package]]
@@ -6659,6 +6661,17 @@ dependencies = [
"syn 2.0.90",
]
[[package]]
name = "thiserror-impl"
version = "2.0.11"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "26afc1baea8a989337eeb52b6e72a039780ce45c3edfcc9c5b9d112feeb173c2"
dependencies = [
"proc-macro2",
"quote",
"syn 2.0.90",
]
[[package]]
name = "thread_local"
version = "1.1.7"
@@ -6815,7 +6828,7 @@ dependencies = [
"nix 0.26.4",
"once_cell",
"scopeguard",
"thiserror",
"thiserror 1.0.69",
"tokio",
"tokio-util",
"tracing",
@@ -6998,14 +7011,14 @@ dependencies = [
[[package]]
name = "tokio-tungstenite"
version = "0.24.0"
version = "0.26.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "edc5f74e248dc973e0dbb7b74c7e0d6fcc301c694ff50049504004ef4d0cdcd9"
checksum = "be4bf6fecd69fcdede0ec680aaf474cdab988f9de6bc73d3758f0160e3b7025a"
dependencies = [
"futures-util",
"log",
"tokio",
"tungstenite 0.24.0",
"tungstenite 0.26.1",
]
[[package]]
@@ -7315,16 +7328,16 @@ dependencies = [
"log",
"rand 0.8.5",
"sha1",
"thiserror",
"thiserror 1.0.69",
"url",
"utf-8",
]
[[package]]
name = "tungstenite"
version = "0.24.0"
version = "0.26.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "18e5b8366ee7a95b16d32197d0b2604b43a0be89dc5fac9f8e96ccafbaedda8a"
checksum = "413083a99c579593656008130e29255e54dcaae495be556cc26888f211648c24"
dependencies = [
"byteorder",
"bytes",
@@ -7334,7 +7347,7 @@ dependencies = [
"log",
"rand 0.8.5",
"sha1",
"thiserror",
"thiserror 2.0.11",
"utf-8",
]
@@ -7529,7 +7542,7 @@ dependencies = [
"signal-hook",
"strum",
"strum_macros",
"thiserror",
"thiserror 1.0.69",
"tokio",
"tokio-stream",
"tokio-tar",
@@ -7629,7 +7642,7 @@ dependencies = [
"remote_storage",
"serde",
"serde_json",
"thiserror",
"thiserror 1.0.69",
"tikv-jemallocator",
"tokio",
"tokio-util",
@@ -8158,7 +8171,7 @@ dependencies = [
"ring",
"signature 2.2.0",
"spki 0.7.3",
"thiserror",
"thiserror 1.0.69",
"zeroize",
]
@@ -8175,7 +8188,7 @@ dependencies = [
"nom",
"oid-registry",
"rusticata-macros",
"thiserror",
"thiserror 1.0.69",
"time",
]

View File

@@ -65,7 +65,7 @@ aws-smithy-types = "1.2"
aws-credential-types = "1.2.0"
aws-sigv4 = { version = "1.2", features = ["sign-http"] }
aws-types = "1.3"
axum = { version = "0.7.9", features = ["ws"] }
axum = { version = "0.8.1", features = ["ws"] }
base64 = "0.13.0"
bincode = "1.3"
bindgen = "0.70"

View File

@@ -45,7 +45,7 @@ COPY --chown=nonroot . .
ARG ADDITIONAL_RUSTFLAGS
RUN set -e \
&& PQ_LIB_DIR=$(pwd)/pg_install/v${STABLE_PG_VERSION}/lib RUSTFLAGS="-Clinker=clang -Clink-arg=-fuse-ld=mold -Clink-arg=-Wl,--no-rosegment -Cforce-frame-pointers=yes ${ADDITIONAL_RUSTFLAGS}" cargo build \
&& RUSTFLAGS="-Clinker=clang -Clink-arg=-fuse-ld=mold -Clink-arg=-Wl,--no-rosegment -Cforce-frame-pointers=yes ${ADDITIONAL_RUSTFLAGS}" cargo build \
--bin pg_sni_router \
--bin pageserver \
--bin pagectl \

View File

@@ -1486,7 +1486,6 @@ impl ComputeNode {
// First, create control files for all availale extensions
extension_server::create_control_files(remote_extensions, &self.pgbin);
// Second, preload all remote extensions specified in the shared_preload_libraries
let library_load_start_time = Utc::now();
let remote_ext_metrics = self.prepare_preload_libraries(&pspec.spec)?;
@@ -1909,9 +1908,8 @@ LIMIT 100",
.as_ref()
.ok_or(anyhow::anyhow!("Remote extensions are not configured"))?;
let mut libs_vec = Vec::new();
info!("parse shared_preload_libraries from spec.cluster.settings");
let mut libs_vec = Vec::new();
if let Some(libs) = spec.cluster.settings.find("shared_preload_libraries") {
libs_vec = libs
.split(&[',', '\'', ' '])
@@ -1919,9 +1917,9 @@ LIMIT 100",
.map(str::to_string)
.collect();
}
// This is used in neon_local and python tests
info!("parse shared_preload_libraries from provided postgresql.conf");
// that is used in neon_local and python tests
if let Some(conf) = &spec.cluster.postgresql_conf {
let conf_lines = conf.split('\n').collect::<Vec<&str>>();
let mut shared_preload_libraries_line = "";
@@ -1945,10 +1943,7 @@ LIMIT 100",
// Assume that they are already present locally.
libs_vec.retain(|lib| remote_extensions.library_index.contains_key(lib));
info!(
"Downloading extensions specified in shared_preload_libraries: {:?}",
&libs_vec
);
info!("Downloading to shared preload libraries: {:?}", &libs_vec);
let mut download_tasks = Vec::new();
for library in &libs_vec {

View File

@@ -148,18 +148,18 @@ fn parse_pg_version(human_version: &str) -> PostgresMajorVersion {
},
_ => {}
}
panic!("Unsupported Postgres version {human_version}");
panic!("Unsuported postgres version {human_version}");
}
/// Download the archive for a given extension,
/// unzip it, and place files in the appropriate locations (share/lib)
// download the archive for a given extension,
// unzip it, and place files in the appropriate locations (share/lib)
pub async fn download_extension(
ext_name: &str,
ext_path: &RemotePath,
ext_remote_storage: &str,
pgbin: &str,
) -> Result<u64> {
info!("Downloading extension {:?} from {:?}", ext_name, ext_path);
info!("Download extension {:?} from {:?}", ext_name, ext_path);
// TODO add retry logic
let download_buffer =
@@ -200,23 +200,23 @@ pub async fn download_extension(
// move contents of the libdir / sharedir in unzipped archive to the correct local paths
for paths in [sharedir_paths, libdir_paths] {
let (zip_dir, real_dir) = paths;
info!("Moving {zip_dir:?}/* to {real_dir:?}");
info!("mv {zip_dir:?}/* {real_dir:?}");
for file in std::fs::read_dir(zip_dir)? {
let old_file = file?.path();
let new_file =
Path::new(&real_dir).join(old_file.file_name().context("error parsing file")?);
info!("Moving {old_file:?} to {new_file:?}");
info!("moving {old_file:?} to {new_file:?}");
// extension download failed: Directory not empty (os error 39)
match std::fs::rename(old_file, new_file) {
Ok(()) => info!("Move succeeded"),
Ok(()) => info!("move succeeded"),
Err(e) => {
warn!("Move failed, probably because the extension already exists: {e}")
warn!("move failed, probably because the extension already exists: {e}")
}
}
}
}
info!("Done moving extension {ext_name}");
info!("done moving extension {ext_name}");
Ok(download_size)
}
@@ -239,16 +239,10 @@ pub fn create_control_files(remote_extensions: &RemoteExtSpec, pgbin: &str) {
for (control_name, control_content) in &ext_data.control_data {
let control_path = local_sharedir.join(control_name);
if !control_path.exists() {
info!(
"Writing control file content {:?}: {:?}",
control_path, control_content
);
info!("writing file {:?}{:?}", control_path, control_content);
std::fs::write(control_path, control_content).unwrap();
} else {
warn!(
"Control file {:?} exists locally. Ignoring the version from the spec.",
control_path
);
warn!("control file {:?} exists both locally and remotely. ignoring the remote version.", control_path);
}
}
}
@@ -256,7 +250,9 @@ pub fn create_control_files(remote_extensions: &RemoteExtSpec, pgbin: &str) {
// Do request to extension storage proxy, i.e.
// curl http://pg-ext-s3-gateway/latest/v15/extensions/anon.tar.zst
// using HTTP GET and return the response body as bytes.
// using HHTP GET
// and return the response body as bytes
//
async fn download_extension_tar(ext_remote_storage: &str, ext_path: &str) -> Result<Bytes> {
let uri = format!("{}/{}", ext_remote_storage, ext_path);

View File

@@ -1,9 +1,6 @@
use std::ops::{Deref, DerefMut};
use axum::{
async_trait,
extract::{rejection::JsonRejection, FromRequest, Request},
};
use axum::extract::{rejection::JsonRejection, FromRequest, Request};
use compute_api::responses::GenericAPIError;
use http::StatusCode;
@@ -12,7 +9,6 @@ use http::StatusCode;
#[derive(Debug, Clone, Copy, Default)]
pub(crate) struct Json<T>(pub T);
#[async_trait]
impl<S, T> FromRequest<S> for Json<T>
where
axum::Json<T>: FromRequest<S, Rejection = JsonRejection>,

View File

@@ -1,9 +1,6 @@
use std::ops::{Deref, DerefMut};
use axum::{
async_trait,
extract::{rejection::PathRejection, FromRequestParts},
};
use axum::extract::{rejection::PathRejection, FromRequestParts};
use compute_api::responses::GenericAPIError;
use http::{request::Parts, StatusCode};
@@ -12,7 +9,6 @@ use http::{request::Parts, StatusCode};
#[derive(Debug, Clone, Copy, Default)]
pub(crate) struct Path<T>(pub T);
#[async_trait]
impl<S, T> FromRequestParts<S> for Path<T>
where
axum::extract::Path<T>: FromRequestParts<S, Rejection = PathRejection>,

View File

@@ -1,9 +1,6 @@
use std::ops::{Deref, DerefMut};
use axum::{
async_trait,
extract::{rejection::QueryRejection, FromRequestParts},
};
use axum::extract::{rejection::QueryRejection, FromRequestParts};
use compute_api::responses::GenericAPIError;
use http::{request::Parts, StatusCode};
@@ -12,7 +9,6 @@ use http::{request::Parts, StatusCode};
#[derive(Debug, Clone, Copy, Default)]
pub(crate) struct Query<T>(pub T);
#[async_trait]
impl<S, T> FromRequestParts<S> for Query<T>
where
axum::extract::Query<T>: FromRequestParts<S, Rejection = QueryRejection>,

View File

@@ -55,7 +55,7 @@ async fn serve(port: u16, compute: Arc<ComputeNode>) {
.route("/database_schema", get(database_schema::get_schema_dump))
.route("/dbs_and_roles", get(dbs_and_roles::get_catalog_objects))
.route(
"/extension_server/*filename",
"/extension_server/{*filename}",
post(extension_server::download_extension),
)
.route("/extensions", post(extensions::install_extension))

View File

@@ -357,6 +357,11 @@ impl PageServerNode {
.map(|x| x.parse::<usize>())
.transpose()
.context("Failed to parse 'l0_flush_delay_threshold' as an integer")?,
l0_flush_wait_upload: settings
.remove("l0_flush_wait_upload")
.map(|x| x.parse::<bool>())
.transpose()
.context("Failed to parse 'l0_flush_wait_upload' as a boolean")?,
l0_flush_stall_threshold: settings
.remove("l0_flush_stall_threshold")
.map(|x| x.parse::<usize>())

View File

@@ -41,8 +41,8 @@ allow = [
"MIT",
"MPL-2.0",
"OpenSSL",
"Unicode-DFS-2016",
"Unicode-3.0",
"Zlib",
]
confidence-threshold = 0.8
exceptions = [

View File

@@ -260,12 +260,15 @@ pub struct TenantConfigToml {
/// Level0 delta layer threshold at which to delay layer flushes for compaction backpressure,
/// such that they take 2x as long, and start waiting for layer flushes during ephemeral layer
/// rolls. This helps compaction keep up with WAL ingestion, and avoids read amplification
/// blowing up. Should be >compaction_threshold. If None, defaults to 2 * compaction_threshold.
/// 0 to disable.
/// blowing up. Should be >compaction_threshold. 0 to disable. Disabled by default.
pub l0_flush_delay_threshold: Option<usize>,
/// Level0 delta layer threshold at which to stall layer flushes. 0 to disable. If None,
/// defaults to 4 * compaction_threshold. Must be >compaction_threshold to avoid deadlock.
/// Level0 delta layer threshold at which to stall layer flushes. Must be >compaction_threshold
/// to avoid deadlock. 0 to disable. Disabled by default.
pub l0_flush_stall_threshold: Option<usize>,
/// If true, Level0 delta layer flushes will wait for S3 upload before flushing the next
/// layer. This is a temporary backpressure mechanism which should be removed once
/// l0_flush_{delay,stall}_threshold is fully enabled.
pub l0_flush_wait_upload: bool,
// Determines how much history is retained, to allow
// branching and read replicas at an older point in time.
// The unit is #of bytes of WAL.
@@ -523,6 +526,8 @@ pub mod tenant_conf_defaults {
pub const DEFAULT_COMPACTION_ALGORITHM: crate::models::CompactionAlgorithm =
crate::models::CompactionAlgorithm::Legacy;
pub const DEFAULT_L0_FLUSH_WAIT_UPLOAD: bool = true;
pub const DEFAULT_GC_HORIZON: u64 = 64 * 1024 * 1024;
// Large DEFAULT_GC_PERIOD is fine as long as PITR_INTERVAL is larger.
@@ -563,6 +568,7 @@ impl Default for TenantConfigToml {
},
l0_flush_delay_threshold: None,
l0_flush_stall_threshold: None,
l0_flush_wait_upload: DEFAULT_L0_FLUSH_WAIT_UPLOAD,
gc_horizon: DEFAULT_GC_HORIZON,
gc_period: humantime::parse_duration(DEFAULT_GC_PERIOD)
.expect("cannot parse default gc period"),

View File

@@ -464,6 +464,18 @@ pub fn rel_size_to_key(rel: RelTag) -> Key {
}
}
#[inline(always)]
pub fn rel_size_key_to_rel(key: Key) -> RelTag {
assert_eq!(key.field1, 0x00);
assert_eq!(key.field6, 0xffff_ffff);
RelTag {
forknum: key.field5,
spcnode: key.field2,
dbnode: key.field3,
relnode: key.field4,
}
}
impl Key {
#[inline(always)]
pub fn is_rel_size_key(&self) -> bool {
@@ -559,6 +571,15 @@ pub fn slru_segment_size_to_key(kind: SlruKind, segno: u32) -> Key {
}
}
#[inline(always)]
pub fn slru_segment_size_key_to_segno(key: Key) -> u32 {
assert_eq!(key.field1, 0x01);
assert_eq!(key.field3, 1);
assert_eq!(key.field5, 0);
assert_eq!(key.field6, 0xffff_ffff);
key.field4
}
impl Key {
pub fn is_slru_segment_size_key(&self) -> bool {
self.field1 == 0x01

View File

@@ -466,6 +466,8 @@ pub struct TenantConfigPatch {
#[serde(skip_serializing_if = "FieldPatch::is_noop")]
pub l0_flush_stall_threshold: FieldPatch<usize>,
#[serde(skip_serializing_if = "FieldPatch::is_noop")]
pub l0_flush_wait_upload: FieldPatch<bool>,
#[serde(skip_serializing_if = "FieldPatch::is_noop")]
pub gc_horizon: FieldPatch<u64>,
#[serde(skip_serializing_if = "FieldPatch::is_noop")]
pub gc_period: FieldPatch<String>,
@@ -524,6 +526,7 @@ pub struct TenantConfig {
pub compaction_algorithm: Option<CompactionAlgorithmSettings>,
pub l0_flush_delay_threshold: Option<usize>,
pub l0_flush_stall_threshold: Option<usize>,
pub l0_flush_wait_upload: Option<bool>,
pub gc_horizon: Option<u64>,
pub gc_period: Option<String>,
pub image_creation_threshold: Option<usize>,
@@ -559,6 +562,7 @@ impl TenantConfig {
mut compaction_algorithm,
mut l0_flush_delay_threshold,
mut l0_flush_stall_threshold,
mut l0_flush_wait_upload,
mut gc_horizon,
mut gc_period,
mut image_creation_threshold,
@@ -597,6 +601,7 @@ impl TenantConfig {
patch
.l0_flush_stall_threshold
.apply(&mut l0_flush_stall_threshold);
patch.l0_flush_wait_upload.apply(&mut l0_flush_wait_upload);
patch.gc_horizon.apply(&mut gc_horizon);
patch.gc_period.apply(&mut gc_period);
patch
@@ -651,6 +656,7 @@ impl TenantConfig {
compaction_algorithm,
l0_flush_delay_threshold,
l0_flush_stall_threshold,
l0_flush_wait_upload,
gc_horizon,
gc_period,
image_creation_threshold,
@@ -1015,6 +1021,13 @@ pub struct TenantConfigPatchRequest {
pub config: TenantConfigPatch, // as we have a flattened field, we should reject all unknown fields in it
}
#[derive(Serialize, Deserialize, Debug)]
pub struct TenantWaitLsnRequest {
#[serde(flatten)]
pub timelines: HashMap<TimelineId, Lsn>,
pub timeout: Duration,
}
/// See [`TenantState::attachment_status`] and the OpenAPI docs for context.
#[derive(Serialize, Deserialize, Clone)]
#[serde(tag = "slug", content = "data", rename_all = "snake_case")]

View File

@@ -7,7 +7,7 @@
//! (notifying it of upscale).
use anyhow::{bail, Context};
use axum::extract::ws::{Message, WebSocket};
use axum::extract::ws::{Message, Utf8Bytes, WebSocket};
use futures::{
stream::{SplitSink, SplitStream},
SinkExt, StreamExt,
@@ -82,21 +82,21 @@ impl Dispatcher {
let highest_shared_version = match monitor_range.highest_shared_version(&agent_range) {
Ok(version) => {
sink.send(Message::Text(
sink.send(Message::Text(Utf8Bytes::from(
serde_json::to_string(&ProtocolResponse::Version(version)).unwrap(),
))
)))
.await
.context("failed to notify agent of negotiated protocol version")?;
version
}
Err(e) => {
sink.send(Message::Text(
sink.send(Message::Text(Utf8Bytes::from(
serde_json::to_string(&ProtocolResponse::Error(format!(
"Received protocol version range {} which does not overlap with {}",
agent_range, monitor_range
)))
.unwrap(),
))
)))
.await
.context("failed to notify agent of no overlap between protocol version ranges")?;
Err(e).context("error determining suitable protocol version range")?
@@ -126,7 +126,7 @@ impl Dispatcher {
let json = serde_json::to_string(&message).context("failed to serialize message")?;
self.sink
.send(Message::Text(json))
.send(Message::Text(Utf8Bytes::from(json)))
.await
.context("stream error sending message")
}

View File

@@ -763,4 +763,19 @@ impl Client {
.await
.map_err(Error::ReceiveBody)
}
pub async fn wait_lsn(
&self,
tenant_shard_id: TenantShardId,
request: TenantWaitLsnRequest,
) -> Result<StatusCode> {
let uri = format!(
"{}/v1/tenant/{tenant_shard_id}/wait_lsn",
self.mgmt_api_endpoint,
);
self.request_noerror(Method::POST, uri, request)
.await
.map(|resp| resp.status())
}
}

View File

@@ -160,9 +160,12 @@ pub fn draw_history<W: std::io::Write>(history: &[LayerTraceEvent], mut output:
// Fill in and thicken rectangle if it's an
// image layer so that we can see it.
let mut style = Style::default();
style.fill = Fill::Color(rgb(0x80, 0x80, 0x80));
style.stroke = Stroke::Color(rgb(0, 0, 0), 0.5);
let mut style = Style {
fill: Fill::Color(rgb(0x80, 0x80, 0x80)),
stroke: Stroke::Color(rgb(0, 0, 0), 0.5),
opacity: 1.0,
stroke_opacity: 1.0,
};
let y_start = lsn_max - lsn_start;
let y_end = lsn_max - lsn_end;
@@ -214,10 +217,6 @@ pub fn draw_history<W: std::io::Write>(history: &[LayerTraceEvent], mut output:
files_seen.insert(f);
}
let mut record_style = Style::default();
record_style.fill = Fill::Color(rgb(0x80, 0x80, 0x80));
record_style.stroke = Stroke::None;
writeln!(svg, "{}", EndSvg)?;
let mut layer_events_str = String::new();

View File

@@ -10,6 +10,7 @@ use std::time::Duration;
use anyhow::{anyhow, Context, Result};
use enumset::EnumSet;
use futures::future::join_all;
use futures::StreamExt;
use futures::TryFutureExt;
use humantime::format_rfc3339;
@@ -40,6 +41,7 @@ use pageserver_api::models::TenantShardSplitRequest;
use pageserver_api::models::TenantShardSplitResponse;
use pageserver_api::models::TenantSorting;
use pageserver_api::models::TenantState;
use pageserver_api::models::TenantWaitLsnRequest;
use pageserver_api::models::TimelineArchivalConfigRequest;
use pageserver_api::models::TimelineCreateRequestMode;
use pageserver_api::models::TimelineCreateRequestModeImportPgdata;
@@ -95,6 +97,8 @@ use crate::tenant::timeline::CompactOptions;
use crate::tenant::timeline::CompactRequest;
use crate::tenant::timeline::CompactionError;
use crate::tenant::timeline::Timeline;
use crate::tenant::timeline::WaitLsnTimeout;
use crate::tenant::timeline::WaitLsnWaiter;
use crate::tenant::GetTimelineError;
use crate::tenant::OffloadedTimeline;
use crate::tenant::{LogicalSizeCalculationCause, PageReconstructError};
@@ -2790,6 +2794,63 @@ async fn secondary_download_handler(
json_response(status, progress)
}
async fn wait_lsn_handler(
mut request: Request<Body>,
cancel: CancellationToken,
) -> Result<Response<Body>, ApiError> {
let tenant_shard_id: TenantShardId = parse_request_param(&request, "tenant_shard_id")?;
let wait_lsn_request: TenantWaitLsnRequest = json_request(&mut request).await?;
let state = get_state(&request);
let tenant = state
.tenant_manager
.get_attached_tenant_shard(tenant_shard_id)?;
let mut wait_futures = Vec::default();
for timeline in tenant.list_timelines() {
let Some(lsn) = wait_lsn_request.timelines.get(&timeline.timeline_id) else {
continue;
};
let fut = {
let timeline = timeline.clone();
let ctx = RequestContext::new(TaskKind::MgmtRequest, DownloadBehavior::Error);
async move {
timeline
.wait_lsn(
*lsn,
WaitLsnWaiter::HttpEndpoint,
WaitLsnTimeout::Custom(wait_lsn_request.timeout),
&ctx,
)
.await
}
};
wait_futures.push(fut);
}
if wait_futures.is_empty() {
return json_response(StatusCode::NOT_FOUND, ());
}
let all_done = tokio::select! {
results = join_all(wait_futures) => {
results.iter().all(|res| res.is_ok())
},
_ = cancel.cancelled() => {
return Err(ApiError::Cancelled);
}
};
let status = if all_done {
StatusCode::OK
} else {
StatusCode::ACCEPTED
};
json_response(status, ())
}
async fn secondary_status_handler(
request: Request<Body>,
_cancel: CancellationToken,
@@ -3577,6 +3638,9 @@ pub fn make_router(
.post("/v1/tenant/:tenant_shard_id/secondary/download", |r| {
api_handler(r, secondary_download_handler)
})
.post("/v1/tenant/:tenant_shard_id/wait_lsn", |r| {
api_handler(r, wait_lsn_handler)
})
.put("/v1/tenant/:tenant_shard_id/break", |r| {
testing_api_handler("set tenant state to broken", r, handle_tenant_break)
})

View File

@@ -3,7 +3,7 @@ use metrics::{
register_counter_vec, register_gauge_vec, register_histogram, register_histogram_vec,
register_int_counter, register_int_counter_pair_vec, register_int_counter_vec,
register_int_gauge, register_int_gauge_vec, register_uint_gauge, register_uint_gauge_vec,
Counter, CounterVec, GaugeVec, Histogram, HistogramVec, IntCounter, IntCounterPair,
Counter, CounterVec, Gauge, GaugeVec, Histogram, HistogramVec, IntCounter, IntCounterPair,
IntCounterPairVec, IntCounterVec, IntGauge, IntGaugeVec, UIntGauge, UIntGaugeVec,
};
use once_cell::sync::Lazy;
@@ -398,6 +398,15 @@ pub(crate) static WAIT_LSN_TIME: Lazy<Histogram> = Lazy::new(|| {
.expect("failed to define a metric")
});
static FLUSH_WAIT_UPLOAD_TIME: Lazy<GaugeVec> = Lazy::new(|| {
register_gauge_vec!(
"pageserver_flush_wait_upload_seconds",
"Time spent waiting for preceding uploads during layer flush",
&["tenant_id", "shard_id", "timeline_id"]
)
.expect("failed to define a metric")
});
static LAST_RECORD_LSN: Lazy<IntGaugeVec> = Lazy::new(|| {
register_int_gauge_vec!(
"pageserver_last_record_lsn",
@@ -2569,6 +2578,7 @@ pub(crate) struct TimelineMetrics {
timeline_id: String,
pub flush_time_histo: StorageTimeMetrics,
pub flush_delay_histo: StorageTimeMetrics,
pub flush_wait_upload_time_gauge: Gauge,
pub compact_time_histo: StorageTimeMetrics,
pub create_images_time_histo: StorageTimeMetrics,
pub logical_size_histo: StorageTimeMetrics,
@@ -2620,6 +2630,9 @@ impl TimelineMetrics {
&shard_id,
&timeline_id,
);
let flush_wait_upload_time_gauge = FLUSH_WAIT_UPLOAD_TIME
.get_metric_with_label_values(&[&tenant_id, &shard_id, &timeline_id])
.unwrap();
let compact_time_histo = StorageTimeMetrics::new(
StorageTimeOperation::Compact,
&tenant_id,
@@ -2766,6 +2779,7 @@ impl TimelineMetrics {
timeline_id,
flush_time_histo,
flush_delay_histo,
flush_wait_upload_time_gauge,
compact_time_histo,
create_images_time_histo,
logical_size_histo,
@@ -2815,6 +2829,14 @@ impl TimelineMetrics {
self.resident_physical_size_gauge.get()
}
pub(crate) fn flush_wait_upload_time_gauge_add(&self, duration: f64) {
self.flush_wait_upload_time_gauge.add(duration);
crate::metrics::FLUSH_WAIT_UPLOAD_TIME
.get_metric_with_label_values(&[&self.tenant_id, &self.shard_id, &self.timeline_id])
.unwrap()
.add(duration);
}
pub(crate) fn shutdown(&self) {
let was_shutdown = self
.shutdown
@@ -2832,6 +2854,7 @@ impl TimelineMetrics {
let shard_id = &self.shard_id;
let _ = LAST_RECORD_LSN.remove_label_values(&[tenant_id, shard_id, timeline_id]);
let _ = DISK_CONSISTENT_LSN.remove_label_values(&[tenant_id, shard_id, timeline_id]);
let _ = FLUSH_WAIT_UPLOAD_TIME.remove_label_values(&[tenant_id, shard_id, timeline_id]);
let _ = STANDBY_HORIZON.remove_label_values(&[tenant_id, shard_id, timeline_id]);
{
RESIDENT_PHYSICAL_SIZE_GLOBAL.sub(self.resident_physical_size_get());

View File

@@ -1708,6 +1708,7 @@ impl PageServerHandler {
.wait_lsn(
not_modified_since,
crate::tenant::timeline::WaitLsnWaiter::PageService,
timeline::WaitLsnTimeout::Default,
ctx,
)
.await?;
@@ -2044,6 +2045,7 @@ impl PageServerHandler {
.wait_lsn(
lsn,
crate::tenant::timeline::WaitLsnWaiter::PageService,
crate::tenant::timeline::WaitLsnTimeout::Default,
ctx,
)
.await?;

View File

@@ -17,20 +17,21 @@ use crate::span::{
debug_assert_current_span_has_tenant_and_timeline_id,
debug_assert_current_span_has_tenant_and_timeline_id_no_shard_id,
};
use crate::tenant::storage_layer::IoConcurrency;
use crate::tenant::storage_layer::{IoConcurrency, ValuesReconstructState};
use crate::tenant::timeline::GetVectoredError;
use anyhow::{ensure, Context};
use bytes::{Buf, Bytes, BytesMut};
use enum_map::Enum;
use itertools::Itertools;
use pageserver_api::key::Key;
use pageserver_api::key::{
dbdir_key_range, rel_block_to_key, rel_dir_to_key, rel_key_range, rel_size_to_key,
relmap_file_key, repl_origin_key, repl_origin_key_range, slru_block_to_key, slru_dir_to_key,
slru_segment_key_range, slru_segment_size_to_key, twophase_file_key, twophase_key_range,
CompactKey, AUX_FILES_KEY, CHECKPOINT_KEY, CONTROLFILE_KEY, DBDIR_KEY, TWOPHASEDIR_KEY,
slru_segment_key_range, slru_segment_size_key_to_segno, slru_segment_size_to_key,
twophase_file_key, twophase_key_range, CompactKey, AUX_FILES_KEY, CHECKPOINT_KEY,
CONTROLFILE_KEY, DBDIR_KEY, TWOPHASEDIR_KEY,
};
use pageserver_api::keyspace::SparseKeySpace;
use pageserver_api::key::{rel_size_key_to_rel, Key};
use pageserver_api::keyspace::{KeySpaceRandomAccum, SparseKeySpace};
use pageserver_api::record::NeonWalRecord;
use pageserver_api::reltag::{BlockNumber, RelTag, SlruKind};
use pageserver_api::shard::ShardIdentity;
@@ -110,10 +111,21 @@ pub(crate) enum CollectKeySpaceError {
Decode(#[from] DeserializeError),
#[error(transparent)]
PageRead(PageReconstructError),
#[error(transparent)]
GetVectored(GetVectoredError),
#[error("cancelled")]
Cancelled,
}
impl From<GetVectoredError> for CollectKeySpaceError {
fn from(err: GetVectoredError) -> Self {
match err {
GetVectoredError::Cancelled => Self::Cancelled,
err => Self::GetVectored(err),
}
}
}
impl From<PageReconstructError> for CollectKeySpaceError {
fn from(err: PageReconstructError) -> Self {
match err {
@@ -1071,11 +1083,30 @@ impl Timeline {
.into_iter()
.collect();
rels.sort_unstable();
let mut relsize_keys_to_collect = KeySpaceRandomAccum::new();
for rel in rels {
let relsize_key = rel_size_to_key(rel);
let mut buf = self.get(relsize_key, lsn, ctx).await?;
relsize_keys_to_collect.add_key(relsize_key);
}
// Skip the vectored-read max key check by using `get_vectored_impl`.
let io_concurrency = IoConcurrency::spawn_from_conf(
self.conf,
self.gate
.enter()
.map_err(|_| CollectKeySpaceError::Cancelled)?,
);
let res = self
.get_vectored_impl(
relsize_keys_to_collect.consume_keyspace(),
lsn,
&mut ValuesReconstructState::new(io_concurrency),
ctx,
)
.await?;
for (relsize_key, buf) in res {
let mut buf = buf?;
let relsize = buf.get_u32_le();
let rel = rel_size_key_to_rel(relsize_key);
result.add_range(rel_block_to_key(rel, 0)..rel_block_to_key(rel, relsize));
result.add_key(relsize_key);
}
@@ -1094,11 +1125,30 @@ impl Timeline {
let dir = SlruSegmentDirectory::des(&buf)?;
let mut segments: Vec<u32> = dir.segments.iter().cloned().collect();
segments.sort_unstable();
let mut segsize_keys_to_collect = KeySpaceRandomAccum::new();
for segno in segments {
let segsize_key = slru_segment_size_to_key(kind, segno);
let mut buf = self.get(segsize_key, lsn, ctx).await?;
segsize_keys_to_collect.add_key(segsize_key);
}
// Skip the vectored-read max key check by using `get_vectored_impl`.
let io_concurrency = IoConcurrency::spawn_from_conf(
self.conf,
self.gate
.enter()
.map_err(|_| CollectKeySpaceError::Cancelled)?,
);
let res = self
.get_vectored_impl(
segsize_keys_to_collect.consume_keyspace(),
lsn,
&mut ValuesReconstructState::new(io_concurrency),
ctx,
)
.await?;
for (segsize_key, buf) in res {
let mut buf = buf?;
let segsize = buf.get_u32_le();
let segno = slru_segment_size_key_to_segno(segsize_key);
result.add_range(
slru_block_to_key(kind, segno, 0)..slru_block_to_key(kind, segno, segsize),
);

View File

@@ -37,6 +37,8 @@ use remote_timeline_client::manifest::{
OffloadedTimelineManifest, TenantManifest, LATEST_TENANT_MANIFEST_VERSION,
};
use remote_timeline_client::UploadQueueNotReadyError;
use remote_timeline_client::FAILED_REMOTE_OP_RETRIES;
use remote_timeline_client::FAILED_UPLOAD_WARN_THRESHOLD;
use std::collections::BTreeMap;
use std::fmt;
use std::future::Future;
@@ -2558,7 +2560,12 @@ impl Tenant {
// sizes etc. and that would get confused if the previous page versions
// are not in the repository yet.
ancestor_timeline
.wait_lsn(*lsn, timeline::WaitLsnWaiter::Tenant, ctx)
.wait_lsn(
*lsn,
timeline::WaitLsnWaiter::Tenant,
timeline::WaitLsnTimeout::Default,
ctx,
)
.await
.map_err(|e| match e {
e @ (WaitLsnError::Timeout(_) | WaitLsnError::BadState { .. }) => {
@@ -5308,27 +5315,37 @@ impl Tenant {
return Ok(());
}
upload_tenant_manifest(
&self.remote_storage,
&self.tenant_shard_id,
self.generation,
&manifest,
// Remote storage does no retries internally, so wrap it
match backoff::retry(
|| async {
upload_tenant_manifest(
&self.remote_storage,
&self.tenant_shard_id,
self.generation,
&manifest,
&self.cancel,
)
.await
},
|_e| self.cancel.is_cancelled(),
FAILED_UPLOAD_WARN_THRESHOLD,
FAILED_REMOTE_OP_RETRIES,
"uploading tenant manifest",
&self.cancel,
)
.await
.map_err(|e| {
if self.cancel.is_cancelled() {
TenantManifestError::Cancelled
} else {
TenantManifestError::RemoteStorage(e)
{
None => Err(TenantManifestError::Cancelled),
Some(Err(_)) if self.cancel.is_cancelled() => Err(TenantManifestError::Cancelled),
Some(Err(e)) => Err(TenantManifestError::RemoteStorage(e)),
Some(Ok(_)) => {
// Store the successfully uploaded manifest, so that future callers can avoid
// re-uploading the same thing.
*guard = Some(manifest);
Ok(())
}
})?;
// Store the successfully uploaded manifest, so that future callers can avoid
// re-uploading the same thing.
*guard = Some(manifest);
Ok(())
}
}
}
@@ -5455,6 +5472,7 @@ pub(crate) mod harness {
compaction_algorithm: Some(tenant_conf.compaction_algorithm),
l0_flush_delay_threshold: tenant_conf.l0_flush_delay_threshold,
l0_flush_stall_threshold: tenant_conf.l0_flush_stall_threshold,
l0_flush_wait_upload: Some(tenant_conf.l0_flush_wait_upload),
gc_horizon: Some(tenant_conf.gc_horizon),
gc_period: Some(tenant_conf.gc_period),
image_creation_threshold: Some(tenant_conf.image_creation_threshold),

View File

@@ -289,6 +289,10 @@ pub struct TenantConfOpt {
#[serde(default)]
pub l0_flush_stall_threshold: Option<usize>,
#[serde(skip_serializing_if = "Option::is_none")]
#[serde(default)]
pub l0_flush_wait_upload: Option<bool>,
#[serde(skip_serializing_if = "Option::is_none")]
#[serde(default)]
pub gc_horizon: Option<u64>,
@@ -408,6 +412,9 @@ impl TenantConfOpt {
l0_flush_stall_threshold: self
.l0_flush_stall_threshold
.or(global_conf.l0_flush_stall_threshold),
l0_flush_wait_upload: self
.l0_flush_wait_upload
.unwrap_or(global_conf.l0_flush_wait_upload),
gc_horizon: self.gc_horizon.unwrap_or(global_conf.gc_horizon),
gc_period: self.gc_period.unwrap_or(global_conf.gc_period),
image_creation_threshold: self
@@ -474,6 +481,7 @@ impl TenantConfOpt {
mut compaction_algorithm,
mut l0_flush_delay_threshold,
mut l0_flush_stall_threshold,
mut l0_flush_wait_upload,
mut gc_horizon,
mut gc_period,
mut image_creation_threshold,
@@ -518,6 +526,7 @@ impl TenantConfOpt {
patch
.l0_flush_stall_threshold
.apply(&mut l0_flush_stall_threshold);
patch.l0_flush_wait_upload.apply(&mut l0_flush_wait_upload);
patch.gc_horizon.apply(&mut gc_horizon);
patch
.gc_period
@@ -590,6 +599,7 @@ impl TenantConfOpt {
compaction_algorithm,
l0_flush_delay_threshold,
l0_flush_stall_threshold,
l0_flush_wait_upload,
gc_horizon,
gc_period,
image_creation_threshold,
@@ -649,6 +659,7 @@ impl From<TenantConfOpt> for models::TenantConfig {
compaction_threshold: value.compaction_threshold,
l0_flush_delay_threshold: value.l0_flush_delay_threshold,
l0_flush_stall_threshold: value.l0_flush_stall_threshold,
l0_flush_wait_upload: value.l0_flush_wait_upload,
gc_horizon: value.gc_horizon,
gc_period: value.gc_period.map(humantime),
image_creation_threshold: value.image_creation_threshold,

View File

@@ -1643,6 +1643,7 @@ impl TenantManager {
.wait_lsn(
*target_lsn,
crate::tenant::timeline::WaitLsnWaiter::Tenant,
crate::tenant::timeline::WaitLsnTimeout::Default,
ctx,
)
.await

View File

@@ -222,6 +222,10 @@ impl LayerFileMetadata {
shard,
}
}
/// Helper to get both generation and file size in a tuple
pub fn generation_file_size(&self) -> (Generation, u64) {
(self.generation, self.file_size)
}
}
/// Limited history of earlier ancestors.

View File

@@ -559,6 +559,13 @@ impl JobGenerator<PendingDownload, RunningDownload, CompleteDownload, DownloadCo
}
}
enum LayerAction {
Download,
NoAction,
Skip,
Touch,
}
/// This type is a convenience to group together the various functions involved in
/// freshening a secondary tenant.
struct TenantDownloader<'a> {
@@ -1008,69 +1015,17 @@ impl<'a> TenantDownloader<'a> {
return (Err(UpdateError::Restart), touched);
}
// Existing on-disk layers: just update their access time.
if let Some(on_disk) = timeline_state.on_disk_layers.get(&layer.name) {
tracing::debug!("Layer {} is already on disk", layer.name);
if cfg!(debug_assertions) {
// Debug for https://github.com/neondatabase/neon/issues/6966: check that the files we think
// are already present on disk are really there.
match tokio::fs::metadata(&on_disk.local_path).await {
Ok(meta) => {
tracing::debug!(
"Layer {} present at {}, size {}",
layer.name,
on_disk.local_path,
meta.len(),
);
}
Err(e) => {
tracing::warn!(
"Layer {} not found at {} ({})",
layer.name,
on_disk.local_path,
e
);
debug_assert!(false);
}
}
}
if on_disk.metadata != layer.metadata || on_disk.access_time != layer.access_time {
// We already have this layer on disk. Update its access time.
tracing::debug!(
"Access time updated for layer {}: {} -> {}",
layer.name,
strftime(&on_disk.access_time),
strftime(&layer.access_time)
);
touched.push(layer);
}
continue;
} else {
tracing::debug!("Layer {} not present on disk yet", layer.name);
}
// Eviction: if we evicted a layer, then do not re-download it unless it was accessed more
// recently than it was evicted.
if let Some(evicted_at) = timeline_state.evicted_at.get(&layer.name) {
if &layer.access_time > evicted_at {
tracing::info!(
"Re-downloading evicted layer {}, accessed at {}, evicted at {}",
layer.name,
strftime(&layer.access_time),
strftime(evicted_at)
);
} else {
tracing::trace!(
"Not re-downloading evicted layer {}, accessed at {}, evicted at {}",
layer.name,
strftime(&layer.access_time),
strftime(evicted_at)
);
match self.layer_action(&timeline_state, &layer).await {
LayerAction::Download => (),
LayerAction::NoAction => continue,
LayerAction::Skip => {
self.skip_layer(layer);
continue;
}
LayerAction::Touch => {
touched.push(layer);
continue;
}
}
match self
@@ -1091,6 +1046,86 @@ impl<'a> TenantDownloader<'a> {
(Ok(()), touched)
}
async fn layer_action(
&self,
timeline_state: &SecondaryDetailTimeline,
layer: &HeatMapLayer,
) -> LayerAction {
// Existing on-disk layers: just update their access time.
if let Some(on_disk) = timeline_state.on_disk_layers.get(&layer.name) {
tracing::debug!("Layer {} is already on disk", layer.name);
if cfg!(debug_assertions) {
// Debug for https://github.com/neondatabase/neon/issues/6966: check that the files we think
// are already present on disk are really there.
match tokio::fs::metadata(&on_disk.local_path).await {
Ok(meta) => {
tracing::debug!(
"Layer {} present at {}, size {}",
layer.name,
on_disk.local_path,
meta.len(),
);
}
Err(e) => {
tracing::warn!(
"Layer {} not found at {} ({})",
layer.name,
on_disk.local_path,
e
);
debug_assert!(false);
}
}
}
if on_disk.metadata.generation_file_size() != on_disk.metadata.generation_file_size() {
tracing::info!(
"Re-downloading layer {} with changed size or generation: {:?}->{:?}",
layer.name,
on_disk.metadata.generation_file_size(),
on_disk.metadata.generation_file_size()
);
return LayerAction::Download;
}
if on_disk.metadata != layer.metadata || on_disk.access_time != layer.access_time {
// We already have this layer on disk. Update its access time.
tracing::debug!(
"Access time updated for layer {}: {} -> {}",
layer.name,
strftime(&on_disk.access_time),
strftime(&layer.access_time)
);
return LayerAction::Touch;
}
return LayerAction::NoAction;
} else {
tracing::debug!("Layer {} not present on disk yet", layer.name);
}
// Eviction: if we evicted a layer, then do not re-download it unless it was accessed more
// recently than it was evicted.
if let Some(evicted_at) = timeline_state.evicted_at.get(&layer.name) {
if &layer.access_time > evicted_at {
tracing::info!(
"Re-downloading evicted layer {}, accessed at {}, evicted at {}",
layer.name,
strftime(&layer.access_time),
strftime(evicted_at)
);
} else {
tracing::trace!(
"Not re-downloading evicted layer {}, accessed at {}, evicted at {}",
layer.name,
strftime(&layer.access_time),
strftime(evicted_at)
);
return LayerAction::Skip;
}
}
LayerAction::Download
}
async fn download_timeline(
&self,
timeline: HeatMapTimeline,

View File

@@ -144,15 +144,19 @@ use self::layer_manager::LayerManager;
use self::logical_size::LogicalSize;
use self::walreceiver::{WalReceiver, WalReceiverConf};
use super::config::TenantConf;
use super::remote_timeline_client::index::IndexPart;
use super::remote_timeline_client::RemoteTimelineClient;
use super::secondary::heatmap::{HeatMapLayer, HeatMapTimeline};
use super::storage_layer::{LayerFringe, LayerVisibilityHint, ReadableLayer};
use super::upload_queue::NotInitialized;
use super::GcError;
use super::{
debug_assert_current_span_has_tenant_and_timeline_id, AttachedTenantConf, MaybeOffloaded,
config::TenantConf, storage_layer::LayerVisibilityHint, upload_queue::NotInitialized,
MaybeOffloaded,
};
use super::{debug_assert_current_span_has_tenant_and_timeline_id, AttachedTenantConf};
use super::{remote_timeline_client::index::IndexPart, storage_layer::LayerFringe};
use super::{
remote_timeline_client::RemoteTimelineClient, remote_timeline_client::WaitCompletionError,
storage_layer::ReadableLayer,
};
use super::{
secondary::heatmap::{HeatMapLayer, HeatMapTimeline},
GcError,
};
#[cfg(test)]
@@ -897,10 +901,17 @@ impl From<GetReadyAncestorError> for PageReconstructError {
}
}
pub(crate) enum WaitLsnTimeout {
Custom(Duration),
// Use the [`PageServerConf::wait_lsn_timeout`] default
Default,
}
pub(crate) enum WaitLsnWaiter<'a> {
Timeline(&'a Timeline),
Tenant,
PageService,
HttpEndpoint,
}
/// Argument to [`Timeline::shutdown`].
@@ -1146,7 +1157,7 @@ impl Timeline {
vectored_res
}
pub(super) async fn get_vectored_impl(
pub(crate) async fn get_vectored_impl(
&self,
keyspace: KeySpace,
lsn: Lsn,
@@ -1297,6 +1308,7 @@ impl Timeline {
&self,
lsn: Lsn,
who_is_waiting: WaitLsnWaiter<'_>,
timeout: WaitLsnTimeout,
ctx: &RequestContext, /* Prepare for use by cancellation */
) -> Result<(), WaitLsnError> {
let state = self.current_state();
@@ -1313,7 +1325,7 @@ impl Timeline {
| TaskKind::WalReceiverConnectionPoller => {
let is_myself = match who_is_waiting {
WaitLsnWaiter::Timeline(waiter) => Weak::ptr_eq(&waiter.myself, &self.myself),
WaitLsnWaiter::Tenant | WaitLsnWaiter::PageService => unreachable!("tenant or page_service context are not expected to have task kind {:?}", ctx.task_kind()),
WaitLsnWaiter::Tenant | WaitLsnWaiter::PageService | WaitLsnWaiter::HttpEndpoint => unreachable!("tenant or page_service context are not expected to have task kind {:?}", ctx.task_kind()),
};
if is_myself {
if let Err(current) = self.last_record_lsn.would_wait_for(lsn) {
@@ -1329,13 +1341,14 @@ impl Timeline {
}
}
let timeout = match timeout {
WaitLsnTimeout::Custom(t) => t,
WaitLsnTimeout::Default => self.conf.wait_lsn_timeout,
};
let _timer = crate::metrics::WAIT_LSN_TIME.start_timer();
match self
.last_record_lsn
.wait_for_timeout(lsn, self.conf.wait_lsn_timeout)
.await
{
match self.last_record_lsn.wait_for_timeout(lsn, timeout).await {
Ok(()) => Ok(()),
Err(e) => {
use utils::seqwait::SeqWaitError::*;
@@ -2168,8 +2181,8 @@ impl Timeline {
}
fn get_l0_flush_delay_threshold(&self) -> Option<usize> {
// Default to delay L0 flushes at 3x compaction threshold.
const DEFAULT_L0_FLUSH_DELAY_FACTOR: usize = 3;
// Disable L0 flushes by default. This and compaction needs further tuning.
const DEFAULT_L0_FLUSH_DELAY_FACTOR: usize = 0; // TODO: default to e.g. 3
// If compaction is disabled, don't delay.
if self.get_compaction_period() == Duration::ZERO {
@@ -2197,10 +2210,9 @@ impl Timeline {
}
fn get_l0_flush_stall_threshold(&self) -> Option<usize> {
// Default to stall L0 flushes at 5x compaction threshold.
// TODO: stalls are temporarily disabled by default, see below.
#[allow(unused)]
const DEFAULT_L0_FLUSH_STALL_FACTOR: usize = 5;
// Disable L0 stalls by default. In ingest benchmarks, we see image compaction take >10
// minutes, blocking L0 compaction, and we can't stall L0 flushes for that long.
const DEFAULT_L0_FLUSH_STALL_FACTOR: usize = 0; // TODO: default to e.g. 5
// If compaction is disabled, don't stall.
if self.get_compaction_period() == Duration::ZERO {
@@ -2232,13 +2244,8 @@ impl Timeline {
return None;
}
// Disable stalls by default. In ingest benchmarks, we see image compaction take >10
// minutes, blocking L0 compaction, and we can't stall L0 flushes for that long.
//
// TODO: fix this.
// let l0_flush_stall_threshold = l0_flush_stall_threshold
// .unwrap_or(DEFAULT_L0_FLUSH_STALL_FACTOR * compaction_threshold);
let l0_flush_stall_threshold = l0_flush_stall_threshold?;
let l0_flush_stall_threshold = l0_flush_stall_threshold
.unwrap_or(DEFAULT_L0_FLUSH_STALL_FACTOR * compaction_threshold);
// 0 disables backpressure.
if l0_flush_stall_threshold == 0 {
@@ -2252,6 +2259,14 @@ impl Timeline {
Some(max(l0_flush_stall_threshold, compaction_threshold))
}
fn get_l0_flush_wait_upload(&self) -> bool {
let tenant_conf = self.tenant_conf.load();
tenant_conf
.tenant_conf
.l0_flush_wait_upload
.unwrap_or(self.conf.default_tenant_conf.l0_flush_wait_upload)
}
fn get_image_creation_threshold(&self) -> usize {
let tenant_conf = self.tenant_conf.load();
tenant_conf
@@ -3584,7 +3599,12 @@ impl Timeline {
}
}
ancestor
.wait_lsn(self.ancestor_lsn, WaitLsnWaiter::Timeline(self), ctx)
.wait_lsn(
self.ancestor_lsn,
WaitLsnWaiter::Timeline(self),
WaitLsnTimeout::Default,
ctx,
)
.await
.map_err(|e| match e {
e @ WaitLsnError::Timeout(_) => GetReadyAncestorError::AncestorLsnTimeout(e),
@@ -4034,6 +4054,27 @@ impl Timeline {
// release lock on 'layers'
};
// Backpressure mechanism: wait with continuation of the flush loop until we have uploaded all layer files.
// This makes us refuse ingest until the new layers have been persisted to the remote
// TODO: remove this, and rely on l0_flush_{delay,stall}_threshold instead.
if self.get_l0_flush_wait_upload() {
let start = Instant::now();
self.remote_client
.wait_completion()
.await
.map_err(|e| match e {
WaitCompletionError::UploadQueueShutDownOrStopped
| WaitCompletionError::NotInitialized(
NotInitialized::ShuttingDown | NotInitialized::Stopped,
) => FlushLayerError::Cancelled,
WaitCompletionError::NotInitialized(NotInitialized::Uninitialized) => {
FlushLayerError::Other(anyhow!(e).into())
}
})?;
let duration = start.elapsed().as_secs_f64();
self.metrics.flush_wait_upload_time_gauge_add(duration);
}
// FIXME: between create_delta_layer and the scheduling of the upload in `update_metadata_file`,
// a compaction can delete the file and then it won't be available for uploads any more.
// We still schedule the upload, resulting in an error, but ideally we'd somehow avoid this

View File

@@ -113,7 +113,7 @@ pub async fn doit(
match res {
Ok(_) => break,
Err(err) => {
info!(?err, "indefintely waiting for pgdata to finish");
info!(?err, "indefinitely waiting for pgdata to finish");
if tokio::time::timeout(std::time::Duration::from_secs(10), cancel.cancelled())
.await
.is_ok()

View File

@@ -308,7 +308,7 @@ impl ControlFile {
202107181 => 14,
202209061 => 15,
202307071 => 16,
/* XXX pg17 */
202406281 => 17,
catversion => {
anyhow::bail!("unrecognized catalog version {catversion}")
}

View File

@@ -164,9 +164,10 @@ pub(super) async fn connection_manager_loop_step(
Ok(Some(broker_update)) => connection_manager_state.register_timeline_update(broker_update),
Err(status) => {
match status.code() {
Code::Unknown if status.message().contains("stream closed because of a broken pipe") || status.message().contains("connection reset") => {
Code::Unknown if status.message().contains("stream closed because of a broken pipe") || status.message().contains("connection reset") || status.message().contains("error reading a body from connection") => {
// tonic's error handling doesn't provide a clear code for disconnections: we get
// "h2 protocol error: error reading a body from connection: stream closed because of a broken pipe"
// => https://github.com/neondatabase/neon/issues/9562
info!("broker disconnected: {status}");
},
_ => {
@@ -273,7 +274,7 @@ pub(super) async fn connection_manager_loop_step(
};
last_discovery_ts = Some(std::time::Instant::now());
debug!("No active connection and no candidates, sending discovery request to the broker");
info!("No active connection and no candidates, sending discovery request to the broker");
// Cancellation safety: we want to send a message to the broker, but publish_one()
// function can get cancelled by the other select! arm. This is absolutely fine, because

View File

@@ -118,7 +118,7 @@ pub(super) async fn handle_walreceiver_connection(
cancellation: CancellationToken,
connect_timeout: Duration,
ctx: RequestContext,
node: NodeId,
safekeeper_node: NodeId,
ingest_batch_size: u64,
) -> Result<(), WalReceiverError> {
debug_assert_current_span_has_tenant_and_timeline_id();
@@ -140,7 +140,7 @@ pub(super) async fn handle_walreceiver_connection(
let (replication_client, connection) = {
let mut config = wal_source_connconf.to_tokio_postgres_config();
config.application_name(format!("pageserver-{}", node.0).as_str());
config.application_name(format!("pageserver-{}", timeline.conf.id.0).as_str());
config.replication_mode(tokio_postgres::config::ReplicationMode::Physical);
match time::timeout(connect_timeout, config.connect(postgres::NoTls)).await {
Ok(client_and_conn) => client_and_conn?,
@@ -162,7 +162,7 @@ pub(super) async fn handle_walreceiver_connection(
latest_wal_update: Utc::now().naive_utc(),
streaming_lsn: None,
commit_lsn: None,
node,
node: safekeeper_node,
};
if let Err(e) = events_sender.send(TaskStateUpdate::Progress(connection_status)) {
warn!("Wal connection event listener dropped right after connection init, aborting the connection: {e}");

View File

@@ -423,11 +423,11 @@ async fn upload_parquet(
.await
.ok_or_else(|| anyhow::Error::new(TimeoutOrCancel::Cancel))
.and_then(|x| x)
.context("request_data_upload")
.with_context(|| format!("request_data_upload: path={path}"))
.err();
if let Some(err) = maybe_err {
tracing::error!(%id, error = ?err, "failed to upload request data");
tracing::error!(%id, %path, error = ?err, "failed to upload request data");
}
Ok(buffer.writer())

View File

@@ -396,13 +396,13 @@ async fn upload_backup_events(
TimeoutOrCancel::caused_by_cancel,
FAILED_UPLOAD_WARN_THRESHOLD,
FAILED_UPLOAD_MAX_RETRIES,
"request_data_upload",
"usage_metrics_upload",
cancel,
)
.await
.ok_or_else(|| anyhow::Error::new(TimeoutOrCancel::Cancel))
.and_then(|x| x)
.context("request_data_upload")?;
.with_context(|| format!("usage_metrics_upload: path={remote_path}"))?;
Ok(())
}

View File

@@ -2,8 +2,9 @@ use pageserver_api::{
models::{
detach_ancestor::AncestorDetached, LocationConfig, LocationConfigListResponse,
PageserverUtilization, SecondaryProgress, TenantScanRemoteStorageResponse,
TenantShardSplitRequest, TenantShardSplitResponse, TimelineArchivalConfigRequest,
TimelineCreateRequest, TimelineInfo, TopTenantShardsRequest, TopTenantShardsResponse,
TenantShardSplitRequest, TenantShardSplitResponse, TenantWaitLsnRequest,
TimelineArchivalConfigRequest, TimelineCreateRequest, TimelineInfo, TopTenantShardsRequest,
TopTenantShardsResponse,
},
shard::TenantShardId,
};
@@ -299,4 +300,17 @@ impl PageserverClient {
self.inner.top_tenant_shards(request).await
)
}
pub(crate) async fn wait_lsn(
&self,
tenant_shard_id: TenantShardId,
request: TenantWaitLsnRequest,
) -> Result<StatusCode> {
measured_request!(
"wait_lsn",
crate::metrics::Method::Post,
&self.node_id_label,
self.inner.wait_lsn(tenant_shard_id, request).await
)
}
}

View File

@@ -3,7 +3,7 @@ use crate::persistence::Persistence;
use crate::{compute_hook, service};
use pageserver_api::controller_api::{AvailabilityZone, PlacementPolicy};
use pageserver_api::models::{
LocationConfig, LocationConfigMode, LocationConfigSecondary, TenantConfig,
LocationConfig, LocationConfigMode, LocationConfigSecondary, TenantConfig, TenantWaitLsnRequest,
};
use pageserver_api::shard::{ShardIdentity, TenantShardId};
use pageserver_client::mgmt_api;
@@ -348,6 +348,32 @@ impl Reconciler {
Ok(())
}
async fn wait_lsn(
&self,
node: &Node,
tenant_shard_id: TenantShardId,
timelines: HashMap<TimelineId, Lsn>,
) -> Result<StatusCode, ReconcileError> {
const TIMEOUT: Duration = Duration::from_secs(10);
let client = PageserverClient::new(
node.get_id(),
node.base_url(),
self.service_config.jwt_token.as_deref(),
);
client
.wait_lsn(
tenant_shard_id,
TenantWaitLsnRequest {
timelines,
timeout: TIMEOUT,
},
)
.await
.map_err(|e| e.into())
}
async fn get_lsns(
&self,
tenant_shard_id: TenantShardId,
@@ -461,6 +487,39 @@ impl Reconciler {
node: &Node,
baseline: HashMap<TimelineId, Lsn>,
) -> anyhow::Result<()> {
// Signal to the pageserver that it should ingest up to the baseline LSNs.
loop {
match self.wait_lsn(node, tenant_shard_id, baseline.clone()).await {
Ok(StatusCode::OK) => {
// Everything is caught up
return Ok(());
}
Ok(StatusCode::ACCEPTED) => {
// Some timelines are not caught up yet.
// They'll be polled below.
break;
}
Ok(StatusCode::NOT_FOUND) => {
// None of the timelines are present on the pageserver.
// This is correct if they've all been deleted, but
// let let the polling loop below cross check.
break;
}
Ok(status_code) => {
tracing::warn!(
"Unexpected status code ({status_code}) returned by wait_lsn endpoint"
);
break;
}
Err(e) => {
tracing::info!("🕑 Can't trigger LSN wait on {node} yet, waiting ({e})",);
tokio::time::sleep(Duration::from_millis(500)).await;
continue;
}
}
}
// Poll the LSNs until they catch up
loop {
let latest = match self.get_lsns(tenant_shard_id, node).await {
Ok(l) => l,

View File

@@ -165,6 +165,7 @@ PAGESERVER_PER_TENANT_METRICS: tuple[str, ...] = (
"pageserver_evictions_with_low_residence_duration_total",
"pageserver_aux_file_estimated_size",
"pageserver_valid_lsn_lease_count",
"pageserver_flush_wait_upload_seconds",
counter("pageserver_tenant_throttling_count_accounted_start"),
counter("pageserver_tenant_throttling_count_accounted_finish"),
counter("pageserver_tenant_throttling_wait_usecs_sum"),

View File

@@ -141,6 +141,7 @@ def test_fully_custom_config(positive_env: NeonEnv):
"compaction_threshold": 13,
"l0_flush_delay_threshold": 25,
"l0_flush_stall_threshold": 42,
"l0_flush_wait_upload": True,
"compaction_target_size": 1048576,
"checkpoint_distance": 10000,
"checkpoint_timeout": "13m",

View File

@@ -19,7 +19,6 @@ from fixtures.pageserver.utils import wait_until_tenant_active
from fixtures.utils import query_scalar
from performance.test_perf_pgbench import get_scales_matrix
from requests import RequestException
from requests.exceptions import RetryError
# Test branch creation
@@ -177,8 +176,11 @@ def test_cannot_create_endpoint_on_non_uploaded_timeline(neon_env_builder: NeonE
env.neon_cli.mappings_map_branch(initial_branch, env.initial_tenant, env.initial_timeline)
with pytest.raises(RuntimeError, match="is not active, state: Loading"):
env.endpoints.create_start(initial_branch, tenant_id=env.initial_tenant)
with pytest.raises(RuntimeError, match="ERROR: Not found: Timeline"):
env.endpoints.create_start(
initial_branch, tenant_id=env.initial_tenant, basebackup_request_tries=2
)
ps_http.configure_failpoints(("before-upload-index-pausable", "off"))
finally:
env.pageserver.stop(immediate=True)
@@ -219,7 +221,10 @@ def test_cannot_branch_from_non_uploaded_branch(neon_env_builder: NeonEnvBuilder
branch_id = TimelineId.generate()
with pytest.raises(RetryError, match="too many 503 error responses"):
with pytest.raises(
PageserverApiException,
match="Cannot branch off the timeline that's not present in pageserver",
):
ps_http.timeline_create(
env.pg_version,
env.initial_tenant,

View File

@@ -14,10 +14,8 @@ from fixtures.pageserver.http import (
ImportPgdataIdemptencyKey,
PageserverApiException,
)
from fixtures.pg_version import PgVersion
from fixtures.port_distributor import PortDistributor
from fixtures.remote_storage import RemoteStorageKind
from fixtures.utils import run_only_on_postgres
from pytest_httpserver import HTTPServer
from werkzeug.wrappers.request import Request
from werkzeug.wrappers.response import Response
@@ -39,10 +37,6 @@ smoke_params = [
]
@run_only_on_postgres(
[PgVersion.V14, PgVersion.V15, PgVersion.V16],
"newer control file catalog version and struct format isn't supported",
)
@pytest.mark.parametrize("shard_count,stripe_size,rel_block_size", smoke_params)
def test_pgdata_import_smoke(
vanilla_pg: VanillaPostgres,
@@ -117,13 +111,15 @@ def test_pgdata_import_smoke(
# TODO: would be nicer to just compare pgdump
# Enable IO concurrency for batching on large sequential scan, to avoid making
# this test unnecessarily onerous on CPU
# this test unnecessarily onerous on CPU. Especially on debug mode, it's still
# pretty onerous though, so increase statement_timeout to avoid timeouts.
assert ep.safe_psql_many(
[
"set effective_io_concurrency=32;",
"SET statement_timeout='300s';",
"select count(*), sum(data::bigint)::bigint from t",
]
) == [[], [(expect_nrows, expect_sum)]]
) == [[], [], [(expect_nrows, expect_sum)]]
validate_vanilla_equivalence(vanilla_pg)
@@ -317,10 +313,6 @@ def test_pgdata_import_smoke(
br_initdb_endpoint.safe_psql("select * from othertable")
@run_only_on_postgres(
[PgVersion.V14, PgVersion.V15, PgVersion.V16],
"newer control file catalog version and struct format isn't supported",
)
def test_fast_import_binary(
test_output_dir,
vanilla_pg: VanillaPostgres,

View File

@@ -786,6 +786,54 @@ def test_empty_branch_remote_storage_upload_on_restart(neon_env_builder: NeonEnv
create_thread.join()
def test_paused_upload_stalls_checkpoint(
neon_env_builder: NeonEnvBuilder,
):
"""
This test checks that checkpoints block on uploads to remote storage.
"""
neon_env_builder.enable_pageserver_remote_storage(RemoteStorageKind.LOCAL_FS)
env = neon_env_builder.init_start(
initial_tenant_conf={
# Set a small compaction threshold
"compaction_threshold": "3",
# Disable GC
"gc_period": "0s",
# disable PITR
"pitr_interval": "0s",
}
)
env.pageserver.allowed_errors.append(
f".*PUT.* path=/v1/tenant/{env.initial_tenant}/timeline.* request was dropped before completing"
)
tenant_id = env.initial_tenant
timeline_id = env.initial_timeline
client = env.pageserver.http_client()
layers_at_creation = client.layer_map_info(tenant_id, timeline_id)
deltas_at_creation = len(layers_at_creation.delta_layers())
assert (
deltas_at_creation == 1
), "are you fixing #5863? make sure we end up with 2 deltas at the end of endpoint lifecycle"
# Make new layer uploads get stuck.
# Note that timeline creation waits for the initial layers to reach remote storage.
# So at this point, the `layers_at_creation` are in remote storage.
client.configure_failpoints(("before-upload-layer-pausable", "pause"))
with env.endpoints.create_start("main", tenant_id=tenant_id) as endpoint:
# Build two tables with some data inside
endpoint.safe_psql("CREATE TABLE foo AS SELECT x FROM generate_series(1, 10000) g(x)")
wait_for_last_flush_lsn(env, endpoint, tenant_id, timeline_id)
with pytest.raises(ReadTimeout):
client.timeline_checkpoint(tenant_id, timeline_id, timeout=5)
client.configure_failpoints(("before-upload-layer-pausable", "off"))
def wait_upload_queue_empty(
client: PageserverHttpClient, tenant_id: TenantId, timeline_id: TimelineId
):