Compare commits

...

77 Commits

Author SHA1 Message Date
Tristan Partin
f879c66f14 BLAH 2025-03-13 15:12:35 -05:00
Tristan Partin
387b5c585e Commit 6 2025-03-13 15:12:35 -05:00
Tristan Partin
ad83156e2f Commit 5 2025-03-13 15:12:35 -05:00
Tristan Partin
3bcfaeb936 Commit 4 2025-03-13 15:12:35 -05:00
Tristan Partin
7c102b92fb Commit 3 2025-03-13 15:12:35 -05:00
Tristan Partin
2e59a41c98 Commit 2 2025-03-13 15:12:35 -05:00
Tristan Partin
358b59a5d7 Commit 1
Signed-off-by: Tristan Partin <tristan@neon.tech>
2025-03-13 15:12:35 -05:00
Arpad Müller
b1a1be6a4c switch pytests and neon_local to control_plane_hooks_api (#11195)
We want to switch away from and deprecate the `--compute-hook-url` param
for the storcon in favour of `--control-plane-url` because it allows us
to construct urls with `notify-safekeepers`.

This PR switches the pytests and neon_local from a
`control_plane_compute_hook_api` to a new param named
`control_plane_hooks_api` which is supposed to point to the parent of
the `notify-attach` URL.

We still support reading the old url from disk to not be too disruptive
with existing deployments, but we just ignore it.

Also add docs for the `notify-safekeepers` upcall API.

Follow-up of #11173
Part of https://github.com/neondatabase/neon/issues/11163
2025-03-13 19:50:52 +00:00
Erik Grinaker
8afae9d03c pageserver: enable l0_flush_delay_threshold by default (#11214)
## Problem

`l0_flush_delay_threshold` has already been set to 30 in production for
a couple of weeks. Let's harmonize the default.

## Summary of changes

Update `DEFAULT_L0_FLUSH_DELAY_FACTOR` to 3 such that the default
`l0_flush_delay_threshold` is `3 * compaction_threshold`.

This differs from the production setting, which is hardcoded to 30 (with
`compaction_threshold` at 10), and is more appropriate for any tenants
that have custom `compaction_threshold` overrides.
2025-03-13 19:15:22 +00:00
JC Grünhage
066b0a1be9 fix(ci): correctly push neon-test-extensions in releases and to ghcr (#11225)
## Problem
ef0d4a48a adjusted how we build container images and how we push them,
and the neon-test-extensions image was overlooked. Additionally, is was
also missed in 1f0dea9a1, which pushed our container images to GHCR.

## Summary of changes
Push neon-test-extensions to GHCR and also push release tags for it.
2025-03-13 18:18:55 +00:00
Konstantin Knizhnik
398d2794eb Handle DEBUG_COMPARE_LOCAL mode in neon_zeroextend (#11220)
## Problem

DEBUG_COMPARE_LOCAL is not supported in neon_zeroextend added in PG16

## Summary of changes

Add support of DEBUG_COMPARE_LOCAL in neon_zeroextend

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
2025-03-13 16:30:32 +00:00
Erik Grinaker
3c3b9dc919 pageserver: enable image_creation_preempt_threshold by default (#11216)
## Problem

This is already set in production, we should harmonize the default.

## Summary of changes

Default `image_creation_preempt_threshold` to 3.
2025-03-13 16:28:21 +00:00
Christian Schwarz
ed31dd2a3c pageserver: better observability for slow wait_lsn (#11176)
# Problem

We leave too few observability breadcrumbs in the case where wait_lsn is
exceptionally slow.

# Changes

- refactor: extract the monitoring logic out of `log_slow` into
`monitor_slow_future`
- add global + per-timeline counter for time spent waiting for wait_lsn
- It is updated while we're still waiting, similar to what we do for
page_service response flush.
- add per-timeline counterpair for started & finished wait_lsn count
- add slow-logging to leave breadcrumbs in logs, not just metrics

For the slow-logging, we need to consider not flooding the logs during a
broker or network outage/blip.
The solution is a "log-streak-level" concurrency limit per timeline.
At any given time, there is at most one slow wait_lsn that is logging
the "still running" and "completed" sequence of logs.
Other concurrent slow wait_lsn's don't log at all.
This leaves at least one breadcrumb in each timeline's logs if some
wait_lsn was exceptionally slow during a given period.
The full degree of slowness can then be determined by looking at the
per-timeline metric.

# Performance

Reran the `bench_log_slow` benchmark, no difference, so, existing call
sites are fine.

We do use a Semaphore, but only try_acquire it _after_ things have
already been determined to be slow. So, no baseline overhead
anticipated.

# Refs

-
https://github.com/neondatabase/cloud/issues/23486#issuecomment-2711587222
2025-03-13 15:03:53 +00:00
Conrad Ludgate
3dec117572 feat(compute_ctl): use TLS if configured (#10972)
Closes: https://github.com/neondatabase/cloud/issues/22998

If control-plane reports that TLS should be used, load the certificates
(and watch for updates), make sure postgres use them, and detects
updates.

Procedure:
1. Load certificates
2. Reconfigure postgres/pgbouncer
3. Loop on a timer until certificates have loaded
4. Go to 1

Notes:
1. We only run this procedure if requested on startup by control plane.
2. We needed to compile pgbouncer with openssl enabled
3. Postgres doesn't allow tls keys to be globally accessible - must be
read only to the postgres user. I couldn't convince the autoscaling team
to let me put this logic into the VM settings, so instead compute_ctl
will copy the keys to be read-only by postgres.
4. To mitigate a race condition, we also verify that the key matches the
cert.
2025-03-13 15:03:22 +00:00
Alex Chi Z.
b2286f5bcb fix(pageserver): don't panic if gc-compaction find no keys (#11200)
## Problem

There was a panic on staging that compaction didn't find any keys. This
is possible if all layers selected for compaction does not contain any
keys within the current shard.

## Summary of changes

Make panic an error. In the future, we can try creating an empty image
layer so that GC can clean up those layers. Otherwise, for now, we can
only rely on shard ancestor compaction to remove these data.

Signed-off-by: Alex Chi Z <chi@neon.tech>
2025-03-13 14:38:45 +00:00
Erik Grinaker
c036fec065 pageserver: enable compaction_l0_first by default (#11212)
## Problem

`compaction_l0_first` has already been enabled in production for a
couple of weeks.

## Summary of changes

Enable `compaction_l0_first` by default.

Also set `CompactFlags::NoYield` in `timeline_checkpoint_handler`, to
ensure explicitly requested compaction runs to completion. This endpoint
is mainly used in tests, and caused some flakiness where tests expected
compaction to complete.
2025-03-13 14:28:42 +00:00
JC Grünhage
89c7e4e917 fix(ci): use paranthesis for error handling in jq when fetching release PRs (#11217)
## Problem
#11061 introduced code fetching previous releases. #11151 introduced jq
error handling, which has also been applied in #11061, but parenthesis
have been missed.

## Summary of changes
Add parenthesis around error handling code.
2025-03-13 13:40:43 +00:00
Erik Grinaker
5a245a837d storcon: retain stripe size when autosplitting sharded tenants (#11194)
## Problem

Autosplits always request `DEFAULT_STRIPE_SIZE` for splits. However,
splits do not allow changing the stripe size of already-sharded tenants,
and will error out if it differs.

In #11168, we are changing the stripe size, which could hit this when
attempting to autosplit already sharded tenants.

Touches #11168.

## Summary of changes

Pass `new_stripe_size: None` when autosplitting already sharded tenants.
Otherwise, pass `DEFAULT_STRIPE_SIZE` instead of the shard identity's
stripe size, since we want to use the current default rather than an
old, persisted default.
2025-03-13 13:28:10 +00:00
devin-ai-integration[bot]
efb1df4362 fix: Change metric_unit from 'microseconds' to 'μs' in test_compute_ctl_api.py (#11209)
# Fix metric_unit length in test_compute_ctl_api.py

## Description
This PR changes the metric_unit from "microseconds" to "μs" in
test_compute_ctl_api.py to fix the issue where perf test results were
not being stored in the database due to the string exceeding the 10
character limit of the metric_unit column in the perf_test_results
table.

## Problem
As reported in Slack, the perf test results were not being uploaded to
the database because the "microseconds" string (12 characters) exceeds
the 10 character limit of the metric_unit column in the
perf_test_results table.

## Solution
Replace "microseconds" with "μs" in all metric_unit parameters in the
test_compute_ctl_api.py file.

## Testing
The changes have been committed and pushed. The PR is ready for review.

Link to Devin run:
https://app.devin.ai/sessions/e29edd672bd34114b059915820e8a853
Requested by: Peter Bendel

Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Co-authored-by: peterbendel@neon.tech <peterbendel@neon.tech>
2025-03-13 10:17:01 +00:00
JC Grünhage
803e6f908a fix(ci): fix syntax of lint-release-pr (#11208)
## Problem
A small adjustment in #11061 broke the lint-release-pr.sh script, and
the new version was neither tested nor linted. This has been done now,
the script is once again tested and passing `shellcheck`.

## Summary of changes
Add missing `el` of `elif` condition chain.
2025-03-13 09:42:38 +00:00
JC Grünhage
afc9524bc7 fix(ci): run lint-release-pr on head-ref (#11206)
## Problem
#11061 changed release pr creation, and I missed that the workflow will
checkout a would-be-merge of the rc branch and the release branch
instead of the head ref, unless explicitly instructed otherwise.

## Summary of changes
Check out head ref for linting the release PRs.
2025-03-13 08:17:33 +00:00
JC Grünhage
507353404c fix(ci): pass emtpy body when creating release PRs (#11203)
## Problem
#11061 changed release pr creation, and I missed that creating PRs using
`gh` in non-interactive environments *requires* `--body` instead of
defaulting to an empty body.

## Summary of changes
Explicitly set an empty body when creating release PRs.
2025-03-12 23:54:43 +00:00
JC Grünhage
48be4df3f3 fix(ci): fetch all refs in release PR creation (#11201)
## Problem
#11061 changed release PR creation, and I missed that we need to
explicitly fetch the whole history so that the relevant git refs and
objects are available.

## Summary of changes
- Fetch all git refs including history by setting fetch-depth to 0
- Reference release branch as a remote branch, because we haven't
checked it out locally
2025-03-12 22:32:38 +00:00
Alex Chi Z.
c3b3b507f7 feat(pageserver): support detaching behavior v2 (#11158)
## Problem

close https://github.com/neondatabase/neon/issues/10310

## Summary of changes

This patch adds a new behavior for the detach_ancestor API: detach with
multi-level ancestor and no reparenting. Though we can potentially
support multi-level + do reparenting / single-level + no-reparenting in
the future, as it's not required for the recovery/snapshot epic, I'd
prefer keeping things simple now that we only handle the old one and the
new one instead of supporting the full feature matrix.

I only added a test case of successful detaching instead of testing
failures. I'd like to make this into staging and add more tests in the
future.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2025-03-12 22:27:23 +00:00
JC Grünhage
ef0d4a48a8 Reuse artifacts from release PRs (#11061)
## Problem
When we release our components, we perform builds in the release PR,
then test the components, then merge the PR, and then build everything
*again*, run tests *again*, and only then start deployments.

To speed things up, we want to perform builds and run tests in the PR,
and start deployments using the existing artifacts from the release PR.

To make that possible, we need to have both CI pipelines running on the
same commit hash, which requires fast forwarding release. That only
works, if we have a commit in the PR that has the current release branch
state as an ancestor.

## Summary of changes
- Changes to release PR creation:
- Remove templates and automatic bodies for release PRs. The previous
template wasn't used anymore, and the automatic body we created in the
pipeline didn't contain any useful content anymore after the changees
here.
- Make it possible to select the source branch. For releases that aren't
cut from `main`, like https://github.com/neondatabase/neon/pull/11051,
we need a way to trigger the new flow from a different branch.
- Determine `release-branch` automatically from the component name
instead of passing that as well.
- Changes to the merge queue job:
- Rename `get-changed-files` to `meta` in preparation of additional data
being fetched as part of that job
- Fail the merge queue if we're trying to merge into a branch other than
main - this is to prevent non-fast-forward merges.
- Label PRs to branches other than main as `fast-forward`, to trigger
the fast-forward job
- Add a fast-forward job that can be triggered with the `fast-forward`
label that performs a fast-forward merge. This only happens if the PR
has `mergeable_state == clean`, so CI having passed.
- Build and Test on releases now skips building images, skips testing
images and skips triggering e2e tests. We add new tags to the images
from the release PR to tag them as release images, and we push them to
the prod registries.
2025-03-12 21:00:59 +00:00
Alex Chi Z.
8a5a739af0 test(pageserver): add small tenant compaction (#11049)
## Problem

close https://github.com/neondatabase/neon/issues/10881

## Summary of changes

Mock a tenant with very small amount of data.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2025-03-12 20:34:19 +00:00
Tristan Partin
5eed0e4b94 Add docs to performance/test_logical_replication.py on how to run the suite (#10175)
These docs are in tandem with what was recently published on the
internal docs site.

Signed-off-by: Tristan Partin <tristan@neon.tech>
2025-03-12 17:31:09 +00:00
Tristan Partin
bb3c0ff251 Make collecting the installed extensions metric async (#11071)
If the goal is to make compute_ctl completely asynchronous, then this is
one step to getting there.

Signed-off-by: Tristan Partin <tristan@neon.tech>
2025-03-12 16:09:02 +00:00
Conrad Ludgate
7aec1364dd chore(proxy): remove enum and composite type queries (#11178)
In our json encoding, we only need to know about array types.
Information about composites or enums are not actually used.

Enums are quite popular, needing to type query them when not needed can
add some latency cost for no gain.
2025-03-12 15:47:17 +00:00
Tristan Partin
40672b739e Move maybe_add_request_id_header middleware into middleware module (#11187)
This matches the authorization middleware.

---------

Signed-off-by: Tristan Partin <tristan@neon.tech>
Co-authored-by: Mikhail Kot <mikhail@neon.tech>
2025-03-12 15:34:46 +00:00
Vlad Lazar
02a83913ec storcon: do not update observed state on node activation (#11155)
## Problem

When a node becomes active, we query its locations and update the
observed state in-place.
This can race with the observed state updates done when processing
reconcile results.

## Summary of changes

The argument for this reconciliation step is that is reduces the need
for background reconciliations.
I don't think is actually true anymore. There's two cases.

1. Restart of node after drain. Usually the node does not go through the
offline state here, so observed locations
were not marked as none. In any case, there should be a handful of
shards max on the node since we've just drained it.
2. Node comes back online after failure or network partition. When the
node is marked offline, we reschedule everything away from it. When it
later becomes active, the previous observed location is extraneous and
requires a reconciliation anyway.

Closes https://github.com/neondatabase/neon/issues/11148
2025-03-12 15:31:28 +00:00
Erik Grinaker
c7717c85c7 storcon,pageserver: use persisted stripe size when loading unsharded tenants (#11193)
## Problem

When the storage controller and Pageserver loads tenants from persisted
storage, it uses `ShardIdentity::unsharded()` for unsharded tenants.
However, this replaces the persisted stripe size of unsharded tenants
with the default stripe size.

This doesn't really matter for practical purposes, since the stripe size
is meaningless for unsharded tenants anyway, but can cause consistency
check failures if the persisted stripe size differs from the default.
This was seen in #11168, where we change the default stripe size.

Touches #11168.

## Summary of changes

Carry over the persisted stripe size from `TenantShardPersistence` for
unsharded tenants, and from `LocationConf` on Pageservers.

Also add bounds checks for type casts when loading persisted shard
metadata.
2025-03-12 15:16:54 +00:00
Erik Grinaker
1436b8469c pageserver: appease unused lint on macOS (#11192)
## Problem

`info_span!` is only used in a `linux` branch, causing the unused lint
to fire on macOS.

## Summary of changes

Fully qualify the `info_span!` use.
2025-03-12 14:34:29 +00:00
JC Grünhage
fc515e7be2 chore(deps): bump env_logger to 0.11.7 (#11188)
## Problem
`humantime` is unmaintained, we want to migrate to `jiff`, see
https://github.com/neondatabase/neon/issues/11179.

`env_logger` in older versions depend on `humantime`, and newer versions
depend on `jiff`, so we need to update it.

## Summary of changes
Update `env_logger` to the most recent release, which does not depend on
`humantime` anymore.
2025-03-12 14:26:52 +00:00
John Spray
7015dbbdf0 storcon_cli: remove pre-warm helper (#11183)
## Problem

This command was used when onboarding tenants to the storage controller.
We no longer do that, so the command can go.

## Summary of changes

- Remove `storcon_cli tenant-warmup` command
2025-03-12 14:02:11 +00:00
Dmitrii Kovalkov
73e37ae388 Suppress "request was dropped" errors in test_timeline_archive (#11190)
## Problem

Test `test_timeline_archive` is flaky because it makes requests that are
intended to fail. It sometimes leads to warning in pageserver's logs.
More details are in the issue.

- Closes: https://github.com/neondatabase/neon/issues/11177

## Summary of changes
- Suppress such errors.
2025-03-12 13:23:31 +00:00
Vlad Lazar
1c0ff3c04d utils: explicit OTEL export config and OTEL enablement via common entry point (#11139)
We want to export performance traces from the pageserver in OTEL format.
End goal is to see them in Grafana.

To this end, there are two changes here:
1. Update the `tracing-utils` crate to allow for explicitly specifying
the export configuration. Pageserver configuration is loaded from a file
on start-up. This allows us to use the same flow for export configs
there.
2. Update the `utils::logging::init` common entry point to set up OTEL
tracing infrastructure if requested. Note that an entirely different
tracing subscriber is used. This is to avoid interference with the
existing tracing set-up. For now, no service uses this functionality.

PR to plug this into the pageserver is
[here](https://github.com/neondatabase/neon/pull/11140).

Related https://github.com/neondatabase/neon/issues/9873
2025-03-12 11:07:49 +00:00
John Spray
7bf6397334 pageserver: remove legacy TimelineInfo::latest_gc_cutoff field (1/2) (#11149)
## Problem

This field was retained for backward compat only in
https://github.com/neondatabase/neon/pull/10707.

Once https://github.com/neondatabase/cloud/pull/25233 is released,
nothing external will be reading this field.

Internally, this was a mandatory field so storage controller is still
trying to decode it, so we must do this removal in two steps: this PR
makes the field optional, and after one release we can fully remove it.

Related: https://github.com/neondatabase/cloud/issues/24250

## Summary of changes

- Rename field to `_unused`
- Remove field from swagger
- Make field optional
2025-03-12 10:23:41 +00:00
Konstantin Knizhnik
f60ffe3021 Rebase compare local debug mode (#11174)
## Problem

DEBUG_COMPARE_LOCAL mode is broken

See
https://neondb.slack.com/archives/C03QLRH7PPD/p1732862608323269?thread_ts=1732711054.862919&cid=C03QLRH7PPD

## Summary of changes

Fix compile errors and unlogged build issues.

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
2025-03-12 05:52:18 +00:00
Arpad Müller
da2431f11f storcon: add --control-plane-url config option (#11173)
Adds the `--control-plane-url` config option to the storcon, which we
want to migrate to instead of using `notify-attach`.

Part of #11163
2025-03-12 02:30:56 +00:00
JC Grünhage
e8396034ac fix(ci): fail meta using jq halt_error if data is unexpectedly missing (#11151)
## Problem
When the githb API is having problems, we might not get data back, and
are happily setting vars as empty. This causes problems down the line.
See
https://github.com/neondatabase/neon/actions/runs/13718859397/job/38381946590?pr=11132#step:5:1
for example.

## Summary of changes
Fail the `meta` job if we don't get expected data back from github.
2025-03-11 22:59:30 +00:00
Tristan Partin
decd265c99 Revert notify to 6.0.0 (#11162)
The upgrade to 8.0.0 caused severe performance regressions in the
start_postgres_ms metric, which measures the time it takes from execing
Postgres to the time Postgres marks itself as ready in the
postmaster.pid file. We use the notify crate to watch for changes in the
pgdata directory and the postmaster.pid file.

Signed-off-by: Tristan Partin <tristan@neon.tech>
2025-03-11 22:18:09 +00:00
Christian Schwarz
158db414bf buffered writer: handle write errors by retrying all write IO errors indefinitely (#10993)
# Problem

If the Pageserver ingest path
(InMemoryLayer=>EphemeralFile=>BufferedWriter)
encounters ENOSPC or any other write IO error when flushing the mutable
buffer
of the BufferedWriter, the buffered writer is left in a state where
subsequent _reads_
from the InMemoryLayer it will cause a `must not use after we returned
an error` panic.

The reason is that
1. the flush background task bails on flush failure, 
2. causing the `FlushHandle::flush` function to fail at channel.recv()
and
3. causing the `FlushHandle::flush` function to bail with the flush
error,
4. leaving its caller `BufferedWriter::flush` with
`BufferedWriter::mutable = None`,
5. once the InMemoryLayer's RwLock::write guard is dropped, subsequent
reads can enter,
6. those reads find `mutable = None` and cause the panic.

# Context

It has always been the contract that writes against the BufferedWriter
API
must not be retried because the writer/stream-style/append-only
interface makes no atomicity guarantees ("On error, did nothing or a
piece of the buffer get appended?").

The idea was that the error would bubble up to upper layers that can
throw away the buffered writer and create a new one. (See our [internal
error handling policy document on how to handle e.g.
`ENOSPC`](c870a50bc0/src/storage/handling_io_and_logical_errors.md (L36-L43))).

That _might_ be true for delta/image layer writers, I haven't checked.

But it's certainly not true for the ingest path: there are no provisions
to throw away an InMemoryLayer that encountered a write error an
reingest the WAL already written to it.

Adding such higher-level retries would involve either resetting
last_record_lsn to a lower value and restarting walreceiver. The code
isn't flexible enough to do that, and such complexity likely isn't worth
it given that write errors are rare.

# Solution

The solution in this PR is to retry _any_ failing write operation
_indefinitely_ inside the buffered writer flush task, except of course
those that are fatal as per `maybe_fatal_err`.

Retrying indefinitely ensures that `BufferedWriter::mutable` is never
left `None` in the case of IO errors, thereby solving the problem
described above.

It's a clear improvement over the status quo.

However, while we're retrying, we build up backpressure because the
`flush` is only double-buffered, not infinitely buffered.

Backpressure here is generally good to avoid resource exhaustion, **but
blocks reads** and hence stalls GetPage requests because InMemoryLayer
reads and writes are mutually exclusive.
That's orthogonal to the problem that is solved here, though.

## Caveats

Note that there are some remaining conditions in the flush background
task where it can bail with an error. I have annotated one of them with
a TODO comment.
Hence the `FlushHandle::flush` is still fallible and hence the overall
scenario of leaving `mutable = None` on the bail path is still possible.
We can clean that up in a later commit.

Note also that retrying indefinitely is great for temporary errors like
ENOSPC but likely undesirable in case the `std::io::Error` we get is
really due to higher-level logic bugs.
For example, we could fail to flush because the timeline or tenant
directory got deleted and VirtualFile's reopen fails with ENOENT.

Note finally that cancellation is not respected while we're retrying.
This means we will block timeline/tenant/pageserver shutdown.
The reason is that the existing cancellation story for the buffered
writer background task was to recv from flush op channel until the
sending side (FlushHandle) is explicitly shut down or dropped.

Failing to handle cancellation carries the operational risk that even if
a single timeline gets stuck because of a logic bug such as the one laid
out above, we must still restart the whole pageserver process.

# Alternatives Considered

As pointed out in the `Context` section, throwing away a InMemoryLayer
that encountered an error and reingesting the WAL is a lot of complexity
that IMO isn't justified for such an edge case.
Also, it's wasteful.
I think it's a local optimum.

A more general and simpler solution for ENOSPC is to `abort()` the
process and run eviction on startup before bringing up the rest of
pageserver.
I argued for it in the past, the pro arguments are still valid and
complete:
https://neondb.slack.com/archives/C033RQ5SPDH/p1716896265296329
The trouble at the time was implementing eviction on startup.
However, maybe things are simpler now that we are fully storcon-managed
and all tenants have secondaries.
For example, if pageserver `abort()`s on ENOSPC and then simply don't
respond to storcon heartbeats while we're running eviction on startup,
storcon will fail tenants over to the secondary anyway, giving us all
the time we need to clean up.

The downside is that if there's a systemic space management bug, above
proposal will just propagate the problem to other nodes. But I imagine
that because of the delays involved with filling up disks, the system
might reach a half-stable state, providing operators more time to react.

# Demo

Intermediary commit `a03f335121480afc0171b0f34606bdf929e962c5` is demoed
in this (internal) screen recording:
https://drive.google.com/file/d/1nBC6lFV2himQ8vRXDXrY30yfWmI2JL5J/view?usp=drive_link

# Perf Testing

Ran `bench_ingest` on tmpfs, no measurable difference.

Spans are uniquely owned by the flush task, and the span stack isn't too
deep, so, enter and exit should be cheap.
Plus, each flush takes ~150us with direct IO enabled, so, not _that_
high frequency event anyways.

# Refs
- fixes https://github.com/neondatabase/neon/issues/10856
2025-03-11 20:40:23 +00:00
Christian Schwarz
083a30b1e2 storage broker: disable deploy by default (#11172)
context
-
https://github.com/neondatabase/cloud/issues/23486#issuecomment-2711587222
- companion infra.git PR:
https://github.com/neondatabase/infra/pull/3249
2025-03-11 19:45:06 +00:00
Alex Chi Z.
7d221214bb feat(pageserver): support no-yield for gc-compaction (#11184)
## Problem

This should also resolve the test flakiness of `test_gc_feedback`.

close https://github.com/neondatabase/neon/issues/11144

## Summary of changes

If `NoYield` is set, do not yield in gc-compaction.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2025-03-11 19:13:52 +00:00
Dmitrii Kovalkov
8983677f29 Ignore cargo deny advisory RUSTSEC-2025-0014 for humantime (#11180)
## Problem

`humantime` is not maintained and `cargo deny check` fails

- Will be addressed in https://github.com/neondatabase/neon/issues/11179

## Summary of changes

Ignore RUSTSEC-2025-0014 advisory for now
2025-03-11 19:09:32 +00:00
Ivan Efremov
011f7c21a3 fix(proxy): Add testodrome query id HTTP header (#11167)
Handle "X-Neon-Query-ID" header to glue data with testodrome queries.

Relates to the #22486
2025-03-11 17:17:30 +00:00
Alex Chi Z.
7588983168 fix(scrubber): log even if no refs are found (#11160)
## Problem

Investigate https://github.com/neondatabase/neon/issues/11159

## Summary of changes

This doesn't fix the issue, but at least we can narrow down the cause
next time it happens by logging ancestor referenced layer cnt even if
it's 0.

Signed-off-by: Alex Chi Z <chi@neon.tech>
2025-03-11 14:33:35 +00:00
Arseny Sher
359c64c779 walproposer: pre generations refactoring (#11060)
## Problem

https://github.com/neondatabase/neon/issues/10851

## Summary of changes

Do some refactoring before making walproposer generations aware.

- Rename SS_VOTING to SS_WAIT_VOTING, SS_IDLE to SS_WAIT_ELECTED
- Continue to get rid of epochs: rename GetEpoch to GetLastLogTerm,
donorEpoch to donorLastLogTerm
- Instead of counting n_votes, n_connected, introduce explicit
WalProposerState (collecting terms / voting / elected). Refactor out
TermsCollected and VotesCollected; they will determine state transition
differently depending whether generations are enabled or not.

There is no new logic in this PR and thus no new tests.
2025-03-11 14:01:00 +00:00
Erik Grinaker
f466c01995 pageserver: add max_logical_size_per_shard for get_top_tenants (#11157)
## Problem

In #11122, we want to split shards once the logical size of the largest
timeline exceeds a split threshold. However, `get_top_tenants` currently
only returns `max_logical_size`, which tracks the max _total_ logical
size of a timeline across all shards.

This is problematic, because the storage controller needs to fetch a
list of N tenants that are eligible for splits, but the API doesn't
currently have a way to express this. For example, with a split
threshold of 1 GB, a tenant with `max_logical_size` of 4 GB is eligible
to split if it has 1 or 2 shards, but not if it already has 4 shards. We
need to express this in per-shard terms, otherwise the `get_top_tenants`
endpoint may end up only returning tenants that can't be split, blocking
splits entirely.

Touches https://github.com/neondatabase/neon/pull/11122.
Touches https://github.com/neondatabase/cloud/issues/22532.

## Summary of changes

Add `TenantShardItem::max_logical_size_per_shard` containing
`max_logical_size / shard_count`, and
`TenantSorting::MaxLogicalSizePerShard` to order and filter by it.
2025-03-11 11:43:55 +00:00
Conrad Ludgate
d1b60fa0b6 fix(proxy): delete prepared statements when discarding (#11165)
Fixes https://github.com/neondatabase/serverless/issues/144

When tables have enums, we need to perform type queries for that data.
We cache these query statements for performance reasons. In Neon RLS, we
run "discard all" for security reasons, which discards all the
statements. When we need to type check again, the statements are no
longer valid.

This fixes it to discard the statements as well.

I've also added some new logs and error types to monitor this. Currently
we don't see the prepared statement errors in our logs.
2025-03-11 10:48:50 +00:00
Christian Schwarz
7c462b3417 impr: propagate VirtualFile metrics via RequestContext (#7202)
# Refs

- fixes https://github.com/neondatabase/neon/issues/6107

# Problem

`VirtualFile` currently parses the path it is opened with to identify
the `tenant,shard,timeline` labels to be used for the `STORAGE_IO_SIZE`
metric.

Further, for each read or write call to VirtualFile, it uses
`with_label_values` to retrieve the correct metrics object, which under
the hood is a global hashmap guarded by a parking_lot mutex.

We perform tens of thousands of reads and writes per second on every
pageserver instance; thus, doing the mutex lock + hashmap lookup is
wasteful.

# Changes

Apply the technique we use for all other timeline-scoped metrics to
avoid the repeat `with_label_values`: add it to `TimelineMetrics`.

Wrap `TimelineMetrics` into an `Arc`.

Propagate the `Arc<TimelineMetrics>` down do `VirtualFile`, and use
`Timeline::metrics::storage_io_size`.

To avoid contention on the `Arc<TimelineMetrics>`'s refcount atomics
between different connection handlers for the same timeline, we wrap it
into another Arc.

To avoid frequent allocations, we store that Arc<Arc<TimelineMetrics>>
inside the per-connection timeline cache.

Preliminary refactorings to enable this change:
- https://github.com/neondatabase/neon/pull/11001
- https://github.com/neondatabase/neon/pull/11030


# Performance

I ran the benchmarks in
`test_runner/performance/pageserver/pagebench/test_pageserver_max_throughput_getpage_at_latest_lsn.py`
on an `i3en.3xlarge` because that's what we currently run them on.

None of the benchmarks shows a meaningful difference in latency or
throughput or CPU utilization.

I would have expected some improvement in the
many-tenants-one-client-each workload because they all hit that hashmap
constantly, and clone the same `UintCounter` / `Arc` inside of it.

But apparently the overhead is miniscule compared to the remaining work
we do per getpage.

Yet, since the changes are already made, the added complexity is
manageable, and the perf overhead of `with_label_values` demonstrable in
micro-benchmarks, let's have this change anyway.
Also, propagating TimelineMetrics through RequestContext might come in
handy down the line.

The micro-benchmark that demonstrates perf impact of
`with_label_values`, along with other pitfalls and mitigation techniques
around the `metrics`/`prometheus` crate:
- https://github.com/neondatabase/neon/pull/11019

# Alternative Designs

An earlier iteration of this PR stored an `Arc<Arc<Timeline>>` inside
`RequestContext`.
The problem is that this risks reference cycles if the RequestContext
gets stored in an object that is owned directly or indirectly by
`Timeline`.

Ideally, we wouldn't be using this mess of Arc's at all and propagate
Rust references instead.
But tokio requires tasks to be `'static`, and so, we wouldn't be able to
propagate references across task boundaries, which is incompatible with
any sort of fan-out code we already have (e.g. concurrent IO) or future
code (parallel compaction).
So, opt for Arc for now.
2025-03-11 07:23:06 +00:00
Christian Schwarz
420f7b07b4 add benchmark demonstrating metrics/prometheus crate multicore scalability pitfalls & workarounds (#11019)
We use the `metrics` / `prometheus` crate in the Pageserver code base.

This PR demonstrates
- typical performance pitfalls with that crate
- our current set of techniques to avoid most of these pitfalls.

refs
- https://github.com/neondatabase/neon/issues/10948
- https://github.com/neondatabase/neon/pull/7202 
- I applied the `label_values__cache_label_values_lookup` technique
there.
  - It didn't yield measurable results in high-level benchmarks though.
2025-03-11 07:22:56 +00:00
Arpad Müller
4d3c477689 storcon: timetime table, creation and deletion (#11058)
This PR extends the storcon with basic safekeeper management of
timelines, mainly timeline creation and deletion. We want to make the
storcon manage safekeepers in the future. Timeline creation is
controlled by the `--timelines-onto-safekeepers` flag.

1. it adds the `timelines` and `safekeeper_timeline_pending_ops` tables
to the storcon db
2. extend code for the timeline creation and deletion
4. it adds per-safekeeper reconciler tasks 

TODO:

* maybe not immediately schedule reconciliations for deletions but have
a prior manual step
* tenant deletions
* add exclude API definitions (probably separate PR)
* how to choose safekeeper to do exclude on vs deletion? this can be a
bit hairy because the safekeeper might go offline in the meantime.
* error/failure case handling
* tests (cc test_explicit_timeline_creation from #11002)
* single safekeeper mode: we often only have one SK (in tests for
example)
* `notify-safekeepers` hook:
https://github.com/neondatabase/neon/issues/11163

TODOs implemented:

* cancellations of enqueued reconciliations on a per-timeline basis,
helpful if there is an ongoing deletion
* implement pending ops overwrite behavior
* load pending operations from db

RFC section for important reading:
[link](https://github.com/neondatabase/neon/blob/main/docs/rfcs/035-safekeeper-dynamic-membership-change.md#storage_controller-implementation)

Implements the bulk of #9011
Successor of #10440.

---------

Co-authored-by: Arseny Sher <sher-ars@yandex.ru>
2025-03-11 02:31:22 +00:00
Alex Chi Z.
3451bdd3d2 fix(test): force L0 compaction before gc-compaction (#11143)
## Problem

Fix test flakyness of `test_gc_feedback`

Closes: https://github.com/neondatabase/neon/issues/11153

## Summary of changes

Looking at the log, gc-compaction is interrupted by L0 compaction.

Signed-off-by: Alex Chi Z <chi@neon.tech>
2025-03-10 20:03:49 +00:00
Konstantin Knizhnik
fb1957936c Fix caclulation of LFC used_pages (#11095)
## Problem

Async prefetch in LFC PR cause incorrect calculation of LFC
`used_pages`when page is overwritten

## Summary of changes

Decrement `used_pages` is page is overwritten.

---------

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
Co-authored-by: Matthias van de Meent <matthias@neon.tech>
2025-03-10 18:28:55 +00:00
Matthias van de Meent
bc052fd0fc Add configuration options to disable prevlink checks (#11138)
This allows for improved decoding of otherwise broken WAL.

## Problem

Currently, if (or when) a WAL record has a wrong prevptr, that breaks
decoding. With this, we don't have to break on that if we decide it's OK
to proceed after that.

## Summary of changes

Use a Neon GUC to allow the system to enable the NEON-specific
skip_lsn_checks option in XLogReader.
2025-03-10 17:02:30 +00:00
Vlad Lazar
8c553297cb safekeeper: use max end lsn as start of next batch (#11152)
## Problem

Partial reads are still problematic. They are stored in the buffer of
the wal decoder and result in gaps being reported too eagerly on the
pageserver side.

## Summary of changes

Previously, we always used the start LSN of the chunk of WAL that was
just read. This patch switches to using the end LSN of the last record
that was decoded in the previous iteration.
2025-03-10 16:33:28 +00:00
Dmitrii Kovalkov
63b22d3fb1 pageserver: https for management API (#11025)
## Problem

Storage controller uses unencrypted HTTP requests for pageserver
management API.

Closes: https://github.com/neondatabase/cloud/issues/24283


## Summary of changes
- Implement `http_utils::server::Server` with TLS support.
- Replace `hyper0::server::Server` with `http_utils::server::Server` in
pageserver.
- Add HTTPS handler for pageserver management API.
- Generate local SSL certificates in neon local.
2025-03-10 15:07:59 +00:00
JC Grünhage
f17931870f fix(ci): use <!subteam^ID> syntax for pinging groups on slack (#11135)
## Problem
Pinging groups on slack didn't work, because I didn't use the correct
syntax.

## Summary of changes
Use the correct syntax for pinging groups.
2025-03-10 13:27:23 +00:00
Arpad Müller
33c3c34c95 Appease cargo deny errors (#11142)
* pprof can also use `prost` as a backend, switch to it as `protobuf`
has no update available but a security issue.
* `paste` is a build time dependency, so add the unmaintained warning as
an exception.
2025-03-10 13:24:14 +00:00
Ivan Efremov
5d38fd6c43 fix(proxy): Use testodrome query id for latency measurement (#11150)
Add a new neon option "neon_query_id" to glue data with testodrome
queries. Log latency in microseconds always.

Relates to the #22486
2025-03-10 12:55:16 +00:00
Dmitrii Kovalkov
66881b4394 storcon: update scheduler stats when changing node's preferred az (#11147)
## Problem

`home_shard_count` is not updated on the preferred AZ change.

Closes: https://github.com/neondatabase/neon/issues/10493

## Summary of changes
- Update scheduler stats (node ref counts) on preferred AZ change.
2025-03-10 11:33:00 +00:00
Konstantin Knizhnik
c87d307e8c Print state of connection buffer when no response is receioved from PS for a long time (#11145)
## Problem

See https://neondb.slack.com/archives/C08DE6Q9C3B

Sometimes compute is not able to receive responses from PS for a long
time (minutes).
I do not think that the problem is at compute side, but in order to
exclude this possibility I wan to see more information about connection
state at compute side, particularly amount of data cached in connection
buffer.

## Summary of changes

Right now we are dumping state of socket buffer.
This PR adds printing state of connection buffer.

---------

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
2025-03-09 18:36:36 +00:00
Tristan Partin
1b8c4286c4 Fetch remote extension in ALTER EXTENSION UPDATE statements (#11102)
Previously, remote extensions were not fetched unless they were used in
some other manner. For instance, loading a BM25 index in pg_search
fetches the pg_search extension. However, if on a fresh compute with
pg_search 0.15.5 installed, the user ran `ALTER EXTENSION pg_search
UPDATE TO '0.15.6'` without first using the pg_search extension, we
would not fetch the extension and fail to find an update path.

Signed-off-by: Tristan Partin <tristan@neon.tech>
2025-03-09 17:29:44 +00:00
Tristan Partin
3fe5650039 Fix dropping role with table privileges granted by non-neon_superuser (#10964)
We were previously only revoking privileges granted by neon_superuser.
However, we need to do it for all grantors.

Signed-off-by: Tristan Partin <tristan@neon.tech>
2025-03-07 19:00:11 +00:00
Alex Chi Z.
cd438406fb feat(pageserver): add force patch index_part API (#11119)
## Problem

As part of the disaster recovery tool. Partly for
https://github.com/neondatabase/neon/issues/9114.

## Summary of changes

* Add a new pageserver API to force patch the fields in index_part and
modify the timeline internal structures.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2025-03-07 17:42:52 +00:00
Dmitrii Kovalkov
e876794ce5 storcon: use https safekeeper api (#11065)
## Problem

Storage controller uses http for requests to safekeeper management API.

Closes: https://github.com/neondatabase/cloud/issues/24835

## Summary of changes
- Add `use_https_safekeeper_api` option to storcon to use https api
- Use https for requests to safekeeper management API if this option is
enabled
- Add `ssl_ca_file` option to storcon for ability to specify custom root
CA certificate
2025-03-07 17:22:47 +00:00
John Spray
87e6117dfd storage controller: API-driven graceful migrations (#10913)
## Problem

The current migration API does a live migration, but if the destination
doesn't already have a secondary, that live migration is unlikely to be
able to warm up a tenant properly within its timeout (full warmup of a
big tenant can take tens of minutes).

Background optimisation code knows how to do this gracefully by creating
a secondary first, but we don't currently give a human a way to trigger
that.

Closes: https://github.com/neondatabase/neon/issues/10540

## Summary of changes

- Add `prefererred_node` parameter to TenantShard, which is respected by
optimize_attachment
- Modify migration API to have optional prewarm=true mode, in which we
set preferred_node and call optimize_attachment, rather than directly
modifying intentstate
- Require override_scheduler=true flag if migrating somewhere that is a
less-than-optimal scheduling location (e.g. wrong AZ)
- Add `origin_node_id` to migration API so that callers can ensure
they're moving from where they think they're moving from
- Add tests for the above

The storcon_cli wrapper for this has a 'watch' mode that waits for
eventual cutover. This doesn't show the warmth of the secondary evolve
because we don't currently have an API for that in the controller, as
the passthrough API only targets attached locations, not secondaries. It
would be straightforward to add later as a dedicated endpoint for
getting secondary status, then extend the storcon_cli to consume that
and print a nice progress indicator.
2025-03-07 17:02:38 +00:00
Vlad Lazar
084fc4a757 pageserver: enable previous heatmaps by default (#11132)
We add the off by default configs in
https://github.com/neondatabase/neon/pull/11088 because
the unarchival heatmap was causing oversized secondary locations. That
was fixed in https://github.com/neondatabase/neon/pull/11098, so let's
turn them on by default.
2025-03-07 16:05:31 +00:00
Vlad Lazar
937876cbe2 safekeeper: don't skip empty records for shard zero (#11137)
## Problem

Shard zero needs to track the start LSN of the latest record
in adition to the LSN of the next record to ingest. The former
is included in basebackup persisted by the compute in WAL.

Previously, empty records were skipped for all shards. This caused
the prev LSN tracking on the PS to fall behind and led to logical
replication
issues.

## Summary of changes

Shard zero now receives emtpy interpreted records for LSN tracking
purposes.
A test is included too.
2025-03-07 15:52:01 +00:00
Alexander Lakhin
a4ce20db5c Support workflow_dispatch event in _meta.yml (#11133)
## Problem
Allow for using _meta.yml with workflow_dispatch event.

## Summary of changes
Handle this event in the run-kind step; fix and update the description
of the run-kind output.
2025-03-07 15:00:06 +00:00
Erik Grinaker
eedd179f0c storcon: initial autosplit tweaks (#11134)
## Problem

This patch makes some initial tweaks as preparation for
https://github.com/neondatabase/cloud/issues/22532, where we will be
introducing additional autosplit logic.

The plan is outlined in
https://github.com/neondatabase/cloud/issues/22532#issuecomment-2706215907.

## Summary of changes

Minor code cleanups and behavioral changes:

* Decide that we'll split based on `max_logical_size` (later possibly
`total_logical_size`).
* Fix a bug that would split the smallest candidate rather than the
largest.
* Pick the largest candidate by `max_logical_size` rather than
`resident_size`, for consistency (debatable).
* Split out `get_top_tenant_shards()` to fetch split candidates.
* Fetch candidates concurrently from all nodes.
* Make `TenantShard.get_scheduling_policy()` return a copy instead of a
reference.
2025-03-07 14:38:01 +00:00
Arpad Müller
f1b18874c3 storcon: require safekeeper jwt's in strict mode (#11116)
We have introduced the ability to specify safekeeper JWTs for the
storage controller. It now does heartbeats. We now want to also require
the presence of those JWTs.

Let's merge this PR shortly after the release cutoff.

Part of / follow-up of
https://github.com/neondatabase/cloud/issues/24727
2025-03-07 13:29:48 +00:00
Fedor Dikarev
db77896e92 remove CODEOWNER assignement for the test_runner/ (#11130)
## Problem
That adds an extra unnecessary load to the PerfCorr team

## Summary of changes
Remove `CODEOWNERS` assignment for the `test_runner/` folder
2025-03-07 12:38:27 +00:00
Alexey Kondratov
f5aa8c3eac feat(compute_ctl): Add a basic HTTP API benchmark (#11123)
## Problem

We just had a regression reported at
https://neondb.slack.com/archives/C08EXUJF554/p1741102467515599, which
clearly came with one of the releases. It's not a huge problem yet, but
it's annoying that we cannot quickly attribute it to a specific commit.

## Summary of changes

Add a very simple `compute_ctl` HTTP API benchmark that does 10k
requests to `/status` and `metrics.json` and reports p50 and p99.

---------

Co-authored-by: Peter Bendel <peterbendel@neon.tech>
2025-03-07 12:35:42 +00:00
Arpad Müller
cea67fc062 update ring to 0.17.13 (#11131)
Update ring from 0.17.6 to 0.17.13. Addresses the advisory:
https://rustsec.org/advisories/RUSTSEC-2025-0009
2025-03-07 12:17:04 +00:00
193 changed files with 6989 additions and 1858 deletions

View File

@@ -1,21 +0,0 @@
## Release 202Y-MM-DD
**NB: this PR must be merged only by 'Create a merge commit'!**
### Checklist when preparing for release
- [ ] Read or refresh [the release flow guide](https://www.notion.so/neondatabase/Release-general-flow-61f2e39fd45d4d14a70c7749604bd70b)
- [ ] Ask in the [cloud Slack channel](https://neondb.slack.com/archives/C033A2WE6BZ) that you are going to rollout the release. Any blockers?
- [ ] Does this release contain any db migrations? Destructive ones? What is the rollback plan?
<!-- List everything that should be done **before** release, any issues / setting changes / etc -->
### Checklist after release
- [ ] Make sure instructions from PRs included in this release and labeled `manual_release_instructions` are executed (either by you or by people who wrote them).
- [ ] Based on the merged commits write release notes and open a PR into `website` repo ([example](https://github.com/neondatabase/website/pull/219/files))
- [ ] Check [#dev-production-stream](https://neondb.slack.com/archives/C03F5SM1N02) Slack channel
- [ ] Check [stuck projects page](https://console.neon.tech/admin/projects?sort=last_active&order=desc&stuck=true)
- [ ] Check [recent operation failures](https://console.neon.tech/admin/operations?action=create_timeline%2Cstart_compute%2Cstop_compute%2Csuspend_compute%2Capply_config%2Cdelete_timeline%2Cdelete_tenant%2Ccreate_branch%2Ccheck_availability&sort=updated_at&order=desc&had_retries=some)
- [ ] Check [cloud SLO dashboard](https://neonprod.grafana.net/d/_oWcBMJ7k/cloud-slos?orgId=1)
- [ ] Check [compute startup metrics dashboard](https://neonprod.grafana.net/d/5OkYJEmVz/compute-startup-time)
<!-- List everything that should be done **after** release, any admin UI configuration / Grafana dashboard / alert changes / setting changes / etc -->

View File

@@ -33,3 +33,5 @@ config-variables:
- NEON_PROD_AWS_ACCOUNT_ID
- AWS_ECR_REGION
- BENCHMARK_LARGE_OLTP_PROJECTID
- SLACK_ON_CALL_DEVPROD_STREAM
- SLACK_RUST_CHANNEL_ID

View File

@@ -1,14 +1,16 @@
import itertools
import json
import os
import sys
build_tag = os.environ["BUILD_TAG"]
branch = os.environ["BRANCH"]
dev_acr = os.environ["DEV_ACR"]
prod_acr = os.environ["PROD_ACR"]
dev_aws = os.environ["DEV_AWS"]
prod_aws = os.environ["PROD_AWS"]
aws_region = os.environ["AWS_REGION"]
source_tag = os.getenv("SOURCE_TAG")
target_tag = os.getenv("TARGET_TAG")
branch = os.getenv("BRANCH")
dev_acr = os.getenv("DEV_ACR")
prod_acr = os.getenv("PROD_ACR")
dev_aws = os.getenv("DEV_AWS")
prod_aws = os.getenv("PROD_AWS")
aws_region = os.getenv("AWS_REGION")
components = {
"neon": ["neon"],
@@ -39,24 +41,23 @@ registries = {
outputs: dict[str, dict[str, list[str]]] = {}
target_tags = [build_tag, "latest"] if branch == "main" else [build_tag]
target_stages = ["dev", "prod"] if branch.startswith("release") else ["dev"]
target_tags = [target_tag, "latest"] if branch == "main" else [target_tag]
target_stages = (
["dev", "prod"] if branch in ["release", "release-proxy", "release-compute"] else ["dev"]
)
for component_name, component_images in components.items():
for stage in target_stages:
outputs[f"{component_name}-{stage}"] = dict(
[
(
f"docker.io/neondatabase/{component_image}:{build_tag}",
[
f"{combo[0]}/{component_image}:{combo[1]}"
for combo in itertools.product(registries[stage], target_tags)
],
)
for component_image in component_images
outputs[f"{component_name}-{stage}"] = {
f"docker.io/neondatabase/{component_image}:{source_tag}": [
f"{registry}/{component_image}:{tag}"
for registry, tag in itertools.product(registries[stage], target_tags)
if not (registry == "docker.io/neondatabase" and tag == source_tag)
]
)
for component_image in component_images
}
with open(os.environ["GITHUB_OUTPUT"], "a") as f:
with open(os.getenv("GITHUB_OUTPUT", "/dev/null"), "a") as f:
for key, value in outputs.items():
f.write(f"{key}={json.dumps(value)}\n")
print(f"Image map for {key}:\n{json.dumps(value, indent=2)}\n\n", file=sys.stderr)

110
.github/scripts/lint-release-pr.sh vendored Executable file
View File

@@ -0,0 +1,110 @@
#!/usr/bin/env bash
set -euo pipefail
DOCS_URL="https://docs.neon.build/overview/repositories/neon.html"
message() {
if [[ -n "${GITHUB_PR_NUMBER:-}" ]]; then
gh pr comment --repo "${GITHUB_REPOSITORY}" "${GITHUB_PR_NUMBER}" --edit-last --body "$1" \
|| gh pr comment --repo "${GITHUB_REPOSITORY}" "${GITHUB_PR_NUMBER}" --body "$1"
fi
echo "$1"
}
report_error() {
message "$1
For more details, see the documentation: ${DOCS_URL}"
exit 1
}
case "$RELEASE_BRANCH" in
"release") COMPONENT="Storage" ;;
"release-proxy") COMPONENT="Proxy" ;;
"release-compute") COMPONENT="Compute" ;;
*)
report_error "Unknown release branch: ${RELEASE_BRANCH}"
;;
esac
# Identify main and release branches
MAIN_BRANCH="origin/main"
REMOTE_RELEASE_BRANCH="origin/${RELEASE_BRANCH}"
# Find merge base
MERGE_BASE=$(git merge-base "${MAIN_BRANCH}" "${REMOTE_RELEASE_BRANCH}")
echo "Merge base of ${MAIN_BRANCH} and ${RELEASE_BRANCH}: ${MERGE_BASE}"
# Get the HEAD commit (last commit in PR, expected to be the merge commit)
LAST_COMMIT=$(git rev-parse HEAD)
MERGE_COMMIT_MESSAGE=$(git log -1 --format=%s "${LAST_COMMIT}")
EXPECTED_MESSAGE_REGEX="^$COMPONENT release [0-9]{4}-[0-9]{2}-[0-9]{2}$"
if ! [[ "${MERGE_COMMIT_MESSAGE}" =~ ${EXPECTED_MESSAGE_REGEX} ]]; then
report_error "Merge commit message does not match expected pattern: '<component> release YYYY-MM-DD'
Expected component: ${COMPONENT}
Found: '${MERGE_COMMIT_MESSAGE}'"
fi
echo "✅ Merge commit message is correctly formatted: '${MERGE_COMMIT_MESSAGE}'"
LAST_COMMIT_PARENTS=$(git cat-file -p "${LAST_COMMIT}" | jq -sR '[capture("parent (?<parent>[0-9a-f]{40})"; "g") | .parent]')
if [[ "$(echo "${LAST_COMMIT_PARENTS}" | jq 'length')" -ne 2 ]]; then
report_error "Last commit must be a merge commit with exactly two parents"
fi
EXPECTED_RELEASE_HEAD=$(git rev-parse "${REMOTE_RELEASE_BRANCH}")
if echo "${LAST_COMMIT_PARENTS}" | jq -e --arg rel "${EXPECTED_RELEASE_HEAD}" 'index($rel) != null' > /dev/null; then
LINEAR_HEAD=$(echo "${LAST_COMMIT_PARENTS}" | jq -r '[.[] | select(. != $rel)][0]' --arg rel "${EXPECTED_RELEASE_HEAD}")
else
report_error "Last commit must merge the release branch (${RELEASE_BRANCH})"
fi
echo "✅ Last commit correctly merges the previous commit and the release branch"
echo "Top commit of linear history: ${LINEAR_HEAD}"
MERGE_COMMIT_TREE=$(git rev-parse "${LAST_COMMIT}^{tree}")
LINEAR_HEAD_TREE=$(git rev-parse "${LINEAR_HEAD}^{tree}")
if [[ "${MERGE_COMMIT_TREE}" != "${LINEAR_HEAD_TREE}" ]]; then
report_error "Tree of merge commit (${MERGE_COMMIT_TREE}) does not match tree of linear history head (${LINEAR_HEAD_TREE})
This indicates that the merge of ${RELEASE_BRANCH} into this branch was not performed using the merge strategy 'ours'"
fi
echo "✅ Merge commit tree matches the linear history head"
EXPECTED_PREVIOUS_COMMIT="${LINEAR_HEAD}"
# Now traverse down the history, ensuring each commit has exactly one parent
CURRENT_COMMIT="${EXPECTED_PREVIOUS_COMMIT}"
while [[ "${CURRENT_COMMIT}" != "${MERGE_BASE}" && "${CURRENT_COMMIT}" != "${EXPECTED_RELEASE_HEAD}" ]]; do
CURRENT_COMMIT_PARENTS=$(git cat-file -p "${CURRENT_COMMIT}" | jq -sR '[capture("parent (?<parent>[0-9a-f]{40})"; "g") | .parent]')
if [[ "$(echo "${CURRENT_COMMIT_PARENTS}" | jq 'length')" -ne 1 ]]; then
report_error "Commit ${CURRENT_COMMIT} must have exactly one parent"
fi
NEXT_COMMIT=$(echo "${CURRENT_COMMIT_PARENTS}" | jq -r '.[0]')
if [[ "${NEXT_COMMIT}" == "${MERGE_BASE}" ]]; then
echo "✅ Reached merge base (${MERGE_BASE})"
PR_BASE="${MERGE_BASE}"
elif [[ "${NEXT_COMMIT}" == "${EXPECTED_RELEASE_HEAD}" ]]; then
echo "✅ Reached release branch (${EXPECTED_RELEASE_HEAD})"
PR_BASE="${EXPECTED_RELEASE_HEAD}"
elif [[ -z "${NEXT_COMMIT}" ]]; then
report_error "Unexpected end of commit history before reaching merge base"
fi
# Move to the next commit in the chain
CURRENT_COMMIT="${NEXT_COMMIT}"
done
echo "✅ All commits are properly ordered and linear"
echo "✅ Release PR structure is valid"
echo
message "Commits that are part of this release:
$(git log --oneline "${PR_BASE}..${LINEAR_HEAD}")"

View File

@@ -17,6 +17,12 @@
({};
.[$entry.component] |= (if . == null or $entry.version > .version then $entry else . end))
# Ensure that each component exists, or fail
| (["storage", "compute", "proxy"] - (keys)) as $missing
| if ($missing | length) > 0 then
"Error: Found no release for \($missing | join(", "))!\n" | halt_error(1)
else . end
# Convert the resulting object into an array of formatted strings
| to_entries
| map("\(.key)=\(.value.full)")

View File

@@ -7,8 +7,8 @@ on:
description: 'Component name'
required: true
type: string
release-branch:
description: 'Release branch'
source-branch:
description: 'Source branch'
required: true
type: string
secrets:
@@ -30,17 +30,25 @@ jobs:
steps:
- uses: actions/checkout@v4
with:
ref: main
ref: ${{ inputs.source-branch }}
fetch-depth: 0
- name: Set variables
id: vars
env:
COMPONENT_NAME: ${{ inputs.component-name }}
RELEASE_BRANCH: ${{ inputs.release-branch }}
RELEASE_BRANCH: >-
${{
false
|| inputs.component-name == 'Storage' && 'release'
|| inputs.component-name == 'Proxy' && 'release-proxy'
|| inputs.component-name == 'Compute' && 'release-compute'
}}
run: |
today=$(date +'%Y-%m-%d')
echo "title=${COMPONENT_NAME} release ${today}" | tee -a ${GITHUB_OUTPUT}
echo "rc-branch=rc/${RELEASE_BRANCH}/${today}" | tee -a ${GITHUB_OUTPUT}
echo "release-branch=${RELEASE_BRANCH}" | tee -a ${GITHUB_OUTPUT}
- name: Configure git
run: |
@@ -49,31 +57,36 @@ jobs:
- name: Create RC branch
env:
RELEASE_BRANCH: ${{ steps.vars.outputs.release-branch }}
RC_BRANCH: ${{ steps.vars.outputs.rc-branch }}
TITLE: ${{ steps.vars.outputs.title }}
run: |
git checkout -b "${RC_BRANCH}"
git switch -c "${RC_BRANCH}"
# create an empty commit to distinguish workflow runs
# from other possible releases from the same commit
git commit --allow-empty -m "${TITLE}"
# Manually create a merge commit on the current branch, keeping the
# tree and setting the parents to the current HEAD and the HEAD of the
# release branch. This commit is what we'll fast-forward the release
# branch to when merging the release branch.
# For details on why, look at
# https://docs.neon.build/overview/repositories/neon.html#background-on-commit-history-of-release-prs
current_tree=$(git rev-parse 'HEAD^{tree}')
release_head=$(git rev-parse "origin/${RELEASE_BRANCH}")
current_head=$(git rev-parse HEAD)
merge_commit=$(git commit-tree -p "${current_head}" -p "${release_head}" -m "${TITLE}" "${current_tree}")
# Fast-forward the current branch to the newly created merge_commit
git merge --ff-only ${merge_commit}
git push origin "${RC_BRANCH}"
- name: Create a PR into ${{ inputs.release-branch }}
- name: Create a PR into ${{ steps.vars.outputs.release-branch }}
env:
GH_TOKEN: ${{ secrets.ci-access-token }}
RC_BRANCH: ${{ steps.vars.outputs.rc-branch }}
RELEASE_BRANCH: ${{ inputs.release-branch }}
RELEASE_BRANCH: ${{ steps.vars.outputs.release-branch }}
TITLE: ${{ steps.vars.outputs.title }}
run: |
cat << EOF > body.md
## ${TITLE}
**Please merge this Pull Request using 'Create a merge commit' button**
EOF
gh pr create --title "${TITLE}" \
--body-file "body.md" \
--body "" \
--head "${RC_BRANCH}" \
--base "${RELEASE_BRANCH}"

View File

@@ -19,11 +19,18 @@ on:
description: "Tag of the last compute release"
value: ${{ jobs.tags.outputs.compute }}
run-kind:
description: "The kind of run we're currently in. Will be one of `pr`, `push-main`, `storage-rc`, `storage-release`, `proxy-rc`, `proxy-release`, `compute-rc`, `compute-release` or `merge_queue`"
description: "The kind of run we're currently in. Will be one of `push-main`, `storage-release`, `compute-release`, `proxy-release`, `storage-rc-pr`, `compute-rc-pr`, `proxy-rc-pr`, `pr`, or `workflow-dispatch`"
value: ${{ jobs.tags.outputs.run-kind }}
release-pr-run-id:
description: "Only available if `run-kind in [storage-release, proxy-release, compute-release]`. Contains the run ID of the `Build and Test` workflow, assuming one with the current commit can be found."
value: ${{ jobs.tags.outputs.release-pr-run-id }}
permissions: {}
defaults:
run:
shell: bash -euo pipefail {0}
jobs:
tags:
runs-on: ubuntu-22.04
@@ -33,6 +40,7 @@ jobs:
proxy: ${{ steps.previous-releases.outputs.proxy }}
storage: ${{ steps.previous-releases.outputs.storage }}
run-kind: ${{ steps.run-kind.outputs.run-kind }}
release-pr-run-id: ${{ steps.release-pr-run-id.outputs.release-pr-run-id }}
permissions:
contents: read
steps:
@@ -55,6 +63,7 @@ jobs:
|| (inputs.github-event-name == 'pull_request' && github.base_ref == 'release-compute') && 'compute-rc-pr'
|| (inputs.github-event-name == 'pull_request' && github.base_ref == 'release-proxy') && 'proxy-rc-pr'
|| (inputs.github-event-name == 'pull_request') && 'pr'
|| (inputs.github-event-name == 'workflow_dispatch') && 'workflow-dispatch'
|| 'unknown'
}}
run: |
@@ -82,9 +91,16 @@ jobs:
echo "tag=release-compute-$(git rev-list --count HEAD)" | tee -a $GITHUB_OUTPUT
;;
pr|storage-rc-pr|compute-rc-pr|proxy-rc-pr)
BUILD_AND_TEST_RUN_ID=$(gh run list -b $CURRENT_BRANCH -c $CURRENT_SHA -w 'Build and Test' -L 1 --json databaseId --jq '.[].databaseId')
BUILD_AND_TEST_RUN_ID=$(gh api --paginate \
-H "Accept: application/vnd.github+json" \
-H "X-GitHub-Api-Version: 2022-11-28" \
"/repos/${GITHUB_REPOSITORY}/actions/runs?head_sha=${CURRENT_SHA}&branch=${CURRENT_BRANCH}" \
| jq '[.workflow_runs[] | select(.name == "Build and Test")][0].id // ("Error: No matching workflow run found." | halt_error(1))')
echo "tag=$BUILD_AND_TEST_RUN_ID" | tee -a $GITHUB_OUTPUT
;;
workflow-dispatch)
echo "tag=$GITHUB_RUN_ID" | tee -a $GITHUB_OUTPUT
;;
*)
echo "Unexpected RUN_KIND ('${RUN_KIND}'), failing to assign build-tag!"
exit 1
@@ -101,3 +117,13 @@ jobs:
"/repos/${GITHUB_REPOSITORY}/releases" \
| jq -f .github/scripts/previous-releases.jq -r \
| tee -a "${GITHUB_OUTPUT}"
- name: Get the release PR run ID
id: release-pr-run-id
if: ${{ contains(fromJson('["storage-release", "compute-release", "proxy-release"]'), steps.run-kind.outputs.run-kind) }}
env:
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
CURRENT_SHA: ${{ github.event.pull_request.head.sha || github.sha }}
run: |
RELEASE_PR_RUN_ID=$(gh api "/repos/${GITHUB_REPOSITORY}/actions/runs?head_sha=$CURRENT_SHA" | jq '[.workflow_runs[] | select(.name == "Build and Test") | select(.head_branch | test("^rc/release(-(proxy)|(compute))?/[0-9]{4}-[0-9]{2}-[0-9]{2}$"; "s"))] | first | .id // ("Falied to find Build and Test run from RC PR!" | halt_error(1))')
echo "release-pr-run-id=$RELEASE_PR_RUN_ID" | tee -a $GITHUB_OUTPUT

View File

@@ -476,7 +476,7 @@ jobs:
(
!github.event.pull_request.draft
|| contains( github.event.pull_request.labels.*.name, 'run-e2e-tests-in-draft')
|| contains(fromJSON('["push-main", "storage-release", "proxy-release", "compute-release"]'), needs.meta.outputs.run-kind)
|| needs.meta.outputs.run-kind == 'push-main'
) && !failure() && !cancelled()
}}
needs: [ check-permissions, push-neon-image-dev, push-compute-image-dev, meta ]
@@ -487,7 +487,7 @@ jobs:
neon-image-arch:
needs: [ check-permissions, build-build-tools-image, meta ]
if: ${{ contains(fromJSON('["push-main", "pr", "storage-release", "storage-rc-pr", "proxy-release", "proxy-rc-pr"]'), needs.meta.outputs.run-kind) }}
if: ${{ contains(fromJSON('["push-main", "pr", "storage-rc-pr", "proxy-rc-pr"]'), needs.meta.outputs.run-kind) }}
strategy:
matrix:
arch: [ x64, arm64 ]
@@ -537,7 +537,7 @@ jobs:
neon-image:
needs: [ neon-image-arch, meta ]
if: ${{ contains(fromJSON('["push-main", "pr", "storage-release", "storage-rc-pr", "proxy-release", "proxy-rc-pr"]'), needs.meta.outputs.run-kind) }}
if: ${{ contains(fromJSON('["push-main", "pr", "storage-rc-pr", "proxy-rc-pr"]'), needs.meta.outputs.run-kind) }}
runs-on: ubuntu-22.04
permissions:
id-token: write # aws-actions/configure-aws-credentials
@@ -559,7 +559,7 @@ jobs:
compute-node-image-arch:
needs: [ check-permissions, build-build-tools-image, meta ]
if: ${{ contains(fromJSON('["push-main", "pr", "compute-release", "compute-rc-pr"]'), needs.meta.outputs.run-kind) }}
if: ${{ contains(fromJSON('["push-main", "pr", "compute-rc-pr"]'), needs.meta.outputs.run-kind) }}
permissions:
id-token: write # aws-actions/configure-aws-credentials
statuses: write
@@ -651,7 +651,7 @@ jobs:
compute-node-image:
needs: [ compute-node-image-arch, meta ]
if: ${{ contains(fromJSON('["push-main", "pr", "compute-release", "compute-rc-pr"]'), needs.meta.outputs.run-kind) }}
if: ${{ contains(fromJSON('["push-main", "pr", "compute-rc-pr"]'), needs.meta.outputs.run-kind) }}
permissions:
id-token: write # aws-actions/configure-aws-credentials
statuses: write
@@ -694,7 +694,7 @@ jobs:
vm-compute-node-image-arch:
needs: [ check-permissions, meta, compute-node-image ]
if: ${{ contains(fromJSON('["push-main", "pr", "compute-release", "compute-rc-pr"]'), needs.meta.outputs.run-kind) }}
if: ${{ contains(fromJSON('["push-main", "pr", "compute-rc-pr"]'), needs.meta.outputs.run-kind) }}
runs-on: ${{ fromJson(format('["self-hosted", "{0}"]', matrix.arch == 'arm64' && 'large-arm64' || 'large')) }}
strategy:
fail-fast: false
@@ -747,7 +747,7 @@ jobs:
vm-compute-node-image:
needs: [ vm-compute-node-image-arch, meta ]
if: ${{ contains(fromJSON('["push-main", "pr", "compute-release", "compute-rc-pr"]'), needs.meta.outputs.run-kind) }}
if: ${{ contains(fromJSON('["push-main", "pr", "compute-rc-pr"]'), needs.meta.outputs.run-kind) }}
runs-on: ubuntu-22.04
strategy:
matrix:
@@ -773,7 +773,12 @@ jobs:
test-images:
needs: [ check-permissions, meta, neon-image, compute-node-image ]
# Depends on jobs that can get skipped
if: "!failure() && !cancelled()"
if: >-
${{
!failure()
&& !cancelled()
&& contains(fromJSON('["push-main", "pr", "storage-rc-pr", "proxy-rc-pr", "compute-rc-pr"]'), needs.meta.outputs.run-kind)
}}
strategy:
fail-fast: false
matrix:
@@ -800,7 +805,7 @@ jobs:
# Ensure that we don't have bad versions.
- name: Verify image versions
shell: bash # ensure no set -e for better error messages
if: ${{ contains(fromJSON('["push-main", "pr", "storage-release", "storage-rc-pr", "proxy-release", "proxy-rc-pr"]'), needs.meta.outputs.run-kind) }}
if: ${{ contains(fromJSON('["push-main", "pr", "storage-rc-pr", "proxy-rc-pr"]'), needs.meta.outputs.run-kind) }}
run: |
pageserver_version=$(docker run --rm neondatabase/neon:${{ needs.meta.outputs.build-tag }} "/bin/sh" "-c" "/usr/local/bin/pageserver --version")
@@ -821,19 +826,19 @@ jobs:
env:
TAG: >-
${{
contains(fromJSON('["compute-release", "compute-rc-pr"]'), needs.meta.outputs.run-kind)
needs.meta.outputs.run-kind == 'compute-rc-pr'
&& needs.meta.outputs.previous-storage-release
|| needs.meta.outputs.build-tag
}}
COMPUTE_TAG: >-
${{
contains(fromJSON('["storage-release", "storage-rc-pr", "proxy-release", "proxy-rc-pr"]'), needs.meta.outputs.run-kind)
contains(fromJSON('["storage-rc-pr", "proxy-rc-pr"]'), needs.meta.outputs.run-kind)
&& needs.meta.outputs.previous-compute-release
|| needs.meta.outputs.build-tag
}}
TEST_EXTENSIONS_TAG: >-
${{
contains(fromJSON('["storage-release", "storage-rc-pr", "proxy-release", "proxy-rc-pr"]'), needs.meta.outputs.run-kind)
contains(fromJSON('["storage-rc-pr", "proxy-rc-pr"]'), needs.meta.outputs.run-kind)
&& 'latest'
|| needs.meta.outputs.build-tag
}}
@@ -885,7 +890,13 @@ jobs:
id: generate
run: python3 .github/scripts/generate_image_maps.py
env:
BUILD_TAG: "${{ needs.meta.outputs.build-tag }}"
SOURCE_TAG: >-
${{
contains(fromJson('["storage-release", "compute-release", "proxy-release"]'), needs.meta.outputs.run-kind)
&& needs.meta.outputs.release-pr-run-id
|| needs.meta.outputs.build-tag
}}
TARGET_TAG: ${{ needs.meta.outputs.build-tag }}
BRANCH: "${{ github.ref_name }}"
DEV_ACR: "${{ vars.AZURE_DEV_REGISTRY_NAME }}"
PROD_ACR: "${{ vars.AZURE_PROD_REGISTRY_NAME }}"
@@ -895,7 +906,7 @@ jobs:
push-neon-image-dev:
needs: [ meta, generate-image-maps, neon-image ]
if: ${{ contains(fromJSON('["push-main", "pr", "storage-release", "storage-rc-pr", "proxy-release", "proxy-rc-pr"]'), needs.meta.outputs.run-kind) }}
if: ${{ !failure() && !cancelled() && contains(fromJSON('["push-main", "pr", "storage-release", "storage-rc-pr", "proxy-release", "proxy-rc-pr"]'), needs.meta.outputs.run-kind) }}
uses: ./.github/workflows/_push-to-container-registry.yml
permissions:
id-token: write # Required for aws/azure login
@@ -913,7 +924,7 @@ jobs:
push-compute-image-dev:
needs: [ meta, generate-image-maps, vm-compute-node-image ]
if: ${{ contains(fromJSON('["push-main", "pr", "compute-release", "compute-rc-pr"]'), needs.meta.outputs.run-kind) }}
if: ${{ !failure() && !cancelled() && contains(fromJSON('["push-main", "pr", "compute-release", "compute-rc-pr"]'), needs.meta.outputs.run-kind) }}
uses: ./.github/workflows/_push-to-container-registry.yml
permissions:
id-token: write # Required for aws/azure login
@@ -967,16 +978,55 @@ jobs:
acr-registry-name: ${{ vars.AZURE_PROD_REGISTRY_NAME }}
secrets: inherit
# This is a bit of a special case so we're not using a generated image map.
add-latest-tag-to-neon-extensions-test-image:
if: github.ref_name == 'main'
push-neon-test-extensions-image-ghcr:
if: ${{ contains(fromJSON('["push-main", "pr", "compute-rc-pr"]'), needs.meta.outputs.run-kind) }}
needs: [ meta, compute-node-image ]
uses: ./.github/workflows/_push-to-container-registry.yml
with:
image-map: |
{
"docker.io/neondatabase/neon-test-extensions-v16:${{ needs.meta.outputs.build-tag }}": ["docker.io/neondatabase/neon-test-extensions-v16:latest"],
"docker.io/neondatabase/neon-test-extensions-v17:${{ needs.meta.outputs.build-tag }}": ["docker.io/neondatabase/neon-test-extensions-v17:latest"]
"docker.io/neondatabase/neon-test-extensions-v16:${{ needs.meta.outputs.build-tag }}": [
"ghcr.io/neondatabase/neon-test-extensions-v16:${{ needs.meta.outputs.build-tag }}"
],
"docker.io/neondatabase/neon-test-extensions-v17:${{ needs.meta.outputs.build-tag }}": [
"ghcr.io/neondatabase/neon-test-extensions-v17:${{ needs.meta.outputs.build-tag }}"
]
}
secrets: inherit
add-latest-tag-to-neon-test-extensions-image:
if: ${{ needs.meta.outputs.run-kind == 'push-main' }}
needs: [ meta, compute-node-image ]
uses: ./.github/workflows/_push-to-container-registry.yml
with:
image-map: |
{
"docker.io/neondatabase/neon-test-extensions-v16:${{ needs.meta.outputs.build-tag }}": [
"docker.io/neondatabase/neon-test-extensions-v16:latest",
"ghcr.io/neondatabase/neon-test-extensions-v16:latest"
],
"docker.io/neondatabase/neon-test-extensions-v17:${{ needs.meta.outputs.build-tag }}": [
"docker.io/neondatabase/neon-test-extensions-v17:latest",
"ghcr.io/neondatabase/neon-test-extensions-v17:latest"
]
}
secrets: inherit
add-release-tag-to-neon-test-extensions-image:
if: ${{ needs.meta.outputs.run-kind == 'compute-release' }}
needs: [ meta, compute-node-image ]
uses: ./.github/workflows/_push-to-container-registry.yml
with:
image-map: |
{
"docker.io/neondatabase/neon-test-extensions-v16:${{ needs.meta.outputs.release-pr-run-id }}": [
"docker.io/neondatabase/neon-test-extensions-v16:${{ needs.meta.outputs.build-tag }}",
"ghcr.io/neondatabase/neon-test-extensions-v16:${{ needs.meta.outputs.build-tag }}"
],
"docker.io/neondatabase/neon-test-extensions-v17:${{ needs.meta.outputs.release-pr-run-id }}": [
"docker.io/neondatabase/neon-test-extensions-v17:${{ needs.meta.outputs.build-tag }}",
"ghcr.io/neondatabase/neon-test-extensions-v17:${{ needs.meta.outputs.build-tag }}"
]
}
secrets: inherit
@@ -1175,7 +1225,7 @@ jobs:
-f deployPgSniRouter=false \
-f deployProxy=false \
-f deployStorage=true \
-f deployStorageBroker=true \
-f deployStorageBroker=false \
-f deployStorageController=true \
-f branch=main \
-f dockerTag=${{needs.meta.outputs.build-tag}} \
@@ -1183,7 +1233,7 @@ jobs:
gh workflow --repo neondatabase/infra run deploy-prod.yml --ref main \
-f deployStorage=true \
-f deployStorageBroker=true \
-f deployStorageBroker=false \
-f deployStorageController=true \
-f branch=main \
-f dockerTag=${{needs.meta.outputs.build-tag}}
@@ -1231,11 +1281,11 @@ jobs:
payload: |
channel: ${{ vars.SLACK_STORAGE_CHANNEL_ID }}
text: |
🔴 @oncall-storage: deploy job on release branch had unexpected status "${{ needs.deploy.result }}" <${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}|GitHub Run>.
🔴 <!subteam^S06CJ87UMNY|@oncall-storage>: deploy job on release branch had unexpected status "${{ needs.deploy.result }}" <${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}|GitHub Run>.
# The job runs on `release` branch and copies compatibility data and Neon artifact from the last *release PR* to the latest directory
promote-compatibility-data:
needs: [ deploy ]
needs: [ meta, deploy ]
permissions:
id-token: write # aws-actions/configure-aws-credentials
statuses: write
@@ -1245,37 +1295,6 @@ jobs:
runs-on: ubuntu-22.04
steps:
- name: Fetch GITHUB_RUN_ID and COMMIT_SHA for the last merged release PR
id: fetch-last-release-pr-info
env:
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
run: |
branch_name_and_pr_number=$(gh pr list \
--repo "${GITHUB_REPOSITORY}" \
--base release \
--state merged \
--limit 10 \
--json mergeCommit,headRefName,number \
--jq ".[] | select(.mergeCommit.oid==\"${GITHUB_SHA}\") | { branch_name: .headRefName, pr_number: .number }")
branch_name=$(echo "${branch_name_and_pr_number}" | jq -r '.branch_name')
pr_number=$(echo "${branch_name_and_pr_number}" | jq -r '.pr_number')
run_id=$(gh run list \
--repo "${GITHUB_REPOSITORY}" \
--workflow build_and_test.yml \
--branch "${branch_name}" \
--json databaseId \
--limit 1 \
--jq '.[].databaseId')
last_commit_sha=$(gh pr view "${pr_number}" \
--repo "${GITHUB_REPOSITORY}" \
--json commits \
--jq '.commits[-1].oid')
echo "run-id=${run_id}" | tee -a ${GITHUB_OUTPUT}
echo "commit-sha=${last_commit_sha}" | tee -a ${GITHUB_OUTPUT}
- uses: aws-actions/configure-aws-credentials@v4
with:
aws-region: eu-central-1
@@ -1286,8 +1305,8 @@ jobs:
env:
BUCKET: neon-github-public-dev
AWS_REGION: eu-central-1
COMMIT_SHA: ${{ steps.fetch-last-release-pr-info.outputs.commit-sha }}
RUN_ID: ${{ steps.fetch-last-release-pr-info.outputs.run-id }}
COMMIT_SHA: ${{ github.sha }}
RUN_ID: ${{ needs.meta.outputs.release-pr-run-id }}
run: |
old_prefix="artifacts/${COMMIT_SHA}/${RUN_ID}"
new_prefix="artifacts/latest"
@@ -1376,5 +1395,5 @@ jobs:
|| needs.files-changed.result == 'skipped'
|| (needs.push-compute-image-dev.result == 'skipped' && contains(fromJSON('["push-main", "pr", "compute-release", "compute-rc-pr"]'), needs.meta.outputs.run-kind))
|| (needs.push-neon-image-dev.result == 'skipped' && contains(fromJSON('["push-main", "pr", "storage-release", "storage-rc-pr", "proxy-release", "proxy-rc-pr"]'), needs.meta.outputs.run-kind))
|| needs.test-images.result == 'skipped'
|| (needs.test-images.result == 'skipped' && contains(fromJSON('["push-main", "pr", "storage-rc-pr", "proxy-rc-pr", "compute-rc-pr"]'), needs.meta.outputs.run-kind))
|| (needs.trigger-custom-extensions-build-and-wait.result == 'skipped' && contains(fromJSON('["push-main", "pr", "compute-release", "compute-rc-pr"]'), needs.meta.outputs.run-kind))

View File

@@ -7,7 +7,7 @@ on:
required: false
type: string
schedule:
- cron: '0 0 * * *'
- cron: '0 10 * * *'
jobs:
cargo-deny:
@@ -50,8 +50,9 @@ jobs:
method: chat.postMessage
token: ${{ secrets.SLACK_BOT_TOKEN }}
payload: |
channel: ${{ vars.SLACK_CICD_CHANNEL_ID }}
channel: ${{ vars.SLACK_ON_CALL_DEVPROD_STREAM }}
text: |
Periodic cargo-deny on ${{ matrix.ref }}: ${{ job.status }}
<${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}|GitHub Run>
Pinging @oncall-devprod.
Fixing the problem should be fairly straight forward from the logs. If not, <#${{ vars.SLACK_RUST_CHANNEL_ID }}> is there to help.
Pinging <!subteam^S0838JPSH32|@oncall-devprod>.

36
.github/workflows/fast-forward.yml vendored Normal file
View File

@@ -0,0 +1,36 @@
name: Fast forward merge
on:
pull_request:
types: [labeled]
branches:
- release
- release-proxy
- release-compute
jobs:
fast-forward:
if: ${{ github.event.label.name == 'fast-forward' }}
runs-on: ubuntu-22.04
steps:
- name: Remove fast-forward label to PR
env:
GH_TOKEN: ${{ secrets.CI_ACCESS_TOKEN }}
run: |
gh pr edit ${{ github.event.pull_request.number }} --repo "${GITHUB_REPOSITORY}" --remove-label "fast-forward"
- name: Fast forwarding
uses: sequoia-pgp/fast-forward@ea7628bedcb0b0b96e94383ada458d812fca4979
# See https://docs.github.com/en/graphql/reference/enums#mergestatestatus
if: ${{ github.event.pull_request.mergeable_state == 'clean' }}
with:
merge: true
comment: on-error
github_token: ${{ secrets.CI_ACCESS_TOKEN }}
- name: Comment if mergeable_state is not clean
if: ${{ github.event.pull_request.mergeable_state != 'clean' }}
run: |
gh pr comment ${{ github.event.pull_request.number }} \
--repo "${GITHUB_REPOSITORY}" \
--body "Not trying to forward pull-request, because \`mergeable_state\` is \`${{ github.event.pull_request.mergeable_state }}\`, not \`clean\`."

24
.github/workflows/lint-release-pr.yml vendored Normal file
View File

@@ -0,0 +1,24 @@
name: Lint Release PR
on:
pull_request:
branches:
- release
- release-proxy
- release-compute
jobs:
lint-release-pr:
runs-on: ubuntu-22.04
steps:
- name: Checkout PR branch
uses: actions/checkout@v4
with:
fetch-depth: 0 # Fetch full history for git operations
ref: ${{ github.event.pull_request.head.ref }}
- name: Run lint script
env:
RELEASE_BRANCH: ${{ github.base_ref }}
run: |
./.github/scripts/lint-release-pr.sh

View File

@@ -8,8 +8,6 @@ on:
- .github/workflows/build-build-tools-image.yml
- .github/workflows/pre-merge-checks.yml
merge_group:
branches:
- main
defaults:
run:
@@ -19,11 +17,13 @@ defaults:
permissions: {}
jobs:
get-changed-files:
meta:
runs-on: ubuntu-22.04
outputs:
python-changed: ${{ steps.python-src.outputs.any_changed }}
rust-changed: ${{ steps.rust-src.outputs.any_changed }}
branch: ${{ steps.group-metadata.outputs.branch }}
pr-number: ${{ steps.group-metadata.outputs.pr-number }}
steps:
- uses: actions/checkout@v4
@@ -58,12 +58,20 @@ jobs:
echo "${PYTHON_CHANGED_FILES}"
echo "${RUST_CHANGED_FILES}"
- name: Merge group metadata
if: ${{ github.event_name == 'merge_group' }}
id: group-metadata
env:
MERGE_QUEUE_REF: ${{ github.event.merge_group.head_ref }}
run: |
echo $MERGE_QUEUE_REF | jq -Rr 'capture("refs/heads/gh-readonly-queue/(?<branch>.*)/pr-(?<pr_number>[0-9]+)-[0-9a-f]{40}") | ["branch=" + .branch, "pr-number=" + .pr_number] | .[]' | tee -a "${GITHUB_OUTPUT}"
build-build-tools-image:
if: |
false
|| needs.get-changed-files.outputs.python-changed == 'true'
|| needs.get-changed-files.outputs.rust-changed == 'true'
needs: [ get-changed-files ]
|| needs.meta.outputs.python-changed == 'true'
|| needs.meta.outputs.rust-changed == 'true'
needs: [ meta ]
uses: ./.github/workflows/build-build-tools-image.yml
with:
# Build only one combination to save time
@@ -72,8 +80,8 @@ jobs:
secrets: inherit
check-codestyle-python:
if: needs.get-changed-files.outputs.python-changed == 'true'
needs: [ get-changed-files, build-build-tools-image ]
if: needs.meta.outputs.python-changed == 'true'
needs: [ meta, build-build-tools-image ]
uses: ./.github/workflows/_check-codestyle-python.yml
with:
# `-bookworm-x64` suffix should match the combination in `build-build-tools-image`
@@ -81,8 +89,8 @@ jobs:
secrets: inherit
check-codestyle-rust:
if: needs.get-changed-files.outputs.rust-changed == 'true'
needs: [ get-changed-files, build-build-tools-image ]
if: needs.meta.outputs.rust-changed == 'true'
needs: [ meta, build-build-tools-image ]
uses: ./.github/workflows/_check-codestyle-rust.yml
with:
# `-bookworm-x64` suffix should match the combination in `build-build-tools-image`
@@ -101,7 +109,7 @@ jobs:
statuses: write # for `github.repos.createCommitStatus(...)`
contents: write
needs:
- get-changed-files
- meta
- check-codestyle-python
- check-codestyle-rust
runs-on: ubuntu-22.04
@@ -129,7 +137,20 @@ jobs:
run: exit 1
if: |
false
|| (needs.check-codestyle-python.result == 'skipped' && needs.get-changed-files.outputs.python-changed == 'true')
|| (needs.check-codestyle-rust.result == 'skipped' && needs.get-changed-files.outputs.rust-changed == 'true')
|| (github.event_name == 'merge_group' && needs.meta.outputs.branch != 'main')
|| (needs.check-codestyle-python.result == 'skipped' && needs.meta.outputs.python-changed == 'true')
|| (needs.check-codestyle-rust.result == 'skipped' && needs.meta.outputs.rust-changed == 'true')
|| contains(needs.*.result, 'failure')
|| contains(needs.*.result, 'cancelled')
- name: Add fast-forward label to PR to trigger fast-forward merge
if: >-
${{
always()
&& github.event_name == 'merge_group'
&& contains(fromJson('["release", "release-proxy", "release-compute"]'), github.base_ref)
}}
env:
GH_TOKEN: ${{ secrets.CI_ACCESS_TOKEN }}
run: >-
gh pr edit ${{ needs.meta.outputs.pr-number }} --repo "${GITHUB_REPOSITORY}" --add-label "fast-forward"

View File

@@ -38,7 +38,7 @@ jobs:
uses: ./.github/workflows/_create-release-pr.yml
with:
component-name: 'Storage'
release-branch: 'release'
source-branch: ${{ github.ref_name }}
secrets:
ci-access-token: ${{ secrets.CI_ACCESS_TOKEN }}
@@ -51,7 +51,7 @@ jobs:
uses: ./.github/workflows/_create-release-pr.yml
with:
component-name: 'Proxy'
release-branch: 'release-proxy'
source-branch: ${{ github.ref_name }}
secrets:
ci-access-token: ${{ secrets.CI_ACCESS_TOKEN }}
@@ -64,6 +64,6 @@ jobs:
uses: ./.github/workflows/_create-release-pr.yml
with:
component-name: 'Compute'
release-branch: 'release-compute'
source-branch: ${{ github.ref_name }}
secrets:
ci-access-token: ${{ secrets.CI_ACCESS_TOKEN }}

View File

@@ -3,7 +3,6 @@
# DevProd & PerfCorr
/.github/ @neondatabase/developer-productivity @neondatabase/performance-correctness
/test_runner/ @neondatabase/performance-correctness
# Compute
/pgxn/ @neondatabase/compute

383
Cargo.lock generated
View File

@@ -191,7 +191,7 @@ checksum = "965c2d33e53cb6b267e148a4cb0760bc01f4904c1cd4bb4002a085bb016d1490"
dependencies = [
"proc-macro2",
"quote",
"syn 2.0.90",
"syn 2.0.100",
"synstructure",
]
@@ -203,7 +203,7 @@ checksum = "7b18050c2cd6fe86c3a76584ef5e0baf286d038cda203eb6223df2cc413565f7"
dependencies = [
"proc-macro2",
"quote",
"syn 2.0.90",
"syn 2.0.100",
]
[[package]]
@@ -272,7 +272,7 @@ checksum = "16e62a023e7c117e27523144c5d2459f4397fcc3cab0085af8e2224f643a0193"
dependencies = [
"proc-macro2",
"quote",
"syn 2.0.90",
"syn 2.0.100",
]
[[package]]
@@ -283,7 +283,7 @@ checksum = "b9ccdd8f2a161be9bd5c023df56f1b2a0bd1d83872ae53b71a84a12c9bf6e842"
dependencies = [
"proc-macro2",
"quote",
"syn 2.0.90",
"syn 2.0.100",
]
[[package]]
@@ -1021,7 +1021,7 @@ dependencies = [
"regex",
"rustc-hash 2.1.1",
"shlex",
"syn 2.0.90",
"syn 2.0.100",
]
[[package]]
@@ -1127,9 +1127,9 @@ checksum = "37b2a672a2cb129a2e41c10b1224bb368f9f37a2b16b612598138befd7b37eb5"
[[package]]
name = "cc"
version = "1.1.30"
version = "1.2.16"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "b16803a61b81d9eabb7eae2588776c4c1e584b738ede45fdbb4c972cec1e9945"
checksum = "be714c154be609ec7f5dad223a33bf1482fff90472de28f7362806e6d4832b8c"
dependencies = [
"jobserver",
"libc",
@@ -1248,7 +1248,7 @@ dependencies = [
"heck",
"proc-macro2",
"quote",
"syn 2.0.90",
"syn 2.0.100",
]
[[package]]
@@ -1309,6 +1309,7 @@ version = "0.1.0"
dependencies = [
"anyhow",
"chrono",
"indexmap 2.0.1",
"jsonwebtoken",
"regex",
"remote_storage",
@@ -1339,6 +1340,7 @@ dependencies = [
"flate2",
"futures",
"http 1.1.0",
"indexmap 2.0.1",
"jsonwebtoken",
"metrics",
"nix 0.27.1",
@@ -1347,17 +1349,20 @@ dependencies = [
"once_cell",
"opentelemetry",
"opentelemetry_sdk",
"p256 0.13.2",
"postgres",
"postgres_initdb",
"regex",
"remote_storage",
"reqwest",
"ring",
"rlimit",
"rust-ini",
"serde",
"serde_json",
"serde_with",
"signal-hook",
"spki 0.7.3",
"tar",
"thiserror 1.0.69",
"tokio",
@@ -1377,6 +1382,7 @@ dependencies = [
"vm_monitor",
"walkdir",
"workspace_hack",
"x509-cert",
"zstd",
]
@@ -1703,7 +1709,7 @@ checksum = "f46882e17999c6cc590af592290432be3bce0428cb0d5f8b6715e4dc7b383eb3"
dependencies = [
"proc-macro2",
"quote",
"syn 2.0.90",
"syn 2.0.100",
]
[[package]]
@@ -1727,7 +1733,7 @@ dependencies = [
"proc-macro2",
"quote",
"strsim 0.10.0",
"syn 2.0.90",
"syn 2.0.100",
]
[[package]]
@@ -1738,7 +1744,7 @@ checksum = "29a358ff9f12ec09c3e61fef9b5a9902623a695a46a917b07f269bff1445611a"
dependencies = [
"darling_core",
"quote",
"syn 2.0.90",
"syn 2.0.100",
]
[[package]]
@@ -1801,6 +1807,8 @@ source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "fffa369a668c8af7dbf8b5e56c9f744fbd399949ed171606040001947de40b1c"
dependencies = [
"const-oid",
"der_derive",
"flagset",
"pem-rfc7468",
"zeroize",
]
@@ -1819,6 +1827,17 @@ dependencies = [
"rusticata-macros",
]
[[package]]
name = "der_derive"
version = "0.7.3"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "8034092389675178f570469e6c3b0465d3d30b4505c294a6550db47f3c17ad18"
dependencies = [
"proc-macro2",
"quote",
"syn 2.0.100",
]
[[package]]
name = "deranged"
version = "0.3.11"
@@ -1888,7 +1907,7 @@ dependencies = [
"dsl_auto_type",
"proc-macro2",
"quote",
"syn 2.0.90",
"syn 2.0.100",
]
[[package]]
@@ -1908,7 +1927,7 @@ version = "0.2.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "209c735641a413bc68c4923a9d6ad4bcb3ca306b794edaa7eb0b3228a99ffb25"
dependencies = [
"syn 2.0.90",
"syn 2.0.100",
]
[[package]]
@@ -1937,7 +1956,7 @@ checksum = "487585f4d0c6655fe74905e2504d8ad6908e4db67f744eb140876906c2f3175d"
dependencies = [
"proc-macro2",
"quote",
"syn 2.0.90",
"syn 2.0.100",
]
[[package]]
@@ -1960,7 +1979,7 @@ dependencies = [
"heck",
"proc-macro2",
"quote",
"syn 2.0.90",
"syn 2.0.100",
]
[[package]]
@@ -2105,7 +2124,7 @@ dependencies = [
"darling",
"proc-macro2",
"quote",
"syn 2.0.90",
"syn 2.0.100",
]
[[package]]
@@ -2115,28 +2134,19 @@ source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "186e05a59d4c50738528153b83b0b0194d3a29507dfec16eccd4b342903397d0"
dependencies = [
"log",
]
[[package]]
name = "env_logger"
version = "0.10.2"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "4cd405aab171cb85d6735e5c8d9db038c17d3ca007a4d2c25f337935c3d90580"
dependencies = [
"humantime",
"is-terminal",
"log",
"regex",
"termcolor",
]
[[package]]
name = "env_logger"
version = "0.11.2"
version = "0.11.7"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "6c012a26a7f605efc424dd53697843a72be7dc86ad2d01f7814337794a12231d"
checksum = "c3716d7a920fb4fac5d84e9d4bce8ceb321e9414b4409da61b07b75c1e3d0697"
dependencies = [
"anstream",
"anstyle",
"env_filter",
"jiff",
"log",
]
@@ -2157,7 +2167,7 @@ checksum = "3bf679796c0322556351f287a51b49e48f7c4986e727b5dd78c972d30e2e16cc"
dependencies = [
"proc-macro2",
"quote",
"syn 2.0.90",
"syn 2.0.100",
]
[[package]]
@@ -2291,6 +2301,12 @@ version = "0.4.2"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "0ce7134b9999ecaf8bcd65542e436736ef32ddca1b3e06094cb6ec5755203b80"
[[package]]
name = "flagset"
version = "0.4.6"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "b3ea1ec5f8307826a5b71094dd91fc04d4ae75d5709b20ad351c7fb4815c86ec"
[[package]]
name = "flate2"
version = "1.0.26"
@@ -2417,7 +2433,7 @@ checksum = "162ee34ebcb7c64a8abebc059ce0fee27c2262618d7b60ed8faf72fef13c3650"
dependencies = [
"proc-macro2",
"quote",
"syn 2.0.90",
"syn 2.0.100",
]
[[package]]
@@ -2530,7 +2546,7 @@ checksum = "53010ccb100b96a67bc32c0175f0ed1426b31b655d562898e57325f81c023ac0"
dependencies = [
"proc-macro2",
"quote",
"syn 2.0.90",
"syn 2.0.100",
]
[[package]]
@@ -2848,6 +2864,7 @@ dependencies = [
"anyhow",
"bytes",
"fail",
"futures",
"hyper 0.14.30",
"itertools 0.10.5",
"jemalloc_pprof",
@@ -2861,6 +2878,7 @@ dependencies = [
"serde_path_to_error",
"thiserror 1.0.69",
"tokio",
"tokio-rustls 0.26.0",
"tokio-stream",
"tokio-util",
"tracing",
@@ -3146,7 +3164,7 @@ checksum = "1ec89e9337638ecdc08744df490b221a7399bf8d164eb52a665454e60e075ad6"
dependencies = [
"proc-macro2",
"quote",
"syn 2.0.90",
"syn 2.0.100",
]
[[package]]
@@ -3239,7 +3257,7 @@ dependencies = [
"crossbeam-channel",
"crossbeam-utils",
"dashmap 6.1.0",
"env_logger 0.11.2",
"env_logger",
"indexmap 2.0.1",
"itoa",
"log",
@@ -3362,6 +3380,30 @@ dependencies = [
"tracing",
]
[[package]]
name = "jiff"
version = "0.2.4"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "d699bc6dfc879fb1bf9bdff0d4c56f0884fc6f0d0eb0fba397a6d00cd9a6b85e"
dependencies = [
"jiff-static",
"log",
"portable-atomic",
"portable-atomic-util",
"serde",
]
[[package]]
name = "jiff-static"
version = "0.2.4"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "8d16e75759ee0aa64c57a56acbf43916987b20c77373cb7e808979e02b93c9f9"
dependencies = [
"proc-macro2",
"quote",
"syn 2.0.100",
]
[[package]]
name = "jobserver"
version = "0.1.32"
@@ -3533,9 +3575,9 @@ dependencies = [
[[package]]
name = "log"
version = "0.4.20"
version = "0.4.26"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "b5e6163cb8c49088c2c36f57875e58ccd8c87c7427f7fbd50ea6710b2f3f2e8f"
checksum = "30bde2b3dc3671ae49d8e2e9f044c7c005836e7a023ee57cffa25ab82764bb9e"
[[package]]
name = "lru"
@@ -3616,7 +3658,7 @@ dependencies = [
"heck",
"proc-macro2",
"quote",
"syn 2.0.90",
"syn 2.0.100",
]
[[package]]
@@ -4062,7 +4104,7 @@ dependencies = [
"opentelemetry-http",
"opentelemetry-proto",
"opentelemetry_sdk",
"prost",
"prost 0.13.3",
"reqwest",
"thiserror 1.0.69",
]
@@ -4075,7 +4117,7 @@ checksum = "a6e05acbfada5ec79023c85368af14abd0b307c015e9064d249b2a950ef459a6"
dependencies = [
"opentelemetry",
"opentelemetry_sdk",
"prost",
"prost 0.13.3",
"tonic",
]
@@ -4189,6 +4231,7 @@ dependencies = [
"pageserver_api",
"pageserver_client",
"rand 0.8.5",
"reqwest",
"serde",
"serde_json",
"tokio",
@@ -4278,6 +4321,9 @@ dependencies = [
"remote_storage",
"reqwest",
"rpds",
"rustls 0.23.18",
"rustls-pemfile 2.1.1",
"rustls-pki-types",
"scopeguard",
"send-future",
"serde",
@@ -4296,11 +4342,13 @@ dependencies = [
"tokio-epoll-uring",
"tokio-io-timeout",
"tokio-postgres",
"tokio-rustls 0.26.0",
"tokio-stream",
"tokio-tar",
"tokio-util",
"toml_edit",
"tracing",
"tracing-utils",
"url",
"utils",
"uuid",
@@ -4478,7 +4526,7 @@ dependencies = [
"parquet",
"proc-macro2",
"quote",
"syn 2.0.90",
"syn 2.0.100",
]
[[package]]
@@ -4580,7 +4628,7 @@ checksum = "f6e859e6e5bd50440ab63c47e3ebabc90f26251f7c73c3d3e837b74a1cc3fa67"
dependencies = [
"proc-macro2",
"quote",
"syn 2.0.90",
"syn 2.0.100",
]
[[package]]
@@ -4676,6 +4724,15 @@ version = "1.10.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "280dc24453071f1b63954171985a0b0d30058d287960968b9b2aca264c8d4ee6"
[[package]]
name = "portable-atomic-util"
version = "0.2.4"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "d8a2f0d8d040d7848a709caf78912debcc3f33ee4b3cac47d73d1e1069e83507"
dependencies = [
"portable-atomic",
]
[[package]]
name = "postgres"
version = "0.19.7"
@@ -4783,7 +4840,7 @@ dependencies = [
"bytes",
"crc32c",
"criterion",
"env_logger 0.10.2",
"env_logger",
"log",
"memoffset 0.9.0",
"once_cell",
@@ -4830,8 +4887,10 @@ dependencies = [
"nix 0.26.4",
"once_cell",
"parking_lot 0.12.1",
"protobuf",
"protobuf-codegen-pure",
"prost 0.12.6",
"prost-build 0.12.6",
"prost-derive 0.12.6",
"sha2",
"smallvec",
"symbolic-demangle",
"tempfile",
@@ -4850,7 +4909,7 @@ dependencies = [
"inferno 0.12.0",
"num",
"paste",
"prost",
"prost 0.13.3",
]
[[package]]
@@ -4880,7 +4939,7 @@ source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "8d3928fb5db768cb86f891ff014f0144589297e3c6a1aba6ed7cecfdace270c7"
dependencies = [
"proc-macro2",
"syn 2.0.90",
"syn 2.0.100",
]
[[package]]
@@ -4894,9 +4953,9 @@ dependencies = [
[[package]]
name = "proc-macro2"
version = "1.0.92"
version = "1.0.94"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "37d3544b3f2748c54e147655edb5025752e2303145b5aefb3c3ea2c78b973bb0"
checksum = "a31971752e70b8b2686d7e46ec17fb38dad4051d94024c88df49b667caea9c84"
dependencies = [
"unicode-ident",
]
@@ -4943,6 +5002,16 @@ dependencies = [
"thiserror 1.0.69",
]
[[package]]
name = "prost"
version = "0.12.6"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "deb1435c188b76130da55f17a466d252ff7b1418b2ad3e037d127b94e3411f29"
dependencies = [
"bytes",
"prost-derive 0.12.6",
]
[[package]]
name = "prost"
version = "0.13.3"
@@ -4950,7 +5019,28 @@ source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "7b0487d90e047de87f984913713b85c601c05609aad5b0df4b4573fbf69aa13f"
dependencies = [
"bytes",
"prost-derive",
"prost-derive 0.13.3",
]
[[package]]
name = "prost-build"
version = "0.12.6"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "22505a5c94da8e3b7c2996394d1c933236c4d743e81a410bcca4e6989fc066a4"
dependencies = [
"bytes",
"heck",
"itertools 0.12.1",
"log",
"multimap",
"once_cell",
"petgraph",
"prettyplease",
"prost 0.12.6",
"prost-types 0.12.6",
"regex",
"syn 2.0.100",
"tempfile",
]
[[package]]
@@ -4967,13 +5057,26 @@ dependencies = [
"once_cell",
"petgraph",
"prettyplease",
"prost",
"prost-types",
"prost 0.13.3",
"prost-types 0.13.3",
"regex",
"syn 2.0.90",
"syn 2.0.100",
"tempfile",
]
[[package]]
name = "prost-derive"
version = "0.12.6"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "81bddcdb20abf9501610992b6759a4c888aef7d1a7247ef75e2404275ac24af1"
dependencies = [
"anyhow",
"itertools 0.12.1",
"proc-macro2",
"quote",
"syn 2.0.100",
]
[[package]]
name = "prost-derive"
version = "0.13.3"
@@ -4984,7 +5087,16 @@ dependencies = [
"itertools 0.12.1",
"proc-macro2",
"quote",
"syn 2.0.90",
"syn 2.0.100",
]
[[package]]
name = "prost-types"
version = "0.12.6"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "9091c90b0a32608e984ff2fa4091273cbdd755d54935c51d520887f4a1dbd5b0"
dependencies = [
"prost 0.12.6",
]
[[package]]
@@ -4993,32 +5105,7 @@ version = "0.13.3"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "4759aa0d3a6232fb8dbdb97b61de2c20047c68aca932c7ed76da9d788508d670"
dependencies = [
"prost",
]
[[package]]
name = "protobuf"
version = "2.28.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "106dd99e98437432fed6519dedecfade6a06a73bb7b2a1e019fdd2bee5778d94"
[[package]]
name = "protobuf-codegen"
version = "2.28.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "033460afb75cf755fcfc16dfaed20b86468082a2ea24e05ac35ab4a099a017d6"
dependencies = [
"protobuf",
]
[[package]]
name = "protobuf-codegen-pure"
version = "2.28.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "95a29399fc94bcd3eeaa951c715f7bea69409b2445356b00519740bcd6ddd865"
dependencies = [
"protobuf",
"protobuf-codegen",
"prost 0.13.3",
]
[[package]]
@@ -5047,7 +5134,7 @@ dependencies = [
"consumption_metrics",
"ecdsa 0.16.9",
"ed25519-dalek",
"env_logger 0.10.2",
"env_logger",
"fallible-iterator",
"flate2",
"framed-websockets",
@@ -5184,9 +5271,9 @@ dependencies = [
[[package]]
name = "quote"
version = "1.0.37"
version = "1.0.39"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "b5b9d34b8991d19d98081b46eacdd8eb58c6f2b201139f7c5f643cc155a633af"
checksum = "c1f1914ce909e1658d9907913b4b91947430c7d9be598b15a1912935b8c04801"
dependencies = [
"proc-macro2",
]
@@ -5627,16 +5714,16 @@ dependencies = [
[[package]]
name = "ring"
version = "0.17.6"
version = "0.17.13"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "684d5e6e18f669ccebf64a92236bb7db9a34f07be010e3627368182027180866"
checksum = "70ac5d832aa16abd7d1def883a8545280c20a60f523a370aa3a9617c2b8550ee"
dependencies = [
"cc",
"cfg-if",
"getrandom 0.2.11",
"libc",
"spin",
"untrusted",
"windows-sys 0.48.0",
"windows-sys 0.52.0",
]
[[package]]
@@ -5715,7 +5802,7 @@ dependencies = [
"regex",
"relative-path",
"rustc_version",
"syn 2.0.90",
"syn 2.0.100",
"unicode-ident",
]
@@ -5878,9 +5965,9 @@ dependencies = [
[[package]]
name = "rustls-pki-types"
version = "1.10.0"
version = "1.11.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "16f1201b3c9a7ee8039bcadc17b7e605e2945b27eee7631788c1bd2b0643674b"
checksum = "917ce264624a4b4db1c364dcc35bfca9ded014d0a958cd47ad3e960e988ea51c"
[[package]]
name = "rustls-webpki"
@@ -5930,7 +6017,7 @@ dependencies = [
"crc32c",
"criterion",
"desim",
"env_logger 0.10.2",
"env_logger",
"fail",
"futures",
"hex",
@@ -6261,7 +6348,7 @@ checksum = "ad1e866f866923f252f05c889987993144fb74e722403468a4ebd70c3cd756c0"
dependencies = [
"proc-macro2",
"quote",
"syn 2.0.90",
"syn 2.0.100",
]
[[package]]
@@ -6343,7 +6430,7 @@ dependencies = [
"darling",
"proc-macro2",
"quote",
"syn 2.0.90",
"syn 2.0.100",
]
[[package]]
@@ -6358,9 +6445,9 @@ dependencies = [
[[package]]
name = "sha1"
version = "0.10.5"
version = "0.10.6"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "f04293dc80c3993519f2d7f6f511707ee7094fe0c6d3406feb330cdb3540eba3"
checksum = "e3bf829a2d51ab4a5ddf1352d8470c140cadc8301b2ae1789db023f01cedd6ba"
dependencies = [
"cfg-if",
"cpufeatures",
@@ -6566,7 +6653,7 @@ dependencies = [
"metrics",
"once_cell",
"parking_lot 0.12.1",
"prost",
"prost 0.13.3",
"rustls 0.23.18",
"tokio",
"tonic",
@@ -6584,6 +6671,7 @@ dependencies = [
"bytes",
"chrono",
"clap",
"clashmap",
"control_plane",
"cron",
"diesel",
@@ -6744,7 +6832,7 @@ dependencies = [
"proc-macro2",
"quote",
"rustversion",
"syn 2.0.90",
"syn 2.0.100",
]
[[package]]
@@ -6795,9 +6883,9 @@ dependencies = [
[[package]]
name = "syn"
version = "2.0.90"
version = "2.0.100"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "919d3b74a5dd0ccd15aeb8f93e7006bd9e14c295087c9896a110f490752bcf31"
checksum = "b09a44accad81e1ba1cd74a32461ba89dee89095ba17b32f5d03683b1b1fc2a0"
dependencies = [
"proc-macro2",
"quote",
@@ -6827,7 +6915,7 @@ checksum = "c8af7666ab7b6390ab78131fb5b0fce11d6b7a6951602017c35fa82800708971"
dependencies = [
"proc-macro2",
"quote",
"syn 2.0.90",
"syn 2.0.100",
]
[[package]]
@@ -6878,15 +6966,6 @@ dependencies = [
"serde_json",
]
[[package]]
name = "termcolor"
version = "1.2.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "be55cf8942feac5c765c2c993422806843c9a9a45d4d5c407ad6dd2ea95eb9b6"
dependencies = [
"winapi-util",
]
[[package]]
name = "test-context"
version = "0.3.0"
@@ -6905,7 +6984,7 @@ checksum = "78ea17a2dc368aeca6f554343ced1b1e31f76d63683fa8016e5844bd7a5144a1"
dependencies = [
"proc-macro2",
"quote",
"syn 2.0.90",
"syn 2.0.100",
]
[[package]]
@@ -6934,7 +7013,7 @@ checksum = "4fee6c4efc90059e10f81e6d42c60a18f76588c3d74cb83a0b242a2b6c7504c1"
dependencies = [
"proc-macro2",
"quote",
"syn 2.0.90",
"syn 2.0.100",
]
[[package]]
@@ -6945,7 +7024,7 @@ checksum = "26afc1baea8a989337eeb52b6e72a039780ce45c3edfcc9c5b9d112feeb173c2"
dependencies = [
"proc-macro2",
"quote",
"syn 2.0.90",
"syn 2.0.100",
]
[[package]]
@@ -7076,6 +7155,27 @@ version = "0.1.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "1f3ccbac311fea05f86f61904b462b55fb3df8837a366dfc601a0161d0532f20"
[[package]]
name = "tls_codec"
version = "0.4.2"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "0de2e01245e2bb89d6f05801c564fa27624dbd7b1846859876c7dad82e90bf6b"
dependencies = [
"tls_codec_derive",
"zeroize",
]
[[package]]
name = "tls_codec_derive"
version = "0.4.2"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "2d2e76690929402faae40aebdda620a2c0e25dd6d3b9afe48867dfd95991f4bd"
dependencies = [
"proc-macro2",
"quote",
"syn 2.0.100",
]
[[package]]
name = "tokio"
version = "1.43.0"
@@ -7128,7 +7228,7 @@ checksum = "6e06d43f1345a3bcd39f6a56dbb7dcab2ba47e68e8ac134855e7e2bdbaf8cab8"
dependencies = [
"proc-macro2",
"quote",
"syn 2.0.90",
"syn 2.0.100",
]
[[package]]
@@ -7338,7 +7438,7 @@ dependencies = [
"hyper-util",
"percent-encoding",
"pin-project",
"prost",
"prost 0.13.3",
"rustls-native-certs 0.8.0",
"rustls-pemfile 2.1.1",
"tokio",
@@ -7358,10 +7458,10 @@ checksum = "9557ce109ea773b399c9b9e5dca39294110b74f1f342cb347a80d1fce8c26a11"
dependencies = [
"prettyplease",
"proc-macro2",
"prost-build",
"prost-types",
"prost-build 0.13.3",
"prost-types 0.13.3",
"quote",
"syn 2.0.90",
"syn 2.0.100",
]
[[package]]
@@ -7476,7 +7576,7 @@ checksum = "395ae124c09f9e6918a2310af6038fba074bcf474ac352496d5910dd59a2226d"
dependencies = [
"proc-macro2",
"quote",
"syn 2.0.90",
"syn 2.0.100",
]
[[package]]
@@ -7807,6 +7907,7 @@ dependencies = [
"tracing",
"tracing-error",
"tracing-subscriber",
"tracing-utils",
"walkdir",
]
@@ -7870,7 +7971,7 @@ dependencies = [
"anyhow",
"camino-tempfile",
"clap",
"env_logger 0.10.2",
"env_logger",
"log",
"postgres",
"postgres_ffi",
@@ -7892,7 +7993,7 @@ dependencies = [
"pageserver_api",
"postgres_ffi",
"pprof",
"prost",
"prost 0.13.3",
"remote_storage",
"serde",
"serde_json",
@@ -7975,7 +8076,7 @@ dependencies = [
"once_cell",
"proc-macro2",
"quote",
"syn 2.0.90",
"syn 2.0.100",
"wasm-bindgen-shared",
]
@@ -8009,7 +8110,7 @@ checksum = "e94f17b526d0a461a191c78ea52bbce64071ed5c04c9ffe424dcb38f74171bb7"
dependencies = [
"proc-macro2",
"quote",
"syn 2.0.90",
"syn 2.0.100",
"wasm-bindgen-backend",
"wasm-bindgen-shared",
]
@@ -8316,6 +8417,7 @@ name = "workspace_hack"
version = "0.1.0"
dependencies = [
"ahash",
"anstream",
"anyhow",
"base64 0.13.1",
"base64 0.21.7",
@@ -8326,12 +8428,17 @@ dependencies = [
"chrono",
"clap",
"clap_builder",
"const-oid",
"crypto-bigint 0.5.5",
"der 0.7.8",
"deranged",
"digest",
"displaydoc",
"ecdsa 0.16.9",
"either",
"elliptic-curve 0.13.8",
"env_filter",
"env_logger",
"fail",
"form_urlencoded",
"futures-channel",
@@ -8364,10 +8471,11 @@ dependencies = [
"num-rational",
"num-traits",
"once_cell",
"p256 0.13.2",
"parquet",
"prettyplease",
"proc-macro2",
"prost",
"prost 0.13.3",
"quote",
"rand 0.8.5",
"regex",
@@ -8376,6 +8484,7 @@ dependencies = [
"reqwest",
"rustls 0.23.18",
"scopeguard",
"sec1 0.7.3",
"serde",
"serde_json",
"sha2",
@@ -8384,7 +8493,7 @@ dependencies = [
"spki 0.7.3",
"stable_deref_trait",
"subtle",
"syn 2.0.90",
"syn 2.0.100",
"sync_wrapper 0.1.2",
"tikv-jemalloc-ctl",
"tikv-jemalloc-sys",
@@ -8421,6 +8530,18 @@ version = "0.5.5"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "1e9df38ee2d2c3c5948ea468a8406ff0db0b29ae1ffde1bcf20ef305bcc95c51"
[[package]]
name = "x509-cert"
version = "0.2.5"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "1301e935010a701ae5f8655edc0ad17c44bad3ac5ce8c39185f75453b720ae94"
dependencies = [
"const-oid",
"der 0.7.8",
"spki 0.7.3",
"tls_codec",
]
[[package]]
name = "x509-certificate"
version = "0.23.1"
@@ -8501,7 +8622,7 @@ checksum = "2380878cad4ac9aac1e2435f3eb4020e8374b5f13c296cb75b4620ff8e229154"
dependencies = [
"proc-macro2",
"quote",
"syn 2.0.90",
"syn 2.0.100",
"synstructure",
]
@@ -8523,7 +8644,7 @@ checksum = "b3c129550b3e6de3fd0ba67ba5c81818f9805e58b8d7fee80a3a59d2c9fc601a"
dependencies = [
"proc-macro2",
"quote",
"syn 2.0.90",
"syn 2.0.100",
]
[[package]]
@@ -8543,15 +8664,15 @@ checksum = "595eed982f7d355beb85837f651fa22e90b3c044842dc7f2c2842c086f295808"
dependencies = [
"proc-macro2",
"quote",
"syn 2.0.90",
"syn 2.0.100",
"synstructure",
]
[[package]]
name = "zeroize"
version = "1.7.0"
version = "1.8.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "525b4ec142c6b68a2d10f01f7bbf6755599ca3f81ea53b8431b7dd348f5fdb2d"
checksum = "ced3678a2879b30306d323f4542626697a464a97c0a07c9aebf7ebca65cd4dde"
dependencies = [
"serde",
"zeroize_derive",
@@ -8565,7 +8686,7 @@ checksum = "ce36e65b0d2999d2aafac989fb249189a141aee1f53c612c1f37d72631959f69"
dependencies = [
"proc-macro2",
"quote",
"syn 2.0.90",
"syn 2.0.100",
]
[[package]]
@@ -8587,7 +8708,7 @@ checksum = "6eafa6dfb17584ea3e2bd6e76e0cc15ad7af12b09abdd1ca55961bed9b1063c6"
dependencies = [
"proc-macro2",
"quote",
"syn 2.0.90",
"syn 2.0.100",
]
[[package]]

View File

@@ -112,7 +112,7 @@ hyper0 = { package = "hyper", version = "0.14" }
hyper = "1.4"
hyper-util = "0.1"
tokio-tungstenite = "0.21.0"
indexmap = "2"
indexmap = { version = "2", features = ["serde"] }
indoc = "2"
ipnet = "2.10.0"
itertools = "0.10"
@@ -139,7 +139,7 @@ parquet = { version = "53", default-features = false, features = ["zstd"] }
parquet_derive = "53"
pbkdf2 = { version = "0.12.1", features = ["simple", "std"] }
pin-project-lite = "0.2"
pprof = { version = "0.14", features = ["criterion", "flamegraph", "frame-pointer", "protobuf", "protobuf-codec"] }
pprof = { version = "0.14", features = ["criterion", "flamegraph", "frame-pointer", "prost-codec"] }
procfs = "0.16"
prometheus = {version = "0.13", default-features=false, features = ["process"]} # removes protobuf dependency
prost = "0.13"
@@ -155,6 +155,7 @@ rpds = "0.13"
rustc-hash = "1.1.0"
rustls = { version = "0.23.16", default-features = false }
rustls-pemfile = "2"
rustls-pki-types = "1.11"
scopeguard = "1.1"
sysinfo = "0.29.2"
sd-notify = "0.4.1"
@@ -218,7 +219,7 @@ zerocopy = { version = "0.7", features = ["derive"] }
json-structural-diff = { version = "0.2.0" }
## TODO replace this with tracing
env_logger = "0.10"
env_logger = "0.11"
log = "0.4"
## Libraries from neondatabase/ git forks, ideally with changes to be upstreamed

View File

@@ -1735,6 +1735,8 @@ RUN set -e \
libevent-dev \
libtool \
pkg-config \
libcurl4-openssl-dev \
libssl-dev \
&& apt clean && rm -rf /var/lib/apt/lists/*
# Use `dist_man_MANS=` to skip manpage generation (which requires python3/pandoc)
@@ -1743,7 +1745,7 @@ RUN set -e \
&& git clone --recurse-submodules --depth 1 --branch ${PGBOUNCER_TAG} https://github.com/pgbouncer/pgbouncer.git pgbouncer \
&& cd pgbouncer \
&& ./autogen.sh \
&& ./configure --prefix=/usr/local/pgbouncer --without-openssl \
&& ./configure --prefix=/usr/local/pgbouncer \
&& make -j $(nproc) dist_man_MANS= \
&& make install dist_man_MANS=

View File

@@ -26,6 +26,7 @@ fail.workspace = true
flate2.workspace = true
futures.workspace = true
http.workspace = true
indexmap.workspace = true
jsonwebtoken.workspace = true
metrics.workspace = true
nix.workspace = true
@@ -34,16 +35,19 @@ num_cpus.workspace = true
once_cell.workspace = true
opentelemetry.workspace = true
opentelemetry_sdk.workspace = true
p256 = { version = "0.13", features = ["pem"] }
postgres.workspace = true
regex.workspace = true
reqwest = { workspace = true, features = ["json"] }
ring = "0.17"
serde.workspace = true
serde_with.workspace = true
serde_json.workspace = true
signal-hook.workspace = true
spki = { version = "0.7.3", features = ["std"] }
tar.workspace = true
tower.workspace = true
tower-http.workspace = true
reqwest = { workspace = true, features = ["json"] }
tokio = { workspace = true, features = ["rt", "rt-multi-thread"] }
tokio-postgres.workspace = true
tokio-util.workspace = true
@@ -57,6 +61,7 @@ thiserror.workspace = true
url.workspace = true
uuid.workspace = true
walkdir.workspace = true
x509-cert = { version = "0.2.5" }
postgres_initdb.workspace = true
compute_api.workspace = true

View File

@@ -41,6 +41,7 @@ use crate::rsyslog::configure_audit_rsyslog;
use crate::spec::*;
use crate::swap::resize_swap;
use crate::sync_sk::{check_if_synced, ping_safekeeper};
use crate::tls::watch_cert_for_changes;
use crate::{config, extension_server, local_proxy};
pub static SYNC_SAFEKEEPERS_PID: AtomicU32 = AtomicU32::new(0);
@@ -112,6 +113,7 @@ pub struct ComputeNode {
// key: ext_archive_name, value: started download time, download_completed?
pub ext_download_progress: RwLock<HashMap<String, (DateTime<Utc>, bool)>>,
pub compute_ctl_config: ComputeCtlConfig,
}
// store some metrics about download size that might impact startup time
@@ -135,8 +137,6 @@ pub struct ComputeState {
/// passed by the control plane with a /configure HTTP request.
pub pspec: Option<ParsedSpec>,
pub compute_ctl_config: ComputeCtlConfig,
/// If the spec is passed by a /configure request, 'startup_span' is the
/// /configure request's tracing span. The main thread enters it when it
/// processes the compute startup, so that the compute startup is considered
@@ -160,7 +160,6 @@ impl ComputeState {
last_active: None,
error: None,
pspec: None,
compute_ctl_config: ComputeCtlConfig::default(),
startup_span: None,
metrics: ComputeMetrics::default(),
}
@@ -314,7 +313,6 @@ impl ComputeNode {
let pspec = ParsedSpec::try_from(cli_spec).map_err(|msg| anyhow::anyhow!(msg))?;
new_state.pspec = Some(pspec);
}
new_state.compute_ctl_config = compute_ctl_config;
Ok(ComputeNode {
params,
@@ -323,6 +321,7 @@ impl ComputeNode {
state: Mutex::new(new_state),
state_changed: Condvar::new(),
ext_download_progress: RwLock::new(HashMap::new()),
compute_ctl_config,
})
}
@@ -345,7 +344,7 @@ impl ComputeNode {
// requests while configuration is still in progress.
crate::http::server::Server::External {
port: this.params.external_http_port,
jwks: this.state.lock().unwrap().compute_ctl_config.jwks.clone(),
config: this.compute_ctl_config.clone(),
compute_id: this.params.compute_id.clone(),
}
.launch(&this);
@@ -524,6 +523,16 @@ impl ComputeNode {
// Collect all the tasks that must finish here
let mut pre_tasks = tokio::task::JoinSet::new();
// Make sure TLS certificates are properly loaded and in the right place.
if self.compute_ctl_config.tls.is_some() {
let this = self.clone();
pre_tasks.spawn(async move {
this.watch_cert_for_changes().await;
Ok::<(), anyhow::Error>(())
});
}
// If there are any remote extensions in shared_preload_libraries, start downloading them
if pspec.spec.remote_extensions.is_some() {
let (this, spec) = (self.clone(), pspec.spec.clone());
@@ -579,11 +588,13 @@ impl ComputeNode {
if let Some(pgbouncer_settings) = &pspec.spec.pgbouncer_settings {
info!("tuning pgbouncer");
let pgbouncer_settings = pgbouncer_settings.clone();
let tls_config = self.compute_ctl_config.tls.clone();
// Spawn a background task to do the tuning,
// so that we don't block the main thread that starts Postgres.
let pgbouncer_settings = pgbouncer_settings.clone();
let _handle = tokio::spawn(async move {
let res = tune_pgbouncer(pgbouncer_settings).await;
let res = tune_pgbouncer(pgbouncer_settings, tls_config).await;
if let Err(err) = res {
error!("error while tuning pgbouncer: {err:?}");
// Continue with the startup anyway
@@ -645,9 +656,9 @@ impl ComputeNode {
if pspec.spec.mode == ComputeMode::Primary {
self.configure_as_primary(&compute_state)?;
let conf = self.get_conn_conf(None);
tokio::task::spawn_blocking(|| {
let res = get_installed_extensions(conf);
let conf = self.get_tokio_conn_conf(None);
tokio::task::spawn(async {
let res = get_installed_extensions(conf).await;
match res {
Ok(extensions) => {
info!(
@@ -1105,9 +1116,10 @@ impl ComputeNode {
// Remove/create an empty pgdata directory and put configuration there.
self.create_pgdata()?;
config::write_postgres_conf(
&pgdata_path.join("postgresql.conf"),
pgdata_path,
&pspec.spec,
self.params.internal_http_port,
&self.compute_ctl_config.tls,
)?;
// Syncing safekeepers is only safe with primary nodes: if a primary
@@ -1489,11 +1501,13 @@ impl ComputeNode {
if let Some(ref pgbouncer_settings) = spec.pgbouncer_settings {
info!("tuning pgbouncer");
let pgbouncer_settings = pgbouncer_settings.clone();
let tls_config = self.compute_ctl_config.tls.clone();
// Spawn a background task to do the tuning,
// so that we don't block the main thread that starts Postgres.
let pgbouncer_settings = pgbouncer_settings.clone();
tokio::spawn(async move {
let res = tune_pgbouncer(pgbouncer_settings).await;
let res = tune_pgbouncer(pgbouncer_settings, tls_config).await;
if let Err(err) = res {
error!("error while tuning pgbouncer: {err:?}");
}
@@ -1505,7 +1519,8 @@ impl ComputeNode {
// Spawn a background task to do the configuration,
// so that we don't block the main thread that starts Postgres.
let local_proxy = local_proxy.clone();
let mut local_proxy = local_proxy.clone();
local_proxy.tls = self.compute_ctl_config.tls.clone();
tokio::spawn(async move {
if let Err(err) = local_proxy::configure(&local_proxy) {
error!("error while configuring local_proxy: {err:?}");
@@ -1515,8 +1530,12 @@ impl ComputeNode {
// Write new config
let pgdata_path = Path::new(&self.params.pgdata);
let postgresql_conf_path = pgdata_path.join("postgresql.conf");
config::write_postgres_conf(&postgresql_conf_path, &spec, self.params.internal_http_port)?;
config::write_postgres_conf(
pgdata_path,
&spec,
self.params.internal_http_port,
&self.compute_ctl_config.tls,
)?;
if !spec.skip_pg_catalog_updates {
let max_concurrent_connections = spec.reconfigure_concurrency;
@@ -1587,6 +1606,56 @@ impl ComputeNode {
Ok(())
}
pub async fn watch_cert_for_changes(self: Arc<Self>) {
// update status on cert renewal
if let Some(tls_config) = &self.compute_ctl_config.tls {
let tls_config = tls_config.clone();
// wait until the cert exists.
let mut cert_watch = watch_cert_for_changes(tls_config.cert_path.clone()).await;
tokio::task::spawn_blocking(move || {
let handle = tokio::runtime::Handle::current();
'cert_update: loop {
// let postgres/pgbouncer/local_proxy know the new cert/key exists.
// we need to wait until it's configurable first.
let mut state = self.state.lock().unwrap();
'status_update: loop {
match state.status {
// let's update the state to config pending
ComputeStatus::ConfigurationPending | ComputeStatus::Running => {
state.set_status(
ComputeStatus::ConfigurationPending,
&self.state_changed,
);
break 'status_update;
}
// exit loop
ComputeStatus::Failed
| ComputeStatus::TerminationPending
| ComputeStatus::Terminated => break 'cert_update,
// wait
ComputeStatus::Init
| ComputeStatus::Configuration
| ComputeStatus::Empty => {
state = self.state_changed.wait(state).unwrap();
}
}
}
drop(state);
// wait for a new certificate update
if handle.block_on(cert_watch.changed()).is_err() {
break;
}
}
});
}
}
/// Update the `last_active` in the shared state, but ensure that it's a more recent one.
pub fn update_last_active(&self, last_active: Option<DateTime<Utc>>) {
let mut state = self.state.lock().unwrap();

View File

@@ -6,11 +6,13 @@ use std::io::Write;
use std::io::prelude::*;
use std::path::Path;
use compute_api::responses::TlsConfig;
use compute_api::spec::{ComputeAudit, ComputeMode, ComputeSpec, GenericOption};
use crate::pg_helpers::{
GenericOptionExt, GenericOptionsSearch, PgOptionsSerialize, escape_conf_value,
};
use crate::tls::{self, SERVER_CRT, SERVER_KEY};
/// Check that `line` is inside a text file and put it there if it is not.
/// Create file if it doesn't exist.
@@ -38,10 +40,12 @@ pub fn line_in_file(path: &Path, line: &str) -> Result<bool> {
/// Create or completely rewrite configuration file specified by `path`
pub fn write_postgres_conf(
path: &Path,
pgdata_path: &Path,
spec: &ComputeSpec,
extension_server_port: u16,
tls_config: &Option<TlsConfig>,
) -> Result<()> {
let path = pgdata_path.join("postgresql.conf");
// File::create() destroys the file content if it exists.
let mut file = File::create(path)?;
@@ -86,6 +90,20 @@ pub fn write_postgres_conf(
)?;
}
// tls
if let Some(tls_config) = tls_config {
writeln!(file, "ssl = on")?;
// postgres requires the keyfile to be in a secure file,
// currently too complicated to ensure that at the VM level,
// so we just copy them to another file instead. :shrug:
tls::update_key_path_blocking(pgdata_path, tls_config);
// these are the default, but good to be explicit.
writeln!(file, "ssl_cert_file = '{}'", SERVER_CRT)?;
writeln!(file, "ssl_key_file = '{}'", SERVER_KEY)?;
}
// Locales
if cfg!(target_os = "macos") {
writeln!(file, "lc_messages='C'")?;

View File

@@ -202,8 +202,24 @@ pub async fn download_extension(
// move contents of the libdir / sharedir in unzipped archive to the correct local paths
for paths in [sharedir_paths, libdir_paths] {
let (zip_dir, real_dir) = paths;
let dir = match std::fs::read_dir(&zip_dir) {
Ok(dir) => dir,
Err(e) => match e.kind() {
// In the event of a SQL-only extension, there would be nothing
// to move from the lib/ directory, so note that in the log and
// move on.
std::io::ErrorKind::NotFound => {
info!("nothing to move from {}", zip_dir);
continue;
}
_ => return Err(anyhow::anyhow!(e)),
},
};
info!("mv {zip_dir:?}/* {real_dir:?}");
for file in std::fs::read_dir(zip_dir)? {
for file in dir {
let old_file = file?.path();
let new_file =
Path::new(&real_dir).join(old_file.file_name().context("error parsing file")?);

View File

@@ -1 +1,2 @@
pub(in crate::http) mod authorize;
pub(in crate::http) mod request_id;

View File

@@ -0,0 +1,16 @@
use axum::{extract::Request, middleware::Next, response::Response};
use uuid::Uuid;
use crate::http::headers::X_REQUEST_ID;
/// This middleware function allows compute_ctl to generate its own request ID
/// if one isn't supplied. The control plane will always send one as a UUID. The
/// neon Postgres extension on the other hand does not send one.
pub async fn maybe_add_request_id_header(mut request: Request, next: Next) -> Response {
let headers = request.headers_mut();
if !headers.contains_key(X_REQUEST_ID) {
headers.append(X_REQUEST_ID, Uuid::new_v4().to_string().parse().unwrap());
}
next.run(request).await
}

View File

@@ -5,20 +5,19 @@ use std::time::Duration;
use anyhow::Result;
use axum::Router;
use axum::extract::Request;
use axum::middleware::{self, Next};
use axum::response::{IntoResponse, Response};
use axum::middleware::{self};
use axum::response::IntoResponse;
use axum::routing::{get, post};
use compute_api::responses::ComputeCtlConfig;
use http::StatusCode;
use jsonwebtoken::jwk::JwkSet;
use tokio::net::TcpListener;
use tower::ServiceBuilder;
use tower_http::{
auth::AsyncRequireAuthorizationLayer, request_id::PropagateRequestIdLayer, trace::TraceLayer,
};
use tracing::{Span, error, info};
use uuid::Uuid;
use super::middleware::request_id::maybe_add_request_id_header;
use super::{
headers::X_REQUEST_ID,
middleware::authorize::Authorize,
@@ -42,7 +41,7 @@ pub enum Server {
},
External {
port: u16,
jwks: JwkSet,
config: ComputeCtlConfig,
compute_id: String,
},
}
@@ -80,7 +79,7 @@ impl From<&Server> for Router<Arc<ComputeNode>> {
router
}
Server::External {
jwks, compute_id, ..
config, compute_id, ..
} => {
let unauthenticated_router =
Router::<Arc<ComputeNode>>::new().route("/metrics", get(metrics::get_metrics));
@@ -96,7 +95,7 @@ impl From<&Server> for Router<Arc<ComputeNode>> {
.route("/terminate", post(terminate::terminate))
.layer(AsyncRequireAuthorizationLayer::new(Authorize::new(
compute_id.clone(),
jwks.clone(),
config.jwks.clone(),
)));
router
@@ -219,15 +218,3 @@ impl Server {
tokio::spawn(self.serve(state));
}
}
/// This middleware function allows compute_ctl to generate its own request ID
/// if one isn't supplied. The control plane will always send one as a UUID. The
/// neon Postgres extension on the other hand does not send one.
async fn maybe_add_request_id_header(mut request: Request, next: Next) -> Response {
let headers = request.headers_mut();
if headers.get(X_REQUEST_ID).is_none() {
headers.append(X_REQUEST_ID, Uuid::new_v4().to_string().parse().unwrap());
}
next.run(request).await
}

View File

@@ -2,7 +2,7 @@ use std::collections::HashMap;
use anyhow::Result;
use compute_api::responses::{InstalledExtension, InstalledExtensions};
use postgres::{Client, NoTls};
use tokio_postgres::{Client, Config, NoTls};
use crate::metrics::INSTALLED_EXTENSIONS;
@@ -10,7 +10,7 @@ use crate::metrics::INSTALLED_EXTENSIONS;
/// and to make database listing query here more explicit.
///
/// Limit the number of databases to 500 to avoid excessive load.
fn list_dbs(client: &mut Client) -> Result<Vec<String>> {
async fn list_dbs(client: &mut Client) -> Result<Vec<String>> {
// `pg_database.datconnlimit = -2` means that the database is in the
// invalid state
let databases = client
@@ -20,7 +20,8 @@ fn list_dbs(client: &mut Client) -> Result<Vec<String>> {
AND datconnlimit <> - 2
LIMIT 500",
&[],
)?
)
.await?
.iter()
.map(|row| {
let db: String = row.get("datname");
@@ -36,20 +37,36 @@ fn list_dbs(client: &mut Client) -> Result<Vec<String>> {
/// Same extension can be installed in multiple databases with different versions,
/// so we report a separate metric (number of databases where it is installed)
/// for each extension version.
pub fn get_installed_extensions(mut conf: postgres::config::Config) -> Result<InstalledExtensions> {
pub async fn get_installed_extensions(mut conf: Config) -> Result<InstalledExtensions> {
conf.application_name("compute_ctl:get_installed_extensions");
let mut client = conf.connect(NoTls)?;
let databases: Vec<String> = list_dbs(&mut client)?;
let databases: Vec<String> = {
let (mut client, connection) = conf.connect(NoTls).await?;
tokio::spawn(async move {
if let Err(e) = connection.await {
eprintln!("connection error: {}", e);
}
});
list_dbs(&mut client).await?
};
let mut extensions_map: HashMap<(String, String, String), InstalledExtension> = HashMap::new();
for db in databases.iter() {
conf.dbname(db);
let mut db_client = conf.connect(NoTls)?;
let extensions: Vec<(String, String, i32)> = db_client
let (client, connection) = conf.connect(NoTls).await?;
tokio::spawn(async move {
if let Err(e) = connection.await {
eprintln!("connection error: {}", e);
}
});
let extensions: Vec<(String, String, i32)> = client
.query(
"SELECT extname, extversion, extowner::integer FROM pg_catalog.pg_extension",
&[],
)?
)
.await?
.iter()
.map(|row| {
(

View File

@@ -26,3 +26,4 @@ pub mod spec;
mod spec_apply;
pub mod swap;
pub mod sync_sk;
pub mod tls;

View File

@@ -24,7 +24,8 @@ pub async fn init_tracing_and_logging(default_log_level: &str) -> anyhow::Result
.with_writer(std::io::stderr);
// Initialize OpenTelemetry
let otlp_layer = tracing_utils::init_tracing("compute_ctl").await;
let otlp_layer =
tracing_utils::init_tracing("compute_ctl", tracing_utils::ExportConfig::default()).await;
// Put it all together
tracing_subscriber::registry()

View File

@@ -2,7 +2,7 @@ use std::collections::HashMap;
use std::fmt::Write;
use std::fs;
use std::fs::File;
use std::io::{BufRead, BufReader};
use std::io::{BufRead, BufReader, Seek};
use std::os::unix::fs::PermissionsExt;
use std::path::Path;
use std::process::Child;
@@ -10,8 +10,10 @@ use std::str::FromStr;
use std::time::{Duration, Instant};
use anyhow::{Result, bail};
use compute_api::responses::TlsConfig;
use compute_api::spec::{Database, GenericOption, GenericOptions, PgIdent, Role};
use futures::StreamExt;
use indexmap::IndexMap;
use ini::Ini;
use notify::{RecursiveMode, Watcher};
use postgres::config::Config;
@@ -333,10 +335,25 @@ pub fn wait_for_postgres(pg: &mut Child, pgdata: &Path) -> Result<()> {
}
};
watcher.watch(pgdata, RecursiveMode::NonRecursive)?;
// You cannot actually watch a file before it exists, so let's create the
// the postmaster.pid file for Postgres, so we can watch it even before
// Postgres actually starts. In the event that it already exists, just open
// the file for reading. Remember that we are racing Postgres here and that
// it doesn't matter who creates the postmaster.pid.
let mut file = match File::create(&pid_path) {
Ok(file) => file,
Err(e) => {
if e.kind() != std::io::ErrorKind::AlreadyExists {
return Err(anyhow::anyhow!(e));
}
File::open(&pid_path)?
}
};
watcher.watch(&pid_path, RecursiveMode::NonRecursive)?;
let started_at = Instant::now();
let mut postmaster_pid_seen = false;
loop {
if let Ok(Some(status)) = pg.try_wait() {
// Postgres exited, that is not what we expected, bail out earlier.
@@ -353,31 +370,18 @@ pub fn wait_for_postgres(pg: &mut Child, pgdata: &Path) -> Result<()> {
debug!("swallowing extra event: {res:?}");
}
// Check that we can open pid file first.
if let Ok(file) = File::open(&pid_path) {
if !postmaster_pid_seen {
debug!("postmaster.pid appeared");
watcher
.unwatch(pgdata)
.expect("Failed to remove pgdata dir watch");
watcher
.watch(&pid_path, RecursiveMode::NonRecursive)
.expect("Failed to add postmaster.pid file watch");
postmaster_pid_seen = true;
}
file.seek(std::io::SeekFrom::Start(0)).unwrap();
let reader = BufReader::new(&file);
let last_line = reader.lines().last();
let file = BufReader::new(file);
let last_line = file.lines().last();
// Pid file could be there and we could read it, but it could be empty, for example.
if let Some(Ok(line)) = last_line {
let status = line.trim();
debug!("last line of postmaster.pid: {status:?}");
// Pid file could be there and we could read it, but it could be empty, for example.
if let Some(Ok(line)) = last_line {
let status = line.trim();
debug!("last line of postmaster.pid: {status:?}");
// Now Postgres is ready to accept connections
if status == "ready" {
break;
}
// Now Postgres is ready to accept connections
if status == "ready" {
break;
}
}
@@ -406,7 +410,7 @@ pub fn create_pgdata(pgdata: &str) -> Result<()> {
/// Update pgbouncer.ini with provided options
fn update_pgbouncer_ini(
pgbouncer_config: HashMap<String, String>,
pgbouncer_config: IndexMap<String, String>,
pgbouncer_ini_path: &str,
) -> Result<()> {
let mut conf = Ini::load_from_file(pgbouncer_ini_path)?;
@@ -427,7 +431,10 @@ fn update_pgbouncer_ini(
/// Tune pgbouncer.
/// 1. Apply new config using pgbouncer admin console
/// 2. Add new values to pgbouncer.ini to preserve them after restart
pub async fn tune_pgbouncer(pgbouncer_config: HashMap<String, String>) -> Result<()> {
pub async fn tune_pgbouncer(
mut pgbouncer_config: IndexMap<String, String>,
tls_config: Option<TlsConfig>,
) -> Result<()> {
let pgbouncer_connstr = if std::env::var_os("AUTOSCALING").is_some() {
// for VMs use pgbouncer specific way to connect to
// pgbouncer admin console without password
@@ -473,19 +480,21 @@ pub async fn tune_pgbouncer(pgbouncer_config: HashMap<String, String>) -> Result
}
};
// Apply new config
for (option_name, value) in pgbouncer_config.iter() {
let query = format!("SET {}={}", option_name, value);
// keep this log line for debugging purposes
info!("Applying pgbouncer setting change: {}", query);
if let Some(tls_config) = tls_config {
// pgbouncer starts in a half-ok state if it cannot find these files.
// It will default to client_tls_sslmode=deny, which causes proxy to error.
// There is a small window at startup where these files don't yet exist in the VM.
// Best to wait until it exists.
loop {
if let Ok(true) = tokio::fs::try_exists(&tls_config.key_path).await {
break;
}
tokio::time::sleep(Duration::from_millis(500)).await
}
if let Err(err) = client.simple_query(&query).await {
// Don't fail on error, just print it into log
error!(
"Failed to apply pgbouncer setting change: {}, {}",
query, err
);
};
pgbouncer_config.insert("client_tls_cert_file".to_string(), tls_config.cert_path);
pgbouncer_config.insert("client_tls_key_file".to_string(), tls_config.key_path);
pgbouncer_config.insert("client_tls_sslmode".to_string(), "allow".to_string());
}
// save values to pgbouncer.ini
@@ -501,6 +510,13 @@ pub async fn tune_pgbouncer(pgbouncer_config: HashMap<String, String>) -> Result
};
update_pgbouncer_ini(pgbouncer_config, &pgbouncer_ini_path)?;
info!("Applying pgbouncer setting change");
if let Err(err) = client.simple_query("RELOAD").await {
// Don't fail on error, just print it into log
error!("Failed to apply pgbouncer setting change, {err}",);
};
Ok(())
}

View File

@@ -1,8 +1,7 @@
SET SESSION ROLE neon_superuser;
DO ${outer_tag}$
DECLARE
schema TEXT;
grantor TEXT;
revoke_query TEXT;
BEGIN
FOR schema IN
@@ -15,16 +14,25 @@ BEGIN
-- ii) it's easy to add more schemas to the list if needed.
WHERE schema_name IN ('public')
LOOP
revoke_query := format(
'REVOKE ALL PRIVILEGES ON ALL TABLES IN SCHEMA %I FROM %I GRANTED BY neon_superuser;',
schema,
-- N.B. this has to be properly dollar-escaped with `pg_quote_dollar()`
{role_name}
);
FOR grantor IN EXECUTE
format(
'SELECT DISTINCT rtg.grantor FROM information_schema.role_table_grants AS rtg WHERE grantee = %s',
-- N.B. this has to be properly dollar-escaped with `pg_quote_dollar()`
quote_literal({role_name})
)
LOOP
EXECUTE format('SET LOCAL ROLE %I', grantor);
EXECUTE revoke_query;
revoke_query := format(
'REVOKE ALL PRIVILEGES ON ALL TABLES IN SCHEMA %I FROM %I GRANTED BY %I',
schema,
-- N.B. this has to be properly dollar-escaped with `pg_quote_dollar()`
{role_name},
grantor
);
EXECUTE revoke_query;
END LOOP;
END LOOP;
END;
${outer_tag}$;
RESET ROLE;

118
compute_tools/src/tls.rs Normal file
View File

@@ -0,0 +1,118 @@
use std::{io::Write, os::unix::fs::OpenOptionsExt, path::Path, time::Duration};
use anyhow::{Context, Result, bail};
use compute_api::responses::TlsConfig;
use ring::digest;
use spki::ObjectIdentifier;
use spki::der::{Decode, PemReader};
use x509_cert::Certificate;
#[derive(Clone, Copy)]
pub struct CertDigest(digest::Digest);
pub async fn watch_cert_for_changes(cert_path: String) -> tokio::sync::watch::Receiver<CertDigest> {
let mut digest = compute_digest(&cert_path).await;
let (tx, rx) = tokio::sync::watch::channel(digest);
tokio::spawn(async move {
while !tx.is_closed() {
let new_digest = compute_digest(&cert_path).await;
if digest.0.as_ref() != new_digest.0.as_ref() {
digest = new_digest;
_ = tx.send(digest);
}
tokio::time::sleep(Duration::from_secs(60)).await
}
});
rx
}
async fn compute_digest(cert_path: &str) -> CertDigest {
loop {
match try_compute_digest(cert_path).await {
Ok(d) => break d,
Err(e) => {
tracing::error!("could not read cert file {e:?}");
tokio::time::sleep(Duration::from_secs(1)).await
}
}
}
}
async fn try_compute_digest(cert_path: &str) -> Result<CertDigest> {
let data = tokio::fs::read(cert_path).await?;
// sha256 is extremely collision resistent. can safely assume the digest to be unique
Ok(CertDigest(digest::digest(&digest::SHA256, &data)))
}
pub const SERVER_CRT: &str = "server.crt";
pub const SERVER_KEY: &str = "server.key";
pub fn update_key_path_blocking(pg_data: &Path, tls_config: &TlsConfig) {
loop {
match try_update_key_path_blocking(pg_data, tls_config) {
Ok(()) => break,
Err(e) => {
tracing::error!("could not create key file {e:?}");
std::thread::sleep(Duration::from_secs(1))
}
}
}
}
// Postgres requires the keypath be "secure". This means
// 1. Owned by the postgres user.
// 2. Have permission 600.
fn try_update_key_path_blocking(pg_data: &Path, tls_config: &TlsConfig) -> Result<()> {
let key = std::fs::read_to_string(&tls_config.key_path)?;
let crt = std::fs::read_to_string(&tls_config.cert_path)?;
// to mitigate a race condition during renewal.
verify_key_cert(&key, &crt)?;
let mut key_file = std::fs::OpenOptions::new()
.write(true)
.create(true)
.truncate(true)
.mode(0o600)
.open(pg_data.join(SERVER_KEY))?;
let mut crt_file = std::fs::OpenOptions::new()
.write(true)
.create(true)
.truncate(true)
.mode(0o600)
.open(pg_data.join(SERVER_CRT))?;
key_file.write_all(key.as_bytes())?;
crt_file.write_all(crt.as_bytes())?;
Ok(())
}
fn verify_key_cert(key: &str, cert: &str) -> Result<()> {
const ECDSA_WITH_SHA256: ObjectIdentifier = ObjectIdentifier::new_unwrap("1.2.840.10045.4.3.2");
let cert = Certificate::decode(&mut PemReader::new(cert.as_bytes()).context("pem reader")?)
.context("decode cert")?;
match cert.signature_algorithm.oid {
ECDSA_WITH_SHA256 => {
let key = p256::SecretKey::from_sec1_pem(key).context("parse key")?;
let a = key.public_key().to_sec1_bytes();
let b = cert
.tbs_certificate
.subject_public_key_info
.subject_public_key
.raw_bytes();
if *a != *b {
bail!("private key file does not match certificate")
}
}
_ => bail!("unknown TLS key type"),
}
Ok(())
}

View File

@@ -36,7 +36,9 @@ use pageserver_api::config::{
use pageserver_api::controller_api::{
NodeAvailabilityWrapper, PlacementPolicy, TenantCreateRequest,
};
use pageserver_api::models::{ShardParameters, TimelineCreateRequest, TimelineInfo};
use pageserver_api::models::{
ShardParameters, TenantConfigRequest, TimelineCreateRequest, TimelineInfo,
};
use pageserver_api::shard::{ShardCount, ShardStripeSize, TenantShardId};
use postgres_backend::AuthType;
use postgres_connection::parse_host_port;
@@ -963,6 +965,7 @@ fn handle_init(args: &InitCmdArgs) -> anyhow::Result<LocalEnv> {
id: pageserver_id,
listen_pg_addr: format!("127.0.0.1:{pg_port}"),
listen_http_addr: format!("127.0.0.1:{http_port}"),
listen_https_addr: None,
pg_auth_type: AuthType::Trust,
http_auth_type: AuthType::Trust,
other: Default::default(),
@@ -976,7 +979,8 @@ fn handle_init(args: &InitCmdArgs) -> anyhow::Result<LocalEnv> {
neon_distrib_dir: None,
default_tenant_id: TenantId::from_array(std::array::from_fn(|_| 0)),
storage_controller: None,
control_plane_compute_hook_api: None,
control_plane_hooks_api: None,
generate_local_ssl_certs: false,
}
};
@@ -1127,12 +1131,16 @@ async fn handle_tenant(subcmd: &TenantCmd, env: &mut local_env::LocalEnv) -> any
let tenant_id = get_tenant_id(args.tenant_id, env)?;
let tenant_conf: HashMap<_, _> =
args.config.iter().flat_map(|c| c.split_once(':')).collect();
let config = PageServerNode::parse_config(tenant_conf)?;
pageserver
.tenant_config(tenant_id, tenant_conf)
let req = TenantConfigRequest { tenant_id, config };
let storage_controller = StorageController::from_env(env);
storage_controller
.set_tenant_config(&req)
.await
.with_context(|| format!("Tenant config failed for tenant with id {tenant_id}"))?;
println!("tenant {tenant_id} successfully configured on the pageserver");
println!("tenant {tenant_id} successfully configured via storcon");
}
}
Ok(())

View File

@@ -72,15 +72,19 @@ pub struct LocalEnv {
// be propagated into each pageserver's configuration.
pub control_plane_api: Url,
// Control plane upcall API for storage controller. If set, this will be propagated into the
// Control plane upcall APIs for storage controller. If set, this will be propagated into the
// storage controller's configuration.
pub control_plane_compute_hook_api: Option<Url>,
pub control_plane_hooks_api: Option<Url>,
/// Keep human-readable aliases in memory (and persist them to config), to hide ZId hex strings from the user.
// A `HashMap<String, HashMap<TenantId, TimelineId>>` would be more appropriate here,
// but deserialization into a generic toml object as `toml::Value::try_from` fails with an error.
// https://toml.io/en/v1.0.0 does not contain a concept of "a table inside another table".
pub branch_name_mappings: HashMap<String, Vec<(TenantId, TimelineId)>>,
/// Flag to generate SSL certificates for components that need it.
/// Also generates root CA certificate that is used to sign all other certificates.
pub generate_local_ssl_certs: bool,
}
/// On-disk state stored in `.neon/config`.
@@ -100,8 +104,13 @@ pub struct OnDiskConfig {
pub pageservers: Vec<PageServerConf>,
pub safekeepers: Vec<SafekeeperConf>,
pub control_plane_api: Option<Url>,
pub control_plane_hooks_api: Option<Url>,
pub control_plane_compute_hook_api: Option<Url>,
branch_name_mappings: HashMap<String, Vec<(TenantId, TimelineId)>>,
// Note: skip serializing because in compat tests old storage controller fails
// to load new config file. May be removed after this field is in release branch.
#[serde(skip_serializing_if = "std::ops::Not::not")]
pub generate_local_ssl_certs: bool,
}
fn fail_if_pageservers_field_specified<'de, D>(_: D) -> Result<Vec<PageServerConf>, D::Error>
@@ -128,7 +137,8 @@ pub struct NeonLocalInitConf {
pub pageservers: Vec<NeonLocalInitPageserverConf>,
pub safekeepers: Vec<SafekeeperConf>,
pub control_plane_api: Option<Url>,
pub control_plane_compute_hook_api: Option<Option<Url>>,
pub control_plane_hooks_api: Option<Url>,
pub generate_local_ssl_certs: bool,
}
/// Broker config for cluster internal communication.
@@ -165,6 +175,11 @@ pub struct NeonStorageControllerConf {
#[serde(with = "humantime_serde")]
pub long_reconcile_threshold: Option<Duration>,
#[serde(default)]
pub use_https_pageserver_api: bool,
pub timelines_onto_safekeepers: bool,
}
impl NeonStorageControllerConf {
@@ -188,6 +203,8 @@ impl Default for NeonStorageControllerConf {
max_secondary_lag_bytes: None,
heartbeat_interval: Self::DEFAULT_HEARTBEAT_INTERVAL,
long_reconcile_threshold: None,
use_https_pageserver_api: false,
timelines_onto_safekeepers: false,
}
}
}
@@ -217,6 +234,7 @@ pub struct PageServerConf {
pub id: NodeId,
pub listen_pg_addr: String,
pub listen_http_addr: String,
pub listen_https_addr: Option<String>,
pub pg_auth_type: AuthType,
pub http_auth_type: AuthType,
pub no_sync: bool,
@@ -228,6 +246,7 @@ impl Default for PageServerConf {
id: NodeId(0),
listen_pg_addr: String::new(),
listen_http_addr: String::new(),
listen_https_addr: None,
pg_auth_type: AuthType::Trust,
http_auth_type: AuthType::Trust,
no_sync: false,
@@ -243,6 +262,7 @@ pub struct NeonLocalInitPageserverConf {
pub id: NodeId,
pub listen_pg_addr: String,
pub listen_http_addr: String,
pub listen_https_addr: Option<String>,
pub pg_auth_type: AuthType,
pub http_auth_type: AuthType,
#[serde(default, skip_serializing_if = "std::ops::Not::not")]
@@ -257,6 +277,7 @@ impl From<&NeonLocalInitPageserverConf> for PageServerConf {
id,
listen_pg_addr,
listen_http_addr,
listen_https_addr,
pg_auth_type,
http_auth_type,
no_sync,
@@ -266,6 +287,7 @@ impl From<&NeonLocalInitPageserverConf> for PageServerConf {
id: *id,
listen_pg_addr: listen_pg_addr.clone(),
listen_http_addr: listen_http_addr.clone(),
listen_https_addr: listen_https_addr.clone(),
pg_auth_type: *pg_auth_type,
http_auth_type: *http_auth_type,
no_sync: *no_sync,
@@ -410,6 +432,41 @@ impl LocalEnv {
}
}
pub fn ssl_ca_cert_path(&self) -> Option<PathBuf> {
if self.generate_local_ssl_certs {
Some(self.base_data_dir.join("rootCA.crt"))
} else {
None
}
}
pub fn ssl_ca_key_path(&self) -> Option<PathBuf> {
if self.generate_local_ssl_certs {
Some(self.base_data_dir.join("rootCA.key"))
} else {
None
}
}
pub fn generate_ssl_ca_cert(&self) -> anyhow::Result<()> {
let cert_path = self.ssl_ca_cert_path().unwrap();
let key_path = self.ssl_ca_key_path().unwrap();
if !fs::exists(cert_path.as_path())? {
generate_ssl_ca_cert(cert_path.as_path(), key_path.as_path())?;
}
Ok(())
}
pub fn generate_ssl_cert(&self, cert_path: &Path, key_path: &Path) -> anyhow::Result<()> {
self.generate_ssl_ca_cert()?;
generate_ssl_cert(
cert_path,
key_path,
self.ssl_ca_cert_path().unwrap().as_path(),
self.ssl_ca_key_path().unwrap().as_path(),
)
}
/// Inspect the base data directory and extract the instance id and instance directory path
/// for all storage controller instances
pub async fn storage_controller_instances(&self) -> std::io::Result<Vec<(u8, PathBuf)>> {
@@ -517,8 +574,10 @@ impl LocalEnv {
pageservers,
safekeepers,
control_plane_api,
control_plane_compute_hook_api,
control_plane_hooks_api,
control_plane_compute_hook_api: _,
branch_name_mappings,
generate_local_ssl_certs,
} = on_disk_config;
LocalEnv {
base_data_dir: repopath.to_owned(),
@@ -531,8 +590,9 @@ impl LocalEnv {
pageservers,
safekeepers,
control_plane_api: control_plane_api.unwrap(),
control_plane_compute_hook_api,
control_plane_hooks_api,
branch_name_mappings,
generate_local_ssl_certs,
}
};
@@ -568,6 +628,7 @@ impl LocalEnv {
struct PageserverConfigTomlSubset {
listen_pg_addr: String,
listen_http_addr: String,
listen_https_addr: Option<String>,
pg_auth_type: AuthType,
http_auth_type: AuthType,
#[serde(default)]
@@ -592,6 +653,7 @@ impl LocalEnv {
let PageserverConfigTomlSubset {
listen_pg_addr,
listen_http_addr,
listen_https_addr,
pg_auth_type,
http_auth_type,
no_sync,
@@ -609,6 +671,7 @@ impl LocalEnv {
},
listen_pg_addr,
listen_http_addr,
listen_https_addr,
pg_auth_type,
http_auth_type,
no_sync,
@@ -634,8 +697,10 @@ impl LocalEnv {
pageservers: vec![], // it's skip_serializing anyway
safekeepers: self.safekeepers.clone(),
control_plane_api: Some(self.control_plane_api.clone()),
control_plane_compute_hook_api: self.control_plane_compute_hook_api.clone(),
control_plane_hooks_api: self.control_plane_hooks_api.clone(),
control_plane_compute_hook_api: None,
branch_name_mappings: self.branch_name_mappings.clone(),
generate_local_ssl_certs: self.generate_local_ssl_certs,
},
)
}
@@ -717,7 +782,8 @@ impl LocalEnv {
pageservers,
safekeepers,
control_plane_api,
control_plane_compute_hook_api,
generate_local_ssl_certs,
control_plane_hooks_api,
} = conf;
// Find postgres binaries.
@@ -764,10 +830,15 @@ impl LocalEnv {
pageservers: pageservers.iter().map(Into::into).collect(),
safekeepers,
control_plane_api: control_plane_api.unwrap(),
control_plane_compute_hook_api: control_plane_compute_hook_api.unwrap_or_default(),
control_plane_hooks_api,
branch_name_mappings: Default::default(),
generate_local_ssl_certs,
};
if generate_local_ssl_certs {
env.generate_ssl_ca_cert()?;
}
// create endpoints dir
fs::create_dir_all(env.endpoints_path())?;
@@ -851,3 +922,80 @@ fn generate_auth_keys(private_key_path: &Path, public_key_path: &Path) -> anyhow
}
Ok(())
}
fn generate_ssl_ca_cert(cert_path: &Path, key_path: &Path) -> anyhow::Result<()> {
// openssl req -x509 -newkey rsa:2048 -nodes -subj "/CN=Neon Local CA" -days 36500 \
// -out rootCA.crt -keyout rootCA.key
let keygen_output = Command::new("openssl")
.args([
"req", "-x509", "-newkey", "rsa:2048", "-nodes", "-days", "36500",
])
.args(["-subj", "/CN=Neon Local CA"])
.args(["-out", cert_path.to_str().unwrap()])
.args(["-keyout", key_path.to_str().unwrap()])
.output()
.context("failed to generate CA certificate")?;
if !keygen_output.status.success() {
bail!(
"openssl failed: '{}'",
String::from_utf8_lossy(&keygen_output.stderr)
);
}
Ok(())
}
fn generate_ssl_cert(
cert_path: &Path,
key_path: &Path,
ca_cert_path: &Path,
ca_key_path: &Path,
) -> anyhow::Result<()> {
// Generate Certificate Signing Request (CSR).
let mut csr_path = cert_path.to_path_buf();
csr_path.set_extension(".csr");
// openssl req -new -nodes -newkey rsa:2048 -keyout server.key -out server.csr \
// -subj "/CN=localhost" -addext "subjectAltName=DNS:localhost,IP:127.0.0.1"
let keygen_output = Command::new("openssl")
.args(["req", "-new", "-nodes"])
.args(["-newkey", "rsa:2048"])
.args(["-subj", "/CN=localhost"])
.args(["-addext", "subjectAltName=DNS:localhost,IP:127.0.0.1"])
.args(["-keyout", key_path.to_str().unwrap()])
.args(["-out", csr_path.to_str().unwrap()])
.output()
.context("failed to generate CSR")?;
if !keygen_output.status.success() {
bail!(
"openssl failed: '{}'",
String::from_utf8_lossy(&keygen_output.stderr)
);
}
// Sign CSR with CA key.
//
// openssl x509 -req -in server.csr -CA rootCA.crt -CAkey rootCA.key -CAcreateserial \
// -out server.crt -days 36500 -copy_extensions copyall
let keygen_output = Command::new("openssl")
.args(["x509", "-req"])
.args(["-in", csr_path.to_str().unwrap()])
.args(["-CA", ca_cert_path.to_str().unwrap()])
.args(["-CAkey", ca_key_path.to_str().unwrap()])
.arg("-CAcreateserial")
.args(["-out", cert_path.to_str().unwrap()])
.args(["-days", "36500"])
.args(["-copy_extensions", "copyall"])
.output()
.context("failed to sign CSR")?;
if !keygen_output.status.success() {
bail!(
"openssl failed: '{}'",
String::from_utf8_lossy(&keygen_output.stderr)
);
}
// Remove CSR file as it's not needed anymore.
fs::remove_file(csr_path)?;
Ok(())
}

View File

@@ -21,6 +21,7 @@ use pageserver_api::shard::TenantShardId;
use pageserver_client::mgmt_api;
use postgres_backend::AuthType;
use postgres_connection::{PgConnectionConfig, parse_host_port};
use reqwest::Certificate;
use utils::auth::{Claims, Scope};
use utils::id::{NodeId, TenantId, TimelineId};
use utils::lsn::Lsn;
@@ -49,12 +50,29 @@ impl PageServerNode {
let (host, port) =
parse_host_port(&conf.listen_pg_addr).expect("Unable to parse listen_pg_addr");
let port = port.unwrap_or(5432);
let ssl_ca_cert = env.ssl_ca_cert_path().map(|ssl_ca_file| {
let buf = std::fs::read(ssl_ca_file).expect("SSL root CA file should exist");
Certificate::from_pem(&buf).expect("CA certificate should be valid")
});
let endpoint = if env.storage_controller.use_https_pageserver_api {
format!(
"https://{}",
conf.listen_https_addr.as_ref().expect(
"listen https address should be specified if use_https_pageserver_api is on"
)
)
} else {
format!("http://{}", conf.listen_http_addr)
};
Self {
pg_connection_config: PgConnectionConfig::new_host_port(host, port),
conf: conf.clone(),
env: env.clone(),
http_client: mgmt_api::Client::new(
format!("http://{}", conf.listen_http_addr),
endpoint,
{
match conf.http_auth_type {
AuthType::Trust => None,
@@ -65,7 +83,9 @@ impl PageServerNode {
}
}
.as_deref(),
),
ssl_ca_cert,
)
.expect("Client constructs with no errors"),
}
}
@@ -220,6 +240,13 @@ impl PageServerNode {
.context("write identity toml")?;
drop(identity_toml);
if self.env.generate_local_ssl_certs {
self.env.generate_ssl_cert(
datadir.join("server.crt").as_path(),
datadir.join("server.key").as_path(),
)?;
}
// TODO: invoke a TBD config-check command to validate that pageserver will start with the written config
// Write metadata file, used by pageserver on startup to register itself with
@@ -230,6 +257,15 @@ impl PageServerNode {
parse_host_port(&self.conf.listen_http_addr).expect("Unable to parse listen_http_addr");
let http_port = http_port.unwrap_or(9898);
let https_port = match self.conf.listen_https_addr.as_ref() {
Some(https_addr) => {
let (_https_host, https_port) =
parse_host_port(https_addr).expect("Unable to parse listen_https_addr");
Some(https_port.unwrap_or(9899))
}
None => None,
};
// Intentionally hand-craft JSON: this acts as an implicit format compat test
// in case the pageserver-side structure is edited, and reflects the real life
// situation: the metadata is written by some other script.
@@ -240,6 +276,7 @@ impl PageServerNode {
postgres_port: self.pg_connection_config.port(),
http_host: "localhost".to_string(),
http_port,
https_port,
other: HashMap::from([(
"availability_zone_id".to_string(),
serde_json::json!(az_id),

View File

@@ -12,13 +12,10 @@ use hyper0::Uri;
use nix::unistd::Pid;
use pageserver_api::controller_api::{
NodeConfigureRequest, NodeDescribeResponse, NodeRegisterRequest, TenantCreateRequest,
TenantCreateResponse, TenantLocateResponse, TenantShardMigrateRequest,
TenantShardMigrateResponse,
TenantCreateResponse, TenantLocateResponse,
};
use pageserver_api::models::{
TenantShardSplitRequest, TenantShardSplitResponse, TimelineCreateRequest, TimelineInfo,
};
use pageserver_api::shard::{ShardStripeSize, TenantShardId};
use pageserver_api::models::{TenantConfigRequest, TimelineCreateRequest, TimelineInfo};
use pageserver_api::shard::TenantShardId;
use pageserver_client::mgmt_api::ResponseErrorMessageExt;
use postgres_backend::AuthType;
use reqwest::Method;
@@ -537,6 +534,14 @@ impl StorageController {
args.push("--start-as-candidate".to_string());
}
if self.config.use_https_pageserver_api {
args.push("--use-https-pageserver-api".to_string());
}
if let Some(ssl_ca_file) = self.env.ssl_ca_cert_path() {
args.push(format!("--ssl-ca-file={}", ssl_ca_file.to_str().unwrap()));
}
if let Some(private_key) = &self.private_key {
let claims = Claims::new(None, Scope::PageServerApi);
let jwt_token =
@@ -553,10 +558,8 @@ impl StorageController {
args.push(format!("--public-key=\"{public_key}\""));
}
if let Some(control_plane_compute_hook_api) = &self.env.control_plane_compute_hook_api {
args.push(format!(
"--compute-hook-url={control_plane_compute_hook_api}"
));
if let Some(control_plane_hooks_api) = &self.env.control_plane_hooks_api {
args.push(format!("--control-plane-url={control_plane_hooks_api}"));
}
if let Some(split_threshold) = self.config.split_threshold.as_ref() {
@@ -579,6 +582,10 @@ impl StorageController {
self.env.base_data_dir.display()
));
if self.config.timelines_onto_safekeepers {
args.push("--timelines-onto-safekeepers".to_string());
}
background_process::start_process(
COMMAND,
&instance_dir,
@@ -825,41 +832,6 @@ impl StorageController {
.await
}
#[instrument(skip(self))]
pub async fn tenant_migrate(
&self,
tenant_shard_id: TenantShardId,
node_id: NodeId,
) -> anyhow::Result<TenantShardMigrateResponse> {
self.dispatch(
Method::PUT,
format!("control/v1/tenant/{tenant_shard_id}/migrate"),
Some(TenantShardMigrateRequest {
node_id,
migration_config: None,
}),
)
.await
}
#[instrument(skip(self), fields(%tenant_id, %new_shard_count))]
pub async fn tenant_split(
&self,
tenant_id: TenantId,
new_shard_count: u8,
new_stripe_size: Option<ShardStripeSize>,
) -> anyhow::Result<TenantShardSplitResponse> {
self.dispatch(
Method::PUT,
format!("control/v1/tenant/{tenant_id}/shard_split"),
Some(TenantShardSplitRequest {
new_shard_count,
new_stripe_size,
}),
)
.await
}
#[instrument(skip_all, fields(node_id=%req.node_id))]
pub async fn node_register(&self, req: NodeRegisterRequest) -> anyhow::Result<()> {
self.dispatch::<_, ()>(Method::POST, "control/v1/node".to_string(), Some(req))
@@ -904,4 +876,9 @@ impl StorageController {
)
.await
}
pub async fn set_tenant_config(&self, req: &TenantConfigRequest) -> anyhow::Result<()> {
self.dispatch(Method::PUT, "v1/tenant/config".to_string(), Some(req))
.await
}
}

View File

@@ -1,20 +1,21 @@
use std::collections::{HashMap, HashSet};
use std::path::PathBuf;
use std::str::FromStr;
use std::time::Duration;
use clap::{Parser, Subcommand};
use futures::StreamExt;
use pageserver_api::controller_api::{
AvailabilityZone, NodeAvailabilityWrapper, NodeConfigureRequest, NodeDescribeResponse,
NodeRegisterRequest, NodeSchedulingPolicy, NodeShardResponse, PlacementPolicy,
SafekeeperDescribeResponse, SafekeeperSchedulingPolicyRequest, ShardSchedulingPolicy,
ShardsPreferredAzsRequest, ShardsPreferredAzsResponse, SkSchedulingPolicy, TenantCreateRequest,
TenantDescribeResponse, TenantPolicyRequest, TenantShardMigrateRequest,
TenantShardMigrateResponse,
AvailabilityZone, MigrationConfig, NodeAvailabilityWrapper, NodeConfigureRequest,
NodeDescribeResponse, NodeRegisterRequest, NodeSchedulingPolicy, NodeShardResponse,
PlacementPolicy, SafekeeperDescribeResponse, SafekeeperSchedulingPolicyRequest,
ShardSchedulingPolicy, ShardsPreferredAzsRequest, ShardsPreferredAzsResponse,
SkSchedulingPolicy, TenantCreateRequest, TenantDescribeResponse, TenantPolicyRequest,
TenantShardMigrateRequest, TenantShardMigrateResponse,
};
use pageserver_api::models::{
EvictionPolicy, EvictionPolicyLayerAccessThreshold, LocationConfigSecondary, ShardParameters,
TenantConfig, TenantConfigPatchRequest, TenantConfigRequest, TenantShardSplitRequest,
EvictionPolicy, EvictionPolicyLayerAccessThreshold, ShardParameters, TenantConfig,
TenantConfigPatchRequest, TenantConfigRequest, TenantShardSplitRequest,
TenantShardSplitResponse,
};
use pageserver_api::shard::{ShardStripeSize, TenantShardId};
@@ -112,6 +113,15 @@ enum Command {
tenant_shard_id: TenantShardId,
#[arg(long)]
node: NodeId,
#[arg(long, default_value_t = true, action = clap::ArgAction::Set)]
prewarm: bool,
#[arg(long, default_value_t = false, action = clap::ArgAction::Set)]
override_scheduler: bool,
},
/// Watch the location of a tenant shard evolve, e.g. while expecting it to migrate
TenantShardWatch {
#[arg(long)]
tenant_shard_id: TenantShardId,
},
/// Migrate the secondary location for a tenant shard to a specific pageserver.
TenantShardMigrateSecondary {
@@ -148,12 +158,6 @@ enum Command {
#[arg(long)]
tenant_id: TenantId,
},
/// For a tenant which hasn't been onboarded to the storage controller yet, add it in secondary
/// mode so that it can warm up content on a pageserver.
TenantWarmup {
#[arg(long)]
tenant_id: TenantId,
},
TenantSetPreferredAz {
#[arg(long)]
tenant_id: TenantId,
@@ -269,6 +273,10 @@ struct Cli {
/// a token with both scopes to use with this tool.
jwt: Option<String>,
#[arg(long)]
/// Trusted root CA certificate to use in https APIs.
ssl_ca_file: Option<PathBuf>,
#[command(subcommand)]
command: Command,
}
@@ -379,9 +387,17 @@ async fn main() -> anyhow::Result<()> {
let storcon_client = Client::new(cli.api.clone(), cli.jwt.clone());
let ssl_ca_cert = match &cli.ssl_ca_file {
Some(ssl_ca_file) => {
let buf = tokio::fs::read(ssl_ca_file).await?;
Some(reqwest::Certificate::from_pem(&buf)?)
}
None => None,
};
let mut trimmed = cli.api.to_string();
trimmed.pop();
let vps_client = mgmt_api::Client::new(trimmed, cli.jwt.as_deref());
let vps_client = mgmt_api::Client::new(trimmed, cli.jwt.as_deref(), ssl_ca_cert)?;
match cli.command {
Command::NodeRegister {
@@ -619,19 +635,43 @@ async fn main() -> anyhow::Result<()> {
Command::TenantShardMigrate {
tenant_shard_id,
node,
prewarm,
override_scheduler,
} => {
let req = TenantShardMigrateRequest {
node_id: node,
migration_config: None,
let migration_config = MigrationConfig {
prewarm,
override_scheduler,
..Default::default()
};
storcon_client
let req = TenantShardMigrateRequest {
node_id: node,
origin_node_id: None,
migration_config,
};
match storcon_client
.dispatch::<TenantShardMigrateRequest, TenantShardMigrateResponse>(
Method::PUT,
format!("control/v1/tenant/{tenant_shard_id}/migrate"),
Some(req),
)
.await?;
.await
{
Err(mgmt_api::Error::ApiError(StatusCode::PRECONDITION_FAILED, msg)) => {
anyhow::bail!(
"Migration to {node} rejected, may require `--force` ({}) ",
msg
);
}
Err(e) => return Err(e.into()),
Ok(_) => {}
}
watch_tenant_shard(storcon_client, tenant_shard_id, Some(node)).await?;
}
Command::TenantShardWatch { tenant_shard_id } => {
watch_tenant_shard(storcon_client, tenant_shard_id, None).await?;
}
Command::TenantShardMigrateSecondary {
tenant_shard_id,
@@ -639,7 +679,8 @@ async fn main() -> anyhow::Result<()> {
} => {
let req = TenantShardMigrateRequest {
node_id: node,
migration_config: None,
origin_node_id: None,
migration_config: MigrationConfig::default(),
};
storcon_client
@@ -824,94 +865,6 @@ async fn main() -> anyhow::Result<()> {
)
.await?;
}
Command::TenantWarmup { tenant_id } => {
let describe_response = storcon_client
.dispatch::<(), TenantDescribeResponse>(
Method::GET,
format!("control/v1/tenant/{tenant_id}"),
None,
)
.await;
match describe_response {
Ok(describe) => {
if matches!(describe.policy, PlacementPolicy::Secondary) {
// Fine: it's already known to controller in secondary mode: calling
// again to put it into secondary mode won't cause problems.
} else {
anyhow::bail!("Tenant already present with policy {:?}", describe.policy);
}
}
Err(mgmt_api::Error::ApiError(StatusCode::NOT_FOUND, _)) => {
// Fine: this tenant isn't know to the storage controller yet.
}
Err(e) => {
// Unexpected API error
return Err(e.into());
}
}
vps_client
.location_config(
TenantShardId::unsharded(tenant_id),
pageserver_api::models::LocationConfig {
mode: pageserver_api::models::LocationConfigMode::Secondary,
generation: None,
secondary_conf: Some(LocationConfigSecondary { warm: true }),
shard_number: 0,
shard_count: 0,
shard_stripe_size: ShardParameters::DEFAULT_STRIPE_SIZE.0,
tenant_conf: TenantConfig::default(),
},
None,
true,
)
.await?;
let describe_response = storcon_client
.dispatch::<(), TenantDescribeResponse>(
Method::GET,
format!("control/v1/tenant/{tenant_id}"),
None,
)
.await?;
let secondary_ps_id = describe_response
.shards
.first()
.unwrap()
.node_secondary
.first()
.unwrap();
println!("Tenant {tenant_id} warming up on pageserver {secondary_ps_id}");
loop {
let (status, progress) = vps_client
.tenant_secondary_download(
TenantShardId::unsharded(tenant_id),
Some(Duration::from_secs(10)),
)
.await?;
println!(
"Progress: {}/{} layers, {}/{} bytes",
progress.layers_downloaded,
progress.layers_total,
progress.bytes_downloaded,
progress.bytes_total
);
match status {
StatusCode::OK => {
println!("Download complete");
break;
}
StatusCode::ACCEPTED => {
// Loop
}
_ => {
anyhow::bail!("Unexpected download status: {status}");
}
}
}
}
Command::TenantDrop { tenant_id, unclean } => {
if !unclean {
anyhow::bail!(
@@ -1105,7 +1058,8 @@ async fn main() -> anyhow::Result<()> {
format!("control/v1/tenant/{}/migrate", mv.tenant_shard_id),
Some(TenantShardMigrateRequest {
node_id: mv.to,
migration_config: None,
origin_node_id: Some(mv.from),
migration_config: MigrationConfig::default(),
}),
)
.await
@@ -1284,3 +1238,68 @@ async fn main() -> anyhow::Result<()> {
Ok(())
}
static WATCH_INTERVAL: Duration = Duration::from_secs(5);
async fn watch_tenant_shard(
storcon_client: Client,
tenant_shard_id: TenantShardId,
until_migrated_to: Option<NodeId>,
) -> anyhow::Result<()> {
if let Some(until_migrated_to) = until_migrated_to {
println!(
"Waiting for tenant shard {} to be migrated to node {}",
tenant_shard_id, until_migrated_to
);
}
loop {
let desc = storcon_client
.dispatch::<(), TenantDescribeResponse>(
Method::GET,
format!("control/v1/tenant/{}", tenant_shard_id.tenant_id),
None,
)
.await?;
// Output the current state of the tenant shard
let shard = desc
.shards
.iter()
.find(|s| s.tenant_shard_id == tenant_shard_id)
.ok_or(anyhow::anyhow!("Tenant shard not found"))?;
let summary = format!(
"attached: {} secondary: {} {}",
shard
.node_attached
.map(|n| format!("{}", n))
.unwrap_or("none".to_string()),
shard
.node_secondary
.iter()
.map(|n| n.to_string())
.collect::<Vec<_>>()
.join(","),
if shard.is_reconciling {
"(reconciler active)"
} else {
"(reconciler idle)"
}
);
println!("{}", summary);
// Maybe drop out if we finished migration
if let Some(until_migrated_to) = until_migrated_to {
if shard.node_attached == Some(until_migrated_to) && !shard.is_reconciling {
println!(
"Tenant shard {} is now on node {}",
tenant_shard_id, until_migrated_to
);
break;
}
}
tokio::time::sleep(WATCH_INTERVAL).await;
}
Ok(())
}

View File

@@ -27,6 +27,14 @@ yanked = "warn"
id = "RUSTSEC-2023-0071"
reason = "the marvin attack only affects private key decryption, not public key signature verification"
[[advisories.ignore]]
id = "RUSTSEC-2024-0436"
reason = "The paste crate is a build-only dependency with no runtime components. It is unlikely to have any security impact."
[[advisories.ignore]]
id = "RUSTSEC-2025-0014"
reason = "The humantime is widely used and is not easy to replace right now. It is unmaintained, but it has no known vulnerabilities to care about. #11179"
# This section is considered when running `cargo deny check licenses`
# More documentation for the licenses section can be found here:
# https://embarkstudios.github.io/cargo-deny/checks/licenses/cfg.html

View File

@@ -101,15 +101,25 @@ changes such as a pageserver node becoming unavailable, or the tenant's shard co
postgres clients to handle such changes, the storage controller calls an API hook when a tenant's pageserver
location changes.
The hook is configured using the storage controller's `--compute-hook-url` CLI option. If the hook requires
JWT auth, the token may be provided with `--control-plane-jwt-token`. The hook will be invoked with a `PUT` request.
The hook is configured using the storage controller's `--control-plane-url` CLI option, from which the hook URL is computed.
In the Neon cloud service, this hook is implemented by Neon's internal cloud control plane. In `neon_local` systems
Currently, there is two hooks, each computed by appending the name to the provided control plane URL prefix:
- `notify-attach`, called whenever attachment for pageservers changes
- `notify-safekeepers`, called whenever attachment for safekeepers changes
If the hooks require JWT auth, the token may be provided with `--control-plane-jwt-token`.
The hooks will be invoked with a `PUT` request.
In the Neon cloud service, these hooks are implemented by Neon's internal cloud control plane. In `neon_local` systems,
the storage controller integrates directly with neon_local to reconfigure local postgres processes instead of calling
the compute hook.
When implementing an on-premise Neon deployment, you must implement a service that handles the compute hook. This is not complicated:
the request body has format of the `ComputeHookNotifyRequest` structure, provided below for convenience.
When implementing an on-premise Neon deployment, you must implement a service that handles the compute hooks. This is not complicated.
### `notify-attach` body
The `notify-attach` request body follows the format of the `ComputeHookNotifyRequest` structure, provided below for convenience.
```
struct ComputeHookNotifyRequestShard {
@@ -128,15 +138,15 @@ When a notification is received:
1. Modify postgres configuration for this tenant:
- set `neon.pageserver_connstr` to a comma-separated list of postgres connection strings to pageservers according to the `shards` list. The
- set `neon.pageserver_connstring` to a comma-separated list of postgres connection strings to pageservers according to the `shards` list. The
shards identified by `NodeId` must be converted to the address+port of the node.
- if stripe_size is not None, set `neon.stripe_size` to this value
- if stripe_size is not None, set `neon.shard_stripe_size` to this value
2. Send SIGHUP to postgres to reload configuration
3. Respond with 200 to the notification request. Do not return success if postgres was not updated: if an error is returned, the controller
will retry the notification until it succeeds..
### Example notification body
Example body:
```
{
@@ -148,3 +158,34 @@ When a notification is received:
],
}
```
### `notify-safekeepers` body
The `notify-safekeepers` request body forllows the format of the `SafekeepersNotifyRequest` structure, provided below for convenience.
```
pub struct SafekeeperInfo {
pub id: NodeId,
pub hostname: String,
}
pub struct SafekeepersNotifyRequest {
pub tenant_id: TenantId,
pub timeline_id: TimelineId,
pub generation: u32,
pub safekeepers: Vec<SafekeeperInfo>,
}
```
When a notification is received:
1. Modify postgres configuration for this tenant:
- set `neon.safekeeper_connstrings` to an array of postgres connection strings to safekeepers according to the `safekeepers` list. The
safekeepers identified by `NodeId` must be converted to the address+port of the respective safekeeper.
The hostname is provided for debugging purposes, so we reserve changes to how we pass it.
- set `neon.safekeepers_generation` to the provided `generation` value.
2. Send SIGHUP to postgres to reload configuration
3. Respond with 200 to the notification request. Do not return success if postgres was not updated: if an error is returned, the controller
will retry the notification until it succeeds..

View File

@@ -7,6 +7,7 @@ license.workspace = true
[dependencies]
anyhow.workspace = true
chrono.workspace = true
indexmap.workspace = true
jsonwebtoken.workspace = true
serde.workspace = true
serde_json.workspace = true

View File

@@ -139,6 +139,7 @@ pub struct ComputeCtlConfig {
/// Set of JSON web keys that the compute can use to authenticate
/// communication from the control plane.
pub jwks: JwkSet,
pub tls: Option<TlsConfig>,
}
impl Default for ComputeCtlConfig {
@@ -147,10 +148,17 @@ impl Default for ComputeCtlConfig {
jwks: JwkSet {
keys: Vec::default(),
},
tls: None,
}
}
}
#[derive(Clone, Debug, Deserialize, Serialize)]
pub struct TlsConfig {
pub key_path: String,
pub cert_path: String,
}
/// Response of the `/computes/{compute_id}/spec` control-plane API.
#[derive(Deserialize, Debug)]
pub struct ControlPlaneSpecResponse {

View File

@@ -5,12 +5,15 @@
//! and connect it to the storage nodes.
use std::collections::HashMap;
use indexmap::IndexMap;
use regex::Regex;
use remote_storage::RemotePath;
use serde::{Deserialize, Serialize};
use utils::id::{TenantId, TimelineId};
use utils::lsn::Lsn;
use crate::responses::TlsConfig;
/// String type alias representing Postgres identifier and
/// intended to be used for DB / role names.
pub type PgIdent = String;
@@ -125,7 +128,7 @@ pub struct ComputeSpec {
// information about available remote extensions
pub remote_extensions: Option<RemoteExtSpec>,
pub pgbouncer_settings: Option<HashMap<String, String>>,
pub pgbouncer_settings: Option<IndexMap<String, String>>,
// Stripe size for pageserver sharding, in pages
#[serde(default)]
@@ -357,6 +360,9 @@ pub struct LocalProxySpec {
#[serde(default)]
#[serde(skip_serializing_if = "Option::is_none")]
pub jwks: Option<Vec<JwksSettings>>,
#[serde(default)]
#[serde(skip_serializing_if = "Option::is_none")]
pub tls: Option<TlsConfig>,
}
#[derive(Clone, Debug, Deserialize, Serialize)]

View File

@@ -8,6 +8,7 @@ license.workspace = true
anyhow.workspace = true
bytes.workspace = true
fail.workspace = true
futures.workspace = true
hyper0.workspace = true
itertools.workspace = true
jemalloc_pprof.workspace = true
@@ -21,6 +22,7 @@ serde_path_to_error.workspace = true
thiserror.workspace = true
tracing.workspace = true
tokio.workspace = true
tokio-rustls.workspace = true
tokio-util.workspace = true
url.workspace = true
uuid.workspace = true

View File

@@ -399,12 +399,10 @@ pub async fn profile_cpu_handler(req: Request<Body>) -> Result<Response<Body>, A
// Return the report in the requested format.
match format {
Format::Pprof => {
let mut body = Vec::new();
report
let body = report
.pprof()
.map_err(|err| ApiError::InternalServerError(err.into()))?
.write_to_vec(&mut body)
.map_err(|err| ApiError::InternalServerError(err.into()))?;
.encode_to_vec();
Response::builder()
.status(200)

View File

@@ -3,9 +3,10 @@ pub mod error;
pub mod failpoints;
pub mod json;
pub mod request;
pub mod server;
extern crate hyper0 as hyper;
/// Current fast way to apply simple http routing in various Neon binaries.
/// Re-exported for sake of uniform approach, that could be later replaced with better alternatives, if needed.
pub use routerify::{RouterBuilder, RouterService, ext::RequestExt};
pub use routerify::{RequestServiceBuilder, RouterBuilder, RouterService, ext::RequestExt};

View File

@@ -0,0 +1,155 @@
use std::{error::Error, sync::Arc};
use futures::StreamExt;
use futures::stream::FuturesUnordered;
use hyper0::Body;
use hyper0::server::conn::Http;
use routerify::{RequestService, RequestServiceBuilder};
use tokio::io::{AsyncRead, AsyncWrite};
use tokio_rustls::TlsAcceptor;
use tokio_util::sync::CancellationToken;
use tracing::{error, info};
use crate::error::ApiError;
/// A simple HTTP server over hyper library.
/// You may want to use it instead of [`hyper0::server::Server`] because:
/// 1. hyper0's Server was removed from hyper v1.
/// It's recommended to replace hyepr0's Server with a manual loop, which is done here.
/// 2. hyper0's Server doesn't support TLS out of the box, and there is no way
/// to support it efficiently with the Accept trait that hyper0's Server uses.
/// That's one of the reasons why it was removed from v1.
/// <https://github.com/hyperium/hyper/blob/115339d3df50f20c8717680aa35f48858e9a6205/docs/ROADMAP.md#higher-level-client-and-server-problems>
pub struct Server {
request_service: Arc<RequestServiceBuilder<Body, ApiError>>,
listener: tokio::net::TcpListener,
tls_acceptor: Option<TlsAcceptor>,
}
impl Server {
pub fn new(
request_service: Arc<RequestServiceBuilder<Body, ApiError>>,
listener: std::net::TcpListener,
tls_acceptor: Option<TlsAcceptor>,
) -> anyhow::Result<Self> {
// Note: caller of from_std is responsible for setting nonblocking mode.
listener.set_nonblocking(true)?;
let listener = tokio::net::TcpListener::from_std(listener)?;
Ok(Self {
request_service,
listener,
tls_acceptor,
})
}
pub async fn serve(self, cancel: CancellationToken) -> anyhow::Result<()> {
fn suppress_io_error(err: &std::io::Error) -> bool {
use std::io::ErrorKind::*;
matches!(err.kind(), ConnectionReset | ConnectionAborted | BrokenPipe)
}
fn suppress_hyper_error(err: &hyper0::Error) -> bool {
if err.is_incomplete_message() || err.is_closed() || err.is_timeout() {
return true;
}
if let Some(inner) = err.source() {
if let Some(io) = inner.downcast_ref::<std::io::Error>() {
return suppress_io_error(io);
}
}
false
}
let mut connections = FuturesUnordered::new();
loop {
tokio::select! {
stream = self.listener.accept() => {
let (tcp_stream, remote_addr) = match stream {
Ok(stream) => stream,
Err(err) => {
if !suppress_io_error(&err) {
info!("Failed to accept TCP connection: {err:#}");
}
continue;
}
};
let service = self.request_service.build(remote_addr);
let tls_acceptor = self.tls_acceptor.clone();
let cancel = cancel.clone();
connections.push(tokio::spawn(
async move {
match tls_acceptor {
Some(tls_acceptor) => {
// Handle HTTPS connection.
let tls_stream = tokio::select! {
tls_stream = tls_acceptor.accept(tcp_stream) => tls_stream,
_ = cancel.cancelled() => return,
};
let tls_stream = match tls_stream {
Ok(tls_stream) => tls_stream,
Err(err) => {
if !suppress_io_error(&err) {
info!("Failed to accept TLS connection: {err:#}");
}
return;
}
};
if let Err(err) = Self::serve_connection(tls_stream, service, cancel).await {
if !suppress_hyper_error(&err) {
info!("Failed to serve HTTPS connection: {err:#}");
}
}
}
None => {
// Handle HTTP connection.
if let Err(err) = Self::serve_connection(tcp_stream, service, cancel).await {
if !suppress_hyper_error(&err) {
info!("Failed to serve HTTP connection: {err:#}");
}
}
}
};
}));
}
Some(conn) = connections.next() => {
if let Err(err) = conn {
error!("Connection panicked: {err:#}");
}
}
_ = cancel.cancelled() => {
// Wait for graceful shutdown of all connections.
while let Some(conn) = connections.next().await {
if let Err(err) = conn {
error!("Connection panicked: {err:#}");
}
}
break;
}
}
}
Ok(())
}
/// Serves HTTP connection with graceful shutdown.
async fn serve_connection<I>(
io: I,
service: RequestService<Body, ApiError>,
cancel: CancellationToken,
) -> Result<(), hyper0::Error>
where
I: AsyncRead + AsyncWrite + Unpin + Send + 'static,
{
let mut conn = Http::new().serve_connection(io, service).with_upgrades();
tokio::select! {
res = &mut conn => res,
_ = cancel.cancelled() => {
Pin::new(&mut conn).graceful_shutdown();
// Note: connection should still be awaited for graceful shutdown to complete.
conn.await
}
}
}
}

View File

@@ -35,6 +35,7 @@ pub struct NodeMetadata {
pub postgres_port: u16,
pub http_host: String,
pub http_port: u16,
pub https_port: Option<u16>,
// Deployment tools may write fields to the metadata file beyond what we
// use in this type: this type intentionally only names fields that require.
@@ -57,6 +58,9 @@ pub struct ConfigToml {
// types mapped 1:1 into the runtime PageServerConfig type
pub listen_pg_addr: String,
pub listen_http_addr: String,
pub listen_https_addr: Option<String>,
pub ssl_key_file: Utf8PathBuf,
pub ssl_cert_file: Utf8PathBuf,
pub availability_zone: Option<String>,
#[serde(with = "humantime_serde")]
pub wait_lsn_timeout: Duration,
@@ -268,15 +272,16 @@ pub struct TenantConfigToml {
/// size exceeds `compaction_upper_limit * checkpoint_distance`.
pub compaction_upper_limit: usize,
pub compaction_algorithm: crate::models::CompactionAlgorithmSettings,
/// If true, compact down L0 across all tenant timelines before doing regular compaction.
/// If true, compact down L0 across all tenant timelines before doing regular compaction. L0
/// compaction must be responsive to avoid read amp during heavy ingestion. Defaults to true.
pub compaction_l0_first: bool,
/// If true, use a separate semaphore (i.e. concurrency limit) for the L0 compaction pass. Only
/// has an effect if `compaction_l0_first` is `true`.
/// has an effect if `compaction_l0_first` is true. Defaults to true.
pub compaction_l0_semaphore: bool,
/// Level0 delta layer threshold at which to delay layer flushes for compaction backpressure,
/// such that they take 2x as long, and start waiting for layer flushes during ephemeral layer
/// rolls. This helps compaction keep up with WAL ingestion, and avoids read amplification
/// blowing up. Should be >compaction_threshold. 0 to disable. Disabled by default.
/// Level0 delta layer threshold at which to delay layer flushes such that they take 2x as long,
/// and block on layer flushes during ephemeral layer rolls, for compaction backpressure. This
/// helps compaction keep up with WAL ingestion, and avoids read amplification blowing up.
/// Should be >compaction_threshold. 0 to disable. Defaults to 3x compaction_threshold.
pub l0_flush_delay_threshold: Option<usize>,
/// Level0 delta layer threshold at which to stall layer flushes. Must be >compaction_threshold
/// to avoid deadlock. 0 to disable. Disabled by default.
@@ -421,6 +426,9 @@ pub mod defaults {
pub const DEFAULT_WAL_RECEIVER_PROTOCOL: utils::postgres_client::PostgresClientProtocol =
utils::postgres_client::PostgresClientProtocol::Vanilla;
pub const DEFAULT_SSL_KEY_FILE: &str = "server.key";
pub const DEFAULT_SSL_CERT_FILE: &str = "server.crt";
}
impl Default for ConfigToml {
@@ -430,6 +438,9 @@ impl Default for ConfigToml {
Self {
listen_pg_addr: (DEFAULT_PG_LISTEN_ADDR.to_string()),
listen_http_addr: (DEFAULT_HTTP_LISTEN_ADDR.to_string()),
listen_https_addr: (None),
ssl_key_file: Utf8PathBuf::from(DEFAULT_SSL_KEY_FILE),
ssl_cert_file: Utf8PathBuf::from(DEFAULT_SSL_CERT_FILE),
availability_zone: (None),
wait_lsn_timeout: (humantime::parse_duration(DEFAULT_WAIT_LSN_TIMEOUT)
.expect("cannot parse default wait lsn timeout")),
@@ -557,7 +568,9 @@ pub mod tenant_conf_defaults {
// be reduced later by optimizing L0 hole calculation to avoid loading all keys into memory). So
// with this config, we can get a maximum peak compaction usage of 9 GB.
pub const DEFAULT_COMPACTION_UPPER_LIMIT: usize = 20;
pub const DEFAULT_COMPACTION_L0_FIRST: bool = false;
// Enable L0 compaction pass and semaphore by default. L0 compaction must be responsive to avoid
// read amp.
pub const DEFAULT_COMPACTION_L0_FIRST: bool = true;
pub const DEFAULT_COMPACTION_L0_SEMAPHORE: bool = true;
pub const DEFAULT_COMPACTION_ALGORITHM: crate::models::CompactionAlgorithm =
@@ -574,9 +587,8 @@ pub mod tenant_conf_defaults {
pub const DEFAULT_GC_PERIOD: &str = "1 hr";
pub const DEFAULT_IMAGE_CREATION_THRESHOLD: usize = 3;
// If there are more than threshold * compaction_threshold (that is 3 * 10 in the default config) L0 layers, image
// layer creation will end immediately. Set to 0 to disable. The target default will be 3 once we
// want to enable this feature.
pub const DEFAULT_IMAGE_CREATION_PREEMPT_THRESHOLD: usize = 0;
// layer creation will end immediately. Set to 0 to disable.
pub const DEFAULT_IMAGE_CREATION_PREEMPT_THRESHOLD: usize = 3;
pub const DEFAULT_PITR_INTERVAL: &str = "7 days";
pub const DEFAULT_WALRECEIVER_CONNECT_TIMEOUT: &str = "10 seconds";
pub const DEFAULT_WALRECEIVER_LAGGING_WAL_TIMEOUT: &str = "10 seconds";

View File

@@ -16,6 +16,30 @@ fn test_node_metadata_v1_backward_compatibilty() {
postgres_port: 23,
http_host: "localhost".to_string(),
http_port: 42,
https_port: None,
other: HashMap::new(),
}
)
}
#[test]
fn test_node_metadata_v2_backward_compatibilty() {
let v2 = serde_json::to_vec(&serde_json::json!({
"host": "localhost",
"port": 23,
"http_host": "localhost",
"http_port": 42,
"https_port": 123,
}));
assert_eq!(
serde_json::from_slice::<NodeMetadata>(&v2.unwrap()).unwrap(),
NodeMetadata {
postgres_host: "localhost".to_string(),
postgres_port: 23,
http_host: "localhost".to_string(),
http_port: 42,
https_port: Some(123),
other: HashMap::new(),
}
)

View File

@@ -182,20 +182,66 @@ pub struct TenantDescribeResponseShard {
#[derive(Serialize, Deserialize, Debug)]
pub struct TenantShardMigrateRequest {
pub node_id: NodeId,
/// Optionally, callers may specify the node they are migrating _from_, and the server will
/// reject the request if the shard is no longer attached there: this enables writing safer
/// clients that don't risk fighting with some other movement of the shard.
#[serde(default)]
pub migration_config: Option<MigrationConfig>,
pub origin_node_id: Option<NodeId>,
#[serde(default)]
pub migration_config: MigrationConfig,
}
#[derive(Serialize, Deserialize, Debug)]
#[derive(Serialize, Deserialize, Debug, PartialEq, Eq)]
pub struct MigrationConfig {
/// If true, the migration will be executed even if it is to a location with a sub-optimal scheduling
/// score: this is usually not what you want, and if you use this then you'll also need to set the
/// tenant's scheduling policy to Essential or Pause to avoid the optimiser reverting your migration.
///
/// Default: false
#[serde(default)]
pub override_scheduler: bool,
/// If true, the migration will be done gracefully by creating a secondary location first and
/// waiting for it to warm up before cutting over. If false, if there is no existing secondary
/// location at the destination, the tenant will be migrated immediately. If the tenant's data
/// can't be downloaded within [`Self::secondary_warmup_timeout`], then the migration will go
/// ahead but run with a cold cache that can severely reduce performance until it warms up.
///
/// When doing a graceful migration, the migration API returns as soon as it is started.
///
/// Default: true
#[serde(default = "default_prewarm")]
pub prewarm: bool,
/// For non-prewarm migrations which will immediately enter a cutover to the new node: how long to wait
/// overall for secondary warmup before cutting over
#[serde(default)]
#[serde(with = "humantime_serde")]
pub secondary_warmup_timeout: Option<Duration>,
/// For non-prewarm migrations which will immediately enter a cutover to the new node: how long to wait
/// within each secondary download poll call to pageserver.
#[serde(default)]
#[serde(with = "humantime_serde")]
pub secondary_download_request_timeout: Option<Duration>,
}
fn default_prewarm() -> bool {
true
}
impl Default for MigrationConfig {
fn default() -> Self {
Self {
override_scheduler: false,
prewarm: default_prewarm(),
secondary_warmup_timeout: None,
secondary_download_request_timeout: None,
}
}
}
#[derive(Serialize, Clone, Debug)]
#[serde(into = "NodeAvailabilityWrapper")]
pub enum NodeAvailability {
@@ -443,6 +489,7 @@ pub struct SafekeeperDescribeResponse {
pub host: String,
pub port: i32,
pub http_port: i32,
pub https_port: Option<i32>,
pub availability_zone_id: String,
pub scheduling_policy: SkSchedulingPolicy,
}
@@ -487,4 +534,43 @@ mod test {
err
);
}
/// Check that a minimal migrate request with no config results in the expected default settings
#[test]
fn test_migrate_request_decode_defaults() {
let json = r#"{
"node_id": 123
}"#;
let request: TenantShardMigrateRequest = serde_json::from_str(json).unwrap();
assert_eq!(request.node_id, NodeId(123));
assert_eq!(request.origin_node_id, None);
assert!(!request.migration_config.override_scheduler);
assert!(request.migration_config.prewarm);
assert_eq!(request.migration_config.secondary_warmup_timeout, None);
assert_eq!(
request.migration_config.secondary_download_request_timeout,
None
);
}
/// Check that a partially specified migration config results in the expected default settings
#[test]
fn test_migration_config_decode_defaults() {
// Specify just one field of the config
let json = r#"{
}"#;
let config: MigrationConfig = serde_json::from_str(json).unwrap();
// Check each field's expected default value
assert!(!config.override_scheduler);
assert!(config.prewarm);
assert_eq!(config.secondary_warmup_timeout, None);
assert_eq!(config.secondary_download_request_timeout, None);
assert_eq!(config.secondary_warmup_timeout, None);
// Consistency check that the Default impl agrees with our serde defaults
assert_eq!(MigrationConfig::default(), config);
}
}

View File

@@ -274,6 +274,31 @@ pub struct TimelineCreateRequest {
pub mode: TimelineCreateRequestMode,
}
/// Storage controller specific extensions to [`TimelineInfo`].
#[derive(Serialize, Deserialize, Clone)]
pub struct TimelineCreateResponseStorcon {
#[serde(flatten)]
pub timeline_info: TimelineInfo,
pub safekeepers: Option<SafekeepersInfo>,
}
/// Safekeepers as returned in timeline creation request to storcon or pushed to
/// cplane in the post migration hook.
#[derive(Serialize, Deserialize, Clone)]
pub struct SafekeepersInfo {
pub tenant_id: TenantId,
pub timeline_id: TimelineId,
pub generation: u32,
pub safekeepers: Vec<SafekeeperInfo>,
}
#[derive(Serialize, Deserialize, Clone)]
pub struct SafekeeperInfo {
pub id: NodeId,
pub hostname: String,
}
#[derive(Serialize, Deserialize, Clone)]
#[serde(untagged)]
pub enum TimelineCreateRequestMode {
@@ -1146,6 +1171,15 @@ pub struct TimelineArchivalConfigRequest {
pub state: TimelineArchivalState,
}
#[derive(Serialize, Deserialize, PartialEq, Eq, Clone)]
pub struct TimelinePatchIndexPartRequest {
pub rel_size_migration: Option<RelSizeMigration>,
pub gc_compaction_last_completed_lsn: Option<Lsn>,
pub applied_gc_cutoff_lsn: Option<Lsn>,
#[serde(default)]
pub force_index_update: bool,
}
#[derive(Debug, Serialize, Deserialize, Clone)]
pub struct TimelinesInfoAndOffloaded {
pub timelines: Vec<TimelineInfo>,
@@ -1191,9 +1225,10 @@ pub struct TimelineInfo {
pub last_record_lsn: Lsn,
pub prev_record_lsn: Option<Lsn>,
/// Legacy field for compat with control plane. Synonym of `min_readable_lsn`.
/// TODO: remove once control plane no longer reads it.
pub latest_gc_cutoff_lsn: Lsn,
/// Legacy field, retained for one version to enable old storage controller to
/// decode (it was a mandatory field).
#[serde(default, rename = "latest_gc_cutoff_lsn")]
pub _unused: Lsn,
/// The LSN up to which GC has advanced: older data may still exist but it is not available for clients.
/// This LSN is not suitable for deciding where to create branches etc: use [`TimelineInfo::min_readable_lsn`] instead,
@@ -1442,8 +1477,14 @@ pub struct TenantScanRemoteStorageResponse {
#[derive(Serialize, Deserialize, Debug, Clone)]
#[serde(rename_all = "snake_case")]
pub enum TenantSorting {
/// Total size of layers on local disk for all timelines in a shard.
ResidentSize,
/// The logical size of the largest timeline within a _tenant_ (not shard). Only tracked on
/// shard 0, contains the sum across all shards.
MaxLogicalSize,
/// The logical size of the largest timeline within a _tenant_ (not shard), divided by number of
/// shards. Only tracked on shard 0, and estimates the per-shard logical size.
MaxLogicalSizePerShard,
}
impl Default for TenantSorting {
@@ -1473,14 +1514,20 @@ pub struct TopTenantShardsRequest {
pub struct TopTenantShardItem {
pub id: TenantShardId,
/// Total size of layers on local disk for all timelines in this tenant
/// Total size of layers on local disk for all timelines in this shard.
pub resident_size: u64,
/// Total size of layers in remote storage for all timelines in this tenant
/// Total size of layers in remote storage for all timelines in this shard.
pub physical_size: u64,
/// The largest logical size of a timeline within this tenant
/// The largest logical size of a timeline within this _tenant_ (not shard). This is only
/// tracked on shard 0, and contains the sum of the logical size across all shards.
pub max_logical_size: u64,
/// The largest logical size of a timeline within this _tenant_ (not shard) divided by number of
/// shards. This is only tracked on shard 0, and is only an estimate as we divide it evenly by
/// shard count, rounded up.
pub max_logical_size_per_shard: u64,
}
#[derive(Serialize, Deserialize, Debug, Default)]

View File

@@ -112,6 +112,16 @@ impl ShardIdentity {
}
}
/// An unsharded identity with the given stripe size (if non-zero). This is typically used to
/// carry over a stripe size for an unsharded tenant from persistent storage.
pub fn unsharded_with_stripe_size(stripe_size: ShardStripeSize) -> Self {
let mut shard_identity = Self::unsharded();
if stripe_size.0 > 0 {
shard_identity.stripe_size = stripe_size;
}
shard_identity
}
/// A broken instance of this type is only used for `TenantState::Broken` tenants,
/// which are constructed in code paths that don't have access to proper configuration.
///

View File

@@ -396,6 +396,14 @@ pub mod waldecoder {
self.lsn + self.inputbuf.remaining() as u64
}
/// Returns the LSN up to which the WAL decoder has processed.
///
/// If [`Self::poll_decode`] returned a record, then this will return
/// the end LSN of said record.
pub fn lsn(&self) -> Lsn {
self.lsn
}
pub fn feed_bytes(&mut self, buf: &[u8]) {
self.inputbuf.extend_from_slice(buf);
}

View File

@@ -135,8 +135,8 @@ impl Type {
pub enum Kind {
/// A simple type like `VARCHAR` or `INTEGER`.
Simple,
/// An enumerated type along with its variants.
Enum(Vec<String>),
/// An enumerated type.
Enum,
/// A pseudo-type.
Pseudo,
/// An array type along with the type of its elements.
@@ -146,9 +146,9 @@ pub enum Kind {
/// A multirange type along with the type of its elements.
Multirange(Type),
/// A domain type along with its underlying type.
Domain(Type),
/// A composite type along with information about its fields.
Composite(Vec<Field>),
Domain(Oid),
/// A composite type.
Composite(Oid),
}
/// Information about a field of a composite type.

View File

@@ -19,10 +19,10 @@ use crate::config::{Host, SslMode};
use crate::connection::{Request, RequestMessages};
use crate::query::RowStream;
use crate::simple_query::SimpleQueryStream;
use crate::types::{Oid, ToSql, Type};
use crate::types::{Oid, Type};
use crate::{
CancelToken, Error, ReadyForQueryStatus, Row, SimpleQueryMessage, Statement, Transaction,
TransactionBuilder, query, simple_query, slice_iter,
CancelToken, Error, ReadyForQueryStatus, SimpleQueryMessage, Statement, Transaction,
TransactionBuilder, query, simple_query,
};
pub struct Responses {
@@ -54,26 +54,18 @@ impl Responses {
/// A cache of type info and prepared statements for fetching type info
/// (corresponding to the queries in the [crate::prepare] module).
#[derive(Default)]
struct CachedTypeInfo {
pub(crate) struct CachedTypeInfo {
/// A statement for basic information for a type from its
/// OID. Corresponds to [TYPEINFO_QUERY](crate::prepare::TYPEINFO_QUERY) (or its
/// fallback).
typeinfo: Option<Statement>,
/// A statement for getting information for a composite type from its OID.
/// Corresponds to [TYPEINFO_QUERY](crate::prepare::TYPEINFO_COMPOSITE_QUERY).
typeinfo_composite: Option<Statement>,
/// A statement for getting information for a composite type from its OID.
/// Corresponds to [TYPEINFO_QUERY](crate::prepare::TYPEINFO_COMPOSITE_QUERY) (or
/// its fallback).
typeinfo_enum: Option<Statement>,
pub(crate) typeinfo: Option<Statement>,
/// Cache of types already looked up.
types: HashMap<Oid, Type>,
pub(crate) types: HashMap<Oid, Type>,
}
pub struct InnerClient {
sender: mpsc::UnboundedSender<Request>,
cached_typeinfo: Mutex<CachedTypeInfo>,
/// A buffer to use when writing out postgres commands.
buffer: Mutex<BytesMut>,
@@ -91,38 +83,6 @@ impl InnerClient {
})
}
pub fn typeinfo(&self) -> Option<Statement> {
self.cached_typeinfo.lock().typeinfo.clone()
}
pub fn set_typeinfo(&self, statement: &Statement) {
self.cached_typeinfo.lock().typeinfo = Some(statement.clone());
}
pub fn typeinfo_composite(&self) -> Option<Statement> {
self.cached_typeinfo.lock().typeinfo_composite.clone()
}
pub fn set_typeinfo_composite(&self, statement: &Statement) {
self.cached_typeinfo.lock().typeinfo_composite = Some(statement.clone());
}
pub fn typeinfo_enum(&self) -> Option<Statement> {
self.cached_typeinfo.lock().typeinfo_enum.clone()
}
pub fn set_typeinfo_enum(&self, statement: &Statement) {
self.cached_typeinfo.lock().typeinfo_enum = Some(statement.clone());
}
pub fn type_(&self, oid: Oid) -> Option<Type> {
self.cached_typeinfo.lock().types.get(&oid).cloned()
}
pub fn set_type(&self, oid: Oid, type_: &Type) {
self.cached_typeinfo.lock().types.insert(oid, type_.clone());
}
/// Call the given function with a buffer to be used when writing out
/// postgres commands.
pub fn with_buf<F, R>(&self, f: F) -> R
@@ -142,7 +102,6 @@ pub struct SocketConfig {
pub host: Host,
pub port: u16,
pub connect_timeout: Option<Duration>,
// pub keepalive: Option<KeepaliveConfig>,
}
/// An asynchronous PostgreSQL client.
@@ -151,6 +110,7 @@ pub struct SocketConfig {
/// through this client object.
pub struct Client {
inner: Arc<InnerClient>,
cached_typeinfo: CachedTypeInfo,
socket_config: SocketConfig,
ssl_mode: SslMode,
@@ -169,9 +129,9 @@ impl Client {
Client {
inner: Arc::new(InnerClient {
sender,
cached_typeinfo: Default::default(),
buffer: Default::default(),
}),
cached_typeinfo: Default::default(),
socket_config,
ssl_mode,
@@ -189,55 +149,6 @@ impl Client {
&self.inner
}
/// Executes a statement, returning a vector of the resulting rows.
///
/// A statement may contain parameters, specified by `$n`, where `n` is the index of the parameter of the list
/// provided, 1-indexed.
///
/// The `statement` argument can either be a `Statement`, or a raw query string. If the same statement will be
/// repeatedly executed (perhaps with different query parameters), consider preparing the statement up front
/// with the `prepare` method.
///
/// # Panics
///
/// Panics if the number of parameters provided does not match the number expected.
pub async fn query(
&self,
statement: Statement,
params: &[&(dyn ToSql + Sync)],
) -> Result<Vec<Row>, Error> {
self.query_raw(statement, slice_iter(params))
.await?
.try_collect()
.await
}
/// The maximally flexible version of [`query`].
///
/// A statement may contain parameters, specified by `$n`, where `n` is the index of the parameter of the list
/// provided, 1-indexed.
///
/// The `statement` argument can either be a `Statement`, or a raw query string. If the same statement will be
/// repeatedly executed (perhaps with different query parameters), consider preparing the statement up front
/// with the `prepare` method.
///
/// # Panics
///
/// Panics if the number of parameters provided does not match the number expected.
///
/// [`query`]: #method.query
pub async fn query_raw<'a, I>(
&self,
statement: Statement,
params: I,
) -> Result<RowStream, Error>
where
I: IntoIterator<Item = &'a (dyn ToSql + Sync)>,
I::IntoIter: ExactSizeIterator,
{
query::query(&self.inner, statement, params).await
}
/// Pass text directly to the Postgres backend to allow it to sort out typing itself and
/// to save a roundtrip
pub async fn query_raw_txt<S, I>(&self, statement: &str, params: I) -> Result<RowStream, Error>
@@ -284,6 +195,14 @@ impl Client {
simple_query::batch_execute(self.inner(), query).await
}
pub async fn discard_all(&mut self) -> Result<ReadyForQueryStatus, Error> {
// clear the prepared statements that are about to be nuked from the postgres session
self.cached_typeinfo.typeinfo = None;
self.batch_execute("discard all").await
}
/// Begins a new database transaction.
///
/// The transaction will roll back by default - use the `commit` method to commit it.
@@ -347,8 +266,8 @@ impl Client {
}
/// Query for type information
pub async fn get_type(&self, oid: Oid) -> Result<Type, Error> {
crate::prepare::get_type(&self.inner, oid).await
pub(crate) async fn get_type_inner(&mut self, oid: Oid) -> Result<Type, Error> {
crate::prepare::get_type(&self.inner, &mut self.cached_typeinfo, oid).await
}
/// Determines if the connection to the server has already closed.

View File

@@ -22,7 +22,7 @@ pub trait GenericClient: private::Sealed {
I::IntoIter: ExactSizeIterator + Sync + Send;
/// Query for type information
async fn get_type(&self, oid: Oid) -> Result<Type, Error>;
async fn get_type(&mut self, oid: Oid) -> Result<Type, Error>;
}
impl private::Sealed for Client {}
@@ -38,8 +38,8 @@ impl GenericClient for Client {
}
/// Query for type information
async fn get_type(&self, oid: Oid) -> Result<Type, Error> {
crate::prepare::get_type(self.inner(), oid).await
async fn get_type(&mut self, oid: Oid) -> Result<Type, Error> {
self.get_type_inner(oid).await
}
}
@@ -56,7 +56,7 @@ impl GenericClient for Transaction<'_> {
}
/// Query for type information
async fn get_type(&self, oid: Oid) -> Result<Type, Error> {
self.client().get_type(oid).await
async fn get_type(&mut self, oid: Oid) -> Result<Type, Error> {
self.client_mut().get_type(oid).await
}
}

View File

@@ -9,10 +9,10 @@ use log::debug;
use postgres_protocol2::message::backend::Message;
use postgres_protocol2::message::frontend;
use crate::client::InnerClient;
use crate::client::{CachedTypeInfo, InnerClient};
use crate::codec::FrontendMessage;
use crate::connection::RequestMessages;
use crate::types::{Field, Kind, Oid, Type};
use crate::types::{Kind, Oid, Type};
use crate::{Column, Error, Statement, query, slice_iter};
pub(crate) const TYPEINFO_QUERY: &str = "\
@@ -23,23 +23,7 @@ INNER JOIN pg_catalog.pg_namespace n ON t.typnamespace = n.oid
WHERE t.oid = $1
";
const TYPEINFO_ENUM_QUERY: &str = "\
SELECT enumlabel
FROM pg_catalog.pg_enum
WHERE enumtypid = $1
ORDER BY enumsortorder
";
pub(crate) const TYPEINFO_COMPOSITE_QUERY: &str = "\
SELECT attname, atttypid
FROM pg_catalog.pg_attribute
WHERE attrelid = $1
AND NOT attisdropped
AND attnum > 0
ORDER BY attnum
";
pub async fn prepare(
async fn prepare_typecheck(
client: &Arc<InnerClient>,
name: &'static str,
query: &str,
@@ -67,7 +51,7 @@ pub async fn prepare(
let mut parameters = vec![];
let mut it = parameter_description.parameters();
while let Some(oid) = it.next().map_err(Error::parse)? {
let type_ = get_type(client, oid).await?;
let type_ = Type::from_oid(oid).ok_or_else(Error::unexpected_message)?;
parameters.push(type_);
}
@@ -75,7 +59,7 @@ pub async fn prepare(
if let Some(row_description) = row_description {
let mut it = row_description.fields();
while let Some(field) = it.next().map_err(Error::parse)? {
let type_ = get_type(client, field.type_oid()).await?;
let type_ = Type::from_oid(field.type_oid()).ok_or_else(Error::unexpected_message)?;
let column = Column::new(field.name().to_string(), type_, field);
columns.push(column);
}
@@ -84,15 +68,6 @@ pub async fn prepare(
Ok(Statement::new(client, name, parameters, columns))
}
fn prepare_rec<'a>(
client: &'a Arc<InnerClient>,
name: &'static str,
query: &'a str,
types: &'a [Type],
) -> Pin<Box<dyn Future<Output = Result<Statement, Error>> + 'a + Send>> {
Box::pin(prepare(client, name, query, types))
}
fn encode(client: &InnerClient, name: &str, query: &str, types: &[Type]) -> Result<Bytes, Error> {
if types.is_empty() {
debug!("preparing query {}: {}", name, query);
@@ -108,16 +83,20 @@ fn encode(client: &InnerClient, name: &str, query: &str, types: &[Type]) -> Resu
})
}
pub async fn get_type(client: &Arc<InnerClient>, oid: Oid) -> Result<Type, Error> {
pub async fn get_type(
client: &Arc<InnerClient>,
typecache: &mut CachedTypeInfo,
oid: Oid,
) -> Result<Type, Error> {
if let Some(type_) = Type::from_oid(oid) {
return Ok(type_);
}
if let Some(type_) = client.type_(oid) {
return Ok(type_);
}
if let Some(type_) = typecache.types.get(&oid) {
return Ok(type_.clone());
};
let stmt = typeinfo_statement(client).await?;
let stmt = typeinfo_statement(client, typecache).await?;
let rows = query::query(client, stmt, slice_iter(&[&oid])).await?;
pin_mut!(rows);
@@ -136,100 +115,48 @@ pub async fn get_type(client: &Arc<InnerClient>, oid: Oid) -> Result<Type, Error
let relid: Oid = row.try_get(6)?;
let kind = if type_ == b'e' as i8 {
let variants = get_enum_variants(client, oid).await?;
Kind::Enum(variants)
Kind::Enum
} else if type_ == b'p' as i8 {
Kind::Pseudo
} else if basetype != 0 {
let type_ = get_type_rec(client, basetype).await?;
Kind::Domain(type_)
Kind::Domain(basetype)
} else if elem_oid != 0 {
let type_ = get_type_rec(client, elem_oid).await?;
let type_ = get_type_rec(client, typecache, elem_oid).await?;
Kind::Array(type_)
} else if relid != 0 {
let fields = get_composite_fields(client, relid).await?;
Kind::Composite(fields)
Kind::Composite(relid)
} else if let Some(rngsubtype) = rngsubtype {
let type_ = get_type_rec(client, rngsubtype).await?;
let type_ = get_type_rec(client, typecache, rngsubtype).await?;
Kind::Range(type_)
} else {
Kind::Simple
};
let type_ = Type::new(name, oid, kind, schema);
client.set_type(oid, &type_);
typecache.types.insert(oid, type_.clone());
Ok(type_)
}
fn get_type_rec<'a>(
client: &'a Arc<InnerClient>,
typecache: &'a mut CachedTypeInfo,
oid: Oid,
) -> Pin<Box<dyn Future<Output = Result<Type, Error>> + Send + 'a>> {
Box::pin(get_type(client, oid))
Box::pin(get_type(client, typecache, oid))
}
async fn typeinfo_statement(client: &Arc<InnerClient>) -> Result<Statement, Error> {
if let Some(stmt) = client.typeinfo() {
return Ok(stmt);
async fn typeinfo_statement(
client: &Arc<InnerClient>,
typecache: &mut CachedTypeInfo,
) -> Result<Statement, Error> {
if let Some(stmt) = &typecache.typeinfo {
return Ok(stmt.clone());
}
let typeinfo = "neon_proxy_typeinfo";
let stmt = prepare_rec(client, typeinfo, TYPEINFO_QUERY, &[]).await?;
let stmt = prepare_typecheck(client, typeinfo, TYPEINFO_QUERY, &[]).await?;
client.set_typeinfo(&stmt);
Ok(stmt)
}
async fn get_enum_variants(client: &Arc<InnerClient>, oid: Oid) -> Result<Vec<String>, Error> {
let stmt = typeinfo_enum_statement(client).await?;
query::query(client, stmt, slice_iter(&[&oid]))
.await?
.and_then(|row| async move { row.try_get(0) })
.try_collect()
.await
}
async fn typeinfo_enum_statement(client: &Arc<InnerClient>) -> Result<Statement, Error> {
if let Some(stmt) = client.typeinfo_enum() {
return Ok(stmt);
}
let typeinfo = "neon_proxy_typeinfo_enum";
let stmt = prepare_rec(client, typeinfo, TYPEINFO_ENUM_QUERY, &[]).await?;
client.set_typeinfo_enum(&stmt);
Ok(stmt)
}
async fn get_composite_fields(client: &Arc<InnerClient>, oid: Oid) -> Result<Vec<Field>, Error> {
let stmt = typeinfo_composite_statement(client).await?;
let rows = query::query(client, stmt, slice_iter(&[&oid]))
.await?
.try_collect::<Vec<_>>()
.await?;
let mut fields = vec![];
for row in rows {
let name = row.try_get(0)?;
let oid = row.try_get(1)?;
let type_ = get_type_rec(client, oid).await?;
fields.push(Field::new(name, type_));
}
Ok(fields)
}
async fn typeinfo_composite_statement(client: &Arc<InnerClient>) -> Result<Statement, Error> {
if let Some(stmt) = client.typeinfo_composite() {
return Ok(stmt);
}
let typeinfo = "neon_proxy_typeinfo_composite";
let stmt = prepare_rec(client, typeinfo, TYPEINFO_COMPOSITE_QUERY, &[]).await?;
client.set_typeinfo_composite(&stmt);
typecache.typeinfo = Some(stmt.clone());
Ok(stmt)
}

View File

@@ -72,4 +72,9 @@ impl<'a> Transaction<'a> {
pub fn client(&self) -> &Client {
self.client
}
/// Returns a reference to the underlying `Client`.
pub fn client_mut(&mut self) -> &mut Client {
self.client
}
}

View File

@@ -131,6 +131,14 @@ impl Configuration {
}
}
pub fn new(members: MemberSet) -> Self {
Configuration {
generation: INITIAL_GENERATION,
members,
new_members: None,
}
}
/// Is `sk_id` member of the configuration?
pub fn contains(&self, sk_id: NodeId) -> bool {
self.members.contains(sk_id) || self.new_members.as_ref().is_some_and(|m| m.contains(sk_id))

View File

@@ -18,7 +18,7 @@ pub struct SafekeeperStatus {
pub id: NodeId,
}
#[derive(Serialize, Deserialize)]
#[derive(Serialize, Deserialize, Clone)]
pub struct TimelineCreateRequest {
pub tenant_id: TenantId,
pub timeline_id: TimelineId,
@@ -283,7 +283,7 @@ pub struct SafekeeperUtilization {
}
/// pull_timeline request body.
#[derive(Debug, Deserialize, Serialize)]
#[derive(Debug, Clone, Deserialize, Serialize)]
pub struct PullTimelineRequest {
pub tenant_id: TenantId,
pub timeline_id: TimelineId,

View File

@@ -21,7 +21,7 @@
//! .with_writer(std::io::stderr);
//!
//! // Initialize OpenTelemetry. Exports tracing spans as OpenTelemetry traces
//! let otlp_layer = tracing_utils::init_tracing("my_application").await;
//! let otlp_layer = tracing_utils::init_tracing("my_application", tracing_utils::ExportConfig::default()).await;
//!
//! // Put it all together
//! tracing_subscriber::registry()
@@ -38,8 +38,12 @@ pub mod http;
use opentelemetry::KeyValue;
use opentelemetry::trace::TracerProvider;
use tracing::Subscriber;
use opentelemetry_otlp::WithExportConfig;
pub use opentelemetry_otlp::{ExportConfig, Protocol};
use tracing::level_filters::LevelFilter;
use tracing::{Dispatch, Subscriber};
use tracing_subscriber::Layer;
use tracing_subscriber::layer::SubscriberExt;
use tracing_subscriber::registry::LookupSpan;
/// Set up OpenTelemetry exporter, using configuration from environment variables.
@@ -69,19 +73,28 @@ use tracing_subscriber::registry::LookupSpan;
///
/// This doesn't block, but is marked as 'async' to hint that this must be called in
/// asynchronous execution context.
pub async fn init_tracing<S>(service_name: &str) -> Option<impl Layer<S>>
pub async fn init_tracing<S>(
service_name: &str,
export_config: ExportConfig,
) -> Option<impl Layer<S>>
where
S: Subscriber + for<'span> LookupSpan<'span>,
{
if std::env::var("OTEL_SDK_DISABLED") == Ok("true".to_string()) {
return None;
};
Some(init_tracing_internal(service_name.to_string()))
Some(init_tracing_internal(
service_name.to_string(),
export_config,
))
}
/// Like `init_tracing`, but creates a separate tokio Runtime for the tracing
/// tasks.
pub fn init_tracing_without_runtime<S>(service_name: &str) -> Option<impl Layer<S>>
pub fn init_tracing_without_runtime<S>(
service_name: &str,
export_config: ExportConfig,
) -> Option<impl Layer<S>>
where
S: Subscriber + for<'span> LookupSpan<'span>,
{
@@ -112,16 +125,22 @@ where
));
let _guard = runtime.enter();
Some(init_tracing_internal(service_name.to_string()))
Some(init_tracing_internal(
service_name.to_string(),
export_config,
))
}
fn init_tracing_internal<S>(service_name: String) -> impl Layer<S>
fn init_tracing_internal<S>(service_name: String, export_config: ExportConfig) -> impl Layer<S>
where
S: Subscriber + for<'span> LookupSpan<'span>,
{
// Sets up exporter from the OTEL_EXPORTER_* environment variables.
// Sets up exporter from the provided [`ExportConfig`] parameter.
// If the endpoint is not specified, it is loaded from the
// OTEL_EXPORTER_OTLP_ENDPOINT environment variable.
let exporter = opentelemetry_otlp::SpanExporter::builder()
.with_http()
.with_export_config(export_config)
.build()
.expect("could not initialize opentelemetry exporter");
@@ -151,3 +170,51 @@ where
pub fn shutdown_tracing() {
opentelemetry::global::shutdown_tracer_provider();
}
pub enum OtelEnablement {
Disabled,
Enabled {
service_name: String,
export_config: ExportConfig,
runtime: &'static tokio::runtime::Runtime,
},
}
pub struct OtelGuard {
pub dispatch: Dispatch,
}
impl Drop for OtelGuard {
fn drop(&mut self) {
shutdown_tracing();
}
}
/// Initializes OTEL infrastructure for performance tracing according to the provided configuration
///
/// Performance tracing is handled by a different [`tracing::Subscriber`]. This functions returns
/// an [`OtelGuard`] containing a [`tracing::Dispatch`] associated with a newly created subscriber.
/// Applications should use this dispatch for their performance traces.
///
/// The lifetime of the guard should match taht of the application. On drop, it tears down the
/// OTEL infra.
pub fn init_performance_tracing(otel_enablement: OtelEnablement) -> Option<OtelGuard> {
let otel_subscriber = match otel_enablement {
OtelEnablement::Disabled => None,
OtelEnablement::Enabled {
service_name,
export_config,
runtime,
} => {
let otel_layer = runtime
.block_on(init_tracing(&service_name, export_config))
.with_filter(LevelFilter::INFO);
let otel_subscriber = tracing_subscriber::registry().with(otel_layer);
let otel_dispatch = Dispatch::new(otel_subscriber);
Some(otel_dispatch)
}
};
otel_subscriber.map(|dispatch| OtelGuard { dispatch })
}

View File

@@ -42,6 +42,7 @@ toml_edit = { workspace = true, features = ["serde"] }
tracing.workspace = true
tracing-error.workspace = true
tracing-subscriber = { workspace = true, features = ["json", "registry"] }
tracing-utils.workspace = true
rand.workspace = true
scopeguard.workspace = true
strum.workspace = true

View File

@@ -49,7 +49,13 @@ pub fn bench_log_slow(c: &mut Criterion) {
// performance too. Use a simple noop future that yields once, to avoid any scheduler fast
// paths for a ready future.
if enabled {
b.iter(|| runtime.block_on(log_slow("ready", THRESHOLD, tokio::task::yield_now())));
b.iter(|| {
runtime.block_on(log_slow(
"ready",
THRESHOLD,
std::pin::pin!(tokio::task::yield_now()),
))
});
} else {
b.iter(|| runtime.block_on(tokio::task::yield_now()));
}

View File

@@ -165,6 +165,7 @@ pub fn init(
};
log_layer.with_filter(rust_log_env_filter())
});
let r = r.with(
TracingEventCountLayer(&TRACING_EVENT_COUNT_METRIC).with_filter(rust_log_env_filter()),
);
@@ -330,37 +331,90 @@ impl std::fmt::Debug for SecretString {
///
/// TODO: consider upgrading this to a warning, but currently it fires too often.
#[inline]
pub async fn log_slow<O>(name: &str, threshold: Duration, f: impl Future<Output = O>) -> O {
// TODO: we unfortunately have to pin the future on the heap, since GetPage futures are huge and
// won't fit on the stack.
let mut f = Box::pin(f);
pub async fn log_slow<F, O>(name: &str, threshold: Duration, f: std::pin::Pin<&mut F>) -> O
where
F: Future<Output = O>,
{
monitor_slow_future(
threshold,
threshold, // period = threshold
f,
|MonitorSlowFutureCallback {
ready,
is_slow,
elapsed_total,
elapsed_since_last_callback: _,
}| {
if !is_slow {
return;
}
if ready {
info!(
"slow {name} completed after {:.3}s",
elapsed_total.as_secs_f64()
);
} else {
info!(
"slow {name} still running after {:.3}s",
elapsed_total.as_secs_f64()
);
}
},
)
.await
}
/// Poll future `fut` to completion, invoking callback `cb` at the given `threshold` and every
/// `period` afterwards, and also unconditionally when the future completes.
#[inline]
pub async fn monitor_slow_future<F, O>(
threshold: Duration,
period: Duration,
mut fut: std::pin::Pin<&mut F>,
mut cb: impl FnMut(MonitorSlowFutureCallback),
) -> O
where
F: Future<Output = O>,
{
let started = Instant::now();
let mut attempt = 1;
let mut last_cb = started;
loop {
// NB: use timeout_at() instead of timeout() to avoid an extra clock reading in the common
// case where the timeout doesn't fire.
let deadline = started + attempt * threshold;
if let Ok(output) = tokio::time::timeout_at(deadline, &mut f).await {
// NB: we check if we exceeded the threshold even if the timeout never fired, because
// scheduling or execution delays may cause the future to succeed even if it exceeds the
// timeout. This costs an extra unconditional clock reading, but seems worth it to avoid
// false negatives.
let elapsed = started.elapsed();
if elapsed >= threshold {
info!("slow {name} completed after {:.3}s", elapsed.as_secs_f64());
}
let deadline = started + threshold + (attempt - 1) * period;
// TODO: still call the callback if the future panics? Copy how we do it for the page_service flush_in_progress counter.
let res = tokio::time::timeout_at(deadline, &mut fut).await;
let now = Instant::now();
let elapsed_total = now - started;
cb(MonitorSlowFutureCallback {
ready: res.is_ok(),
is_slow: elapsed_total >= threshold,
elapsed_total,
elapsed_since_last_callback: now - last_cb,
});
last_cb = now;
if let Ok(output) = res {
return output;
}
let elapsed = started.elapsed().as_secs_f64();
info!("slow {name} still running after {elapsed:.3}s",);
attempt += 1;
}
}
/// See [`monitor_slow_future`].
pub struct MonitorSlowFutureCallback {
/// Whether the future completed. If true, there will be no more callbacks.
pub ready: bool,
/// Whether the future is taking `>=` the specififed threshold duration to complete.
/// Monotonic: if true in one callback invocation, true in all subsequent onces.
pub is_slow: bool,
/// The time elapsed since the [`monitor_slow_future`] was first polled.
pub elapsed_total: Duration,
/// The time elapsed since the last callback invocation.
/// For the initial callback invocation, the time elapsed since the [`monitor_slow_future`] was first polled.
pub elapsed_since_last_callback: Duration,
}
#[cfg(test)]
mod tests {
use metrics::IntCounterVec;

View File

@@ -48,6 +48,9 @@ pprof.workspace = true
rand.workspace = true
range-set-blaze = { version = "0.1.16", features = ["alloc"] }
regex.workspace = true
rustls-pemfile.workspace = true
rustls-pki-types.workspace = true
rustls.workspace = true
scopeguard.workspace = true
send-future.workspace = true
serde.workspace = true
@@ -62,10 +65,12 @@ tokio = { workspace = true, features = ["process", "sync", "fs", "rt", "io-util"
tokio-epoll-uring.workspace = true
tokio-io-timeout.workspace = true
tokio-postgres.workspace = true
tokio-rustls.workspace = true
tokio-stream.workspace = true
tokio-util.workspace = true
toml_edit = { workspace = true, features = [ "serde" ] }
tracing.workspace = true
tracing-utils.workspace = true
url.workspace = true
walkdir.workspace = true
metrics.workspace = true
@@ -116,6 +121,10 @@ harness = false
name = "upload_queue"
harness = false
[[bench]]
name = "bench_metrics"
harness = false
[[bin]]
name = "test_helper_slow_client_reads"
required-features = [ "testing" ]

View File

@@ -0,0 +1,366 @@
use criterion::{BenchmarkId, Criterion, criterion_group, criterion_main};
use utils::id::{TenantId, TimelineId};
//
// Demonstrates that repeat label values lookup is a multicore scalability bottleneck
// that is worth avoiding.
//
criterion_group!(
label_values,
label_values::bench_naive_usage,
label_values::bench_cache_label_values_lookup
);
mod label_values {
use super::*;
pub fn bench_naive_usage(c: &mut Criterion) {
let mut g = c.benchmark_group("label_values__naive_usage");
for ntimelines in [1, 4, 8] {
g.bench_with_input(
BenchmarkId::new("ntimelines", ntimelines),
&ntimelines,
|b, ntimelines| {
b.iter_custom(|iters| {
let barrier = std::sync::Barrier::new(*ntimelines + 1);
let timelines = (0..*ntimelines)
.map(|_| {
(
TenantId::generate().to_string(),
"0000".to_string(),
TimelineId::generate().to_string(),
)
})
.collect::<Vec<_>>();
let metric_vec = metrics::UIntGaugeVec::new(
metrics::opts!("testmetric", "testhelp"),
&["tenant_id", "shard_id", "timeline_id"],
)
.unwrap();
std::thread::scope(|s| {
for (tenant_id, shard_id, timeline_id) in &timelines {
s.spawn(|| {
barrier.wait();
for _ in 0..iters {
metric_vec
.with_label_values(&[tenant_id, shard_id, timeline_id])
.inc();
}
barrier.wait();
});
}
barrier.wait();
let start = std::time::Instant::now();
barrier.wait();
start.elapsed()
})
})
},
);
}
g.finish();
}
pub fn bench_cache_label_values_lookup(c: &mut Criterion) {
let mut g = c.benchmark_group("label_values__cache_label_values_lookup");
for ntimelines in [1, 4, 8] {
g.bench_with_input(
BenchmarkId::new("ntimelines", ntimelines),
&ntimelines,
|b, ntimelines| {
b.iter_custom(|iters| {
let barrier = std::sync::Barrier::new(*ntimelines + 1);
let timelines = (0..*ntimelines)
.map(|_| {
(
TenantId::generate().to_string(),
"0000".to_string(),
TimelineId::generate().to_string(),
)
})
.collect::<Vec<_>>();
let metric_vec = metrics::UIntGaugeVec::new(
metrics::opts!("testmetric", "testhelp"),
&["tenant_id", "shard_id", "timeline_id"],
)
.unwrap();
std::thread::scope(|s| {
for (tenant_id, shard_id, timeline_id) in &timelines {
s.spawn(|| {
let metric = metric_vec.with_label_values(&[
tenant_id,
shard_id,
timeline_id,
]);
barrier.wait();
for _ in 0..iters {
metric.inc();
}
barrier.wait();
});
}
barrier.wait();
let start = std::time::Instant::now();
barrier.wait();
start.elapsed()
})
})
},
);
}
g.finish();
}
}
//
// Demonstrates that even a single metric can be a scalability bottleneck
// if multiple threads in it concurrently but there's nothing we can do
// about it without changing the metrics framework to use e.g. sharded counte atomics.
//
criterion_group!(
single_metric_multicore_scalability,
single_metric_multicore_scalability::bench,
);
mod single_metric_multicore_scalability {
use super::*;
pub fn bench(c: &mut Criterion) {
let mut g = c.benchmark_group("single_metric_multicore_scalability");
for nthreads in [1, 4, 8] {
g.bench_with_input(
BenchmarkId::new("nthreads", nthreads),
&nthreads,
|b, nthreads| {
b.iter_custom(|iters| {
let barrier = std::sync::Barrier::new(*nthreads + 1);
let metric = metrics::UIntGauge::new("testmetric", "testhelp").unwrap();
std::thread::scope(|s| {
for _ in 0..*nthreads {
s.spawn(|| {
barrier.wait();
for _ in 0..iters {
metric.inc();
}
barrier.wait();
});
}
barrier.wait();
let start = std::time::Instant::now();
barrier.wait();
start.elapsed()
})
})
},
);
}
g.finish();
}
}
//
// Demonstrates that even if we cache label value, the propagation of such a cached metric value
// by Clone'ing it is a scalability bottleneck.
// The reason is that it's an Arc internally and thus there's contention on the reference count atomics.
//
// We can avoid that by having long-lived references per thread (= indirection).
//
criterion_group!(
propagation_of_cached_label_value,
propagation_of_cached_label_value::bench_naive,
propagation_of_cached_label_value::bench_long_lived_reference_per_thread,
);
mod propagation_of_cached_label_value {
use std::sync::Arc;
use super::*;
pub fn bench_naive(c: &mut Criterion) {
let mut g = c.benchmark_group("propagation_of_cached_label_value__naive");
for nthreads in [1, 4, 8] {
g.bench_with_input(
BenchmarkId::new("nthreads", nthreads),
&nthreads,
|b, nthreads| {
b.iter_custom(|iters| {
let barrier = std::sync::Barrier::new(*nthreads + 1);
let metric = metrics::UIntGauge::new("testmetric", "testhelp").unwrap();
std::thread::scope(|s| {
for _ in 0..*nthreads {
s.spawn(|| {
barrier.wait();
for _ in 0..iters {
// propagating the metric means we'd clone it into the child RequestContext
let propagated = metric.clone();
// simulate some work
criterion::black_box(propagated);
}
barrier.wait();
});
}
barrier.wait();
let start = std::time::Instant::now();
barrier.wait();
start.elapsed()
})
})
},
);
}
g.finish();
}
pub fn bench_long_lived_reference_per_thread(c: &mut Criterion) {
let mut g =
c.benchmark_group("propagation_of_cached_label_value__long_lived_reference_per_thread");
for nthreads in [1, 4, 8] {
g.bench_with_input(
BenchmarkId::new("nthreads", nthreads),
&nthreads,
|b, nthreads| {
b.iter_custom(|iters| {
let barrier = std::sync::Barrier::new(*nthreads + 1);
let metric = metrics::UIntGauge::new("testmetric", "testhelp").unwrap();
std::thread::scope(|s| {
for _ in 0..*nthreads {
s.spawn(|| {
// This is the technique.
let this_threads_metric_reference = Arc::new(metric.clone());
barrier.wait();
for _ in 0..iters {
// propagating the metric means we'd clone it into the child RequestContext
let propagated = Arc::clone(&this_threads_metric_reference);
// simulate some work (include the pointer chase!)
criterion::black_box(&*propagated);
}
barrier.wait();
});
}
barrier.wait();
let start = std::time::Instant::now();
barrier.wait();
start.elapsed()
})
})
},
);
}
}
}
criterion_main!(
label_values,
single_metric_multicore_scalability,
propagation_of_cached_label_value
);
/*
RUST_BACKTRACE=full cargo bench --bench bench_metrics -- --discard-baseline --noplot
Results on an im4gn.2xlarge instance
label_values__naive_usage/ntimelines/1 time: [178.71 ns 178.74 ns 178.76 ns]
label_values__naive_usage/ntimelines/4 time: [532.94 ns 539.59 ns 546.31 ns]
label_values__naive_usage/ntimelines/8 time: [1.1082 µs 1.1109 µs 1.1135 µs]
label_values__cache_label_values_lookup/ntimelines/1 time: [6.4116 ns 6.4119 ns 6.4123 ns]
label_values__cache_label_values_lookup/ntimelines/4 time: [6.3482 ns 6.3819 ns 6.4079 ns]
label_values__cache_label_values_lookup/ntimelines/8 time: [6.4213 ns 6.5279 ns 6.6293 ns]
single_metric_multicore_scalability/nthreads/1 time: [6.0102 ns 6.0104 ns 6.0106 ns]
single_metric_multicore_scalability/nthreads/4 time: [38.127 ns 38.275 ns 38.416 ns]
single_metric_multicore_scalability/nthreads/8 time: [73.698 ns 74.882 ns 75.864 ns]
propagation_of_cached_label_value__naive/nthreads/1 time: [14.424 ns 14.425 ns 14.426 ns]
propagation_of_cached_label_value__naive/nthreads/4 time: [100.71 ns 102.53 ns 104.35 ns]
propagation_of_cached_label_value__naive/nthreads/8 time: [211.50 ns 214.44 ns 216.87 ns]
propagation_of_cached_label_value__long_lived_reference_per_thread/nthreads/1 time: [14.135 ns 14.147 ns 14.160 ns]
propagation_of_cached_label_value__long_lived_reference_per_thread/nthreads/4 time: [14.243 ns 14.255 ns 14.268 ns]
propagation_of_cached_label_value__long_lived_reference_per_thread/nthreads/8 time: [14.470 ns 14.682 ns 14.895 ns]
Results on an i3en.3xlarge instance
label_values__naive_usage/ntimelines/1 time: [117.32 ns 117.53 ns 117.74 ns]
label_values__naive_usage/ntimelines/4 time: [736.58 ns 741.12 ns 745.61 ns]
label_values__naive_usage/ntimelines/8 time: [1.4513 µs 1.4596 µs 1.4665 µs]
label_values__cache_label_values_lookup/ntimelines/1 time: [8.0964 ns 8.0979 ns 8.0995 ns]
label_values__cache_label_values_lookup/ntimelines/4 time: [8.1620 ns 8.2912 ns 8.4491 ns]
label_values__cache_label_values_lookup/ntimelines/8 time: [14.148 ns 14.237 ns 14.324 ns]
single_metric_multicore_scalability/nthreads/1 time: [8.0993 ns 8.1013 ns 8.1046 ns]
single_metric_multicore_scalability/nthreads/4 time: [80.039 ns 80.672 ns 81.297 ns]
single_metric_multicore_scalability/nthreads/8 time: [153.58 ns 154.23 ns 154.90 ns]
propagation_of_cached_label_value__naive/nthreads/1 time: [13.924 ns 13.926 ns 13.928 ns]
propagation_of_cached_label_value__naive/nthreads/4 time: [143.66 ns 145.27 ns 146.59 ns]
propagation_of_cached_label_value__naive/nthreads/8 time: [296.51 ns 297.90 ns 299.30 ns]
propagation_of_cached_label_value__long_lived_reference_per_thread/nthreads/1 time: [14.013 ns 14.149 ns 14.308 ns]
propagation_of_cached_label_value__long_lived_reference_per_thread/nthreads/4 time: [14.311 ns 14.625 ns 14.984 ns]
propagation_of_cached_label_value__long_lived_reference_per_thread/nthreads/8 time: [25.981 ns 26.227 ns 26.476 ns]
Results on an Standard L16s v3 (16 vcpus, 128 GiB memory) Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz
label_values__naive_usage/ntimelines/1 time: [101.63 ns 101.84 ns 102.06 ns]
label_values__naive_usage/ntimelines/4 time: [417.55 ns 424.73 ns 432.63 ns]
label_values__naive_usage/ntimelines/8 time: [874.91 ns 889.51 ns 904.25 ns]
label_values__cache_label_values_lookup/ntimelines/1 time: [5.7724 ns 5.7760 ns 5.7804 ns]
label_values__cache_label_values_lookup/ntimelines/4 time: [7.8878 ns 7.9401 ns 8.0034 ns]
label_values__cache_label_values_lookup/ntimelines/8 time: [7.2621 ns 7.6354 ns 8.0337 ns]
single_metric_multicore_scalability/nthreads/1 time: [5.7710 ns 5.7744 ns 5.7785 ns]
single_metric_multicore_scalability/nthreads/4 time: [66.629 ns 66.994 ns 67.336 ns]
single_metric_multicore_scalability/nthreads/8 time: [130.85 ns 131.98 ns 132.91 ns]
propagation_of_cached_label_value__naive/nthreads/1 time: [11.540 ns 11.546 ns 11.553 ns]
propagation_of_cached_label_value__naive/nthreads/4 time: [131.22 ns 131.90 ns 132.56 ns]
propagation_of_cached_label_value__naive/nthreads/8 time: [260.99 ns 262.75 ns 264.26 ns]
propagation_of_cached_label_value__long_lived_reference_per_thread/nthreads/1 time: [11.544 ns 11.550 ns 11.557 ns]
propagation_of_cached_label_value__long_lived_reference_per_thread/nthreads/4 time: [11.568 ns 11.642 ns 11.763 ns]
propagation_of_cached_label_value__long_lived_reference_per_thread/nthreads/8 time: [13.416 ns 14.121 ns 14.886 ns
Results on an M4 MAX MacBook Pro Total Number of Cores: 14 (10 performance and 4 efficiency)
label_values__naive_usage/ntimelines/1 time: [52.711 ns 53.026 ns 53.381 ns]
label_values__naive_usage/ntimelines/4 time: [323.99 ns 330.40 ns 337.53 ns]
label_values__naive_usage/ntimelines/8 time: [1.1615 µs 1.1998 µs 1.2399 µs]
label_values__cache_label_values_lookup/ntimelines/1 time: [1.6635 ns 1.6715 ns 1.6809 ns]
label_values__cache_label_values_lookup/ntimelines/4 time: [1.7786 ns 1.7876 ns 1.8028 ns]
label_values__cache_label_values_lookup/ntimelines/8 time: [1.8195 ns 1.8371 ns 1.8665 ns]
single_metric_multicore_scalability/nthreads/1 time: [1.7764 ns 1.7909 ns 1.8079 ns]
single_metric_multicore_scalability/nthreads/4 time: [33.875 ns 34.868 ns 35.923 ns]
single_metric_multicore_scalability/nthreads/8 time: [226.85 ns 235.30 ns 244.18 ns]
propagation_of_cached_label_value__naive/nthreads/1 time: [3.4337 ns 3.4491 ns 3.4660 ns]
propagation_of_cached_label_value__naive/nthreads/4 time: [69.486 ns 71.937 ns 74.472 ns]
propagation_of_cached_label_value__naive/nthreads/8 time: [434.87 ns 456.47 ns 477.84 ns]
propagation_of_cached_label_value__long_lived_reference_per_thread/nthreads/1 time: [3.3767 ns 3.3974 ns 3.4220 ns]
propagation_of_cached_label_value__long_lived_reference_per_thread/nthreads/4 time: [3.6105 ns 4.2355 ns 5.1463 ns]
propagation_of_cached_label_value__long_lived_reference_per_thread/nthreads/8 time: [4.0889 ns 4.9714 ns 6.0779 ns]
Results on a Hetzner AX102 AMD Ryzen 9 7950X3D 16-Core Processor
label_values__naive_usage/ntimelines/1 time: [64.510 ns 64.559 ns 64.610 ns]
label_values__naive_usage/ntimelines/4 time: [309.71 ns 326.09 ns 342.32 ns]
label_values__naive_usage/ntimelines/8 time: [776.92 ns 819.35 ns 856.93 ns]
label_values__cache_label_values_lookup/ntimelines/1 time: [1.2855 ns 1.2943 ns 1.3021 ns]
label_values__cache_label_values_lookup/ntimelines/4 time: [1.3865 ns 1.4139 ns 1.4441 ns]
label_values__cache_label_values_lookup/ntimelines/8 time: [1.5311 ns 1.5669 ns 1.6046 ns]
single_metric_multicore_scalability/nthreads/1 time: [1.1927 ns 1.1981 ns 1.2049 ns]
single_metric_multicore_scalability/nthreads/4 time: [24.346 ns 25.439 ns 26.634 ns]
single_metric_multicore_scalability/nthreads/8 time: [58.666 ns 60.137 ns 61.486 ns]
propagation_of_cached_label_value__naive/nthreads/1 time: [2.7067 ns 2.7238 ns 2.7402 ns]
propagation_of_cached_label_value__naive/nthreads/4 time: [62.723 ns 66.214 ns 69.787 ns]
propagation_of_cached_label_value__naive/nthreads/8 time: [164.24 ns 170.10 ns 175.68 ns]
propagation_of_cached_label_value__long_lived_reference_per_thread/nthreads/1 time: [2.2915 ns 2.2960 ns 2.3012 ns]
propagation_of_cached_label_value__long_lived_reference_per_thread/nthreads/4 time: [2.5726 ns 2.6158 ns 2.6624 ns]
propagation_of_cached_label_value__long_lived_reference_per_thread/nthreads/8 time: [2.7068 ns 2.8243 ns 2.9824 ns]
*/

View File

@@ -7,7 +7,7 @@ use http_utils::error::HttpErrorBody;
use pageserver_api::models::*;
use pageserver_api::shard::TenantShardId;
pub use reqwest::Body as ReqwestBody;
use reqwest::{IntoUrl, Method, StatusCode};
use reqwest::{Certificate, IntoUrl, Method, StatusCode};
use utils::id::{TenantId, TimelineId};
use utils::lsn::Lsn;
@@ -38,6 +38,9 @@ pub enum Error {
#[error("Cancelled")]
Cancelled,
#[error("create client: {0}{}", .0.source().map(|e| format!(": {e}")).unwrap_or_default())]
CreateClient(reqwest::Error),
}
pub type Result<T> = std::result::Result<T, Error>;
@@ -69,8 +72,17 @@ pub enum ForceAwaitLogicalSize {
}
impl Client {
pub fn new(mgmt_api_endpoint: String, jwt: Option<&str>) -> Self {
Self::from_client(reqwest::Client::new(), mgmt_api_endpoint, jwt)
pub fn new(
mgmt_api_endpoint: String,
jwt: Option<&str>,
ssl_ca_cert: Option<Certificate>,
) -> Result<Self> {
let mut http_client = reqwest::Client::builder();
if let Some(ssl_ca_cert) = ssl_ca_cert {
http_client = http_client.add_root_certificate(ssl_ca_cert);
}
let http_client = http_client.build().map_err(Error::CreateClient)?;
Ok(Self::from_client(http_client, mgmt_api_endpoint, jwt))
}
pub fn from_client(
@@ -101,12 +113,10 @@ impl Client {
debug_assert!(path.starts_with('/'));
let uri = format!("{}{}", self.mgmt_api_endpoint, path);
let req = self.client.request(Method::GET, uri);
let req = if let Some(value) = &self.authorization_header {
req.header(reqwest::header::AUTHORIZATION, value)
} else {
req
};
let mut req = self.client.request(Method::GET, uri);
if let Some(value) = &self.authorization_header {
req = req.header(reqwest::header::AUTHORIZATION, value);
}
req.send().await.map_err(Error::ReceiveBody)
}

View File

@@ -12,7 +12,7 @@ pub(crate) fn setup_logging() {
logging::TracingErrorLayerEnablement::EnableWithRustLogFilter,
logging::Output::Stdout,
)
.expect("Failed to init test logging")
.expect("Failed to init test logging");
});
}

View File

@@ -15,6 +15,7 @@ hdrhistogram.workspace = true
humantime.workspace = true
humantime-serde.workspace = true
rand.workspace = true
reqwest.workspace=true
serde.workspace = true
serde_json.workspace = true
tracing.workspace = true

View File

@@ -36,7 +36,8 @@ async fn main_impl(args: Args) -> anyhow::Result<()> {
let mgmt_api_client = Arc::new(pageserver_client::mgmt_api::Client::new(
args.mgmt_api_endpoint.clone(),
args.pageserver_jwt.as_deref(),
));
None, // TODO: support ssl_ca_file for https APIs in pagebench.
)?);
// discover targets
let timelines: Vec<TenantTimelineId> = crate::util::cli::targets::discover(

View File

@@ -77,7 +77,8 @@ async fn main_impl(
let mgmt_api_client = Arc::new(pageserver_client::mgmt_api::Client::new(
args.mgmt_api_endpoint.clone(),
args.pageserver_jwt.as_deref(),
));
None, // TODO: support ssl_ca_file for https APIs in pagebench.
)?);
// discover targets
let timelines: Vec<TenantTimelineId> = crate::util::cli::targets::discover(

View File

@@ -125,7 +125,8 @@ async fn main_impl(
let mgmt_api_client = Arc::new(pageserver_client::mgmt_api::Client::new(
args.mgmt_api_endpoint.clone(),
args.pageserver_jwt.as_deref(),
));
None, // TODO: support ssl_ca_file for https APIs in pagebench.
)?);
if let Some(engine_str) = &args.set_io_engine {
mgmt_api_client.put_io_engine(engine_str).await?;

View File

@@ -83,7 +83,8 @@ async fn main_impl(args: Args) -> anyhow::Result<()> {
let mgmt_api_client = Arc::new(pageserver_client::mgmt_api::Client::new(
args.mgmt_api_endpoint.clone(),
args.pageserver_jwt.as_deref(),
));
None, // TODO: support ssl_ca_file for https APIs in pagebench.
)?);
if let Some(engine_str) = &args.set_io_engine {
mgmt_api_client.put_io_engine(engine_str).await?;

View File

@@ -40,7 +40,8 @@ async fn main_impl(args: Args) -> anyhow::Result<()> {
let mgmt_api_client = Arc::new(pageserver_client::mgmt_api::Client::new(
args.mgmt_api_endpoint.clone(),
args.pageserver_jwt.as_deref(),
));
None, // TODO: support ssl_ca_file for https APIs in pagebench.
)?);
// discover targets
let timelines: Vec<TenantTimelineId> = crate::util::cli::targets::discover(

View File

@@ -25,11 +25,12 @@ use pageserver::task_mgr::{
};
use pageserver::tenant::{TenantSharedResources, mgr, secondary};
use pageserver::{
CancellableTask, ConsumptionMetricsTasks, HttpEndpointListener, http, page_cache, page_service,
task_mgr, virtual_file,
CancellableTask, ConsumptionMetricsTasks, HttpEndpointListener, HttpsEndpointListener, http,
page_cache, page_service, task_mgr, virtual_file,
};
use postgres_backend::AuthType;
use remote_storage::GenericRemoteStorage;
use rustls_pki_types::{CertificateDer, PrivateKeyDer};
use tokio::signal::unix::SignalKind;
use tokio::time::Instant;
use tokio_util::sync::CancellationToken;
@@ -110,6 +111,7 @@ fn main() -> anyhow::Result<()> {
} else {
TracingErrorLayerEnablement::Disabled
};
logging::init(
conf.log_format,
tracing_error_layer_enablement,
@@ -343,8 +345,15 @@ fn start_pageserver(
info!("Starting pageserver http handler on {http_addr}");
let http_listener = tcp_listener::bind(http_addr)?;
let pg_addr = &conf.listen_pg_addr;
let https_listener = match conf.listen_https_addr.as_ref() {
Some(https_addr) => {
info!("Starting pageserver https handler on {https_addr}");
Some(tcp_listener::bind(https_addr)?)
}
None => None,
};
let pg_addr = &conf.listen_pg_addr;
info!("Starting pageserver pg protocol handler on {pg_addr}");
let pageserver_listener = tcp_listener::bind(pg_addr)?;
@@ -575,9 +584,8 @@ fn start_pageserver(
// Start up the service to handle HTTP mgmt API request. We created the
// listener earlier already.
let http_endpoint_listener = {
let (http_endpoint_listener, https_endpoint_listener) = {
let _rt_guard = MGMT_REQUEST_RUNTIME.enter(); // for hyper
let cancel = CancellationToken::new();
let router_state = Arc::new(
http::routes::State::new(
@@ -592,22 +600,51 @@ fn start_pageserver(
)
.context("Failed to initialize router state")?,
);
let router = http::make_router(router_state, launch_ts, http_auth.clone())?
.build()
.map_err(|err| anyhow!(err))?;
let service = http_utils::RouterService::new(router).unwrap();
let server = hyper0::Server::from_tcp(http_listener)?
.serve(service)
.with_graceful_shutdown({
let cancel = cancel.clone();
async move { cancel.clone().cancelled().await }
});
let task = MGMT_REQUEST_RUNTIME.spawn(task_mgr::exit_on_panic_or_error(
"http endpoint listener",
server,
));
HttpEndpointListener(CancellableTask { task, cancel })
let service =
Arc::new(http_utils::RequestServiceBuilder::new(router).map_err(|err| anyhow!(err))?);
let http_task = {
let server =
http_utils::server::Server::new(Arc::clone(&service), http_listener, None)?;
let cancel = CancellationToken::new();
let task = MGMT_REQUEST_RUNTIME.spawn(task_mgr::exit_on_panic_or_error(
"http endpoint listener",
server.serve(cancel.clone()),
));
HttpEndpointListener(CancellableTask { task, cancel })
};
let https_task = match https_listener {
Some(https_listener) => {
let certs = load_certs(&conf.ssl_cert_file)?;
let key = load_private_key(&conf.ssl_key_file)?;
let server_config = rustls::ServerConfig::builder()
.with_no_client_auth()
.with_single_cert(certs, key)?;
let tls_acceptor = tokio_rustls::TlsAcceptor::from(Arc::new(server_config));
let server =
http_utils::server::Server::new(service, https_listener, Some(tls_acceptor))?;
let cancel = CancellationToken::new();
let task = MGMT_REQUEST_RUNTIME.spawn(task_mgr::exit_on_panic_or_error(
"https endpoint listener",
server.serve(cancel.clone()),
));
Some(HttpsEndpointListener(CancellableTask { task, cancel }))
}
None => None,
};
(http_task, https_task)
};
let consumption_metrics_tasks = {
@@ -683,6 +720,7 @@ fn start_pageserver(
shutdown_pageserver.cancel();
pageserver::shutdown_pageserver(
http_endpoint_listener,
https_endpoint_listener,
page_service,
consumption_metrics_tasks,
disk_usage_eviction_task,
@@ -697,6 +735,25 @@ fn start_pageserver(
})
}
fn load_certs(filename: &Utf8Path) -> std::io::Result<Vec<CertificateDer<'static>>> {
let file = std::fs::File::open(filename)?;
let mut reader = std::io::BufReader::new(file);
rustls_pemfile::certs(&mut reader).collect()
}
fn load_private_key(filename: &Utf8Path) -> anyhow::Result<PrivateKeyDer<'static>> {
let file = std::fs::File::open(filename)?;
let mut reader = std::io::BufReader::new(file);
let key = rustls_pemfile::private_key(&mut reader)?;
key.ok_or(anyhow::anyhow!(
"no private key found in {}",
filename.as_str(),
))
}
async fn create_remote_storage_client(
conf: &'static PageServerConf,
) -> anyhow::Result<GenericRemoteStorage> {

View File

@@ -53,6 +53,11 @@ pub struct PageServerConf {
pub listen_pg_addr: String,
/// Example (default): 127.0.0.1:9898
pub listen_http_addr: String,
/// Example: 127.0.0.1:9899
pub listen_https_addr: Option<String>,
pub ssl_key_file: Utf8PathBuf,
pub ssl_cert_file: Utf8PathBuf,
/// Current availability zone. Used for traffic metrics.
pub availability_zone: Option<String>,
@@ -317,6 +322,9 @@ impl PageServerConf {
let pageserver_api::config::ConfigToml {
listen_pg_addr,
listen_http_addr,
listen_https_addr,
ssl_key_file,
ssl_cert_file,
availability_zone,
wait_lsn_timeout,
wal_redo_timeout,
@@ -375,6 +383,9 @@ impl PageServerConf {
// ------------------------------------------------------------
listen_pg_addr,
listen_http_addr,
listen_https_addr,
ssl_key_file,
ssl_cert_file,
availability_zone,
wait_lsn_timeout,
wal_redo_timeout,
@@ -456,8 +467,8 @@ impl PageServerConf {
no_sync: no_sync.unwrap_or(false),
enable_read_path_debugging: enable_read_path_debugging.unwrap_or(false),
validate_wal_contiguity: validate_wal_contiguity.unwrap_or(false),
load_previous_heatmap: load_previous_heatmap.unwrap_or(false),
generate_unarchival_heatmap: generate_unarchival_heatmap.unwrap_or(false),
load_previous_heatmap: load_previous_heatmap.unwrap_or(true),
generate_unarchival_heatmap: generate_unarchival_heatmap.unwrap_or(true),
};
// ------------------------------------------------------------

View File

@@ -89,16 +89,112 @@
//! [`RequestContext`] argument. Functions in the middle of the call chain
//! only need to pass it on.
use crate::task_mgr::TaskKind;
use std::sync::Arc;
use once_cell::sync::Lazy;
use tracing::warn;
use utils::{id::TimelineId, shard::TenantShardId};
use crate::{
metrics::{StorageIoSizeMetrics, TimelineMetrics},
task_mgr::TaskKind,
tenant::Timeline,
};
// The main structure of this module, see module-level comment.
#[derive(Debug)]
pub struct RequestContext {
task_kind: TaskKind,
download_behavior: DownloadBehavior,
access_stats_behavior: AccessStatsBehavior,
page_content_kind: PageContentKind,
read_path_debug: bool,
scope: Scope,
}
#[derive(Clone)]
pub(crate) enum Scope {
Global {
io_size_metrics: &'static crate::metrics::StorageIoSizeMetrics,
},
SecondaryTenant {
io_size_metrics: &'static crate::metrics::StorageIoSizeMetrics,
},
SecondaryTimeline {
io_size_metrics: crate::metrics::StorageIoSizeMetrics,
},
Timeline {
// We wrap the `Arc<TimelineMetrics>`s inside another Arc to avoid child
// context creation contending for the ref counters of the Arc<TimelineMetrics>,
// which are shared among all tasks that operate on the timeline, especially
// concurrent page_service connections.
#[allow(clippy::redundant_allocation)]
arc_arc: Arc<Arc<TimelineMetrics>>,
},
#[cfg(test)]
UnitTest {
io_size_metrics: &'static crate::metrics::StorageIoSizeMetrics,
},
}
static GLOBAL_IO_SIZE_METRICS: Lazy<crate::metrics::StorageIoSizeMetrics> =
Lazy::new(|| crate::metrics::StorageIoSizeMetrics::new("*", "*", "*"));
impl Scope {
pub(crate) fn new_global() -> Self {
Scope::Global {
io_size_metrics: &GLOBAL_IO_SIZE_METRICS,
}
}
/// NB: this allocates, so, use only at relatively long-lived roots, e.g., at start
/// of a compaction iteration.
pub(crate) fn new_timeline(timeline: &Timeline) -> Self {
Scope::Timeline {
arc_arc: Arc::new(Arc::clone(&timeline.metrics)),
}
}
pub(crate) fn new_page_service_pagestream(
timeline_handle: &crate::tenant::timeline::handle::Handle<
crate::page_service::TenantManagerTypes,
>,
) -> Self {
Scope::Timeline {
arc_arc: Arc::clone(&timeline_handle.metrics),
}
}
pub(crate) fn new_secondary_timeline(
tenant_shard_id: &TenantShardId,
timeline_id: &TimelineId,
) -> Self {
// TODO(https://github.com/neondatabase/neon/issues/11156): secondary timelines have no infrastructure for metrics lifecycle.
let tenant_id = tenant_shard_id.tenant_id.to_string();
let shard_id = tenant_shard_id.shard_slug().to_string();
let timeline_id = timeline_id.to_string();
let io_size_metrics =
crate::metrics::StorageIoSizeMetrics::new(&tenant_id, &shard_id, &timeline_id);
Scope::SecondaryTimeline { io_size_metrics }
}
pub(crate) fn new_secondary_tenant(_tenant_shard_id: &TenantShardId) -> Self {
// Before propagating metrics via RequestContext, the labels were inferred from file path.
// The only user of VirtualFile at tenant scope is the heatmap download & read.
// The inferred labels for the path of the heatmap file on local disk were that of the global metric (*,*,*).
// Thus, we do the same here, and extend that for anything secondary-tenant scoped.
//
// If we want to have (tenant_id, shard_id, '*') labels for secondary tenants in the future,
// we will need to think about the metric lifecycle, i.e., remove them during secondary tenant shutdown,
// like we do for attached timelines. (We don't have attached-tenant-scoped usage of VirtualFile
// at this point, so, we were able to completely side-step tenant-scoped stuff there).
Scope::SecondaryTenant {
io_size_metrics: &GLOBAL_IO_SIZE_METRICS,
}
}
#[cfg(test)]
pub(crate) fn new_unit_test() -> Self {
Scope::UnitTest {
io_size_metrics: &GLOBAL_IO_SIZE_METRICS,
}
}
}
/// The kind of access to the page cache.
@@ -157,6 +253,7 @@ impl RequestContextBuilder {
access_stats_behavior: AccessStatsBehavior::Update,
page_content_kind: PageContentKind::Unknown,
read_path_debug: false,
scope: Scope::new_global(),
},
}
}
@@ -171,10 +268,16 @@ impl RequestContextBuilder {
access_stats_behavior: original.access_stats_behavior,
page_content_kind: original.page_content_kind,
read_path_debug: original.read_path_debug,
scope: original.scope.clone(),
},
}
}
pub fn task_kind(mut self, k: TaskKind) -> Self {
self.inner.task_kind = k;
self
}
/// Configure the DownloadBehavior of the context: whether to
/// download missing layers, and/or warn on the download.
pub fn download_behavior(mut self, b: DownloadBehavior) -> Self {
@@ -199,6 +302,11 @@ impl RequestContextBuilder {
self
}
pub(crate) fn scope(mut self, s: Scope) -> Self {
self.inner.scope = s;
self
}
pub fn build(self) -> RequestContext {
self.inner
}
@@ -281,7 +389,50 @@ impl RequestContext {
}
fn child_impl(&self, task_kind: TaskKind, download_behavior: DownloadBehavior) -> Self {
Self::new(task_kind, download_behavior)
RequestContextBuilder::extend(self)
.task_kind(task_kind)
.download_behavior(download_behavior)
.build()
}
pub fn with_scope_timeline(&self, timeline: &Arc<Timeline>) -> Self {
RequestContextBuilder::extend(self)
.scope(Scope::new_timeline(timeline))
.build()
}
pub(crate) fn with_scope_page_service_pagestream(
&self,
timeline_handle: &crate::tenant::timeline::handle::Handle<
crate::page_service::TenantManagerTypes,
>,
) -> Self {
RequestContextBuilder::extend(self)
.scope(Scope::new_page_service_pagestream(timeline_handle))
.build()
}
pub fn with_scope_secondary_timeline(
&self,
tenant_shard_id: &TenantShardId,
timeline_id: &TimelineId,
) -> Self {
RequestContextBuilder::extend(self)
.scope(Scope::new_secondary_timeline(tenant_shard_id, timeline_id))
.build()
}
pub fn with_scope_secondary_tenant(&self, tenant_shard_id: &TenantShardId) -> Self {
RequestContextBuilder::extend(self)
.scope(Scope::new_secondary_tenant(tenant_shard_id))
.build()
}
#[cfg(test)]
pub fn with_scope_unit_test(&self) -> Self {
RequestContextBuilder::new(TaskKind::UnitTest)
.scope(Scope::new_unit_test())
.build()
}
pub fn task_kind(&self) -> TaskKind {
@@ -303,4 +454,38 @@ impl RequestContext {
pub(crate) fn read_path_debug(&self) -> bool {
self.read_path_debug
}
pub(crate) fn io_size_metrics(&self) -> &StorageIoSizeMetrics {
match &self.scope {
Scope::Global { io_size_metrics } => {
let is_unit_test = cfg!(test);
let is_regress_test_build = cfg!(feature = "testing");
if is_unit_test || is_regress_test_build {
panic!("all VirtualFile instances are timeline-scoped");
} else {
use once_cell::sync::Lazy;
use std::sync::Mutex;
use std::time::Duration;
use utils::rate_limit::RateLimit;
static LIMIT: Lazy<Mutex<RateLimit>> =
Lazy::new(|| Mutex::new(RateLimit::new(Duration::from_secs(1))));
let mut guard = LIMIT.lock().unwrap();
guard.call2(|rate_limit_stats| {
warn!(
%rate_limit_stats,
backtrace=%std::backtrace::Backtrace::force_capture(),
"all VirtualFile instances are timeline-scoped",
);
});
io_size_metrics
}
}
Scope::Timeline { arc_arc } => &arc_arc.storage_io_size,
Scope::SecondaryTimeline { io_size_metrics } => io_size_metrics,
Scope::SecondaryTenant { io_size_metrics } => io_size_metrics,
#[cfg(test)]
Scope::UnitTest { io_size_metrics } => io_size_metrics,
}
}
}

View File

@@ -181,7 +181,7 @@ impl ControlPlaneGenerationsApi for ControllerUpcallClient {
listen_pg_port: m.postgres_port,
listen_http_addr: m.http_host,
listen_http_port: m.http_port,
listen_https_port: None, // TODO: Support https.
listen_https_port: m.https_port,
availability_zone_id: az_id.expect("Checked above"),
})
}

View File

@@ -1079,7 +1079,6 @@ components:
- last_record_lsn
- disk_consistent_lsn
- state
- latest_gc_cutoff_lsn
properties:
timeline_id:
type: string
@@ -1123,9 +1122,6 @@ components:
min_readable_lsn:
type: string
format: hex
latest_gc_cutoff_lsn:
type: string
format: hex
applied_gc_cutoff_lsn:
type: string
format: hex

View File

@@ -37,7 +37,8 @@ use pageserver_api::models::{
TenantShardSplitResponse, TenantSorting, TenantState, TenantWaitLsnRequest,
TimelineArchivalConfigRequest, TimelineCreateRequest, TimelineCreateRequestMode,
TimelineCreateRequestModeImportPgdata, TimelineGcRequest, TimelineInfo,
TimelinesInfoAndOffloaded, TopTenantShardItem, TopTenantShardsRequest, TopTenantShardsResponse,
TimelinePatchIndexPartRequest, TimelinesInfoAndOffloaded, TopTenantShardItem,
TopTenantShardsRequest, TopTenantShardsResponse,
};
use pageserver_api::shard::{ShardCount, TenantShardId};
use remote_storage::{DownloadError, GenericRemoteStorage, TimeTravelError};
@@ -54,6 +55,7 @@ use utils::id::{TenantId, TimelineId};
use utils::lsn::Lsn;
use crate::config::PageServerConf;
use crate::context;
use crate::context::{DownloadBehavior, RequestContext, RequestContextBuilder};
use crate::deletion_queue::DeletionQueueClient;
use crate::pgdatadir_mapping::LsnForTimestamp;
@@ -63,12 +65,14 @@ use crate::tenant::mgr::{
GetActiveTenantError, GetTenantError, TenantManager, TenantMapError, TenantMapInsertError,
TenantSlot, TenantSlotError, TenantSlotUpsertError, TenantStateError, UpsertLocationError,
};
use crate::tenant::remote_timeline_client::index::GcCompactionState;
use crate::tenant::remote_timeline_client::{
download_index_part, list_remote_tenant_shards, list_remote_timelines,
};
use crate::tenant::secondary::SecondaryController;
use crate::tenant::size::ModelInputs;
use crate::tenant::storage_layer::{IoConcurrency, LayerAccessStatsReset, LayerName};
use crate::tenant::timeline::detach_ancestor::DetachBehavior;
use crate::tenant::timeline::offload::{OffloadError, offload_timeline};
use crate::tenant::timeline::{
CompactFlags, CompactOptions, CompactRequest, CompactionError, Timeline, WaitLsnTimeout,
@@ -457,10 +461,7 @@ async fn build_timeline_info_common(
initdb_lsn,
last_record_lsn,
prev_record_lsn: Some(timeline.get_prev_record_lsn()),
// Externally, expose the lowest LSN that can be used to create a branch as the "GC cutoff", although internally
// we distinguish between the "planned" GC cutoff (PITR point) and the "latest" GC cutoff (where we
// actually trimmed data to), which can pass each other when PITR is changed.
latest_gc_cutoff_lsn: min_readable_lsn,
_unused: Default::default(), // Unused, for legacy decode only
min_readable_lsn,
applied_gc_cutoff_lsn: *timeline.get_applied_gc_cutoff_lsn(),
current_logical_size: current_logical_size.size_dont_care_about_accuracy(),
@@ -858,6 +859,75 @@ async fn timeline_archival_config_handler(
json_response(StatusCode::OK, ())
}
/// This API is used to patch the index part of a timeline. You must ensure such patches are safe to apply. Use this API as an emergency
/// measure only.
///
/// Some examples of safe patches:
/// - Increase the gc_cutoff and gc_compaction_cutoff to a larger value in case of a bug that didn't bump the cutoff and cause read errors.
/// - Force set the index part to use reldir v2 (migrating/migrated).
///
/// Some examples of unsafe patches:
/// - Force set the index part from v2 to v1 (legacy). This will cause the code path to ignore anything written to the new keyspace and cause
/// errors.
/// - Decrease the gc_cutoff without validating the data really exists. It will cause read errors in the background.
async fn timeline_patch_index_part_handler(
mut request: Request<Body>,
_cancel: CancellationToken,
) -> Result<Response<Body>, ApiError> {
let tenant_shard_id: TenantShardId = parse_request_param(&request, "tenant_shard_id")?;
let timeline_id: TimelineId = parse_request_param(&request, "timeline_id")?;
let request_data: TimelinePatchIndexPartRequest = json_request(&mut request).await?;
check_permission(&request, None)?; // require global permission for this request
let state = get_state(&request);
async {
let timeline =
active_timeline_of_active_tenant(&state.tenant_manager, tenant_shard_id, timeline_id)
.await?;
if let Some(rel_size_migration) = request_data.rel_size_migration {
timeline
.update_rel_size_v2_status(rel_size_migration)
.map_err(ApiError::InternalServerError)?;
}
if let Some(gc_compaction_last_completed_lsn) =
request_data.gc_compaction_last_completed_lsn
{
timeline
.update_gc_compaction_state(GcCompactionState {
last_completed_lsn: gc_compaction_last_completed_lsn,
})
.map_err(ApiError::InternalServerError)?;
}
if let Some(applied_gc_cutoff_lsn) = request_data.applied_gc_cutoff_lsn {
{
let guard = timeline.applied_gc_cutoff_lsn.lock_for_write();
guard.store_and_unlock(applied_gc_cutoff_lsn);
}
}
if request_data.force_index_update {
timeline
.remote_client
.force_schedule_index_upload()
.context("force schedule index upload")
.map_err(ApiError::InternalServerError)?;
}
Ok::<_, ApiError>(())
}
.instrument(info_span!("timeline_patch_index_part",
tenant_id = %tenant_shard_id.tenant_id,
shard_id = %tenant_shard_id.shard_slug(),
%timeline_id))
.await?;
json_response(StatusCode::OK, ())
}
async fn timeline_detail_handler(
request: Request<Body>,
_cancel: CancellationToken,
@@ -882,12 +952,13 @@ async fn timeline_detail_handler(
tenant.wait_to_become_active(ACTIVE_TENANT_TIMEOUT).await?;
let timeline = tenant.get_timeline(timeline_id, false)?;
let ctx = &ctx.with_scope_timeline(&timeline);
let timeline_info = build_timeline_info(
&timeline,
include_non_incremental_logical_size.unwrap_or(false),
force_await_initial_logical_size.unwrap_or(false),
&ctx,
ctx,
)
.await
.context("get local timeline info")
@@ -931,7 +1002,8 @@ async fn get_lsn_by_timestamp_handler(
let timeline =
active_timeline_of_active_tenant(&state.tenant_manager, tenant_shard_id, timeline_id)
.await?;
let ctx = RequestContext::new(TaskKind::MgmtRequest, DownloadBehavior::Download);
let ctx = RequestContext::new(TaskKind::MgmtRequest, DownloadBehavior::Download)
.with_scope_timeline(&timeline);
let result = timeline
.find_lsn_for_timestamp(timestamp_pg, &cancel, &ctx)
.await?;
@@ -1003,7 +1075,8 @@ async fn get_timestamp_of_lsn_handler(
let timeline =
active_timeline_of_active_tenant(&state.tenant_manager, tenant_shard_id, timeline_id)
.await?;
let ctx = RequestContext::new(TaskKind::MgmtRequest, DownloadBehavior::Download);
let ctx = RequestContext::new(TaskKind::MgmtRequest, DownloadBehavior::Download)
.with_scope_timeline(&timeline);
let result = timeline.get_timestamp_for_lsn(lsn, &ctx).await?;
match result {
@@ -1358,7 +1431,8 @@ async fn timeline_layer_scan_disposable_keys(
active_timeline_of_active_tenant(&state.tenant_manager, tenant_shard_id, timeline_id)
.await?;
let ctx = RequestContext::new(TaskKind::MgmtRequest, DownloadBehavior::Download);
let ctx = RequestContext::new(TaskKind::MgmtRequest, DownloadBehavior::Download)
.with_scope_timeline(&timeline);
let guard = timeline.layers.read().await;
let Some(layer) = guard.try_get_from_key(&layer_name.clone().into()) else {
@@ -1444,7 +1518,8 @@ async fn timeline_download_heatmap_layers_handler(
let timeline =
active_timeline_of_active_tenant(&state.tenant_manager, tenant_shard_id, timeline_id)
.await?;
let ctx = RequestContext::new(TaskKind::MgmtRequest, DownloadBehavior::Download);
let ctx = RequestContext::new(TaskKind::MgmtRequest, DownloadBehavior::Download)
.with_scope_timeline(&timeline);
let max_concurrency = get_config(&request)
.remote_storage_config
@@ -1492,7 +1567,8 @@ async fn layer_download_handler(
let timeline =
active_timeline_of_active_tenant(&state.tenant_manager, tenant_shard_id, timeline_id)
.await?;
let ctx = RequestContext::new(TaskKind::MgmtRequest, DownloadBehavior::Download);
let ctx = RequestContext::new(TaskKind::MgmtRequest, DownloadBehavior::Download)
.with_scope_timeline(&timeline);
let downloaded = timeline
.download_layer(&layer_name, &ctx)
.await
@@ -2228,8 +2304,8 @@ async fn timeline_compact_handler(
.unwrap_or(false);
async {
let ctx = RequestContext::new(TaskKind::MgmtRequest, DownloadBehavior::Download);
let timeline = active_timeline_of_active_tenant(&state.tenant_manager, tenant_shard_id, timeline_id).await?;
let ctx = RequestContext::new(TaskKind::MgmtRequest, DownloadBehavior::Download).with_scope_timeline(&timeline);
if scheduled {
let tenant = state
.tenant_manager
@@ -2316,6 +2392,7 @@ async fn timeline_checkpoint_handler(
let state = get_state(&request);
let mut flags = EnumSet::empty();
flags |= CompactFlags::NoYield; // run compaction to completion
if Some(true) == parse_query_param::<_, bool>(&request, "force_l0_compaction")? {
flags |= CompactFlags::ForceL0Compaction;
}
@@ -2336,8 +2413,8 @@ async fn timeline_checkpoint_handler(
parse_query_param::<_, bool>(&request, "wait_until_uploaded")?.unwrap_or(false);
async {
let ctx = RequestContext::new(TaskKind::MgmtRequest, DownloadBehavior::Download);
let timeline = active_timeline_of_active_tenant(&state.tenant_manager, tenant_shard_id, timeline_id).await?;
let ctx = RequestContext::new(TaskKind::MgmtRequest, DownloadBehavior::Download).with_scope_timeline(&timeline);
if wait_until_flushed {
timeline.freeze_and_flush().await
} else {
@@ -2392,7 +2469,8 @@ async fn timeline_download_remote_layers_handler_post(
let timeline =
active_timeline_of_active_tenant(&state.tenant_manager, tenant_shard_id, timeline_id)
.await?;
let ctx = RequestContext::new(TaskKind::MgmtRequest, DownloadBehavior::Download);
let ctx = RequestContext::new(TaskKind::MgmtRequest, DownloadBehavior::Download)
.with_scope_timeline(&timeline);
match timeline.spawn_download_all_remote_layers(body, &ctx).await {
Ok(st) => json_response(StatusCode::ACCEPTED, st),
Err(st) => json_response(StatusCode::CONFLICT, st),
@@ -2429,6 +2507,8 @@ async fn timeline_detach_ancestor_handler(
let tenant_shard_id: TenantShardId = parse_request_param(&request, "tenant_shard_id")?;
check_permission(&request, Some(tenant_shard_id.tenant_id))?;
let timeline_id: TimelineId = parse_request_param(&request, "timeline_id")?;
let behavior: Option<DetachBehavior> = parse_query_param(&request, "detach_behavior")?;
let behavior = behavior.unwrap_or_default();
let span = tracing::info_span!("detach_ancestor", tenant_id=%tenant_shard_id.tenant_id, shard_id=%tenant_shard_id.shard_slug(), %timeline_id);
@@ -2475,9 +2555,10 @@ async fn timeline_detach_ancestor_handler(
tracing::info!("all timeline upload queues are drained");
let timeline = tenant.get_timeline(timeline_id, true)?;
let ctx = &ctx.with_scope_timeline(&timeline);
let progress = timeline
.prepare_to_detach_from_ancestor(&tenant, options, ctx)
.prepare_to_detach_from_ancestor(&tenant, options, behavior, ctx)
.await?;
// uncomment to allow early as possible Tenant::drop
@@ -2492,6 +2573,7 @@ async fn timeline_detach_ancestor_handler(
tenant_shard_id,
timeline_id,
prepared,
behavior,
attempt,
ctx,
)
@@ -2581,8 +2663,9 @@ async fn getpage_at_lsn_handler_inner(
async {
let ctx = RequestContext::new(TaskKind::MgmtRequest, DownloadBehavior::Download);
// Enable read path debugging
let ctx = RequestContextBuilder::extend(&ctx).read_path_debug(true).build();
let timeline = active_timeline_of_active_tenant(&state.tenant_manager, tenant_shard_id, timeline_id).await?;
let ctx = RequestContextBuilder::extend(&ctx).read_path_debug(true)
.scope(context::Scope::new_timeline(&timeline)).build();
// Use last_record_lsn if no lsn is provided
let lsn = lsn.unwrap_or_else(|| timeline.get_last_record_lsn());
@@ -2616,8 +2699,8 @@ async fn timeline_collect_keyspace(
let at_lsn: Option<Lsn> = parse_query_param(&request, "at_lsn")?;
async {
let ctx = RequestContext::new(TaskKind::MgmtRequest, DownloadBehavior::Download);
let timeline = active_timeline_of_active_tenant(&state.tenant_manager, tenant_shard_id, timeline_id).await?;
let ctx = RequestContext::new(TaskKind::MgmtRequest, DownloadBehavior::Download).with_scope_timeline(&timeline);
let at_lsn = at_lsn.unwrap_or_else(|| timeline.get_last_record_lsn());
let (dense_ks, sparse_ks) = timeline
.collect_keyspace(at_lsn, &ctx)
@@ -3142,6 +3225,7 @@ async fn post_top_tenants(
match order_by {
TenantSorting::ResidentSize => sizes.resident_size,
TenantSorting::MaxLogicalSize => sizes.max_logical_size,
TenantSorting::MaxLogicalSizePerShard => sizes.max_logical_size_per_shard,
}
}
@@ -3254,7 +3338,7 @@ async fn put_tenant_timeline_import_basebackup(
tenant.wait_to_become_active(ACTIVE_TENANT_TIMEOUT).await?;
let timeline = tenant
let (timeline, timeline_ctx) = tenant
.create_empty_timeline(timeline_id, base_lsn, pg_version, &ctx)
.map_err(ApiError::InternalServerError)
.await?;
@@ -3273,7 +3357,13 @@ async fn put_tenant_timeline_import_basebackup(
info!("importing basebackup");
timeline
.import_basebackup_from_tar(tenant.clone(), &mut body, base_lsn, broker_client, &ctx)
.import_basebackup_from_tar(
tenant.clone(),
&mut body,
base_lsn,
broker_client,
&timeline_ctx,
)
.await
.map_err(ApiError::InternalServerError)?;
@@ -3313,6 +3403,7 @@ async fn put_tenant_timeline_import_wal(
let state = get_state(&request);
let timeline = active_timeline_of_active_tenant(&state.tenant_manager, TenantShardId::unsharded(tenant_id), timeline_id).await?;
let ctx = RequestContextBuilder::extend(&ctx).scope(context::Scope::new_timeline(&timeline)).build();
let mut body = StreamReader::new(request.into_body().map(|res| {
res.map_err(|error| {
@@ -3629,6 +3720,10 @@ pub fn make_router(
"/v1/tenant/:tenant_shard_id/timeline/:timeline_id/get_timestamp_of_lsn",
|r| api_handler(r, get_timestamp_of_lsn_handler),
)
.post(
"/v1/tenant/:tenant_shard_id/timeline/:timeline_id/patch_index_part",
|r| api_handler(r, timeline_patch_index_part_handler),
)
.post(
"/v1/tenant/:tenant_shard_id/timeline/:timeline_id/lsn_lease",
|r| api_handler(r, lsn_lease_handler),

View File

@@ -64,6 +64,7 @@ pub struct CancellableTask {
pub cancel: CancellationToken,
}
pub struct HttpEndpointListener(pub CancellableTask);
pub struct HttpsEndpointListener(pub CancellableTask);
pub struct ConsumptionMetricsTasks(pub CancellableTask);
pub struct DiskUsageEvictionTask(pub CancellableTask);
impl CancellableTask {
@@ -77,6 +78,7 @@ impl CancellableTask {
#[allow(clippy::too_many_arguments)]
pub async fn shutdown_pageserver(
http_listener: HttpEndpointListener,
https_listener: Option<HttpsEndpointListener>,
page_service: page_service::Listener,
consumption_metrics_worker: ConsumptionMetricsTasks,
disk_usage_eviction_task: Option<DiskUsageEvictionTask>,
@@ -213,6 +215,15 @@ pub async fn shutdown_pageserver(
)
.await;
if let Some(https_listener) = https_listener {
timed(
https_listener.0.shutdown(),
"shutdown https",
Duration::from_secs(1),
)
.await;
}
// Shut down the HTTP endpoint last, so that you can still check the server's
// status while it's shutting down.
// FIXME: We should probably stop accepting commands like attach/detach earlier.

View File

@@ -465,12 +465,40 @@ pub(crate) fn page_cache_errors_inc(error_kind: PageCacheErrorKind) {
pub(crate) static WAIT_LSN_TIME: Lazy<Histogram> = Lazy::new(|| {
register_histogram!(
"pageserver_wait_lsn_seconds",
"Time spent waiting for WAL to arrive",
"Time spent waiting for WAL to arrive. Updated on completion of the wait_lsn operation.",
CRITICAL_OP_BUCKETS.into(),
)
.expect("failed to define a metric")
});
pub(crate) static WAIT_LSN_START_FINISH_COUNTERPAIR: Lazy<IntCounterPairVec> = Lazy::new(|| {
register_int_counter_pair_vec!(
"pageserver_wait_lsn_started_count",
"Number of wait_lsn operations started.",
"pageserver_wait_lsn_finished_count",
"Number of wait_lsn operations finished.",
&["tenant_id", "shard_id", "timeline_id"],
)
.expect("failed to define a metric")
});
pub(crate) static WAIT_LSN_IN_PROGRESS_MICROS: Lazy<IntCounterVec> = Lazy::new(|| {
register_int_counter_vec!(
"pageserver_wait_lsn_in_progress_micros",
"Time spent waiting for WAL to arrive, by timeline_id. Updated periodically while waiting.",
&["tenant_id", "shard_id", "timeline_id"],
)
.expect("failed to define a metric")
});
pub(crate) static WAIT_LSN_IN_PROGRESS_GLOBAL_MICROS: Lazy<IntCounter> = Lazy::new(|| {
register_int_counter!(
"pageserver_wait_lsn_in_progress_micros_global",
"Time spent waiting for WAL to arrive, globally. Updated periodically while waiting."
)
.expect("failed to define a metric")
});
static FLUSH_WAIT_UPLOAD_TIME: Lazy<GaugeVec> = Lazy::new(|| {
register_gauge_vec!(
"pageserver_flush_wait_upload_seconds",
@@ -1227,11 +1255,24 @@ impl StorageIoTime {
pub(crate) static STORAGE_IO_TIME_METRIC: Lazy<StorageIoTime> = Lazy::new(StorageIoTime::new);
const STORAGE_IO_SIZE_OPERATIONS: &[&str] = &["read", "write"];
#[derive(Clone, Copy)]
#[repr(usize)]
enum StorageIoSizeOperation {
Read,
Write,
}
impl StorageIoSizeOperation {
const VARIANTS: &'static [&'static str] = &["read", "write"];
fn as_str(&self) -> &'static str {
Self::VARIANTS[*self as usize]
}
}
// Needed for the https://neonprod.grafana.net/d/5uK9tHL4k/picking-tenant-for-relocation?orgId=1
pub(crate) static STORAGE_IO_SIZE: Lazy<IntGaugeVec> = Lazy::new(|| {
register_int_gauge_vec!(
static STORAGE_IO_SIZE: Lazy<UIntGaugeVec> = Lazy::new(|| {
register_uint_gauge_vec!(
"pageserver_io_operations_bytes_total",
"Total amount of bytes read/written in IO operations",
&["operation", "tenant_id", "shard_id", "timeline_id"]
@@ -1239,6 +1280,34 @@ pub(crate) static STORAGE_IO_SIZE: Lazy<IntGaugeVec> = Lazy::new(|| {
.expect("failed to define a metric")
});
#[derive(Clone, Debug)]
pub(crate) struct StorageIoSizeMetrics {
pub read: UIntGauge,
pub write: UIntGauge,
}
impl StorageIoSizeMetrics {
pub(crate) fn new(tenant_id: &str, shard_id: &str, timeline_id: &str) -> Self {
let read = STORAGE_IO_SIZE
.get_metric_with_label_values(&[
StorageIoSizeOperation::Read.as_str(),
tenant_id,
shard_id,
timeline_id,
])
.unwrap();
let write = STORAGE_IO_SIZE
.get_metric_with_label_values(&[
StorageIoSizeOperation::Write.as_str(),
tenant_id,
shard_id,
timeline_id,
])
.unwrap();
Self { read, write }
}
}
#[cfg(not(test))]
pub(crate) mod virtual_file_descriptor_cache {
use super::*;
@@ -2789,7 +2858,6 @@ impl StorageTimeMetrics {
}
}
#[derive(Debug)]
pub(crate) struct TimelineMetrics {
tenant_id: String,
shard_id: String,
@@ -2821,6 +2889,9 @@ pub(crate) struct TimelineMetrics {
/// Number of valid LSN leases.
pub valid_lsn_lease_count_gauge: UIntGauge,
pub wal_records_received: IntCounter,
pub storage_io_size: StorageIoSizeMetrics,
pub wait_lsn_in_progress_micros: GlobalAndPerTenantIntCounter,
pub wait_lsn_start_finish_counterpair: IntCounterPair,
shutdown: std::sync::atomic::AtomicBool,
}
@@ -2956,6 +3027,19 @@ impl TimelineMetrics {
.get_metric_with_label_values(&[&tenant_id, &shard_id, &timeline_id])
.unwrap();
let storage_io_size = StorageIoSizeMetrics::new(&tenant_id, &shard_id, &timeline_id);
let wait_lsn_in_progress_micros = GlobalAndPerTenantIntCounter {
global: WAIT_LSN_IN_PROGRESS_GLOBAL_MICROS.clone(),
per_tenant: WAIT_LSN_IN_PROGRESS_MICROS
.get_metric_with_label_values(&[&tenant_id, &shard_id, &timeline_id])
.unwrap(),
};
let wait_lsn_start_finish_counterpair = WAIT_LSN_START_FINISH_COUNTERPAIR
.get_metric_with_label_values(&[&tenant_id, &shard_id, &timeline_id])
.unwrap();
TimelineMetrics {
tenant_id,
shard_id,
@@ -2985,8 +3069,11 @@ impl TimelineMetrics {
evictions_with_low_residence_duration: std::sync::RwLock::new(
evictions_with_low_residence_duration,
),
storage_io_size,
valid_lsn_lease_count_gauge,
wal_records_received,
wait_lsn_in_progress_micros,
wait_lsn_start_finish_counterpair,
shutdown: std::sync::atomic::AtomicBool::default(),
}
}
@@ -3175,10 +3262,19 @@ impl TimelineMetrics {
]);
}
for op in STORAGE_IO_SIZE_OPERATIONS {
for op in StorageIoSizeOperation::VARIANTS {
let _ = STORAGE_IO_SIZE.remove_label_values(&[op, tenant_id, shard_id, timeline_id]);
}
let _ =
WAIT_LSN_IN_PROGRESS_MICROS.remove_label_values(&[tenant_id, shard_id, timeline_id]);
{
let mut res = [Ok(()), Ok(())];
WAIT_LSN_START_FINISH_COUNTERPAIR
.remove_label_values(&mut res, &[tenant_id, shard_id, timeline_id]);
}
let _ = SMGR_QUERY_STARTED_PER_TENANT_TIMELINE.remove_label_values(&[
SmgrQueryType::GetPageAtLsn.into(),
tenant_id,
@@ -3791,27 +3887,29 @@ pub mod tokio_epoll_uring {
});
}
pub(crate) struct GlobalAndPerTenantIntCounter {
global: IntCounter,
per_tenant: IntCounter,
}
impl GlobalAndPerTenantIntCounter {
#[inline(always)]
pub(crate) fn inc(&self) {
self.inc_by(1)
}
#[inline(always)]
pub(crate) fn inc_by(&self, n: u64) {
self.global.inc_by(n);
self.per_tenant.inc_by(n);
}
}
pub(crate) mod tenant_throttling {
use metrics::{IntCounter, register_int_counter_vec};
use metrics::register_int_counter_vec;
use once_cell::sync::Lazy;
use utils::shard::TenantShardId;
pub(crate) struct GlobalAndPerTenantIntCounter {
global: IntCounter,
per_tenant: IntCounter,
}
impl GlobalAndPerTenantIntCounter {
#[inline(always)]
pub(crate) fn inc(&self) {
self.inc_by(1)
}
#[inline(always)]
pub(crate) fn inc_by(&self, n: u64) {
self.global.inc_by(n);
self.per_tenant.inc_by(n);
}
}
use super::GlobalAndPerTenantIntCounter;
pub(crate) struct Metrics<const KIND: usize> {
pub(super) count_accounted_start: GlobalAndPerTenantIntCounter,
@@ -4057,6 +4155,7 @@ pub fn preinitialize_metrics(conf: &'static PageServerConf) {
&CIRCUIT_BREAKERS_BROKEN,
&CIRCUIT_BREAKERS_UNBROKEN,
&PAGE_SERVICE_SMGR_FLUSH_INPROGRESS_MICROS_GLOBAL,
&WAIT_LSN_IN_PROGRESS_GLOBAL_MICROS,
]
.into_iter()
.for_each(|c| {

View File

@@ -56,6 +56,7 @@ use crate::config::PageServerConf;
use crate::context::{DownloadBehavior, RequestContext};
use crate::metrics::{
self, COMPUTE_COMMANDS_COUNTERS, ComputeCommandKind, LIVE_CONNECTIONS, SmgrOpTimer,
TimelineMetrics,
};
use crate::pgdatadir_mapping::Version;
use crate::span::{
@@ -423,6 +424,9 @@ impl timeline::handle::Types for TenantManagerTypes {
pub(crate) struct TenantManagerCacheItem {
pub(crate) timeline: Arc<Timeline>,
// allow() for cheap propagation through RequestContext inside a task
#[allow(clippy::redundant_allocation)]
pub(crate) metrics: Arc<Arc<TimelineMetrics>>,
#[allow(dead_code)] // we store it to keep the gate open
pub(crate) gate_guard: GateGuard,
}
@@ -506,8 +510,11 @@ impl timeline::handle::TenantManager<TenantManagerTypes> for TenantManagerWrappe
}
};
let metrics = Arc::new(Arc::clone(&timeline.metrics));
Ok(TenantManagerCacheItem {
timeline,
metrics,
gate_guard,
})
}
@@ -1099,12 +1106,19 @@ impl PageServerHandler {
};
// Dispatch the batch to the appropriate request handler.
let (mut handler_results, span) = log_slow(
batch.as_static_str(),
LOG_SLOW_GETPAGE_THRESHOLD,
self.pagestream_dispatch_batched_message(batch, io_concurrency, ctx),
)
.await?;
let log_slow_name = batch.as_static_str();
let (mut handler_results, span) = {
// TODO: we unfortunately have to pin the future on the heap, since GetPage futures are huge and
// won't fit on the stack.
let mut boxpinned =
Box::pin(self.pagestream_dispatch_batched_message(batch, io_concurrency, ctx));
log_slow(
log_slow_name,
LOG_SLOW_GETPAGE_THRESHOLD,
boxpinned.as_mut(),
)
.await?
};
// We purposefully don't count flush time into the smgr operation timer.
//
@@ -1238,6 +1252,14 @@ impl PageServerHandler {
),
QueryError,
> {
macro_rules! upgrade_handle_and_set_context {
($shard:ident) => {{
let weak_handle = &$shard;
let handle = weak_handle.upgrade()?;
let ctx = ctx.with_scope_page_service_pagestream(&handle);
(handle, ctx)
}};
}
Ok(match batch {
BatchedFeMessage::Exists {
span,
@@ -1246,9 +1268,10 @@ impl PageServerHandler {
req,
} => {
fail::fail_point!("ps::handle-pagerequest-message::exists");
let (shard, ctx) = upgrade_handle_and_set_context!(shard);
(
vec![
self.handle_get_rel_exists_request(&*shard.upgrade()?, &req, ctx)
self.handle_get_rel_exists_request(&shard, &req, &ctx)
.instrument(span.clone())
.await
.map(|msg| (msg, timer))
@@ -1264,9 +1287,10 @@ impl PageServerHandler {
req,
} => {
fail::fail_point!("ps::handle-pagerequest-message::nblocks");
let (shard, ctx) = upgrade_handle_and_set_context!(shard);
(
vec![
self.handle_get_nblocks_request(&*shard.upgrade()?, &req, ctx)
self.handle_get_nblocks_request(&shard, &req, &ctx)
.instrument(span.clone())
.await
.map(|msg| (msg, timer))
@@ -1282,17 +1306,18 @@ impl PageServerHandler {
pages,
} => {
fail::fail_point!("ps::handle-pagerequest-message::getpage");
let (shard, ctx) = upgrade_handle_and_set_context!(shard);
(
{
let npages = pages.len();
trace!(npages, "handling getpage request");
let res = self
.handle_get_page_at_lsn_request_batched(
&*shard.upgrade()?,
&shard,
effective_request_lsn,
pages,
io_concurrency,
ctx,
&ctx,
)
.instrument(span.clone())
.await;
@@ -1309,9 +1334,10 @@ impl PageServerHandler {
req,
} => {
fail::fail_point!("ps::handle-pagerequest-message::dbsize");
let (shard, ctx) = upgrade_handle_and_set_context!(shard);
(
vec![
self.handle_db_size_request(&*shard.upgrade()?, &req, ctx)
self.handle_db_size_request(&shard, &req, &ctx)
.instrument(span.clone())
.await
.map(|msg| (msg, timer))
@@ -1327,9 +1353,10 @@ impl PageServerHandler {
req,
} => {
fail::fail_point!("ps::handle-pagerequest-message::slrusegment");
let (shard, ctx) = upgrade_handle_and_set_context!(shard);
(
vec![
self.handle_get_slru_segment_request(&*shard.upgrade()?, &req, ctx)
self.handle_get_slru_segment_request(&shard, &req, &ctx)
.instrument(span.clone())
.await
.map(|msg| (msg, timer))
@@ -1345,12 +1372,13 @@ impl PageServerHandler {
requests,
} => {
fail::fail_point!("ps::handle-pagerequest-message::test");
let (shard, ctx) = upgrade_handle_and_set_context!(shard);
(
{
let npages = requests.len();
trace!(npages, "handling getpage request");
let res = self
.handle_test_request_batch(&*shard.upgrade()?, requests, ctx)
.handle_test_request_batch(&shard, requests, &ctx)
.instrument(span.clone())
.await;
assert_eq!(res.len(), npages);
@@ -2126,6 +2154,7 @@ impl PageServerHandler {
.get(tenant_id, timeline_id, ShardSelector::Zero)
.await?;
set_tracing_field_shard_id(&timeline);
let ctx = ctx.with_scope_timeline(&timeline);
if timeline.is_archived() == Some(true) {
tracing::info!(
@@ -2143,7 +2172,7 @@ impl PageServerHandler {
lsn,
crate::tenant::timeline::WaitLsnWaiter::PageService,
crate::tenant::timeline::WaitLsnTimeout::Default,
ctx,
&ctx,
)
.await?;
timeline
@@ -2169,7 +2198,7 @@ impl PageServerHandler {
prev_lsn,
full_backup,
replica,
ctx,
&ctx,
)
.await
.map_err(map_basebackup_error)?;
@@ -2192,7 +2221,7 @@ impl PageServerHandler {
prev_lsn,
full_backup,
replica,
ctx,
&ctx,
)
.await
.map_err(map_basebackup_error)?;
@@ -2209,7 +2238,7 @@ impl PageServerHandler {
prev_lsn,
full_backup,
replica,
ctx,
&ctx,
)
.await
.map_err(map_basebackup_error)?;

View File

@@ -2758,7 +2758,7 @@ mod tests {
TimelineId::from_array(hex!("11223344556677881122334455667788"));
let (tenant, ctx) = harness.load().await;
let tline = tenant
let (tline, ctx) = tenant
.create_empty_timeline(TIMELINE_ID, Lsn(0x10), DEFAULT_PG_VERSION, &ctx)
.await?;
let tline = tline.raw_timeline().unwrap();

View File

@@ -77,6 +77,8 @@ use self::timeline::{
EvictionTaskTenantState, GcCutoffs, TimelineDeleteProgress, TimelineResources, WaitLsnError,
};
use crate::config::PageServerConf;
use crate::context;
use crate::context::RequestContextBuilder;
use crate::context::{DownloadBehavior, RequestContext};
use crate::deletion_queue::{DeletionQueueClient, DeletionQueueError};
use crate::l0_flush::L0FlushGlobalState;
@@ -1114,7 +1116,7 @@ impl Tenant {
}
};
let timeline = self.create_timeline_struct(
let (timeline, timeline_ctx) = self.create_timeline_struct(
timeline_id,
&metadata,
previous_heatmap,
@@ -1124,6 +1126,7 @@ impl Tenant {
idempotency.clone(),
index_part.gc_compaction.clone(),
index_part.rel_size_migration.clone(),
ctx,
)?;
let disk_consistent_lsn = timeline.get_disk_consistent_lsn();
anyhow::ensure!(
@@ -1257,7 +1260,7 @@ impl Tenant {
match activate {
ActivateTimelineArgs::Yes { broker_client } => {
info!("activating timeline after reload from pgdata import task");
timeline.activate(self.clone(), broker_client, None, ctx);
timeline.activate(self.clone(), broker_client, None, &timeline_ctx);
}
ActivateTimelineArgs::No => (),
}
@@ -1765,6 +1768,7 @@ impl Tenant {
import_pgdata,
ActivateTimelineArgs::No,
guard,
ctx.detached_child(TaskKind::ImportPgdata, DownloadBehavior::Warn),
));
}
}
@@ -1782,6 +1786,7 @@ impl Tenant {
timeline_id,
&index_part.metadata,
remote_timeline_client,
ctx,
)
.instrument(tracing::info_span!("timeline_delete", %timeline_id))
.await
@@ -2219,7 +2224,7 @@ impl Tenant {
self.clone(),
broker_client.clone(),
background_jobs_can_start,
&ctx,
&ctx.with_scope_timeline(&timeline),
);
}
@@ -2416,8 +2421,8 @@ impl Tenant {
new_timeline_id: TimelineId,
initdb_lsn: Lsn,
pg_version: u32,
_ctx: &RequestContext,
) -> anyhow::Result<UninitializedTimeline> {
ctx: &RequestContext,
) -> anyhow::Result<(UninitializedTimeline, RequestContext)> {
anyhow::ensure!(
self.is_active(),
"Cannot create empty timelines on inactive tenant"
@@ -2452,6 +2457,7 @@ impl Tenant {
initdb_lsn,
None,
None,
ctx,
)
.await
}
@@ -2469,7 +2475,7 @@ impl Tenant {
pg_version: u32,
ctx: &RequestContext,
) -> anyhow::Result<Arc<Timeline>> {
let uninit_tl = self
let (uninit_tl, ctx) = self
.create_empty_timeline(new_timeline_id, initdb_lsn, pg_version, ctx)
.await?;
let tline = uninit_tl.raw_timeline().expect("we just created it");
@@ -2481,7 +2487,7 @@ impl Tenant {
.init_empty_test_timeline()
.context("init_empty_test_timeline")?;
modification
.commit(ctx)
.commit(&ctx)
.await
.context("commit init_empty_test_timeline modification")?;
@@ -2699,7 +2705,12 @@ impl Tenant {
// doing stuff before the IndexPart is durable in S3, which is done by the previous section.
let activated_timeline = match result {
CreateTimelineResult::Created(timeline) => {
timeline.activate(self.clone(), broker_client, None, ctx);
timeline.activate(
self.clone(),
broker_client,
None,
&ctx.with_scope_timeline(&timeline),
);
timeline
}
CreateTimelineResult::Idempotent(timeline) => {
@@ -2761,10 +2772,9 @@ impl Tenant {
}
};
let mut uninit_timeline = {
let (mut uninit_timeline, timeline_ctx) = {
let this = &self;
let initdb_lsn = Lsn(0);
let _ctx = ctx;
async move {
let new_metadata = TimelineMetadata::new(
// Initialize disk_consistent LSN to 0, The caller must import some data to
@@ -2784,6 +2794,7 @@ impl Tenant {
initdb_lsn,
None,
None,
ctx,
)
.await
}
@@ -2813,6 +2824,7 @@ impl Tenant {
index_part,
activate,
timeline_create_guard,
timeline_ctx.detached_child(TaskKind::ImportPgdata, DownloadBehavior::Warn),
));
// NB: the timeline doesn't exist in self.timelines at this point
@@ -2826,6 +2838,7 @@ impl Tenant {
index_part: import_pgdata::index_part_format::Root,
activate: ActivateTimelineArgs,
timeline_create_guard: TimelineCreateGuard,
ctx: RequestContext,
) {
debug_assert_current_span_has_tenant_and_timeline_id();
info!("starting");
@@ -2837,6 +2850,7 @@ impl Tenant {
index_part,
activate,
timeline_create_guard,
ctx,
)
.await;
if let Err(err) = &res {
@@ -2852,9 +2866,8 @@ impl Tenant {
index_part: import_pgdata::index_part_format::Root,
activate: ActivateTimelineArgs,
timeline_create_guard: TimelineCreateGuard,
ctx: RequestContext,
) -> Result<(), anyhow::Error> {
let ctx = RequestContext::new(TaskKind::ImportPgdata, DownloadBehavior::Warn);
info!("importing pgdata");
import_pgdata::doit(&timeline, index_part, &ctx, self.cancel.clone())
.await
@@ -3063,6 +3076,7 @@ impl Tenant {
let mut has_pending_l0 = false;
for timeline in compact_l0 {
let ctx = &ctx.with_scope_timeline(&timeline);
let outcome = timeline
.compact(cancel, CompactFlags::OnlyL0Compaction.into(), ctx)
.instrument(info_span!("compact_timeline", timeline_id = %timeline.timeline_id))
@@ -3096,6 +3110,7 @@ impl Tenant {
if !timeline.is_active() {
continue;
}
let ctx = &ctx.with_scope_timeline(&timeline);
let mut outcome = timeline
.compact(cancel, EnumSet::default(), ctx)
@@ -3321,7 +3336,7 @@ impl Tenant {
self.clone(),
broker_client.clone(),
background_jobs_can_start,
ctx,
&ctx.with_scope_timeline(timeline),
);
activated_timelines += 1;
}
@@ -3827,6 +3842,7 @@ impl Tenant {
resident_size: 0,
physical_size: 0,
max_logical_size: 0,
max_logical_size_per_shard: 0,
};
for timeline in self.timelines.lock().unwrap().values() {
@@ -3843,6 +3859,10 @@ impl Tenant {
);
}
result.max_logical_size_per_shard = result
.max_logical_size
.div_ceil(self.tenant_shard_id.shard_count.count() as u64);
result
}
}
@@ -4136,7 +4156,8 @@ impl Tenant {
create_idempotency: CreateTimelineIdempotency,
gc_compaction_state: Option<GcCompactionState>,
rel_size_v2_status: Option<RelSizeMigration>,
) -> anyhow::Result<Arc<Timeline>> {
ctx: &RequestContext,
) -> anyhow::Result<(Arc<Timeline>, RequestContext)> {
let state = match cause {
CreateTimelineCause::Load => {
let ancestor_id = new_metadata.ancestor_timeline();
@@ -4172,7 +4193,11 @@ impl Tenant {
self.cancel.child_token(),
);
Ok(timeline)
let timeline_ctx = RequestContextBuilder::extend(ctx)
.scope(context::Scope::new_timeline(&timeline))
.build();
Ok((timeline, timeline_ctx))
}
/// [`Tenant::shutdown`] must be called before dropping the returned [`Tenant`] object
@@ -4588,6 +4613,7 @@ impl Tenant {
// Ensures all timelines use the same start time when computing the time cutoff.
let now_ts_for_pitr_calc = SystemTime::now();
for timeline in timelines.iter() {
let ctx = &ctx.with_scope_timeline(timeline);
let cutoff = timeline
.get_last_record_lsn()
.checked_sub(horizon)
@@ -4761,7 +4787,7 @@ impl Tenant {
src_timeline: &Arc<Timeline>,
dst_id: TimelineId,
start_lsn: Option<Lsn>,
_ctx: &RequestContext,
ctx: &RequestContext,
) -> Result<CreateTimelineResult, CreateTimelineError> {
let src_id = src_timeline.timeline_id;
@@ -4864,7 +4890,7 @@ impl Tenant {
src_timeline.pg_version,
);
let uninitialized_timeline = self
let (uninitialized_timeline, _timeline_ctx) = self
.prepare_new_timeline(
dst_id,
&metadata,
@@ -4872,6 +4898,7 @@ impl Tenant {
start_lsn + 1,
Some(Arc::clone(src_timeline)),
Some(src_timeline.get_rel_size_v2_status()),
ctx,
)
.await?;
@@ -5138,7 +5165,7 @@ impl Tenant {
pgdata_lsn,
pg_version,
);
let mut raw_timeline = self
let (mut raw_timeline, timeline_ctx) = self
.prepare_new_timeline(
timeline_id,
&new_metadata,
@@ -5146,6 +5173,7 @@ impl Tenant {
pgdata_lsn,
None,
None,
ctx,
)
.await?;
@@ -5156,7 +5184,7 @@ impl Tenant {
&unfinished_timeline,
&pgdata_path,
pgdata_lsn,
ctx,
&timeline_ctx,
)
.await
.with_context(|| {
@@ -5217,6 +5245,7 @@ impl Tenant {
/// An empty layer map is initialized, and new data and WAL can be imported starting
/// at 'disk_consistent_lsn'. After any initial data has been imported, call
/// `finish_creation` to insert the Timeline into the timelines map.
#[allow(clippy::too_many_arguments)]
async fn prepare_new_timeline<'a>(
&'a self,
new_timeline_id: TimelineId,
@@ -5225,7 +5254,8 @@ impl Tenant {
start_lsn: Lsn,
ancestor: Option<Arc<Timeline>>,
rel_size_v2_status: Option<RelSizeMigration>,
) -> anyhow::Result<UninitializedTimeline<'a>> {
ctx: &RequestContext,
) -> anyhow::Result<(UninitializedTimeline<'a>, RequestContext)> {
let tenant_shard_id = self.tenant_shard_id;
let resources = self.build_timeline_resources(new_timeline_id);
@@ -5233,7 +5263,7 @@ impl Tenant {
.remote_client
.init_upload_queue_for_empty_remote(new_metadata, rel_size_v2_status.clone())?;
let timeline_struct = self
let (timeline_struct, timeline_ctx) = self
.create_timeline_struct(
new_timeline_id,
new_metadata,
@@ -5244,6 +5274,7 @@ impl Tenant {
create_guard.idempotency.clone(),
None,
rel_size_v2_status,
ctx,
)
.context("Failed to create timeline data structure")?;
@@ -5264,10 +5295,13 @@ impl Tenant {
"Successfully created initial files for timeline {tenant_shard_id}/{new_timeline_id}"
);
Ok(UninitializedTimeline::new(
self,
new_timeline_id,
Some((timeline_struct, create_guard)),
Ok((
UninitializedTimeline::new(
self,
new_timeline_id,
Some((timeline_struct, create_guard)),
),
timeline_ctx,
))
}
@@ -5720,7 +5754,7 @@ pub(crate) mod harness {
logging::TracingErrorLayerEnablement::EnableWithRustLogFilter,
logging::Output::Stdout,
)
.expect("Failed to init test logging")
.expect("Failed to init test logging");
});
}
@@ -5802,7 +5836,8 @@ pub(crate) mod harness {
}
pub(crate) async fn load(&self) -> (Arc<Tenant>, RequestContext) {
let ctx = RequestContext::new(TaskKind::UnitTest, DownloadBehavior::Error);
let ctx = RequestContext::new(TaskKind::UnitTest, DownloadBehavior::Error)
.with_scope_unit_test();
(
self.do_try_load(&ctx)
.await
@@ -6825,7 +6860,7 @@ mod tests {
let (tenant, ctx) = harness.load().await;
let io_concurrency = IoConcurrency::spawn_for_test();
let tline = tenant
let (tline, ctx) = tenant
.create_empty_timeline(TIMELINE_ID, Lsn(0), DEFAULT_PG_VERSION, &ctx)
.await?;
let tline = tline.raw_timeline().unwrap();
@@ -7447,7 +7482,7 @@ mod tests {
.await;
let initdb_lsn = Lsn(0x20);
let utline = tenant
let (utline, ctx) = tenant
.create_empty_timeline(TIMELINE_ID, initdb_lsn, DEFAULT_PG_VERSION, &ctx)
.await?;
let tline = utline.raw_timeline().unwrap();
@@ -7514,7 +7549,7 @@ mod tests {
let harness = TenantHarness::create(name).await?;
{
let (tenant, ctx) = harness.load().await;
let tline = tenant
let (tline, _ctx) = tenant
.create_empty_timeline(TIMELINE_ID, Lsn(0), DEFAULT_PG_VERSION, &ctx)
.await?;
// Leave the timeline ID in [`Tenant::timelines_creating`] to exclude attempting to create it again

View File

@@ -471,7 +471,8 @@ pub(crate) mod tests {
blobs: &[Vec<u8>],
compression: bool,
) -> Result<(), Error> {
let ctx = RequestContext::new(TaskKind::UnitTest, DownloadBehavior::Error);
let ctx =
RequestContext::new(TaskKind::UnitTest, DownloadBehavior::Error).with_scope_unit_test();
let (_temp_dir, pathbuf, offsets) =
write_maybe_compressed::<BUFFERED>(blobs, compression, &ctx).await?;

View File

@@ -219,7 +219,11 @@ impl LocationConf {
};
let shard = if conf.shard_count == 0 {
ShardIdentity::unsharded()
// NB: carry over the persisted stripe size instead of using the default. This doesn't
// matter for most practical purposes, since unsharded tenants don't use the stripe
// size, but can cause inconsistencies between storcon and Pageserver and cause manual
// splits without `new_stripe_size` to use an unintended stripe size.
ShardIdentity::unsharded_with_stripe_size(ShardStripeSize(conf.shard_stripe_size))
} else {
ShardIdentity::new(
ShardNumber(conf.shard_number),

View File

@@ -32,8 +32,7 @@ use hex;
use thiserror::Error;
use tracing::error;
use crate::context::{DownloadBehavior, RequestContext};
use crate::task_mgr::TaskKind;
use crate::context::RequestContext;
use crate::tenant::block_io::{BlockReader, BlockWriter};
// The maximum size of a value stored in the B-tree. 5 bytes is enough currently.
@@ -477,16 +476,15 @@ where
}
#[allow(dead_code)]
pub async fn dump(&self) -> Result<()> {
pub async fn dump(&self, ctx: &RequestContext) -> Result<()> {
let mut stack = Vec::new();
let ctx = RequestContext::new(TaskKind::DebugTool, DownloadBehavior::Error);
stack.push((self.root_blk, String::new(), 0, 0, 0));
let block_cursor = self.reader.block_cursor();
while let Some((blknum, path, depth, child_idx, key_off)) = stack.pop() {
let blk = block_cursor.read_blk(self.start_blk + blknum, &ctx).await?;
let blk = block_cursor.read_blk(self.start_blk + blknum, ctx).await?;
let buf: &[u8] = blk.as_ref();
let node = OnDiskNode::<L>::deparse(buf)?;
@@ -835,6 +833,8 @@ pub(crate) mod tests {
use rand::Rng;
use super::*;
use crate::context::DownloadBehavior;
use crate::task_mgr::TaskKind;
use crate::tenant::block_io::{BlockCursor, BlockLease, BlockReaderRef};
#[derive(Clone, Default)]
@@ -869,7 +869,8 @@ pub(crate) mod tests {
let mut disk = TestDisk::new();
let mut writer = DiskBtreeBuilder::<_, 6>::new(&mut disk);
let ctx = RequestContext::new(TaskKind::UnitTest, DownloadBehavior::Error);
let ctx =
RequestContext::new(TaskKind::UnitTest, DownloadBehavior::Error).with_scope_unit_test();
let all_keys: Vec<&[u8; 6]> = vec![
b"xaaaaa", b"xaaaba", b"xaaaca", b"xabaaa", b"xababa", b"xabaca", b"xabada", b"xabadb",
@@ -887,7 +888,7 @@ pub(crate) mod tests {
let reader = DiskBtreeReader::new(0, root_offset, disk);
reader.dump().await?;
reader.dump(&ctx).await?;
// Test the `get` function on all the keys.
for (key, val) in all_data.iter() {
@@ -979,7 +980,8 @@ pub(crate) mod tests {
async fn lots_of_keys() -> Result<()> {
let mut disk = TestDisk::new();
let mut writer = DiskBtreeBuilder::<_, 8>::new(&mut disk);
let ctx = RequestContext::new(TaskKind::UnitTest, DownloadBehavior::Error);
let ctx =
RequestContext::new(TaskKind::UnitTest, DownloadBehavior::Error).with_scope_unit_test();
const NUM_KEYS: u64 = 1000;
@@ -997,7 +999,7 @@ pub(crate) mod tests {
let reader = DiskBtreeReader::new(0, root_offset, disk);
reader.dump().await?;
reader.dump(&ctx).await?;
use std::sync::Mutex;
@@ -1167,7 +1169,8 @@ pub(crate) mod tests {
// Build a tree from it
let mut disk = TestDisk::new();
let mut writer = DiskBtreeBuilder::<_, 26>::new(&mut disk);
let ctx = RequestContext::new(TaskKind::UnitTest, DownloadBehavior::Error);
let ctx =
RequestContext::new(TaskKind::UnitTest, DownloadBehavior::Error).with_scope_unit_test();
for (key, val) in disk_btree_test_data::TEST_DATA {
writer.append(&key, val)?;
@@ -1198,7 +1201,7 @@ pub(crate) mod tests {
.await?;
assert_eq!(count, disk_btree_test_data::TEST_DATA.len());
reader.dump().await?;
reader.dump(&ctx).await?;
Ok(())
}

View File

@@ -9,7 +9,7 @@ use camino::Utf8PathBuf;
use num_traits::Num;
use pageserver_api::shard::TenantShardId;
use tokio_epoll_uring::{BoundedBuf, Slice};
use tracing::error;
use tracing::{error, info_span};
use utils::id::TimelineId;
use crate::assert_u64_eq_usize::{U64IsUsize, UsizeIsU64};
@@ -76,6 +76,7 @@ impl EphemeralFile {
|| IoBufferMut::with_capacity(TAIL_SZ),
gate.enter()?,
ctx,
info_span!(parent: None, "ephemeral_file_buffered_writer", tenant_id=%tenant_shard_id.tenant_id, shard_id=%tenant_shard_id.shard_slug(), timeline_id=%timeline_id, path = %filename),
),
_gate_guard: gate.enter()?,
})
@@ -351,7 +352,8 @@ mod tests {
let timeline_id = TimelineId::from_str("22000000000000000000000000000000").unwrap();
fs::create_dir_all(conf.timeline_path(&tenant_shard_id, &timeline_id))?;
let ctx = RequestContext::new(TaskKind::UnitTest, DownloadBehavior::Error);
let ctx =
RequestContext::new(TaskKind::UnitTest, DownloadBehavior::Error).with_scope_unit_test();
Ok((conf, tenant_shard_id, timeline_id, ctx))
}

View File

@@ -300,9 +300,8 @@ impl TimelineMetadata {
/// Returns true if anything was changed
pub fn detach_from_ancestor(&mut self, branchpoint: &(TimelineId, Lsn)) {
if let Some(ancestor) = self.body.ancestor_timeline {
assert_eq!(ancestor, branchpoint.0);
}
// Detaching from ancestor now doesn't always detach directly to the direct ancestor, but we
// ensure the LSN is the same. So we don't check the timeline ID.
if self.body.ancestor_lsn != Lsn(0) {
assert_eq!(self.body.ancestor_lsn, branchpoint.1);
}

View File

@@ -1914,6 +1914,7 @@ impl TenantManager {
tenant_shard_id: TenantShardId,
timeline_id: TimelineId,
prepared: PreparedTimelineDetach,
behavior: detach_ancestor::DetachBehavior,
mut attempt: detach_ancestor::Attempt,
ctx: &RequestContext,
) -> Result<HashSet<TimelineId>, detach_ancestor::Error> {
@@ -1957,7 +1958,14 @@ impl TenantManager {
.map_err(Error::NotFound)?;
let resp = timeline
.detach_from_ancestor_and_reparent(&tenant, prepared, ctx)
.detach_from_ancestor_and_reparent(
&tenant,
prepared,
attempt.ancestor_timeline_id,
attempt.ancestor_lsn,
behavior,
ctx,
)
.await?;
let mut slot_guard = slot_guard;

View File

@@ -954,6 +954,14 @@ impl RemoteTimelineClient {
Ok(())
}
/// Only used in the `patch_index_part` HTTP API to force trigger an index upload.
pub fn force_schedule_index_upload(self: &Arc<Self>) -> Result<(), NotInitialized> {
let mut guard = self.upload_queue.lock().unwrap();
let upload_queue = guard.initialized_mut()?;
self.schedule_index_upload(upload_queue);
Ok(())
}
/// Launch an index-file upload operation in the background (internal function)
fn schedule_index_upload(self: &Arc<Self>, upload_queue: &mut UploadQueueInitialized) {
let disk_consistent_lsn = upload_queue.dirty.metadata.disk_consistent_lsn();

View File

@@ -229,6 +229,7 @@ async fn download_object(
|| IoBufferMut::with_capacity(super::BUFFER_SIZE),
gate.enter().map_err(|_| DownloadError::Cancelled)?,
ctx,
tracing::info_span!(parent: None, "download_object_buffered_writer", %dst_path),
);
// TODO: use vectored write (writev) once supported by tokio-epoll-uring.

View File

@@ -491,7 +491,10 @@ impl JobGenerator<PendingDownload, RunningDownload, CompleteDownload, DownloadCo
let remote_storage = self.remote_storage.clone();
let conf = self.tenant_manager.get_conf();
let tenant_shard_id = *secondary_state.get_tenant_shard_id();
let download_ctx = self.root_ctx.attached_child();
let download_ctx = self
.root_ctx
.attached_child()
.with_scope_secondary_tenant(&tenant_shard_id);
(RunningDownload { barrier }, Box::pin(async move {
let _completion = completion;
@@ -771,6 +774,7 @@ impl<'a> TenantDownloader<'a> {
// Download the layers in the heatmap
for timeline in heatmap.timelines {
let ctx = &ctx.with_scope_secondary_timeline(tenant_shard_id, &timeline.timeline_id);
let timeline_state = timeline_states
.remove(&timeline.timeline_id)
.expect("Just populated above");

View File

@@ -474,7 +474,7 @@ async fn fill_logical_sizes(
if cached_size.is_none() {
let timeline = Arc::clone(timeline_hash.get(&timeline_id).unwrap());
let parallel_size_calcs = Arc::clone(limit);
let ctx = ctx.attached_child();
let ctx = ctx.attached_child().with_scope_timeline(&timeline);
joinset.spawn(
calculate_logical_size(parallel_size_calcs, timeline, lsn, cause, ctx)
.in_current_span(),

View File

@@ -1334,7 +1334,7 @@ impl DeltaLayerInner {
block_reader,
);
tree_reader.dump().await?;
tree_reader.dump(ctx).await?;
let keys = self.index_entries(ctx).await?;
@@ -1972,6 +1972,7 @@ pub(crate) mod test {
.create_test_timeline(TimelineId::generate(), Lsn(0x10), 14, ctx)
.await
.unwrap();
let ctx = &ctx.with_scope_timeline(&timeline);
let initdb_layer = timeline
.layers

View File

@@ -199,7 +199,7 @@ impl ImageLayerInner {
block_reader,
);
tree_reader.dump().await?;
tree_reader.dump(ctx).await?;
tree_reader
.visit(

View File

@@ -8,7 +8,6 @@ use utils::id::TimelineId;
use super::failpoints::{Failpoint, FailpointKind};
use super::*;
use crate::context::DownloadBehavior;
use crate::task_mgr::TaskKind;
use crate::tenant::harness::{TenantHarness, test_img};
use crate::tenant::storage_layer::{IoConcurrency, LayerVisibilityHint};
@@ -27,11 +26,9 @@ async fn smoke_test() {
let h = TenantHarness::create("smoke_test").await.unwrap();
let span = h.span();
let download_span = span.in_scope(|| tracing::info_span!("downloading", timeline_id = 1));
let (tenant, _) = h.load().await;
let (tenant, ctx) = h.load().await;
let io_concurrency = IoConcurrency::spawn_for_test();
let ctx = RequestContext::new(TaskKind::UnitTest, DownloadBehavior::Download);
let image_layers = vec![(
Lsn(0x40),
vec![(
@@ -56,6 +53,7 @@ async fn smoke_test() {
)
.await
.unwrap();
let ctx = &ctx.with_scope_timeline(&timeline);
// Grab one of the timeline's layers to exercise in the test, and the other layer that is just
// there to avoid the timeline being illegally empty
@@ -94,7 +92,7 @@ async fn smoke_test() {
controlfile_keyspace.clone(),
Lsn(0x10)..Lsn(0x11),
&mut data,
&ctx,
ctx,
)
.await
.unwrap();
@@ -129,7 +127,7 @@ async fn smoke_test() {
controlfile_keyspace.clone(),
Lsn(0x10)..Lsn(0x11),
&mut data,
&ctx,
ctx,
)
.instrument(download_span.clone())
.await
@@ -179,7 +177,7 @@ async fn smoke_test() {
// plain downloading is rarely needed
layer
.download_and_keep_resident(&ctx)
.download_and_keep_resident(ctx)
.instrument(download_span)
.await
.unwrap();
@@ -341,6 +339,7 @@ fn read_wins_pending_eviction() {
.create_test_timeline(TimelineId::generate(), Lsn(0x10), 14, &ctx)
.await
.unwrap();
let ctx = ctx.with_scope_timeline(&timeline);
let layer = {
let mut layers = {
@@ -473,6 +472,7 @@ fn multiple_pending_evictions_scenario(name: &'static str, in_order: bool) {
.create_test_timeline(TimelineId::generate(), Lsn(0x10), 14, &ctx)
.await
.unwrap();
let ctx = ctx.with_scope_timeline(&timeline);
let layer = {
let mut layers = {
@@ -642,12 +642,12 @@ async fn cancelled_get_or_maybe_download_does_not_cancel_eviction() {
.create_test_timeline(TimelineId::generate(), Lsn(0x10), 14, &ctx)
.await
.unwrap();
let ctx = ctx.with_scope_timeline(&timeline);
// This test does downloads
let ctx = RequestContextBuilder::extend(&ctx)
.download_behavior(DownloadBehavior::Download)
.build();
let layer = {
let mut layers = {
let layers = timeline.layers.read().await;
@@ -727,6 +727,7 @@ async fn evict_and_wait_does_not_wait_for_download() {
.create_test_timeline(TimelineId::generate(), Lsn(0x10), 14, &ctx)
.await
.unwrap();
let ctx = ctx.with_scope_timeline(&timeline);
// This test does downloads
let ctx = RequestContextBuilder::extend(&ctx)

View File

@@ -67,6 +67,7 @@ use tracing::*;
use utils::generation::Generation;
use utils::guard_arc_swap::GuardArcSwap;
use utils::id::TimelineId;
use utils::logging::{MonitorSlowFutureCallback, monitor_slow_future};
use utils::lsn::{AtomicLsn, Lsn, RecordLsn};
use utils::postgres_client::PostgresClientProtocol;
use utils::rate_limit::RateLimit;
@@ -287,7 +288,7 @@ pub struct Timeline {
// The LSN of gc-compaction that was last applied to this timeline.
gc_compaction_state: ArcSwap<Option<GcCompactionState>>,
pub(super) metrics: TimelineMetrics,
pub(crate) metrics: Arc<TimelineMetrics>,
// `Timeline` doesn't write these metrics itself, but it manages the lifetime. Code
// in `crate::page_service` writes these metrics.
@@ -439,6 +440,8 @@ pub struct Timeline {
heatmap_layers_downloader: Mutex<Option<heatmap_layers_downloader::HeatmapLayersDownloader>>,
pub(crate) rel_size_v2_status: ArcSwapOption<RelSizeMigration>,
wait_lsn_log_slow: tokio::sync::Semaphore,
}
pub(crate) enum PreviousHeatmap {
@@ -1479,17 +1482,67 @@ impl Timeline {
WaitLsnTimeout::Default => self.conf.wait_lsn_timeout,
};
let _timer = crate::metrics::WAIT_LSN_TIME.start_timer();
let timer = crate::metrics::WAIT_LSN_TIME.start_timer();
let start_finish_counterpair_guard = self.metrics.wait_lsn_start_finish_counterpair.guard();
match self.last_record_lsn.wait_for_timeout(lsn, timeout).await {
let wait_for_timeout = self.last_record_lsn.wait_for_timeout(lsn, timeout);
let wait_for_timeout = std::pin::pin!(wait_for_timeout);
// Use threshold of 1 because even 1 second of wait for ingest is very much abnormal.
let log_slow_threshold = Duration::from_secs(1);
// Use period of 10 to avoid flooding logs during an outage that affects all timelines.
let log_slow_period = Duration::from_secs(10);
let mut logging_permit = None;
let wait_for_timeout = monitor_slow_future(
log_slow_threshold,
log_slow_period,
wait_for_timeout,
|MonitorSlowFutureCallback {
ready,
is_slow,
elapsed_total,
elapsed_since_last_callback,
}| {
self.metrics
.wait_lsn_in_progress_micros
.inc_by(u64::try_from(elapsed_since_last_callback.as_micros()).unwrap());
if !is_slow {
return;
}
// It's slow, see if we should log it.
// (We limit the logging to one per invocation per timeline to avoid excessive
// logging during an extended broker / networking outage that affects all timelines.)
if logging_permit.is_none() {
logging_permit = self.wait_lsn_log_slow.try_acquire().ok();
}
if logging_permit.is_none() {
return;
}
// We log it.
if ready {
info!(
"slow wait_lsn completed after {:.3}s",
elapsed_total.as_secs_f64()
);
} else {
info!(
"slow wait_lsn still running for {:.3}s",
elapsed_total.as_secs_f64()
);
}
},
);
let res = wait_for_timeout.await;
// don't count the time spent waiting for lock below, and also in walreceiver.status(), towards the wait_lsn_time_histo
drop(logging_permit);
drop(start_finish_counterpair_guard);
drop(timer);
match res {
Ok(()) => Ok(()),
Err(e) => {
use utils::seqwait::SeqWaitError::*;
match e {
Shutdown => Err(WaitLsnError::Shutdown),
Timeout => {
// don't count the time spent waiting for lock below, and also in walreceiver.status(), towards the wait_lsn_time_histo
drop(_timer);
let walreceiver_status = self.walreceiver_status();
Err(WaitLsnError::Timeout(format!(
"Timed out while waiting for WAL record at LSN {} to arrive, last_record_lsn {} disk consistent LSN={}, WalReceiver status: {}",
@@ -2423,8 +2476,9 @@ impl Timeline {
}
fn get_l0_flush_delay_threshold(&self) -> Option<usize> {
// Disable L0 flushes by default. This and compaction needs further tuning.
const DEFAULT_L0_FLUSH_DELAY_FACTOR: usize = 0; // TODO: default to e.g. 3
// By default, delay L0 flushes at 3x the compaction threshold. The compaction threshold
// defaults to 10, and L0 compaction is generally able to keep L0 counts below 30.
const DEFAULT_L0_FLUSH_DELAY_FACTOR: usize = 3;
// If compaction is disabled, don't delay.
if self.get_compaction_period() == Duration::ZERO {
@@ -2452,8 +2506,9 @@ impl Timeline {
}
fn get_l0_flush_stall_threshold(&self) -> Option<usize> {
// Disable L0 stalls by default. In ingest benchmarks, we see image compaction take >10
// minutes, blocking L0 compaction, and we can't stall L0 flushes for that long.
// Disable L0 stalls by default. Stalling can cause unavailability if L0 compaction isn't
// responsive, and it can e.g. block on other compaction via the compaction semaphore or
// sibling timelines. We need more confidence before enabling this.
const DEFAULT_L0_FLUSH_STALL_FACTOR: usize = 0; // TODO: default to e.g. 5
// If compaction is disabled, don't stall.
@@ -2685,14 +2740,14 @@ impl Timeline {
}
Arc::new_cyclic(|myself| {
let metrics = TimelineMetrics::new(
let metrics = Arc::new(TimelineMetrics::new(
&tenant_shard_id,
&timeline_id,
crate::metrics::EvictionsWithLowResidenceDurationBuilder::new(
"mtime",
evictions_low_residence_duration_metric_threshold,
),
);
));
let aux_file_metrics = metrics.aux_file_size_gauge.clone();
let mut result = Timeline {
@@ -2821,6 +2876,8 @@ impl Timeline {
heatmap_layers_downloader: Mutex::new(None),
rel_size_v2_status: ArcSwapOption::from_pointee(rel_size_v2_status),
wait_lsn_log_slow: tokio::sync::Semaphore::new(1),
};
result.repartition_threshold =
@@ -2876,7 +2933,7 @@ impl Timeline {
"layer flush task",
async move {
let _guard = guard;
let background_ctx = RequestContext::todo_child(TaskKind::LayerFlushTask, DownloadBehavior::Error);
let background_ctx = RequestContext::todo_child(TaskKind::LayerFlushTask, DownloadBehavior::Error).with_scope_timeline(&self_clone);
self_clone.flush_loop(layer_flush_start_rx, &background_ctx).await;
let mut flush_loop_state = self_clone.flush_loop_state.lock().unwrap();
assert!(matches!(*flush_loop_state, FlushLoopState::Running{..}));
@@ -5388,9 +5445,10 @@ impl Timeline {
self: &Arc<Timeline>,
tenant: &crate::tenant::Tenant,
options: detach_ancestor::Options,
behavior: detach_ancestor::DetachBehavior,
ctx: &RequestContext,
) -> Result<detach_ancestor::Progress, detach_ancestor::Error> {
detach_ancestor::prepare(self, tenant, options, ctx).await
detach_ancestor::prepare(self, tenant, behavior, options, ctx).await
}
/// Second step of detach from ancestor; detaches the `self` from it's current ancestor and
@@ -5406,9 +5464,21 @@ impl Timeline {
self: &Arc<Timeline>,
tenant: &crate::tenant::Tenant,
prepared: detach_ancestor::PreparedTimelineDetach,
ancestor_timeline_id: TimelineId,
ancestor_lsn: Lsn,
behavior: detach_ancestor::DetachBehavior,
ctx: &RequestContext,
) -> Result<detach_ancestor::DetachingAndReparenting, detach_ancestor::Error> {
detach_ancestor::detach_and_reparent(self, tenant, prepared, ctx).await
detach_ancestor::detach_and_reparent(
self,
tenant,
prepared,
ancestor_timeline_id,
ancestor_lsn,
behavior,
ctx,
)
.await
}
/// Final step which unblocks the GC.
@@ -7127,6 +7197,7 @@ mod tests {
)
.await
.unwrap();
let ctx = &ctx.with_scope_timeline(&timeline);
// Layer visibility is an input to heatmap generation, so refresh it first
timeline.update_layer_visibility().await.unwrap();
@@ -7192,7 +7263,7 @@ mod tests {
eprintln!("Downloading {layer} and re-generating heatmap");
let ctx = &RequestContextBuilder::extend(&ctx)
let ctx = &RequestContextBuilder::extend(ctx)
.download_behavior(crate::context::DownloadBehavior::Download)
.build();

View File

@@ -393,6 +393,9 @@ impl GcCompactionQueue {
if job.dry_run {
flags |= CompactFlags::DryRun;
}
if options.flags.contains(CompactFlags::NoYield) {
flags |= CompactFlags::NoYield;
}
let options = CompactOptions {
flags,
sub_compaction: false,
@@ -1091,7 +1094,7 @@ impl Timeline {
let latest_gc_cutoff = self.get_applied_gc_cutoff_lsn();
tracing::info!(
"latest_gc_cutoff: {}, pitr cutoff {}",
"starting shard ancestor compaction, latest_gc_cutoff: {}, pitr cutoff {}",
*latest_gc_cutoff,
self.gc_info.read().unwrap().cutoffs.time
);
@@ -1120,6 +1123,7 @@ impl Timeline {
// Expensive, exhaustive check of keys in this layer: this guards against ShardedRange's calculations being
// wrong. If ShardedRange claims the local page count is zero, then no keys in this layer
// should be !is_key_disposable()
// TODO: exclude sparse keyspace from this check, otherwise it will infinitely loop.
let range = layer_desc.get_key_range();
let mut key = range.start;
while key < range.end {
@@ -2616,6 +2620,7 @@ impl Timeline {
) -> Result<CompactionOutcome, CompactionError> {
let sub_compaction = options.sub_compaction;
let job = GcCompactJob::from_compact_options(options.clone());
let no_yield = options.flags.contains(CompactFlags::NoYield);
if sub_compaction {
info!(
"running enhanced gc bottom-most compaction with sub-compaction, splitting compaction jobs"
@@ -2630,14 +2635,15 @@ impl Timeline {
idx + 1,
jobs_len
);
self.compact_with_gc_inner(cancel, job, ctx).await?;
self.compact_with_gc_inner(cancel, job, ctx, no_yield)
.await?;
}
if jobs_len == 0 {
info!("no jobs to run, skipping gc bottom-most compaction");
}
return Ok(CompactionOutcome::Done);
}
self.compact_with_gc_inner(cancel, job, ctx).await
self.compact_with_gc_inner(cancel, job, ctx, no_yield).await
}
async fn compact_with_gc_inner(
@@ -2645,6 +2651,7 @@ impl Timeline {
cancel: &CancellationToken,
job: GcCompactJob,
ctx: &RequestContext,
no_yield: bool,
) -> Result<CompactionOutcome, CompactionError> {
// Block other compaction/GC tasks from running for now. GC-compaction could run along
// with legacy compaction tasks in the future. Always ensure the lock order is compaction -> gc.
@@ -2914,14 +2921,18 @@ impl Timeline {
if cancel.is_cancelled() {
return Err(CompactionError::ShuttingDown);
}
let should_yield = self
.l0_compaction_trigger
.notified()
.now_or_never()
.is_some();
if should_yield {
tracing::info!("preempt gc-compaction when downloading layers: too many L0 layers");
return Ok(CompactionOutcome::YieldForL0);
if !no_yield {
let should_yield = self
.l0_compaction_trigger
.notified()
.now_or_never()
.is_some();
if should_yield {
tracing::info!(
"preempt gc-compaction when downloading layers: too many L0 layers"
);
return Ok(CompactionOutcome::YieldForL0);
}
}
let resident_layer = layer
.download_and_keep_resident(ctx)
@@ -3054,16 +3065,21 @@ impl Timeline {
if cancel.is_cancelled() {
return Err(CompactionError::ShuttingDown);
}
keys_processed += 1;
if keys_processed % 1000 == 0 {
let should_yield = self
.l0_compaction_trigger
.notified()
.now_or_never()
.is_some();
if should_yield {
tracing::info!("preempt gc-compaction in the main loop: too many L0 layers");
return Ok(CompactionOutcome::YieldForL0);
if !no_yield {
keys_processed += 1;
if keys_processed % 1000 == 0 {
let should_yield = self
.l0_compaction_trigger
.notified()
.now_or_never()
.is_some();
if should_yield {
tracing::info!(
"preempt gc-compaction in the main loop: too many L0 layers"
);
return Ok(CompactionOutcome::YieldForL0);
}
}
}
if self.shard_identity.is_key_disposable(&key) {
@@ -3173,7 +3189,11 @@ impl Timeline {
}
// TODO: move the below part to the loop body
let last_key = last_key.expect("no keys produced during compaction");
let Some(last_key) = last_key else {
return Err(CompactionError::Other(anyhow!(
"no keys produced during compaction"
)));
};
stat.on_unique_key_visited();
let retention = self

View File

@@ -11,6 +11,7 @@ use utils::id::TimelineId;
use utils::{crashsafe, fs_ext, pausable_failpoint};
use crate::config::PageServerConf;
use crate::context::RequestContext;
use crate::task_mgr::{self, TaskKind};
use crate::tenant::metadata::TimelineMetadata;
use crate::tenant::remote_timeline_client::{
@@ -291,10 +292,11 @@ impl DeleteTimelineFlow {
timeline_id: TimelineId,
local_metadata: &TimelineMetadata,
remote_client: RemoteTimelineClient,
ctx: &RequestContext,
) -> anyhow::Result<()> {
// Note: here we even skip populating layer map. Timeline is essentially uninitialized.
// RemoteTimelineClient is the only functioning part.
let timeline = tenant
let (timeline, _timeline_ctx) = tenant
.create_timeline_struct(
timeline_id,
local_metadata,
@@ -307,6 +309,7 @@ impl DeleteTimelineFlow {
crate::tenant::CreateTimelineIdempotency::FailWithConflict, // doesn't matter what we put here
None, // doesn't matter what we put here
None, // doesn't matter what we put here
ctx,
)
.context("create_timeline_struct")?;

View File

@@ -32,6 +32,9 @@ pub(crate) enum Error {
#[error("too many ancestors")]
TooManyAncestors,
#[error("ancestor is not empty")]
AncestorNotEmpty,
#[error("shutting down, please retry later")]
ShuttingDown,
@@ -89,7 +92,9 @@ impl From<Error> for ApiError {
fn from(value: Error) -> Self {
match value {
Error::NoAncestor => ApiError::Conflict(value.to_string()),
Error::TooManyAncestors => ApiError::BadRequest(anyhow::anyhow!("{value}")),
Error::TooManyAncestors | Error::AncestorNotEmpty => {
ApiError::BadRequest(anyhow::anyhow!("{value}"))
}
Error::ShuttingDown => ApiError::ShuttingDown,
Error::Archived(_) => ApiError::BadRequest(anyhow::anyhow!("{value}")),
Error::OtherTimelineDetachOngoing(_) | Error::FailedToReparentAll => {
@@ -127,13 +132,37 @@ pub(crate) struct PreparedTimelineDetach {
layers: Vec<Layer>,
}
/// TODO: this should be part of PageserverConf because we cannot easily modify cplane arguments.
// TODO: this should be part of PageserverConf because we cannot easily modify cplane arguments.
#[derive(Debug)]
pub(crate) struct Options {
pub(crate) rewrite_concurrency: std::num::NonZeroUsize,
pub(crate) copy_concurrency: std::num::NonZeroUsize,
}
/// Controls the detach ancestor behavior.
/// - When set to `NoAncestorAndReparent`, we will only detach a branch if its ancestor is a root branch. It will automatically reparent any children of the ancestor before and at the branch point.
/// - When set to `MultiLevelAndNoReparent`, we will detach a branch from multiple levels of ancestors, and no reparenting will happen at all.
#[derive(Debug, Clone, Copy, Default)]
pub enum DetachBehavior {
#[default]
NoAncestorAndReparent,
MultiLevelAndNoReparent,
}
impl std::str::FromStr for DetachBehavior {
type Err = &'static str;
fn from_str(s: &str) -> Result<Self, Self::Err> {
match s {
"no_ancestor_and_reparent" => Ok(DetachBehavior::NoAncestorAndReparent),
"multi_level_and_no_reparent" => Ok(DetachBehavior::MultiLevelAndNoReparent),
"v1" => Ok(DetachBehavior::NoAncestorAndReparent),
"v2" => Ok(DetachBehavior::MultiLevelAndNoReparent),
_ => Err("cannot parse detach behavior"),
}
}
}
impl Default for Options {
fn default() -> Self {
Self {
@@ -147,7 +176,8 @@ impl Default for Options {
#[derive(Debug)]
pub(crate) struct Attempt {
pub(crate) timeline_id: TimelineId,
pub(crate) ancestor_timeline_id: TimelineId,
pub(crate) ancestor_lsn: Lsn,
_guard: completion::Completion,
gate_entered: Option<utils::sync::gate::GateGuard>,
}
@@ -167,25 +197,30 @@ impl Attempt {
pub(super) async fn prepare(
detached: &Arc<Timeline>,
tenant: &Tenant,
behavior: DetachBehavior,
options: Options,
ctx: &RequestContext,
) -> Result<Progress, Error> {
use Error::*;
let Some((ancestor, ancestor_lsn)) = detached
let Some((mut ancestor, mut ancestor_lsn)) = detached
.ancestor_timeline
.as_ref()
.map(|tl| (tl.clone(), detached.ancestor_lsn))
else {
let ancestor_id;
let ancestor_lsn;
let still_in_progress = {
let accessor = detached.remote_client.initialized_upload_queue()?;
// we are safe to inspect the latest uploaded, because we can only witness this after
// restart is complete and ancestor is no more.
let latest = accessor.latest_uploaded_index_part();
if latest.lineage.detached_previous_ancestor().is_none() {
let Some((id, lsn)) = latest.lineage.detached_previous_ancestor() else {
return Err(NoAncestor);
};
ancestor_id = id;
ancestor_lsn = lsn;
latest
.gc_blocking
@@ -196,7 +231,8 @@ pub(super) async fn prepare(
if still_in_progress {
// gc is still blocked, we can still reparent and complete.
// we are safe to reparent remaining, because they were locked in in the beginning.
let attempt = continue_with_blocked_gc(detached, tenant).await?;
let attempt =
continue_with_blocked_gc(detached, tenant, ancestor_id, ancestor_lsn).await?;
// because the ancestor of detached is already set to none, we have published all
// of the layers, so we are still "prepared."
@@ -224,13 +260,34 @@ pub(super) async fn prepare(
check_no_archived_children_of_ancestor(tenant, detached, &ancestor, ancestor_lsn)?;
if ancestor.ancestor_timeline.is_some() {
if let DetachBehavior::MultiLevelAndNoReparent = behavior {
// If the ancestor has an ancestor, we might be able to fast-path detach it if the current ancestor does not have any data written/used by the detaching timeline.
while let Some(ancestor_of_ancestor) = ancestor.ancestor_timeline.clone() {
if ancestor_lsn != ancestor.ancestor_lsn {
// non-technical requirement; we could flatten still if ancestor LSN does not match but that needs
// us to copy and cut more layers.
return Err(AncestorNotEmpty);
}
// Use the ancestor of the ancestor as the new ancestor (only when the ancestor LSNs are the same)
ancestor_lsn = ancestor.ancestor_lsn; // Get the LSN first before resetting the `ancestor` variable
ancestor = ancestor_of_ancestor;
// TODO: do we still need to check if we don't want to reparent?
check_no_archived_children_of_ancestor(tenant, detached, &ancestor, ancestor_lsn)?;
}
} else if ancestor.ancestor_timeline.is_some() {
// non-technical requirement; we could flatten N ancestors just as easily but we chose
// not to, at least initially
return Err(TooManyAncestors);
}
let attempt = start_new_attempt(detached, tenant).await?;
tracing::info!(
"attempt to detach the timeline from the ancestor: {}@{}, behavior={:?}",
ancestor.timeline_id,
ancestor_lsn,
behavior
);
let attempt = start_new_attempt(detached, tenant, ancestor.timeline_id, ancestor_lsn).await?;
utils::pausable_failpoint!("timeline-detach-ancestor::before_starting_after_locking-pausable");
@@ -450,8 +507,13 @@ pub(super) async fn prepare(
Ok(Progress::Prepared(attempt, prepared))
}
async fn start_new_attempt(detached: &Timeline, tenant: &Tenant) -> Result<Attempt, Error> {
let attempt = obtain_exclusive_attempt(detached, tenant)?;
async fn start_new_attempt(
detached: &Timeline,
tenant: &Tenant,
ancestor_timeline_id: TimelineId,
ancestor_lsn: Lsn,
) -> Result<Attempt, Error> {
let attempt = obtain_exclusive_attempt(detached, tenant, ancestor_timeline_id, ancestor_lsn)?;
// insert the block in the index_part.json, if not already there.
let _dont_care = tenant
@@ -466,13 +528,23 @@ async fn start_new_attempt(detached: &Timeline, tenant: &Tenant) -> Result<Attem
Ok(attempt)
}
async fn continue_with_blocked_gc(detached: &Timeline, tenant: &Tenant) -> Result<Attempt, Error> {
async fn continue_with_blocked_gc(
detached: &Timeline,
tenant: &Tenant,
ancestor_timeline_id: TimelineId,
ancestor_lsn: Lsn,
) -> Result<Attempt, Error> {
// FIXME: it would be nice to confirm that there is an in-memory version, since we've just
// verified there is a persistent one?
obtain_exclusive_attempt(detached, tenant)
obtain_exclusive_attempt(detached, tenant, ancestor_timeline_id, ancestor_lsn)
}
fn obtain_exclusive_attempt(detached: &Timeline, tenant: &Tenant) -> Result<Attempt, Error> {
fn obtain_exclusive_attempt(
detached: &Timeline,
tenant: &Tenant,
ancestor_timeline_id: TimelineId,
ancestor_lsn: Lsn,
) -> Result<Attempt, Error> {
use Error::{OtherTimelineDetachOngoing, ShuttingDown};
// ensure we are the only active attempt for this tenant
@@ -493,6 +565,8 @@ fn obtain_exclusive_attempt(detached: &Timeline, tenant: &Tenant) -> Result<Atte
Ok(Attempt {
timeline_id: detached.timeline_id,
ancestor_timeline_id,
ancestor_lsn,
_guard: guard,
gate_entered: Some(_gate_entered),
})
@@ -795,6 +869,9 @@ pub(super) async fn detach_and_reparent(
detached: &Arc<Timeline>,
tenant: &Tenant,
prepared: PreparedTimelineDetach,
ancestor_timeline_id: TimelineId,
ancestor_lsn: Lsn,
behavior: DetachBehavior,
_ctx: &RequestContext,
) -> Result<DetachingAndReparenting, Error> {
let PreparedTimelineDetach { layers } = prepared;
@@ -822,7 +899,30 @@ pub(super) async fn detach_and_reparent(
"cannot (detach? reparent)? complete if the operation is not still ongoing"
);
let ancestor = match (detached.ancestor_timeline.as_ref(), recorded_branchpoint) {
let ancestor_to_detach = match detached.ancestor_timeline.as_ref() {
Some(mut ancestor) => {
while ancestor.timeline_id != ancestor_timeline_id {
match ancestor.ancestor_timeline.as_ref() {
Some(found) => {
if ancestor_lsn != ancestor.ancestor_lsn {
return Err(Error::DetachReparent(anyhow::anyhow!(
"cannot find the ancestor timeline to detach from: wrong ancestor lsn"
)));
}
ancestor = found;
}
None => {
return Err(Error::DetachReparent(anyhow::anyhow!(
"cannot find the ancestor timeline to detach from"
)));
}
}
}
Some(ancestor)
}
None => None,
};
let ancestor = match (ancestor_to_detach, recorded_branchpoint) {
(Some(ancestor), None) => {
assert!(
!layers.is_empty(),
@@ -895,6 +995,11 @@ pub(super) async fn detach_and_reparent(
Ancestor::Detached(ancestor, ancestor_lsn) => (ancestor, ancestor_lsn, false),
};
if let DetachBehavior::MultiLevelAndNoReparent = behavior {
// Do not reparent if the user requests to behave so.
return Ok(DetachingAndReparenting::Reparented(HashSet::new()));
}
let mut tasks = tokio::task::JoinSet::new();
// Returns a single permit semaphore which will be used to make one reparenting succeed,
@@ -1032,6 +1137,11 @@ pub(super) async fn complete(
}
/// Query against a locked `Tenant::timelines`.
///
/// A timeline is reparentable if:
///
/// - It is not the timeline being detached.
/// - It has the same ancestor as the timeline being detached. Note that the ancestor might not be the direct ancestor.
fn reparentable_timelines<'a, I>(
timelines: I,
detached: &'a Arc<Timeline>,

View File

@@ -93,7 +93,8 @@ impl Timeline {
}
}
let ctx = RequestContext::new(TaskKind::Eviction, DownloadBehavior::Warn);
let ctx = RequestContext::new(TaskKind::Eviction, DownloadBehavior::Warn)
.with_scope_timeline(&self);
loop {
let policy = self.get_eviction_policy();
let cf = self

Some files were not shown because too many files have changed in this diff Show More