Compare commits

...

50 Commits

Author SHA1 Message Date
John Spray
4363132305 libs/http: include path when logging auth failures 2024-03-12 10:55:12 +00:00
Heikki Linnakangas
621ea2ec44 tests: try to make restored-datadir comparison tests not flaky v2
This test occasionally fails with a difference in "pg_xact/0000" file
between the local and restored datadirs. My hypothesis is that
something changed in the database between the last explicit checkpoint
and the shutdown. I suspect autovacuum, it could certainly create
transactions.

To fix, be more precise about the point in time that we compare. Shut
down the endpoint first, then read the last LSN (i.e. the shutdown
checkpoint's LSN), from the local disk with pg_controldata. And use
exactly that LSN in the basebackup.

Closes #559
2024-03-11 23:29:32 +04:00
Heikki Linnakangas
74d09b78c7 Keep walproposer alive until shutdown checkpoint is safe on safekepeers
The walproposer pretends to be a walsender in many ways. It has a
WalSnd slot, it claims to be a walsender by calling
MarkPostmasterChildWalSender() etc. But one different to real
walsenders was that the postmaster still treated it as a bgworker
rather than a walsender. The difference is that at shutdown,
walsenders are not killed until the very end, after the checkpointer
process has written the shutdown checkpoint and exited.

As a result, the walproposer always got killed before the shutdown
checkpoint was written, so the shutdown checkpoint never made it to
safekeepers. That's fine in principle, we don't require a clean
shutdown after all. But it also feels a bit silly not to stream the
shutdown checkpoint. It could be useful for initializing hot standby
mode in a read replica, for example.

Change postmaster to treat background workers that have called
MarkPostmasterChildWalSender() as walsenders. That unfortunately
requires another small change in postgres core.

After doing that, walproposers stay alive longer. However, it also
means that the checkpointer will wait for the walproposer to switch to
WALSNDSTATE_STOPPING state, when the checkpointer sends the
PROCSIG_WALSND_INIT_STOPPING signal. We don't have the machinery in
walproposer to receive and handle that signal reliably. Instead, we
mark walproposer as being in WALSNDSTATE_STOPPING always.

In commit 568f91420a, I assumed that shutdown will wait for all the
remaining WAL to be streamed to safekeepers, but before this commit
that was not true, and the test became flaky. This should make it
stable again.

Some tests wrongly assumed that no WAL could have been written between
pg_current_wal_flush_lsn and quick pg stop after it. Fix them by introducing
flush_ep_to_pageserver which first stops the endpoint and then waits till all
committed WAL reaches the pageserver.

In passing extract safekeeper http client to its own module.
2024-03-11 23:29:32 +04:00
Arseny Sher
0cf0731d8b SIGQUIT instead of SIGKILL prewarmed postgres.
To avoid orphaned processes using wiped datadir with confusing logging.
2024-03-11 22:36:52 +04:00
Sasha Krassovsky
98723844ee Don't return from inside PG_TRY (#7095)
## Problem
Returning from PG_TRY is a bug, and we currently do that

## Summary of changes
Make it break and then return false. This should also help stabilize
test_bad_connection.py
2024-03-11 18:36:39 +00:00
Alex Chi Z
73a8c97ac8 fix: warnings when compiling neon extensions (#7053)
proceeding https://github.com/neondatabase/neon/pull/7010, close
https://github.com/neondatabase/neon/issues/6188

## Summary of changes

This pull request (should) fix all warnings except
`-Wdeclaration-after-statement` in the neon extension compilation.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-03-11 17:49:58 +00:00
Christian Schwarz
17a3c9036e follow-up(#7077): adjust flaky-test-detection cutoff date for tokio-epoll-uring (#7090)
Co-authored-by: Alexander Bayandin <alexander@neon.tech>
2024-03-11 16:36:49 +00:00
Joonas Koivunen
8c5b310090 fix: Layer delete on drop and eviction can outlive timeline shutdown (#7082)
This is a follow-up to #7051 where `LayerInner::drop` and
`LayerInner::evict_blocking` were not noticed to require a gate before
the file deletion. The lack of entering a gate opens up a similar
possibility of deleting a layer file which a newer Timeline instance has
already checked out to be resident in a similar case as #7051.
2024-03-11 16:54:06 +01:00
Christian Schwarz
8224580f3e fix(tenant/timeline metrics): race condition during shutdown + recreation (#7064)
Tenant::shutdown or Timeline::shutdown completes and becomes externally
observable before the corresponding Tenant/Timeline object is dropped.

For example, after observing a Tenant::shutdown to complete, we could
attach the same tenant_id again. The shut down Tenant object might still
be around at the time of the attach.

The race is then the following:
- old object's metrics are still around
- new object uses with_label_values
- old object calls remove_label_values

The outcome is that the new object will have the metric objects (they're
an Arc internall) but the metrics won't be part of the internal registry
and hence they'll be missing in `/metrics`.

Later, when the new object gets shut down and tries to
remove_label_value, it will observe an error because
the metric was already removed by the old object.

Changes
-------

This PR moves metric removal to `shutdown()`.

An alternative design would be to multi-version the metrics using a
distinguishing label, or, to use a better metrics crate that allows
removing metrics from the registry through the locally held metric
handle instead of interacting with the (globally shared) registry.

refs https://github.com/neondatabase/neon/pull/7051
2024-03-11 15:41:41 +01:00
Christian Schwarz
2b0f3549f7 default to tokio-epoll-uring in CI tests & on Linux (#7077)
All of production is using it now as of
https://github.com/neondatabase/aws/pull/1121

The change in `flaky_tests.py` resets the flakiness detection logic.

The alternative would have been to repeat the choice of io engine in
each test name, which would junk up the various test reports too much.

---------

Co-authored-by: Alexander Bayandin <alexander@neon.tech>
2024-03-11 14:35:59 +00:00
John Spray
b4972d07d4 storage controller: refactor non-mutable members up into Service (#7086)
result_tx and compute_hook were in ServiceState (i.e. behind a sync
mutex), but didn't need to be.

Moving them up into Service removes a bunch of boilerplate clones.

While we're here, create a helper `Service::maybe_reconcile_shard` which
avoids writing out all the `&self.` arguments to
`TenantState::maybe_reconcile` everywhere we call it.
2024-03-11 14:29:32 +00:00
Joonas Koivunen
26ae7b0b3e fix(metrics): reset TENANT_STATE metric on startup (#7084)
Otherwise, it might happen that we never get to witness the same state
on subsequent restarts, thus the time series will show the value from a
few restarts ago.

The actual case here was that "Activating" was showing `3` while I was
doing tenant migration testing on staging. The number 3 was however from
a startup that happened some time ago which had been interrupted by
another deployment.
2024-03-11 13:25:53 +00:00
John Spray
f8483cc4a3 pageserver: update swagger for HA APIs (#7070)
- The type of heatmap_period in tenant config was wrrong
- Secondary download and heatmap upload endpoints weren't in swagger.
2024-03-11 09:32:17 +00:00
Conrad Ludgate
cc5d6c66b3 proxy: categorise new cplane error message (#7057)
## Problem

`422 Unprocessable Entity: compute time quota of non-primary branches is
exceeded` being marked as a control plane error.

## Summary of changes

Add the manual checks to make this a user error that should not be
retried.
2024-03-11 09:20:09 +01:00
Roman Zaynetdinov
d894d2b450 Export db size, deadlocks and changed row metrics (#7050)
## Problem

We want to report metrics for the oldest user database.
2024-03-11 08:10:04 +00:00
Joonas Koivunen
b09d686335 fix: on-demand downloads can outlive timeline shutdown (#7051)
## Problem

Before this PR, it was possible that on-demand downloads were started
after `Timeline::shutdown()`.

For example, we have observed a walreceiver-connection-handler-initiated
on-demand download that was started after `Timeline::shutdown()`s final
`task_mgr::shutdown_tasks()` call.

The underlying issue is that `task_mgr::shutdown_tasks()` isn't sticky,
i.e., new tasks can be spawned during or after
`task_mgr::shutdown_tasks()`.

Cc: https://github.com/neondatabase/neon/issues/4175 in lieu of a more
specific issue for task_mgr. We already decided we want to get rid of it
anyways.

Original investigation:
https://neondb.slack.com/archives/C033RQ5SPDH/p1709824952465949

## Changes

- enter gate while downloading
- use timeline cancellation token for cancelling download

thereby, fixes #7054

Entering the gate might also remove recent "kept the gate from closing"
in staging.
2024-03-09 13:09:08 +00:00
Christian Schwarz
74d24582cf throttling: exclude throttled time from basebackup (fixup of #6953) (#7072)
PR #6953 only excluded throttled time from the handle_pagerequests
(aka smgr metrics).

This PR implements the deduction for `basebackup ` queries.

The other page_service methods either don't use Timeline::get
or they aren't used in production.

Found by manually inspecting in [staging
logs](https://neonprod.grafana.net/explore?schemaVersion=1&panes=%7B%22wx8%22:%7B%22datasource%22:%22xHHYY0dVz%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bhostname%3D%5C%22pageserver-0.eu-west-1.aws.neon.build%5C%22%7D%20%7C~%20%60git-env%7CERR%7CWARN%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22xHHYY0dVz%22%7D,%22editorMode%22:%22code%22%7D%5D,%22range%22:%7B%22to%22:%221709919114642%22,%22from%22:%221709904430898%22%7D%7D%7D).
2024-03-09 13:37:02 +01:00
Sasha Krassovsky
4834d22d2d Revoke REPLICATION (#7052)
## Problem
Currently users can cause problems with replication
## Summary of changes
Don't let them replicate
2024-03-08 22:24:30 +00:00
Anastasia Lubennikova
86e8c43ddf Add downgrade scripts for neon extension. (#7065)
## Problem

When we start compute with newer version of extension (i.e. 1.2) and
then rollback the release, downgrading the compute version, next compute
start will try to update extension to the latest version available in
neon.control (i.e. 1.1).

Thus we need to provide downgrade scripts like neon--1.2--1.1.sql

These scripts must revert the changes made by the upgrade scripts in the
reverse order. This is necessary to ensure that the next upgrade will
work correctly.

In general, we need to write upgrade and downgrade scripts to be more
robust and add IF EXISTS / CREATE OR REPLACE clauses to all statements
(where applicable).

## Summary of changes
Adds downgrade scripts.
Adds test cases for extension downgrade/upgrade. 

fixes #7066

This is a follow-up for
https://app.incident.io/neondb/incidents/167?tab=follow-ups

Signed-off-by: Alex Chi Z <chi@neon.tech>
Co-authored-by: Alex Chi Z <iskyzh@gmail.com>
Co-authored-by: Anastasia Lubennikova <anastasia@neon.tech>
2024-03-08 20:42:35 +00:00
John Spray
7329413705 storage controller: enable setting PlacementPolicy in tenant creation (#7037)
## Problem

Tenants created via the storage controller have a `PlacementPolicy` that
defines their HA/secondary/detach intent. For backward compat we can
just set it to Single, for onboarding tenants using /location_conf it is
automatically set to Double(1) if there are at least two pageservers,
but for freshly created tenants we didn't have a way to specify it.

This unblocks writing tests that create HA tenants on the storage
controller and do failure injection testing.

## Summary of changes

- Add optional fields to TenantCreateRequest for specifying
PlacementPolicy. This request structure is used both on pageserver API
and storage controller API, but this method is only meaningful for the
storage controller (same as existing `shard_parameters` attribute).
- Use the value from the creation request in tenant creation, if
provided.
2024-03-08 15:34:53 +00:00
Conrad Ludgate
2c132e45cb proxy: do not store ephemeral endpoints in http pool (#6819)
## Problem

For the ephemeral endpoint feature, it's not really too helpful to keep
them around in the connection pool. This isn't really pressing but I
think it's still a bit better this way.

## Summary of changes

Add `is_ephemeral` function to `NeonOptions`. Allow
`serverless::ConnInfo::endpoint_cache_key()` to return an `Option`.
Handle that option appropriately
2024-03-08 07:56:23 +00:00
Vlad Lazar
0f05ef67e2 pageserver: revert open layer rolling revert (#6962)
## Problem
We reverted https://github.com/neondatabase/neon/pull/6661 a few days
ago. The change led to OOMs in
benchmarks followed by large WAL reingests.

The issue was that we removed [this
code](d04af08567/pageserver/src/tenant/timeline/walreceiver/walreceiver_connection.rs (L409-L417)).
That call may trigger a roll of the open layer due to
the keepalive messages received from the safekeeper. Removing it meant
that enforcing
of checkpoint timeout became even more lax and led to using up large
amounts of memory
for the in memory layer indices.

## Summary of changes
Piggyback on keep alive messages to enforce checkpoint timeout. This is
a hack, but it's exactly what
the current code is doing.

## Alternatives
Christhian, Joonas and myself sketched out a timer based approach
[here](https://github.com/neondatabase/neon/pull/6940). While discussing
it further, it became obvious that's also a bit of a hack and not the
desired end state. I chose not
to take that further since it's not what we ultimately want and it'll be
harder to rip out.

Right now it's unclear what the ideal system behaviour is:
* early flushing on memory pressure, or ...
* detaching tenants on memory pressure
2024-03-07 19:53:10 +00:00
Conrad Ludgate
02358b21a4 update rustls (#7048)
## Summary of changes

Update rustls from 0.21 to 0.22.

reqwest/tonic/aws-smithy still use rustls 0.21. no upgrade route
available yet.
2024-03-07 18:23:19 +00:00
Sasha Krassovsky
2fc89428c3 Hopefully stabilize test_bad_connection.py (#6976)
## Problem
It seems that even though we have a retry on basebackup, it still
sometimes fails to fetch it with the failpoint enabled, resulting in a
test error.

## Summary of changes
If we fail to get the basebackup, disable the failpoint and try again.
2024-03-07 10:12:06 -08:00
Arpad Müller
ce7a82db05 Update svg_fmt (#7049)
Gets upstream PR https://github.com/nical/rust_debug/pull/3 , removes
trailing "s from output.
2024-03-07 17:32:09 +00:00
John Spray
d5a6a2a16d storage controller: robustness improvements (#7027)
## Problem


Closes: https://github.com/neondatabase/neon/issues/6847
Closes: https://github.com/neondatabase/neon/issues/7006

## Summary of changes

- Pageserver API calls are wrapped in timeout/retry logic: this prevents
a reconciler getting hung on a pageserver API hang, and prevents
reconcilers having to totally retry if one API call returns a retryable
error (e.g. 503).
- Add a cancellation token to `Node`, so that when we mark a node
offline we will cancel any API calls in progress to that node, and avoid
issuing any more API calls to that offline node.
- If the dirty locations of a shard are all on offline nodes, then don't
spawn a reconciler
- In re-attach, if we have no observed state object for a tenant then
construct one with conf: None (which means "unknown"). Then in
Reconciler, implement a TODO for scanning such locations before running,
so that we will avoid spuriously incrementing a generation in the case
of a node that was offline while we started (this is the case that
tripped up #7006)
- Refactoring: make Node contents private (and thereby guarantee that
updates to availability mode reliably update the cancellation token.)
- Refactoring: don't pass the whole map of nodes into Reconciler (and
thereby remove a bunch of .expect() calls)

Some of this was discovered/tested with a new failure injection test
that will come in a separate PR, once it is stable enough for CI.
2024-03-07 17:10:03 +00:00
Vlad Lazar
871977f14c pageserver: fix early bail out in vectored get (#7038)
## Problem
When vectored get encountered a portion of the key range that could
not be mapped to any layer in the current timeline it would incorrectly
bail out of the current timeline. This is incorrect since we may have
had layers queued for a visit in the fringe.

## Summary of changes
* Add a repro unit test
* Remove the early bail out path
* Simplify range search return value
2024-03-07 16:02:20 +00:00
Joonas Koivunen
602a4da9a5 bench: run branch_creation_many at 500, seeded (#6959)
We have a benchmark for creating a lot of branches, but it does random
things, and the branch count is not what we is the largest maximum we
aim to support. If this PR would stabilize the benchmark total duration
it means that there are some structures which are very much slower than
others. Then we should add a seed-outputting variant to help find and
reproduce such cases.

Additionally, record for the benchmark:
- shutdown duration
- startup metrics once done (on restart)
- duration of first compaction completion via debug logging
2024-03-07 16:23:42 +02:00
John Spray
d3c583efbe Rename binary attachment_service -> storage_controller (#7042)
## Problem

The storage controller binary still has its historic
`attachment_service` name -- it will be painful to change this later
because we can't atomically update this repo and the helm charts used to
deploy.

Companion helm chart change:
https://github.com/neondatabase/helm-charts/pull/70

## Summary of changes

- Change the name of the binary to `storage_controller`
- Skipping renaming things in the source right now: this is just to get
rid of the legacy name in external interfaces.

---------

Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
2024-03-07 14:06:48 +00:00
Vlad Lazar
d03ec9d998 pageserver: don't validate vectored get on shut-down (#7039)
## Problem
We attempted validation for cancelled errors under the assumption that
if vectored get fails, sequential get will too.
That's not right 100% of times though because sequential get may have
the values cached and slip them through
even when shutting down.

## Summary of changes
Don't validate if either search impl failed due to tenant shutdown.
2024-03-07 12:37:52 +00:00
Conrad Ludgate
c2876ec55d proxy http tls investigations (#7045)
## Problem

Some HTTP-specific TLS errors

## Summary of changes

Add more logging, vendor `tls-listener` with minor modifications.
2024-03-07 12:36:47 +00:00
Alex Chi Z
0b330e1310 upgrade neon extension on startup (#7029)
## Problem

Fix https://github.com/neondatabase/neon/issues/7003. Fix
https://github.com/neondatabase/neon/issues/6982. Currently, neon
extension is only upgraded when new compute spec gets applied, for
example, when creating a new role or creating a new database. This also
resolves `neon.lfc_stat` not found warnings in prod.

## Summary of changes

This pull request adds the logic to spawn a background thread to upgrade
the neon extension version if the compute is a primary. If for whatever
reason the upgrade fails, it reports an error to the console and does
not impact compute node state.

This change can be further applied to 3rd-party extension upgrades. We
can silently upgrade the version of 3rd party extensions in the
background in the future.

Questions:

* Does alter extension takes some kind of lock that will block user
requests?
* Does `ALTER EXTENSION` writes to the database if nothing needs to be
upgraded? (may impact storage size).

Otherwise it's safe to land this pull request.

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-03-06 12:20:44 -05:00
Alexander Bayandin
f40b13d801 Update client libs for test_runner/pg_clients to their latest versions (#7022)
## Problem
Closes https://github.com/neondatabase/neon/security/dependabot/56
Supersedes https://github.com/neondatabase/neon/pull/7013

Workflow run:
https://github.com/neondatabase/neon/actions/runs/8157302480

## Summary of changes
- Update client libs for `test_runner/pg_clients` to their latest
versions
2024-03-06 17:09:54 +00:00
John Spray
a9a4a76d13 storage controller: misc fixes (#7036)
## Problem

Collection of small changes, batched together to reduce CI overhead.

## Summary of changes

- Layer download messages include size -- this is useful when watching a
pageserver hydrate its on disk cache in the log.
- Controller migrate API could put an invalid NodeId into TenantState
- Scheduling errors during tenant create could result in creating some
shards and not others.
- Consistency check could give hard-to-understand failures in tests if a
reconcile was in process: explicitly fail the check if reconciles are in
progress instead.
2024-03-06 16:47:32 +00:00
Alex Chi Z
5dc2088cf3 fix(test): drop subscription when test completes (#6975)
This pull request mitigates
https://github.com/neondatabase/neon/issues/6969, but the longer-term
problem is that we cannot properly stop Postgres if there is a
subscription.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-03-06 15:52:24 +00:00
John Spray
4a31e18c81 storage controller: include stripe size in compute notifications (#6974)
## Problem

- The storage controller is the source of truth for a tenant's stripe
size, but doesn't currently have a way to propagate that to compute:
we're just using the default stripe size everywhere.

Closes: https://github.com/neondatabase/neon/issues/6903

## Summary of changes

- Include stripe size in `ComputeHookNotifyRequest`
- Include stripe size in `LocationConfigResponse`

The stripe size is optional: it will only be advertised for
multi-sharded tenants. This enables the controller to defer the choice
of stripe size until we split a tenant for the first time.
2024-03-06 13:56:30 +00:00
John Spray
a3ef50c9b6 storage controller: use 'lazy' mode for location_config (#6987)
## Problem

If large numbers of shards are attached to a pageserver concurrently,
for example after another node fails, it can cause excessive I/O queue
depths due to all the newly attached shards trying to calculate logical
sizes concurrently.

#6907 added the `lazy` flag to handle this.

## Summary of changes

- Use `lazy=true` from all /location_config calls in the storage
controller Reconciler.
2024-03-06 11:26:29 +00:00
Arpad Müller
2f88e7a921 Move compaction code to compaction.rs (#7026)
Moves some of the (legacy) compaction code to compaction.rs. No
functional changes, just moves of code.

Before, compaction.rs was only for the new tiered compaction mechanism,
now it's for both the old and new mechanisms.

Part of #6768
2024-03-06 01:40:23 +00:00
Christian Schwarz
eacdc179dc fixup(#6991): it broke the macOS build (#7024) 2024-03-05 17:03:51 +00:00
Vlad Lazar
2daa2f1d10 test: disable large slru basebackup bench in ci (#7025)
The test is flaky due to
https://github.com/neondatabase/neon/issues/7006.
2024-03-05 15:41:05 +00:00
Anna Khanova
15b3665dc4 proxy: fix bug with populating the data (#7023)
## Problem

Branch/project and coldStart were not populated to data events.

## Summary of changes

Populate it. Also added logging for the coldstart info.
2024-03-05 15:32:58 +00:00
Arpad Müller
e69a25542b Minor improvements to tiered compaction (#7020)
Minor non-functional improvements to tiered compaction, mostly
consisting of comment fixes.

Followup of  #6830, part of #6768
2024-03-05 16:26:51 +01:00
Alex Chi Z
b036c32262 fix -Wmissing-prototypes for neon extension (#7010)
## Problem

ref https://github.com/neondatabase/neon/issues/6188

## Summary of changes

This pull request fixes `-Wmissing-prototypes` for the neon extension.
Note that (1) the gcc version in CI and macOS is different, therefore
some of the warning does not get reported when developing the neon
extension locally. (2) the CI env variable `COPT = -Werror` does not get
passed into the docker build process, therefore warnings are not treated
as errors on CI.


e62baa9704/.github/workflows/build_and_test.yml (L22)

There will be follow-up pull requests on solving other warnings. By the
way, I did not figure out the default compile parameters in the CI env,
and therefore this pull request is tested by manually adding
`-Wmissing-prototypes` into the `COPT`.

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-03-05 10:03:44 -05:00
Anna Khanova
bdbb2f4afc proxy: report redis broken message metric (#7021)
## Problem

Not really a problem. Improving visibility around redis communication.

## Summary of changes

Added metric on the number of broken messages.
2024-03-05 16:02:51 +01:00
Christian Schwarz
270d3be507 feat(per-tenant throttling): exclude throttled time from page_service metrics + regression test (#6953)
part of https://github.com/neondatabase/neon/issues/5899

Problem
-------

Before this PR, the time spent waiting on the throttle was charged
towards the higher-level page_service metrics, i.e.,
`pageserver_smgr_query_seconds`.
The metrics are the foundation of internal SLIs / SLOs.
A throttled tenant would cause the SLI to degrade / SLO alerts to fire.

Changes
-------


- don't charge time spent in throttle towards the page_service metrics
- record time spent in throttle in RequestContext and subtract it from
the elapsed time
- this works because the page_service path doesn't create child context,
so, all the throttle time is recorded in the parent
- it's quite brittle and will break if we ever decide to spawn child
tasks that need child RequestContexts, which would have separate
instances of the `micros_spent_throttled` counter.
- however, let's punt that to a more general refactoring of
RequestContext
- add a test case that ensures that
- throttling happens for getpage requests; this aspect of the test
passed before this PR
- throttling delays aren't charged towards the page_service metrics;
this aspect of the test only passes with this PR
- drive-by: make the throttle log message `info!`, it's an expected
condition

Performance
-----------

I took the same measurements as in #6706 , no meaningful change in CPU
overhead.

Future Work
-----------

This PR enables us to experiment with the throttle for select tenants
without affecting the SLI metrics / triggering SLO alerts.

Before declaring this feature done, we need more work to happen,
specifically:

- decide on whether we want to retain the flexibility of throttling any
`Timeline::get` call, filtered by TaskKind
- versus: separate throttles for each page_service endpoint, potentially
with separate config options
- the trouble here is that this decision implies changes to the
TenantConfig, so, if we start using the current config style now, then
decide to switch to a different config, it'll be a breaking change

Nice-to-haves but probably not worth the time right now:

- Equivalent tests to ensure the throttle applies to all other
page_service handlers.
2024-03-05 13:44:00 +00:00
Vlad Lazar
9dec65b75b pageserver: fix vectored read path delta layer index traversal (#7001)
## Problem
Last weeks enablement of vectored get generated a number of panics.
From them, I diagnosed two issues in the delta layer index traversal
logic
1. The `key >= range.start && lsn >= lsn_range.start`
was too aggressive. Lsns are not monotonically increasing in the delta
layer index (keys are though), so we cannot assert on them.
2. Lsns greater or equal to `lsn_range.end` were not skipped. This
caused the query to consider records newer than the request Lsn.

## Summary of changes
* Fix the issues mentioned above inline
* Refactor the layer traversal logic to make it unit testable
* Add unit test which reproduces the failure modes listed above.
2024-03-05 13:35:45 +00:00
Vlad Lazar
ae8468f97e pageserver: fix AUX key vectored get validation (#7018)
## Problem
The value reconstruct of AUX_FILES_KEY from records is not deterministic
since it uses a hash map under the hood. This caused vectored get validation
failures when enabled in staging.

## Summary of changes
Deserialise AUX_FILES_KEY blobs comparing. All other keys should
reconstruct deterministically, so we simply compare the blobs.
2024-03-05 13:30:43 +00:00
Christian Schwarz
f3e4f85e65 layer file download: final rename: fix durability (#6991)
Before this PR, the layer file download code would fsync the inode after
rename instead of the timeline directory. That is not in line with what
a comment further up says we're doing, and it's obviously not achieving
the goal of making the rename durable.

part of https://github.com/neondatabase/neon/issues/6663
2024-03-05 11:09:13 +00:00
Joonas Koivunen
752bf5a22f build: clippy disallow futures::pin_mut macro (#7016)
`std` has had `pin!` macro for some time, there is no need for us to use
the older alternatives. Cannot disallow `tokio::pin` because tokio
macros use that.
2024-03-05 10:14:37 +00:00
Christian Schwarz
3da410c8fe tokio-epoll-uring: use it on the layer-creating code paths (#6378)
part of #6663 
See that epic for more context & related commits.

Problem
-------

Before this PR, the layer-file-creating code paths were using
VirtualFile, but under the hood these were still blocking system calls.

Generally this meant we'd stall the executor thread, unless the caller
"knew" and used the following pattern instead:

```
spawn_blocking(|| {
    Handle::block_on(async {
        VirtualFile::....().await;
    })
}).await
```

Solution
--------

This PR adopts `tokio-epoll-uring` on the layer-file-creating code paths
in pageserver.

Note that on-demand downloads still use `tokio::fs`, these will be
converted in a future PR.

Design: Avoiding Regressions With `std-fs` 
------------------------------------------

If we make the VirtualFile write path truly async using
`tokio-epoll-uring`, should we then remove the `spawn_blocking` +
`Handle::block_on` usage upstack in the same commit?

No, because if we’re still using the `std-fs` io engine, we’d then block
the executor in those places where previously we were protecting us from
that through the `spawn_blocking` .

So, if we want to see benefits from `tokio-epoll-uring` on the write
path while also preserving the ability to switch between
`tokio-epoll-uring` and `std-fs` , where `std-fs` will behave identical
to what we have now, we need to ***conditionally* use `spawn_blocking +
Handle::block_on`** .

I.e., in the places where we use that know, we’ll need to make that
conditional based on the currently configured io engine.

It boils down to investigating all the places where we do
`spawn_blocking(... block_on(... VirtualFile::...))`.

Detailed [write-up of that investigation in
Notion](https://neondatabase.notion.site/Surveying-VirtualFile-write-path-usage-wrt-tokio-epoll-uring-integration-spawn_blocking-Handle-bl-5dc2270dbb764db7b2e60803f375e015?pvs=4
), made publicly accessible.

tl;dr: Preceding PRs addressed the relevant call sites:
- `metadata` file: turns out we could simply remove it (#6777, #6769,
#6775)
- `create_delta_layer()`: made sensitive to `virtual_file_io_engine` in
#6986

NB: once we are switched over to `tokio-epoll-uring` everywhere in
production, we can deprecate `std-fs`; to keep macOS support, we can use
`tokio::fs` instead. That will remove this whole headache.


Code Changes In This PR
-----------------------

- VirtualFile API changes
  - `VirtualFile::write_at`
- implement an `ioengine` operation and switch `VirtualFile::write_at`
to it
  - `VirtualFile::metadata()`
- curiously, we only use it from the layer writers' `finish()` methods
- introduce a wrapper `Metadata` enum because `std::fs::Metadata` cannot
be constructed by code outside rust std
- `VirtualFile::sync_all()` and for completeness sake, add
`VirtualFile::sync_data()`

Testing & Rollout
-----------------

Before merging this PR, we ran the CI with both io engines.

Additionally, the changes will soak in staging.

We could have a feature gate / add a new io engine
`tokio-epoll-uring-write-path` to do a gradual rollout. However, that's
not part of this PR.


Future Work
-----------

There's still some use of `std::fs` and/or `tokio::fs` for directory
namespace operations, e.g. `std::fs::rename`.

We're not addressing those in this PR, as we'll need to add the support
in tokio-epoll-uring first. Note that rename itself is usually fast if
the directory is in the kernel dentry cache, and only the fsync after
rename is slow. These fsyncs are using tokio-epoll-uring, so, the impact
should be small.
2024-03-05 09:03:54 +00:00
157 changed files with 5524 additions and 2712 deletions

View File

@@ -474,7 +474,7 @@ jobs:
TEST_RESULT_CONNSTR: ${{ secrets.REGRESS_TEST_RESULT_CONNSTR_NEW }}
CHECK_ONDISK_DATA_COMPATIBILITY: nonempty
BUILD_TAG: ${{ needs.tag.outputs.build-tag }}
PAGESERVER_VIRTUAL_FILE_IO_ENGINE: std-fs
PAGESERVER_VIRTUAL_FILE_IO_ENGINE: tokio-epoll-uring
PAGESERVER_GET_VECTORED_IMPL: vectored
# Temporary disable this step until we figure out why it's so flaky
@@ -554,7 +554,7 @@ jobs:
VIP_VAP_ACCESS_TOKEN: "${{ secrets.VIP_VAP_ACCESS_TOKEN }}"
PERF_TEST_RESULT_CONNSTR: "${{ secrets.PERF_TEST_RESULT_CONNSTR }}"
TEST_RESULT_CONNSTR: "${{ secrets.REGRESS_TEST_RESULT_CONNSTR_NEW }}"
PAGESERVER_VIRTUAL_FILE_IO_ENGINE: std-fs
PAGESERVER_VIRTUAL_FILE_IO_ENGINE: tokio-epoll-uring
# XXX: no coverage data handling here, since benchmarks are run on release builds,
# while coverage is currently collected for the debug ones

312
Cargo.lock generated
View File

@@ -241,7 +241,7 @@ checksum = "16e62a023e7c117e27523144c5d2459f4397fcc3cab0085af8e2224f643a0193"
dependencies = [
"proc-macro2",
"quote",
"syn 2.0.32",
"syn 2.0.52",
]
[[package]]
@@ -252,7 +252,7 @@ checksum = "b9ccdd8f2a161be9bd5c023df56f1b2a0bd1d83872ae53b71a84a12c9bf6e842"
dependencies = [
"proc-macro2",
"quote",
"syn 2.0.32",
"syn 2.0.52",
]
[[package]]
@@ -626,7 +626,7 @@ dependencies = [
"once_cell",
"pin-project-lite",
"pin-utils",
"rustls",
"rustls 0.21.9",
"tokio",
"tracing",
]
@@ -907,6 +907,16 @@ version = "1.6.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "8c3c1a368f70d6cf7302d78f8f7093da241fb8e8807c05cc9e51a125895a6d5b"
[[package]]
name = "bcder"
version = "0.7.4"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "c627747a6774aab38beb35990d88309481378558875a41da1a4b2e373c906ef0"
dependencies = [
"bytes",
"smallvec",
]
[[package]]
name = "bincode"
version = "1.3.3"
@@ -935,7 +945,7 @@ dependencies = [
"regex",
"rustc-hash",
"shlex",
"syn 2.0.32",
"syn 2.0.52",
"which",
]
@@ -986,9 +996,9 @@ checksum = "14c189c53d098945499cdfa7ecc63567cf3886b3332b312a5b4585d8d3a6a610"
[[package]]
name = "bytes"
version = "1.4.0"
version = "1.5.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "89b2fd2a0dcf38d7971e2194b6b6eebab45ae01067456a7fd93d5547a61b70be"
checksum = "a2bd12c1caf447e69cd4528f47f94d203fd2582878ecb9e9465484c4148a8223"
dependencies = [
"serde",
]
@@ -1149,7 +1159,7 @@ dependencies = [
"heck",
"proc-macro2",
"quote",
"syn 2.0.32",
"syn 2.0.52",
]
[[package]]
@@ -1574,7 +1584,7 @@ dependencies = [
"proc-macro2",
"quote",
"strsim",
"syn 2.0.32",
"syn 2.0.52",
]
[[package]]
@@ -1585,7 +1595,7 @@ checksum = "29a358ff9f12ec09c3e61fef9b5a9902623a695a46a917b07f269bff1445611a"
dependencies = [
"darling_core",
"quote",
"syn 2.0.32",
"syn 2.0.52",
]
[[package]]
@@ -1627,6 +1637,16 @@ dependencies = [
"zeroize",
]
[[package]]
name = "der"
version = "0.7.8"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "fffa369a668c8af7dbf8b5e56c9f744fbd399949ed171606040001947de40b1c"
dependencies = [
"const-oid",
"zeroize",
]
[[package]]
name = "der-parser"
version = "8.2.0"
@@ -1681,7 +1701,7 @@ dependencies = [
"diesel_table_macro_syntax",
"proc-macro2",
"quote",
"syn 2.0.32",
"syn 2.0.52",
]
[[package]]
@@ -1701,7 +1721,7 @@ version = "0.1.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "fc5557efc453706fed5e4fa85006fe9817c224c3f480a34c7e5959fd700921c5"
dependencies = [
"syn 2.0.32",
"syn 2.0.52",
]
[[package]]
@@ -1723,7 +1743,7 @@ checksum = "487585f4d0c6655fe74905e2504d8ad6908e4db67f744eb140876906c2f3175d"
dependencies = [
"proc-macro2",
"quote",
"syn 2.0.32",
"syn 2.0.52",
]
[[package]]
@@ -1747,10 +1767,10 @@ version = "0.14.8"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "413301934810f597c1d19ca71c8710e99a3f1ba28a0d2ebc01551a2daeea3c5c"
dependencies = [
"der",
"der 0.6.1",
"elliptic-curve",
"rfc6979",
"signature",
"signature 1.6.4",
]
[[package]]
@@ -1767,7 +1787,7 @@ checksum = "e7bb888ab5300a19b8e5bceef25ac745ad065f3c9f7efc6de1b91958110891d3"
dependencies = [
"base16ct",
"crypto-bigint 0.4.9",
"der",
"der 0.6.1",
"digest",
"ff",
"generic-array",
@@ -1827,7 +1847,7 @@ dependencies = [
"darling",
"proc-macro2",
"quote",
"syn 2.0.32",
"syn 2.0.52",
]
[[package]]
@@ -2087,7 +2107,7 @@ checksum = "87750cf4b7a4c0625b1529e4c543c2182106e4dedc60a2a6455e00d212c489ac"
dependencies = [
"proc-macro2",
"quote",
"syn 2.0.32",
"syn 2.0.52",
]
[[package]]
@@ -2470,10 +2490,10 @@ dependencies = [
"http 0.2.9",
"hyper",
"log",
"rustls",
"rustls 0.21.9",
"rustls-native-certs",
"tokio",
"tokio-rustls",
"tokio-rustls 0.24.0",
]
[[package]]
@@ -2711,7 +2731,7 @@ checksum = "5c7ea04a7c5c055c175f189b6dc6ba036fd62306b58c66c9f6389036c503a3f4"
dependencies = [
"base64 0.21.1",
"js-sys",
"pem 3.0.3",
"pem",
"ring 0.17.6",
"serde",
"serde_json",
@@ -3234,7 +3254,7 @@ checksum = "a948666b637a0f465e8564c73e89d4dde00d72d4d473cc972f390fc3dcee7d9c"
dependencies = [
"proc-macro2",
"quote",
"syn 2.0.32",
"syn 2.0.52",
]
[[package]]
@@ -3716,7 +3736,7 @@ dependencies = [
"parquet",
"proc-macro2",
"quote",
"syn 2.0.32",
"syn 2.0.52",
]
[[package]]
@@ -3754,16 +3774,6 @@ version = "0.1.2"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "19b17cddbe7ec3f8bc800887bab5e717348c95ea2ca0b1bf0837fb964dc67099"
[[package]]
name = "pem"
version = "2.0.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "6b13fe415cdf3c8e44518e18a7c95a13431d9bdf6d15367d82b23c377fdd441a"
dependencies = [
"base64 0.21.1",
"serde",
]
[[package]]
name = "pem"
version = "3.0.3"
@@ -3825,7 +3835,7 @@ checksum = "39407670928234ebc5e6e580247dd567ad73a3578460c5990f9503df207e8f07"
dependencies = [
"proc-macro2",
"quote",
"syn 2.0.32",
"syn 2.0.52",
]
[[package]]
@@ -3846,8 +3856,8 @@ version = "0.9.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "9eca2c590a5f85da82668fa685c09ce2888b9430e83299debf1f34b65fd4a4ba"
dependencies = [
"der",
"spki",
"der 0.6.1",
"spki 0.6.0",
]
[[package]]
@@ -3946,14 +3956,14 @@ dependencies = [
"futures",
"once_cell",
"pq_proto",
"rustls",
"rustls-pemfile",
"rustls 0.22.2",
"rustls-pemfile 2.1.1",
"serde",
"thiserror",
"tokio",
"tokio-postgres",
"tokio-postgres-rustls",
"tokio-rustls",
"tokio-rustls 0.25.0",
"tracing",
"workspace_hack",
]
@@ -4042,7 +4052,7 @@ source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "3b69d39aab54d069e7f2fe8cb970493e7834601ca2d8c65fd7bbd183578080d1"
dependencies = [
"proc-macro2",
"syn 2.0.32",
"syn 2.0.52",
]
[[package]]
@@ -4053,9 +4063,9 @@ checksum = "dc375e1527247fe1a97d8b7156678dfe7c1af2fc075c9a4db3690ecd2a148068"
[[package]]
name = "proc-macro2"
version = "1.0.66"
version = "1.0.78"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "18fb31db3f9bddb2ea821cde30a9f70117e3f119938b5ee630b7403aa6e2ead9"
checksum = "e2422ad645d89c99f8f3e6b88a9fdeca7fabeac836b1002371c4367c8f984aae"
dependencies = [
"unicode-ident",
]
@@ -4202,8 +4212,8 @@ dependencies = [
"routerify",
"rstest",
"rustc-hash",
"rustls",
"rustls-pemfile",
"rustls 0.22.2",
"rustls-pemfile 2.1.1",
"scopeguard",
"serde",
"serde_json",
@@ -4216,11 +4226,10 @@ dependencies = [
"thiserror",
"tikv-jemalloc-ctl",
"tikv-jemallocator",
"tls-listener",
"tokio",
"tokio-postgres",
"tokio-postgres-rustls",
"tokio-rustls",
"tokio-rustls 0.25.0",
"tokio-util",
"tracing",
"tracing-opentelemetry",
@@ -4248,9 +4257,9 @@ dependencies = [
[[package]]
name = "quote"
version = "1.0.32"
version = "1.0.35"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "50f3b39ccfb720540debaa0164757101c08ecb8d326b15358ce76a62c7e85965"
checksum = "291ec9ab5efd934aaf503a6466c5d5251535d108ee747472c3977cc5acc868ef"
dependencies = [
"proc-macro2",
]
@@ -4371,12 +4380,12 @@ dependencies = [
[[package]]
name = "rcgen"
version = "0.11.1"
version = "0.12.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "4954fbc00dcd4d8282c987710e50ba513d351400dbdd00e803a05172a90d8976"
checksum = "48406db8ac1f3cbc7dcdb56ec355343817958a356ff430259bb07baf7607e1e1"
dependencies = [
"pem 2.0.1",
"ring 0.16.20",
"pem",
"ring 0.17.6",
"time",
"yasna",
]
@@ -4394,15 +4403,15 @@ dependencies = [
"itoa",
"percent-encoding",
"pin-project-lite",
"rustls",
"rustls 0.21.9",
"rustls-native-certs",
"rustls-pemfile",
"rustls-pemfile 1.0.2",
"rustls-webpki 0.101.7",
"ryu",
"sha1_smol",
"socket2 0.4.9",
"tokio",
"tokio-rustls",
"tokio-rustls 0.24.0",
"tokio-util",
"url",
]
@@ -4548,14 +4557,14 @@ dependencies = [
"once_cell",
"percent-encoding",
"pin-project-lite",
"rustls",
"rustls-pemfile",
"rustls 0.21.9",
"rustls-pemfile 1.0.2",
"serde",
"serde_json",
"serde_urlencoded",
"tokio",
"tokio-native-tls",
"tokio-rustls",
"tokio-rustls 0.24.0",
"tokio-util",
"tower-service",
"url",
@@ -4721,7 +4730,7 @@ dependencies = [
"regex",
"relative-path",
"rustc_version",
"syn 2.0.32",
"syn 2.0.52",
"unicode-ident",
]
@@ -4805,6 +4814,20 @@ dependencies = [
"sct",
]
[[package]]
name = "rustls"
version = "0.22.2"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "e87c9956bd9807afa1f77e0f7594af32566e830e088a5576d27c5b6f30f49d41"
dependencies = [
"log",
"ring 0.17.6",
"rustls-pki-types",
"rustls-webpki 0.102.2",
"subtle",
"zeroize",
]
[[package]]
name = "rustls-native-certs"
version = "0.6.2"
@@ -4812,7 +4835,7 @@ source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "0167bac7a9f490495f3c33013e7722b53cb087ecbe082fb0c6387c96f634ea50"
dependencies = [
"openssl-probe",
"rustls-pemfile",
"rustls-pemfile 1.0.2",
"schannel",
"security-framework",
]
@@ -4826,6 +4849,22 @@ dependencies = [
"base64 0.21.1",
]
[[package]]
name = "rustls-pemfile"
version = "2.1.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "f48172685e6ff52a556baa527774f61fcaa884f59daf3375c62a3f1cd2549dab"
dependencies = [
"base64 0.21.1",
"rustls-pki-types",
]
[[package]]
name = "rustls-pki-types"
version = "1.3.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "5ede67b28608b4c60685c7d54122d4400d90f62b40caee7700e700380a390fa8"
[[package]]
name = "rustls-webpki"
version = "0.100.2"
@@ -4846,6 +4885,17 @@ dependencies = [
"untrusted 0.9.0",
]
[[package]]
name = "rustls-webpki"
version = "0.102.2"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "faaa0a62740bedb9b2ef5afa303da42764c012f743917351dc9a237ea1663610"
dependencies = [
"ring 0.17.6",
"rustls-pki-types",
"untrusted 0.9.0",
]
[[package]]
name = "rustversion"
version = "1.0.12"
@@ -4888,7 +4938,7 @@ dependencies = [
"serde_with",
"thiserror",
"tokio",
"tokio-rustls",
"tokio-rustls 0.25.0",
"tokio-stream",
"tracing",
"tracing-appender",
@@ -5023,7 +5073,7 @@ source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "3be24c1842290c45df0a7bf069e0c268a747ad05a192f2fd7dcfdbc1cba40928"
dependencies = [
"base16ct",
"der",
"der 0.6.1",
"generic-array",
"pkcs8",
"subtle",
@@ -5067,7 +5117,7 @@ checksum = "2e95efd0cefa32028cdb9766c96de71d96671072f9fb494dc9fb84c0ef93e52b"
dependencies = [
"httpdate",
"reqwest",
"rustls",
"rustls 0.21.9",
"sentry-backtrace",
"sentry-contexts",
"sentry-core",
@@ -5189,7 +5239,7 @@ checksum = "aafe972d60b0b9bee71a91b92fee2d4fb3c9d7e8f6b179aa99f27203d99a4816"
dependencies = [
"proc-macro2",
"quote",
"syn 2.0.32",
"syn 2.0.52",
]
[[package]]
@@ -5270,7 +5320,7 @@ dependencies = [
"darling",
"proc-macro2",
"quote",
"syn 2.0.32",
"syn 2.0.52",
]
[[package]]
@@ -5356,6 +5406,15 @@ dependencies = [
"rand_core 0.6.4",
]
[[package]]
name = "signature"
version = "2.2.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "77549399552de45a898a580c1b41d445bf730df867cc44e6c0233bbc4b8329de"
dependencies = [
"rand_core 0.6.4",
]
[[package]]
name = "simple_asn1"
version = "0.6.2"
@@ -5440,7 +5499,17 @@ source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "67cf02bbac7a337dc36e4f5a693db6c21e7863f45070f7064577eb4367a3212b"
dependencies = [
"base64ct",
"der",
"der 0.6.1",
]
[[package]]
name = "spki"
version = "0.7.3"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "d91ed6c858b01f942cd56b37a94b3e0a1798290327d1236e4d9cf4eaca44d29d"
dependencies = [
"base64ct",
"der 0.7.8",
]
[[package]]
@@ -5526,9 +5595,9 @@ checksum = "81cdd64d312baedb58e21336b31bc043b77e01cc99033ce76ef539f78e965ebc"
[[package]]
name = "svg_fmt"
version = "0.4.1"
version = "0.4.2"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "8fb1df15f412ee2e9dfc1c504260fa695c1c3f10fe9f4a6ee2d2184d7d6450e2"
checksum = "f83ba502a3265efb76efb89b0a2f7782ad6f2675015d4ce37e4b547dda42b499"
[[package]]
name = "syn"
@@ -5543,9 +5612,9 @@ dependencies = [
[[package]]
name = "syn"
version = "2.0.32"
version = "2.0.52"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "239814284fd6f1a4ffe4ca893952cdd93c224b6a1571c9a9eadd670295c0c9e2"
checksum = "b699d15b36d1f02c3e7c69f8ffef53de37aefae075d8488d4ba1a7788d574a07"
dependencies = [
"proc-macro2",
"quote",
@@ -5660,22 +5729,22 @@ dependencies = [
[[package]]
name = "thiserror"
version = "1.0.47"
version = "1.0.57"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "97a802ec30afc17eee47b2855fc72e0c4cd62be9b4efe6591edde0ec5bd68d8f"
checksum = "1e45bcbe8ed29775f228095caf2cd67af7a4ccf756ebff23a306bf3e8b47b24b"
dependencies = [
"thiserror-impl",
]
[[package]]
name = "thiserror-impl"
version = "1.0.47"
version = "1.0.57"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "6bb623b56e39ab7dcd4b1b98bb6c8f8d907ed255b18de254088016b27a8ee19b"
checksum = "a953cb265bef375dae3de6663da4d3804eee9682ea80d8e2542529b73c531c81"
dependencies = [
"proc-macro2",
"quote",
"syn 2.0.32",
"syn 2.0.52",
]
[[package]]
@@ -5794,20 +5863,6 @@ version = "0.1.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "1f3ccbac311fea05f86f61904b462b55fb3df8837a366dfc601a0161d0532f20"
[[package]]
name = "tls-listener"
version = "0.7.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "81294c017957a1a69794f506723519255879e15a870507faf45dfed288b763dd"
dependencies = [
"futures-util",
"hyper",
"pin-project-lite",
"thiserror",
"tokio",
"tokio-rustls",
]
[[package]]
name = "tokio"
version = "1.36.0"
@@ -5860,7 +5915,7 @@ checksum = "5b8a1e28f2deaa14e508979454cb3a223b10b938b45af148bc0986de36f1923b"
dependencies = [
"proc-macro2",
"quote",
"syn 2.0.32",
"syn 2.0.52",
]
[[package]]
@@ -5898,16 +5953,17 @@ dependencies = [
[[package]]
name = "tokio-postgres-rustls"
version = "0.10.0"
version = "0.11.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "dd5831152cb0d3f79ef5523b357319ba154795d64c7078b2daa95a803b54057f"
checksum = "0ea13f22eda7127c827983bdaf0d7fff9df21c8817bab02815ac277a21143677"
dependencies = [
"futures",
"ring 0.16.20",
"rustls",
"ring 0.17.6",
"rustls 0.22.2",
"tokio",
"tokio-postgres",
"tokio-rustls",
"tokio-rustls 0.25.0",
"x509-certificate",
]
[[package]]
@@ -5916,7 +5972,18 @@ version = "0.24.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "e0d409377ff5b1e3ca6437aa86c1eb7d40c134bfec254e44c830defa92669db5"
dependencies = [
"rustls",
"rustls 0.21.9",
"tokio",
]
[[package]]
name = "tokio-rustls"
version = "0.25.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "775e0c0f0adb3a2f22a00c4745d728b479985fc15ee7ca6a2608388c5569860f"
dependencies = [
"rustls 0.22.2",
"rustls-pki-types",
"tokio",
]
@@ -6031,9 +6098,9 @@ dependencies = [
"pin-project",
"prost",
"rustls-native-certs",
"rustls-pemfile",
"rustls-pemfile 1.0.2",
"tokio",
"tokio-rustls",
"tokio-rustls 0.24.0",
"tokio-stream",
"tower",
"tower-layer",
@@ -6129,7 +6196,7 @@ checksum = "0f57e3ca2a01450b1a921183a9c9cbfda207fd822cef4ccb00a65402cbba7a74"
dependencies = [
"proc-macro2",
"quote",
"syn 2.0.32",
"syn 2.0.52",
]
[[package]]
@@ -6345,7 +6412,7 @@ dependencies = [
"base64 0.21.1",
"log",
"once_cell",
"rustls",
"rustls 0.21.9",
"rustls-webpki 0.100.2",
"url",
"webpki-roots 0.23.1",
@@ -6587,7 +6654,7 @@ dependencies = [
"once_cell",
"proc-macro2",
"quote",
"syn 2.0.32",
"syn 2.0.52",
"wasm-bindgen-shared",
]
@@ -6621,7 +6688,7 @@ checksum = "e128beba882dd1eb6200e1dc92ae6c5dbaa4311aa7bb211ca035779e5efc39f8"
dependencies = [
"proc-macro2",
"quote",
"syn 2.0.32",
"syn 2.0.52",
"wasm-bindgen-backend",
"wasm-bindgen-shared",
]
@@ -6954,19 +7021,18 @@ dependencies = [
"regex-automata 0.4.3",
"regex-syntax 0.8.2",
"reqwest",
"ring 0.16.20",
"rustls",
"rustls 0.21.9",
"scopeguard",
"serde",
"serde_json",
"smallvec",
"subtle",
"syn 1.0.109",
"syn 2.0.32",
"syn 2.0.52",
"time",
"time-macros",
"tokio",
"tokio-rustls",
"tokio-rustls 0.24.0",
"tokio-util",
"toml_datetime",
"toml_edit",
@@ -6977,11 +7043,31 @@ dependencies = [
"tungstenite",
"url",
"uuid",
"zeroize",
"zstd",
"zstd-safe",
"zstd-sys",
]
[[package]]
name = "x509-certificate"
version = "0.23.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "66534846dec7a11d7c50a74b7cdb208b9a581cad890b7866430d438455847c85"
dependencies = [
"bcder",
"bytes",
"chrono",
"der 0.7.8",
"hex",
"pem",
"ring 0.17.6",
"signature 2.2.0",
"spki 0.7.3",
"thiserror",
"zeroize",
]
[[package]]
name = "x509-parser"
version = "0.15.0"
@@ -7040,7 +7126,7 @@ checksum = "b3c129550b3e6de3fd0ba67ba5c81818f9805e58b8d7fee80a3a59d2c9fc601a"
dependencies = [
"proc-macro2",
"quote",
"syn 2.0.32",
"syn 2.0.52",
]
[[package]]
@@ -7048,6 +7134,20 @@ name = "zeroize"
version = "1.6.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "2a0956f1ba7c7909bfb66c2e9e4124ab6f6482560f6628b5aaeba39207c9aad9"
dependencies = [
"zeroize_derive",
]
[[package]]
name = "zeroize_derive"
version = "1.4.2"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "ce36e65b0d2999d2aafac989fb249189a141aee1f53c612c1f37d72631959f69"
dependencies = [
"proc-macro2",
"quote",
"syn 2.0.52",
]
[[package]]
name = "zstd"

View File

@@ -129,8 +129,8 @@ reqwest-retry = "0.2.2"
routerify = "3"
rpds = "0.13"
rustc-hash = "1.1.0"
rustls = "0.21"
rustls-pemfile = "1"
rustls = "0.22"
rustls-pemfile = "2"
rustls-split = "0.3"
scopeguard = "1.1"
sysinfo = "0.29.2"
@@ -156,12 +156,11 @@ test-context = "0.1"
thiserror = "1.0"
tikv-jemallocator = "0.5"
tikv-jemalloc-ctl = "0.5"
tls-listener = { version = "0.7", features = ["rustls", "hyper-h1"] }
tokio = { version = "1.17", features = ["macros"] }
tokio-epoll-uring = { git = "https://github.com/neondatabase/tokio-epoll-uring.git" , branch = "main" }
tokio-io-timeout = "1.2.0"
tokio-postgres-rustls = "0.10.0"
tokio-rustls = "0.24"
tokio-postgres-rustls = "0.11.0"
tokio-rustls = "0.25"
tokio-stream = "0.1"
tokio-tar = "0.3"
tokio-util = { version = "0.7.10", features = ["io", "rt"] }
@@ -220,7 +219,7 @@ workspace_hack = { version = "0.1", path = "./workspace_hack/" }
## Build dependencies
criterion = "0.5.1"
rcgen = "0.11"
rcgen = "0.12"
rstest = "0.18"
camino-tempfile = "1.0.2"
tonic-build = "0.9"

View File

@@ -53,7 +53,7 @@ RUN set -e \
--bin pagectl \
--bin safekeeper \
--bin storage_broker \
--bin attachment_service \
--bin storage_controller \
--bin proxy \
--bin neon_local \
--locked --release \
@@ -81,7 +81,7 @@ COPY --from=build --chown=neon:neon /home/nonroot/target/release/pageserver
COPY --from=build --chown=neon:neon /home/nonroot/target/release/pagectl /usr/local/bin
COPY --from=build --chown=neon:neon /home/nonroot/target/release/safekeeper /usr/local/bin
COPY --from=build --chown=neon:neon /home/nonroot/target/release/storage_broker /usr/local/bin
COPY --from=build --chown=neon:neon /home/nonroot/target/release/attachment_service /usr/local/bin
COPY --from=build --chown=neon:neon /home/nonroot/target/release/storage_controller /usr/local/bin
COPY --from=build --chown=neon:neon /home/nonroot/target/release/proxy /usr/local/bin
COPY --from=build --chown=neon:neon /home/nonroot/target/release/neon_local /usr/local/bin

View File

@@ -3,3 +3,10 @@ disallowed-methods = [
# Allow this for now, to deny it later once we stop using Handle::block_on completely
# "tokio::runtime::Handle::block_on",
]
disallowed-macros = [
# use std::pin::pin
"futures::pin_mut",
# cannot disallow this, because clippy finds used from tokio macros
#"tokio::pin",
]

View File

@@ -17,6 +17,7 @@ use chrono::{DateTime, Utc};
use futures::future::join_all;
use futures::stream::FuturesUnordered;
use futures::StreamExt;
use nix::unistd::Pid;
use postgres::error::SqlState;
use postgres::{Client, NoTls};
use tracing::{debug, error, info, instrument, warn};
@@ -396,9 +397,9 @@ impl ComputeNode {
// Gets the basebackup in a retry loop
#[instrument(skip_all, fields(%lsn))]
pub fn get_basebackup(&self, compute_state: &ComputeState, lsn: Lsn) -> Result<()> {
let mut retry_period_ms = 500;
let mut retry_period_ms = 500.0;
let mut attempts = 0;
let max_attempts = 5;
let max_attempts = 10;
loop {
let result = self.try_get_basebackup(compute_state, lsn);
match result {
@@ -410,8 +411,8 @@ impl ComputeNode {
"Failed to get basebackup: {} (attempt {}/{})",
e, attempts, max_attempts
);
std::thread::sleep(std::time::Duration::from_millis(retry_period_ms));
retry_period_ms *= 2;
std::thread::sleep(std::time::Duration::from_millis(retry_period_ms as u64));
retry_period_ms *= 1.5;
}
Err(_) => {
return result;
@@ -722,8 +723,12 @@ impl ComputeNode {
// Stop it when it's ready
info!("waiting for postgres");
wait_for_postgres(&mut pg, Path::new(pgdata))?;
pg.kill()?;
info!("sent kill signal");
// SIGQUIT orders postgres to exit immediately. We don't want to SIGKILL
// it to avoid orphaned processes prowling around while datadir is
// wiped.
let pm_pid = Pid::from_raw(pg.id() as i32);
kill(pm_pid, Signal::SIGQUIT)?;
info!("sent SIGQUIT signal");
pg.wait()?;
info!("done prewarming");
@@ -764,6 +769,26 @@ impl ComputeNode {
Ok((pg, logs_handle))
}
/// Do post configuration of the already started Postgres. This function spawns a background thread to
/// configure the database after applying the compute spec. Currently, it upgrades the neon extension
/// version. In the future, it may upgrade all 3rd-party extensions.
#[instrument(skip_all)]
pub fn post_apply_config(&self) -> Result<()> {
let connstr = self.connstr.clone();
thread::spawn(move || {
let func = || {
let mut client = Client::connect(connstr.as_str(), NoTls)?;
handle_neon_extension_upgrade(&mut client)
.context("handle_neon_extension_upgrade")?;
Ok::<_, anyhow::Error>(())
};
if let Err(err) = func() {
error!("error while post_apply_config: {err:#}");
}
});
Ok(())
}
/// Do initial configuration of the already started Postgres.
#[instrument(skip_all)]
pub fn apply_config(&self, compute_state: &ComputeState) -> Result<()> {
@@ -998,18 +1023,21 @@ impl ComputeNode {
let pg_process = self.start_postgres(pspec.storage_auth_token.clone())?;
let config_time = Utc::now();
if pspec.spec.mode == ComputeMode::Primary && !pspec.spec.skip_pg_catalog_updates {
let pgdata_path = Path::new(&self.pgdata);
// temporarily reset max_cluster_size in config
// to avoid the possibility of hitting the limit, while we are applying config:
// creating new extensions, roles, etc...
config::compute_ctl_temp_override_create(pgdata_path, "neon.max_cluster_size=-1")?;
self.pg_reload_conf()?;
if pspec.spec.mode == ComputeMode::Primary {
if !pspec.spec.skip_pg_catalog_updates {
let pgdata_path = Path::new(&self.pgdata);
// temporarily reset max_cluster_size in config
// to avoid the possibility of hitting the limit, while we are applying config:
// creating new extensions, roles, etc...
config::compute_ctl_temp_override_create(pgdata_path, "neon.max_cluster_size=-1")?;
self.pg_reload_conf()?;
self.apply_config(&compute_state)?;
self.apply_config(&compute_state)?;
config::compute_ctl_temp_override_remove(pgdata_path)?;
self.pg_reload_conf()?;
config::compute_ctl_temp_override_remove(pgdata_path)?;
self.pg_reload_conf()?;
}
self.post_apply_config()?;
}
let startup_end_time = Utc::now();

View File

@@ -302,9 +302,9 @@ pub fn handle_roles(spec: &ComputeSpec, client: &mut Client) -> Result<()> {
RoleAction::Create => {
// This branch only runs when roles are created through the console, so it is
// safe to add more permissions here. BYPASSRLS and REPLICATION are inherited
// from neon_superuser.
// from neon_superuser. (NOTE: REPLICATION has been removed from here for now).
let mut query: String = format!(
"CREATE ROLE {} INHERIT CREATEROLE CREATEDB BYPASSRLS REPLICATION IN ROLE neon_superuser",
"CREATE ROLE {} INHERIT CREATEROLE CREATEDB BYPASSRLS IN ROLE neon_superuser",
name.pg_quote()
);
info!("running role create query: '{}'", &query);
@@ -744,7 +744,17 @@ pub fn handle_extension_neon(client: &mut Client) -> Result<()> {
// - extension was just installed
// - extension was already installed and is up to date
let query = "ALTER EXTENSION neon UPDATE";
info!("update neon extension schema with query: {}", query);
info!("update neon extension version with query: {}", query);
client.simple_query(query)?;
Ok(())
}
#[instrument(skip_all)]
pub fn handle_neon_extension_upgrade(client: &mut Client) -> Result<()> {
info!("handle neon extension upgrade");
let query = "ALTER EXTENSION neon UPDATE";
info!("update neon extension version with query: {}", query);
client.simple_query(query)?;
Ok(())
@@ -795,6 +805,18 @@ $$;"#,
"",
"",
// Add new migrations below.
r#"
DO $$
DECLARE
role_name TEXT;
BEGIN
FOR role_name IN SELECT rolname FROM pg_roles WHERE rolreplication IS TRUE
LOOP
RAISE NOTICE 'EXECUTING ALTER ROLE % NOREPLICATION', quote_ident(role_name);
EXECUTE 'ALTER ROLE ' || quote_ident(role_name) || ' NOREPLICATION';
END LOOP;
END
$$;"#,
];
let mut query = "CREATE SCHEMA IF NOT EXISTS neon_migration";

View File

@@ -4,6 +4,10 @@ version = "0.1.0"
edition.workspace = true
license.workspace = true
[[bin]]
name = "storage_controller"
path = "src/main.rs"
[features]
default = []
# Enables test-only APIs and behaviors

View File

@@ -3,7 +3,7 @@ use std::{collections::HashMap, time::Duration};
use control_plane::endpoint::{ComputeControlPlane, EndpointStatus};
use control_plane::local_env::LocalEnv;
use hyper::{Method, StatusCode};
use pageserver_api::shard::{ShardIndex, ShardNumber, TenantShardId};
use pageserver_api::shard::{ShardCount, ShardNumber, ShardStripeSize, TenantShardId};
use postgres_connection::parse_host_port;
use serde::{Deserialize, Serialize};
use tokio_util::sync::CancellationToken;
@@ -19,8 +19,66 @@ const SLOWDOWN_DELAY: Duration = Duration::from_secs(5);
pub(crate) const API_CONCURRENCY: usize = 32;
pub(super) struct ComputeHookTenant {
shards: Vec<(ShardIndex, NodeId)>,
struct ShardedComputeHookTenant {
stripe_size: ShardStripeSize,
shard_count: ShardCount,
shards: Vec<(ShardNumber, NodeId)>,
}
enum ComputeHookTenant {
Unsharded(NodeId),
Sharded(ShardedComputeHookTenant),
}
impl ComputeHookTenant {
/// Construct with at least one shard's information
fn new(tenant_shard_id: TenantShardId, stripe_size: ShardStripeSize, node_id: NodeId) -> Self {
if tenant_shard_id.shard_count.count() > 1 {
Self::Sharded(ShardedComputeHookTenant {
shards: vec![(tenant_shard_id.shard_number, node_id)],
stripe_size,
shard_count: tenant_shard_id.shard_count,
})
} else {
Self::Unsharded(node_id)
}
}
/// Set one shard's location. If stripe size or shard count have changed, Self is reset
/// and drops existing content.
fn update(
&mut self,
tenant_shard_id: TenantShardId,
stripe_size: ShardStripeSize,
node_id: NodeId,
) {
match self {
Self::Unsharded(existing_node_id) if tenant_shard_id.shard_count.count() == 1 => {
*existing_node_id = node_id
}
Self::Sharded(sharded_tenant)
if sharded_tenant.stripe_size == stripe_size
&& sharded_tenant.shard_count == tenant_shard_id.shard_count =>
{
if let Some(existing) = sharded_tenant
.shards
.iter()
.position(|s| s.0 == tenant_shard_id.shard_number)
{
sharded_tenant.shards.get_mut(existing).unwrap().1 = node_id;
} else {
sharded_tenant
.shards
.push((tenant_shard_id.shard_number, node_id));
sharded_tenant.shards.sort_by_key(|s| s.0)
}
}
_ => {
// Shard count changed: reset struct.
*self = Self::new(tenant_shard_id, stripe_size, node_id);
}
}
}
}
#[derive(Serialize, Deserialize, Debug)]
@@ -33,6 +91,7 @@ struct ComputeHookNotifyRequestShard {
#[derive(Serialize, Deserialize, Debug)]
struct ComputeHookNotifyRequest {
tenant_id: TenantId,
stripe_size: Option<ShardStripeSize>,
shards: Vec<ComputeHookNotifyRequestShard>,
}
@@ -63,42 +122,43 @@ pub(crate) enum NotifyError {
}
impl ComputeHookTenant {
async fn maybe_reconfigure(&mut self, tenant_id: TenantId) -> Option<ComputeHookNotifyRequest> {
// Find the highest shard count and drop any shards that aren't
// for that shard count.
let shard_count = self.shards.iter().map(|(k, _v)| k.shard_count).max();
let Some(shard_count) = shard_count else {
// No shards, nothing to do.
tracing::info!("ComputeHookTenant::maybe_reconfigure: no shards");
return None;
};
self.shards.retain(|(k, _v)| k.shard_count == shard_count);
self.shards
.sort_by_key(|(shard, _node_id)| shard.shard_number);
if self.shards.len() == shard_count.count() as usize || shard_count.is_unsharded() {
// We have pageservers for all the shards: emit a configuration update
return Some(ComputeHookNotifyRequest {
fn maybe_reconfigure(&self, tenant_id: TenantId) -> Option<ComputeHookNotifyRequest> {
match self {
Self::Unsharded(node_id) => Some(ComputeHookNotifyRequest {
tenant_id,
shards: self
.shards
.iter()
.map(|(shard, node_id)| ComputeHookNotifyRequestShard {
shard_number: shard.shard_number,
node_id: *node_id,
})
.collect(),
});
} else {
tracing::info!(
"ComputeHookTenant::maybe_reconfigure: not enough shards ({}/{})",
self.shards.len(),
shard_count.count()
);
}
shards: vec![ComputeHookNotifyRequestShard {
shard_number: ShardNumber(0),
node_id: *node_id,
}],
stripe_size: None,
}),
Self::Sharded(sharded_tenant)
if sharded_tenant.shards.len() == sharded_tenant.shard_count.count() as usize =>
{
Some(ComputeHookNotifyRequest {
tenant_id,
shards: sharded_tenant
.shards
.iter()
.map(|(shard_number, node_id)| ComputeHookNotifyRequestShard {
shard_number: *shard_number,
node_id: *node_id,
})
.collect(),
stripe_size: Some(sharded_tenant.stripe_size),
})
}
Self::Sharded(sharded_tenant) => {
// Sharded tenant doesn't yet have information for all its shards
None
tracing::info!(
"ComputeHookTenant::maybe_reconfigure: not enough shards ({}/{})",
sharded_tenant.shards.len(),
sharded_tenant.shard_count.count()
);
None
}
}
}
}
@@ -139,7 +199,11 @@ impl ComputeHook {
};
let cplane =
ComputeControlPlane::load(env.clone()).expect("Error loading compute control plane");
let ComputeHookNotifyRequest { tenant_id, shards } = reconfigure_request;
let ComputeHookNotifyRequest {
tenant_id,
shards,
stripe_size,
} = reconfigure_request;
let compute_pageservers = shards
.into_iter()
@@ -156,7 +220,9 @@ impl ComputeHook {
for (endpoint_name, endpoint) in &cplane.endpoints {
if endpoint.tenant_id == tenant_id && endpoint.status() == EndpointStatus::Running {
tracing::info!("Reconfiguring endpoint {}", endpoint_name,);
endpoint.reconfigure(compute_pageservers.clone()).await?;
endpoint
.reconfigure(compute_pageservers.clone(), stripe_size)
.await?;
}
}
@@ -271,30 +337,26 @@ impl ComputeHook {
&self,
tenant_shard_id: TenantShardId,
node_id: NodeId,
stripe_size: ShardStripeSize,
cancel: &CancellationToken,
) -> Result<(), NotifyError> {
let mut locked = self.state.lock().await;
let entry = locked
.entry(tenant_shard_id.tenant_id)
.or_insert_with(|| ComputeHookTenant { shards: Vec::new() });
let shard_index = ShardIndex {
shard_count: tenant_shard_id.shard_count,
shard_number: tenant_shard_id.shard_number,
use std::collections::hash_map::Entry;
let tenant = match locked.entry(tenant_shard_id.tenant_id) {
Entry::Vacant(e) => e.insert(ComputeHookTenant::new(
tenant_shard_id,
stripe_size,
node_id,
)),
Entry::Occupied(e) => {
let tenant = e.into_mut();
tenant.update(tenant_shard_id, stripe_size, node_id);
tenant
}
};
let mut set = false;
for (existing_shard, existing_node) in &mut entry.shards {
if *existing_shard == shard_index {
*existing_node = node_id;
set = true;
}
}
if !set {
entry.shards.push((shard_index, node_id));
}
let reconfigure_request = entry.maybe_reconfigure(tenant_shard_id.tenant_id).await;
let reconfigure_request = tenant.maybe_reconfigure(tenant_shard_id.tenant_id);
let Some(reconfigure_request) = reconfigure_request else {
// The tenant doesn't yet have pageservers for all its shards: we won't notify anything
// until it does.
@@ -316,3 +378,85 @@ impl ComputeHook {
}
}
}
#[cfg(test)]
pub(crate) mod tests {
use pageserver_api::shard::{ShardCount, ShardNumber};
use utils::id::TenantId;
use super::*;
#[test]
fn tenant_updates() -> anyhow::Result<()> {
let tenant_id = TenantId::generate();
let mut tenant_state = ComputeHookTenant::new(
TenantShardId {
tenant_id,
shard_count: ShardCount::new(0),
shard_number: ShardNumber(0),
},
ShardStripeSize(12345),
NodeId(1),
);
// An unsharded tenant is always ready to emit a notification
assert!(tenant_state.maybe_reconfigure(tenant_id).is_some());
assert_eq!(
tenant_state
.maybe_reconfigure(tenant_id)
.unwrap()
.shards
.len(),
1
);
assert!(tenant_state
.maybe_reconfigure(tenant_id)
.unwrap()
.stripe_size
.is_none());
// Writing the first shard of a multi-sharded situation (i.e. in a split)
// resets the tenant state and puts it in an non-notifying state (need to
// see all shards)
tenant_state.update(
TenantShardId {
tenant_id,
shard_count: ShardCount::new(2),
shard_number: ShardNumber(1),
},
ShardStripeSize(32768),
NodeId(1),
);
assert!(tenant_state.maybe_reconfigure(tenant_id).is_none());
// Writing the second shard makes it ready to notify
tenant_state.update(
TenantShardId {
tenant_id,
shard_count: ShardCount::new(2),
shard_number: ShardNumber(0),
},
ShardStripeSize(32768),
NodeId(1),
);
assert!(tenant_state.maybe_reconfigure(tenant_id).is_some());
assert_eq!(
tenant_state
.maybe_reconfigure(tenant_id)
.unwrap()
.shards
.len(),
2
);
assert_eq!(
tenant_state
.maybe_reconfigure(tenant_id)
.unwrap()
.stripe_size,
Some(ShardStripeSize(32768))
);
Ok(())
}
}

View File

@@ -1,6 +1,5 @@
use crate::reconciler::ReconcileError;
use crate::service::{Service, STARTUP_RECONCILE_TIMEOUT};
use crate::PlacementPolicy;
use hyper::{Body, Request, Response};
use hyper::{StatusCode, Uri};
use pageserver_api::models::{
@@ -119,13 +118,9 @@ async fn handle_tenant_create(
let create_req = json_request::<TenantCreateRequest>(&mut req).await?;
// TODO: enable specifying this. Using Single as a default helps legacy tests to work (they
// have no expectation of HA).
let placement_policy = PlacementPolicy::Single;
json_response(
StatusCode::CREATED,
service.tenant_create(create_req, placement_policy).await?,
service.tenant_create(create_req).await?,
)
}

View File

@@ -1,4 +1,4 @@
use serde::{Deserialize, Serialize};
use serde::Serialize;
use utils::seqwait::MonotonicCounter;
mod auth;
@@ -13,23 +13,6 @@ mod schema;
pub mod service;
mod tenant_state;
#[derive(Clone, Serialize, Deserialize, Debug, PartialEq, Eq)]
enum PlacementPolicy {
/// Cheapest way to attach a tenant: just one pageserver, no secondary
Single,
/// Production-ready way to attach a tenant: one attached pageserver and
/// some number of secondaries.
Double(usize),
/// Create one secondary mode locations. This is useful when onboarding
/// a tenant, or for an idle tenant that we might want to bring online quickly.
Secondary,
/// Do not attach to any pageservers. This is appropriate for tenants that
/// have been idle for a long time, where we do not mind some delay in making
/// them available in future.
Detached,
}
#[derive(Ord, PartialOrd, Eq, PartialEq, Copy, Clone, Serialize)]
struct Sequence(u64);
@@ -66,9 +49,3 @@ impl Sequence {
Sequence(self.0 + 1)
}
}
impl Default for PlacementPolicy {
fn default() -> Self {
PlacementPolicy::Double(1)
}
}

View File

@@ -1,6 +1,16 @@
use pageserver_api::controller_api::{NodeAvailability, NodeSchedulingPolicy};
use std::{str::FromStr, time::Duration};
use hyper::StatusCode;
use pageserver_api::{
controller_api::{
NodeAvailability, NodeRegisterRequest, NodeSchedulingPolicy, TenantLocateResponseShard,
},
shard::TenantShardId,
};
use pageserver_client::mgmt_api;
use serde::Serialize;
use utils::id::NodeId;
use tokio_util::sync::CancellationToken;
use utils::{backoff, id::NodeId};
use crate::persistence::NodePersistence;
@@ -12,16 +22,29 @@ use crate::persistence::NodePersistence;
/// implementation of serialization on this type is only for debug dumps.
#[derive(Clone, Serialize)]
pub(crate) struct Node {
pub(crate) id: NodeId,
id: NodeId,
pub(crate) availability: NodeAvailability,
pub(crate) scheduling: NodeSchedulingPolicy,
availability: NodeAvailability,
scheduling: NodeSchedulingPolicy,
pub(crate) listen_http_addr: String,
pub(crate) listen_http_port: u16,
listen_http_addr: String,
listen_http_port: u16,
pub(crate) listen_pg_addr: String,
pub(crate) listen_pg_port: u16,
listen_pg_addr: String,
listen_pg_port: u16,
// This cancellation token means "stop any RPCs in flight to this node, and don't start
// any more". It is not related to process shutdown.
#[serde(skip)]
cancel: CancellationToken,
}
/// When updating [`Node::availability`] we use this type to indicate to the caller
/// whether/how they changed it.
pub(crate) enum AvailabilityTransition {
ToActive,
ToOffline,
Unchanged,
}
impl Node {
@@ -29,6 +52,71 @@ impl Node {
format!("http://{}:{}", self.listen_http_addr, self.listen_http_port)
}
pub(crate) fn get_id(&self) -> NodeId {
self.id
}
pub(crate) fn set_scheduling(&mut self, scheduling: NodeSchedulingPolicy) {
self.scheduling = scheduling
}
/// Does this registration request match `self`? This is used when deciding whether a registration
/// request should be allowed to update an existing record with the same node ID.
pub(crate) fn registration_match(&self, register_req: &NodeRegisterRequest) -> bool {
self.id == register_req.node_id
&& self.listen_http_addr == register_req.listen_http_addr
&& self.listen_http_port == register_req.listen_http_port
&& self.listen_pg_addr == register_req.listen_pg_addr
&& self.listen_pg_port == register_req.listen_pg_port
}
/// For a shard located on this node, populate a response object
/// with this node's address information.
pub(crate) fn shard_location(&self, shard_id: TenantShardId) -> TenantLocateResponseShard {
TenantLocateResponseShard {
shard_id,
node_id: self.id,
listen_http_addr: self.listen_http_addr.clone(),
listen_http_port: self.listen_http_port,
listen_pg_addr: self.listen_pg_addr.clone(),
listen_pg_port: self.listen_pg_port,
}
}
pub(crate) fn set_availability(
&mut self,
availability: NodeAvailability,
) -> AvailabilityTransition {
use NodeAvailability::*;
let transition = match (self.availability, availability) {
(Offline, Active) => {
// Give the node a new cancellation token, effectively resetting it to un-cancelled. Any
// users of previously-cloned copies of the node will still see the old cancellation
// state. For example, Reconcilers in flight will have to complete and be spawned
// again to realize that the node has become available.
self.cancel = CancellationToken::new();
AvailabilityTransition::ToActive
}
(Active, Offline) => {
// Fire the node's cancellation token to cancel any in-flight API requests to it
self.cancel.cancel();
AvailabilityTransition::ToOffline
}
_ => AvailabilityTransition::Unchanged,
};
self.availability = availability;
transition
}
/// Whether we may send API requests to this node.
pub(crate) fn is_available(&self) -> bool {
// When we clone a node, [`Self::availability`] is a snapshot, but [`Self::cancel`] holds
// a reference to the original Node's cancellation status. Checking both of these results
// in a "pessimistic" check where we will consider a Node instance unavailable if it was unavailable
// when we cloned it, or if the original Node instance's cancellation token was fired.
matches!(self.availability, NodeAvailability::Active) && !self.cancel.is_cancelled()
}
/// Is this node elegible to have work scheduled onto it?
pub(crate) fn may_schedule(&self) -> bool {
match self.availability {
@@ -44,6 +132,26 @@ impl Node {
}
}
pub(crate) fn new(
id: NodeId,
listen_http_addr: String,
listen_http_port: u16,
listen_pg_addr: String,
listen_pg_port: u16,
) -> Self {
Self {
id,
listen_http_addr,
listen_http_port,
listen_pg_addr,
listen_pg_port,
scheduling: NodeSchedulingPolicy::Filling,
// TODO: we shouldn't really call this Active until we've heartbeated it.
availability: NodeAvailability::Active,
cancel: CancellationToken::new(),
}
}
pub(crate) fn to_persistent(&self) -> NodePersistence {
NodePersistence {
node_id: self.id.0 as i64,
@@ -54,4 +162,96 @@ impl Node {
listen_pg_port: self.listen_pg_port as i32,
}
}
pub(crate) fn from_persistent(np: NodePersistence) -> Self {
Self {
id: NodeId(np.node_id as u64),
// At startup we consider a node offline until proven otherwise.
availability: NodeAvailability::Offline,
scheduling: NodeSchedulingPolicy::from_str(&np.scheduling_policy)
.expect("Bad scheduling policy in DB"),
listen_http_addr: np.listen_http_addr,
listen_http_port: np.listen_http_port as u16,
listen_pg_addr: np.listen_pg_addr,
listen_pg_port: np.listen_pg_port as u16,
cancel: CancellationToken::new(),
}
}
/// Wrapper for issuing requests to pageserver management API: takes care of generic
/// retry/backoff for retryable HTTP status codes.
///
/// This will return None to indicate cancellation. Cancellation may happen from
/// the cancellation token passed in, or from Self's cancellation token (i.e. node
/// going offline).
pub(crate) async fn with_client_retries<T, O, F>(
&self,
mut op: O,
jwt: &Option<String>,
warn_threshold: u32,
max_retries: u32,
timeout: Duration,
cancel: &CancellationToken,
) -> Option<mgmt_api::Result<T>>
where
O: FnMut(mgmt_api::Client) -> F,
F: std::future::Future<Output = mgmt_api::Result<T>>,
{
fn is_fatal(e: &mgmt_api::Error) -> bool {
use mgmt_api::Error::*;
match e {
ReceiveBody(_) | ReceiveErrorBody(_) => false,
ApiError(StatusCode::SERVICE_UNAVAILABLE, _)
| ApiError(StatusCode::GATEWAY_TIMEOUT, _)
| ApiError(StatusCode::REQUEST_TIMEOUT, _) => false,
ApiError(_, _) => true,
Cancelled => true,
}
}
backoff::retry(
|| {
let http_client = reqwest::ClientBuilder::new()
.timeout(timeout)
.build()
.expect("Failed to construct HTTP client");
let client =
mgmt_api::Client::from_client(http_client, self.base_url(), jwt.as_deref());
let node_cancel_fut = self.cancel.cancelled();
let op_fut = op(client);
async {
tokio::select! {
r = op_fut=> {r},
_ = node_cancel_fut => {
Err(mgmt_api::Error::Cancelled)
}}
}
},
is_fatal,
warn_threshold,
max_retries,
&format!(
"Call to node {} ({}:{}) management API",
self.id, self.listen_http_addr, self.listen_http_port
),
cancel,
)
.await
}
}
impl std::fmt::Display for Node {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
write!(f, "{} ({})", self.id, self.listen_http_addr)
}
}
impl std::fmt::Debug for Node {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
write!(f, "{} ({})", self.id, self.listen_http_addr)
}
}

View File

@@ -7,11 +7,9 @@ use self::split_state::SplitState;
use camino::Utf8Path;
use camino::Utf8PathBuf;
use diesel::pg::PgConnection;
use diesel::{
Connection, ExpressionMethods, Insertable, QueryDsl, QueryResult, Queryable, RunQueryDsl,
Selectable, SelectableHelper,
};
use pageserver_api::controller_api::NodeSchedulingPolicy;
use diesel::prelude::*;
use diesel::Connection;
use pageserver_api::controller_api::{NodeSchedulingPolicy, PlacementPolicy};
use pageserver_api::models::TenantConfig;
use pageserver_api::shard::{ShardCount, ShardNumber, TenantShardId};
use serde::{Deserialize, Serialize};
@@ -19,7 +17,6 @@ use utils::generation::Generation;
use utils::id::{NodeId, TenantId};
use crate::node::Node;
use crate::PlacementPolicy;
/// ## What do we store?
///
@@ -210,7 +207,7 @@ impl Persistence {
tenant.tenant_id = tenant_id.to_string();
tenant.config = serde_json::to_string(&TenantConfig::default())
.map_err(|e| DatabaseError::Logical(format!("Serialization error: {e}")))?;
tenant.placement_policy = serde_json::to_string(&PlacementPolicy::default())
tenant.placement_policy = serde_json::to_string(&PlacementPolicy::Single)
.map_err(|e| DatabaseError::Logical(format!("Serialization error: {e}")))?;
}
}

View File

@@ -1,6 +1,5 @@
use crate::persistence::Persistence;
use crate::service;
use pageserver_api::controller_api::NodeAvailability;
use pageserver_api::models::{
LocationConfig, LocationConfigMode, LocationConfigSecondary, TenantConfig,
};
@@ -28,15 +27,16 @@ pub(super) struct Reconciler {
pub(crate) shard: ShardIdentity,
pub(crate) generation: Option<Generation>,
pub(crate) intent: TargetState,
/// Nodes not referenced by [`Self::intent`], from which we should try
/// to detach this tenant shard.
pub(crate) detach: Vec<Node>,
pub(crate) config: TenantConfig,
pub(crate) observed: ObservedState,
pub(crate) service_config: service::Config,
/// A snapshot of the pageservers as they were when we were asked
/// to reconcile.
pub(crate) pageservers: Arc<HashMap<NodeId, Node>>,
/// A hook to notify the running postgres instances when we change the location
/// of a tenant. Use this via [`Self::compute_notify`] to update our failure flag
/// and guarantee eventual retries.
@@ -67,29 +67,37 @@ pub(super) struct Reconciler {
/// and the TargetState is just the instruction for a particular Reconciler run.
#[derive(Debug)]
pub(crate) struct TargetState {
pub(crate) attached: Option<NodeId>,
pub(crate) secondary: Vec<NodeId>,
pub(crate) attached: Option<Node>,
pub(crate) secondary: Vec<Node>,
}
impl TargetState {
pub(crate) fn from_intent(intent: &IntentState) -> Self {
pub(crate) fn from_intent(nodes: &HashMap<NodeId, Node>, intent: &IntentState) -> Self {
Self {
attached: *intent.get_attached(),
secondary: intent.get_secondary().clone(),
attached: intent.get_attached().map(|n| {
nodes
.get(&n)
.expect("Intent attached referenced non-existent node")
.clone()
}),
secondary: intent
.get_secondary()
.iter()
.map(|n| {
nodes
.get(n)
.expect("Intent secondary referenced non-existent node")
.clone()
})
.collect(),
}
}
fn all_pageservers(&self) -> Vec<NodeId> {
let mut result = self.secondary.clone();
if let Some(node_id) = &self.attached {
result.push(*node_id);
}
result
}
}
#[derive(thiserror::Error, Debug)]
pub(crate) enum ReconcileError {
#[error(transparent)]
Remote(#[from] mgmt_api::Error),
#[error(transparent)]
Notify(#[from] NotifyError),
#[error("Cancelled")]
@@ -101,44 +109,83 @@ pub(crate) enum ReconcileError {
impl Reconciler {
async fn location_config(
&mut self,
node_id: NodeId,
node: &Node,
config: LocationConfig,
flush_ms: Option<Duration>,
) -> anyhow::Result<()> {
let node = self
.pageservers
.get(&node_id)
.expect("Pageserver may not be removed while referenced");
lazy: bool,
) -> Result<(), ReconcileError> {
self.observed
.locations
.insert(node.get_id(), ObservedStateLocation { conf: None });
// TODO: amend locations that use long-polling: they will hit this timeout.
let timeout = Duration::from_secs(25);
tracing::info!("location_config({node}) calling: {:?}", config);
let tenant_shard_id = self.tenant_shard_id;
let config_ref = &config;
match node
.with_client_retries(
|client| async move {
let config = config_ref.clone();
client
.location_config(tenant_shard_id, config.clone(), flush_ms, lazy)
.await
},
&self.service_config.jwt_token,
1,
3,
timeout,
&self.cancel,
)
.await
{
Some(Ok(_)) => {}
Some(Err(e)) => return Err(e.into()),
None => return Err(ReconcileError::Cancel),
};
tracing::info!("location_config({node}) complete: {:?}", config);
self.observed
.locations
.insert(node.id, ObservedStateLocation { conf: None });
tracing::info!("location_config({}) calling: {:?}", node_id, config);
let client =
mgmt_api::Client::new(node.base_url(), self.service_config.jwt_token.as_deref());
client
.location_config(self.tenant_shard_id, config.clone(), flush_ms)
.await?;
tracing::info!("location_config({}) complete: {:?}", node_id, config);
self.observed
.locations
.insert(node.id, ObservedStateLocation { conf: Some(config) });
.insert(node.get_id(), ObservedStateLocation { conf: Some(config) });
Ok(())
}
fn get_node(&self, node_id: &NodeId) -> Option<&Node> {
if let Some(node) = self.intent.attached.as_ref() {
if node.get_id() == *node_id {
return Some(node);
}
}
if let Some(node) = self
.intent
.secondary
.iter()
.find(|n| n.get_id() == *node_id)
{
return Some(node);
}
if let Some(node) = self.detach.iter().find(|n| n.get_id() == *node_id) {
return Some(node);
}
None
}
async fn maybe_live_migrate(&mut self) -> Result<(), ReconcileError> {
let destination = if let Some(node_id) = self.intent.attached {
match self.observed.locations.get(&node_id) {
let destination = if let Some(node) = &self.intent.attached {
match self.observed.locations.get(&node.get_id()) {
Some(conf) => {
// We will do a live migration only if the intended destination is not
// currently in an attached state.
match &conf.conf {
Some(conf) if conf.mode == LocationConfigMode::Secondary => {
// Fall through to do a live migration
node_id
node
}
None | Some(_) => {
// Attached or uncertain: don't do a live migration, proceed
@@ -151,7 +198,7 @@ impl Reconciler {
None => {
// Our destination is not attached: maybe live migrate if some other
// node is currently attached. Fall through.
node_id
node
}
}
} else {
@@ -164,15 +211,13 @@ impl Reconciler {
for (node_id, state) in &self.observed.locations {
if let Some(observed_conf) = &state.conf {
if observed_conf.mode == LocationConfigMode::AttachedSingle {
let node = self
.pageservers
.get(node_id)
.expect("Nodes may not be removed while referenced");
// We will only attempt live migration if the origin is not offline: this
// avoids trying to do it while reconciling after responding to an HA failover.
if !matches!(node.availability, NodeAvailability::Offline) {
origin = Some(*node_id);
break;
if let Some(node) = self.get_node(node_id) {
if node.is_available() {
origin = Some(node.clone());
break;
}
}
}
}
@@ -185,7 +230,7 @@ impl Reconciler {
// We have an origin and a destination: proceed to do the live migration
tracing::info!("Live migrating {}->{}", origin, destination);
self.live_migrate(origin, destination).await?;
self.live_migrate(origin, destination.clone()).await?;
Ok(())
}
@@ -193,13 +238,8 @@ impl Reconciler {
async fn get_lsns(
&self,
tenant_shard_id: TenantShardId,
node_id: &NodeId,
node: &Node,
) -> anyhow::Result<HashMap<TimelineId, Lsn>> {
let node = self
.pageservers
.get(node_id)
.expect("Pageserver may not be removed while referenced");
let client =
mgmt_api::Client::new(node.base_url(), self.service_config.jwt_token.as_deref());
@@ -210,19 +250,27 @@ impl Reconciler {
.collect())
}
async fn secondary_download(&self, tenant_shard_id: TenantShardId, node_id: &NodeId) {
let node = self
.pageservers
.get(node_id)
.expect("Pageserver may not be removed while referenced");
let client =
mgmt_api::Client::new(node.base_url(), self.service_config.jwt_token.as_deref());
match client.tenant_secondary_download(tenant_shard_id).await {
Ok(()) => {}
Err(_) => {
tracing::info!(" (skipping, destination wasn't in secondary mode)")
async fn secondary_download(
&self,
tenant_shard_id: TenantShardId,
node: &Node,
) -> Result<(), ReconcileError> {
match node
.with_client_retries(
|client| async move { client.tenant_secondary_download(tenant_shard_id).await },
&self.service_config.jwt_token,
1,
1,
Duration::from_secs(60),
&self.cancel,
)
.await
{
None => Err(ReconcileError::Cancel),
Some(Ok(_)) => Ok(()),
Some(Err(e)) => {
tracing::info!(" (skipping destination download: {})", e);
Ok(())
}
}
}
@@ -230,17 +278,14 @@ impl Reconciler {
async fn await_lsn(
&self,
tenant_shard_id: TenantShardId,
pageserver_id: &NodeId,
node: &Node,
baseline: HashMap<TimelineId, Lsn>,
) -> anyhow::Result<()> {
loop {
let latest = match self.get_lsns(tenant_shard_id, pageserver_id).await {
let latest = match self.get_lsns(tenant_shard_id, node).await {
Ok(l) => l,
Err(e) => {
println!(
"🕑 Can't get LSNs on pageserver {} yet, waiting ({e})",
pageserver_id
);
tracing::info!("🕑 Can't get LSNs on node {node} yet, waiting ({e})",);
std::thread::sleep(Duration::from_millis(500));
continue;
}
@@ -250,7 +295,7 @@ impl Reconciler {
for (timeline_id, baseline_lsn) in &baseline {
match latest.get(timeline_id) {
Some(latest_lsn) => {
println!("🕑 LSN origin {baseline_lsn} vs destination {latest_lsn}");
tracing::info!("🕑 LSN origin {baseline_lsn} vs destination {latest_lsn}");
if latest_lsn < baseline_lsn {
any_behind = true;
}
@@ -265,7 +310,7 @@ impl Reconciler {
}
if !any_behind {
println!("✅ LSN caught up. Proceeding...");
tracing::info!("✅ LSN caught up. Proceeding...");
break;
} else {
std::thread::sleep(Duration::from_millis(500));
@@ -277,11 +322,11 @@ impl Reconciler {
pub async fn live_migrate(
&mut self,
origin_ps_id: NodeId,
dest_ps_id: NodeId,
) -> anyhow::Result<()> {
origin_ps: Node,
dest_ps: Node,
) -> Result<(), ReconcileError> {
// `maybe_live_migrate` is responsibble for sanity of inputs
assert!(origin_ps_id != dest_ps_id);
assert!(origin_ps.get_id() != dest_ps.get_id());
fn build_location_config(
shard: &ShardIdentity,
@@ -301,10 +346,7 @@ impl Reconciler {
}
}
tracing::info!(
"🔁 Switching origin pageserver {} to stale mode",
origin_ps_id
);
tracing::info!("🔁 Switching origin node {origin_ps} to stale mode",);
// FIXME: it is incorrect to use self.generation here, we should use the generation
// from the ObservedState of the origin pageserver (it might be older than self.generation)
@@ -315,21 +357,18 @@ impl Reconciler {
self.generation,
None,
);
self.location_config(origin_ps_id, stale_conf, Some(Duration::from_secs(10)))
self.location_config(&origin_ps, stale_conf, Some(Duration::from_secs(10)), false)
.await?;
let baseline_lsns = Some(self.get_lsns(self.tenant_shard_id, &origin_ps_id).await?);
let baseline_lsns = Some(self.get_lsns(self.tenant_shard_id, &origin_ps).await?);
// If we are migrating to a destination that has a secondary location, warm it up first
if let Some(destination_conf) = self.observed.locations.get(&dest_ps_id) {
if let Some(destination_conf) = self.observed.locations.get(&dest_ps.get_id()) {
if let Some(destination_conf) = &destination_conf.conf {
if destination_conf.mode == LocationConfigMode::Secondary {
tracing::info!(
"🔁 Downloading latest layers to destination pageserver {}",
dest_ps_id,
);
self.secondary_download(self.tenant_shard_id, &dest_ps_id)
.await;
tracing::info!("🔁 Downloading latest layers to destination node {dest_ps}",);
self.secondary_download(self.tenant_shard_id, &dest_ps)
.await?;
}
}
}
@@ -337,7 +376,7 @@ impl Reconciler {
// Increment generation before attaching to new pageserver
self.generation = Some(
self.persistence
.increment_generation(self.tenant_shard_id, dest_ps_id)
.increment_generation(self.tenant_shard_id, dest_ps.get_id())
.await?,
);
@@ -349,22 +388,23 @@ impl Reconciler {
None,
);
tracing::info!("🔁 Attaching to pageserver {}", dest_ps_id);
self.location_config(dest_ps_id, dest_conf, None).await?;
tracing::info!("🔁 Attaching to pageserver {dest_ps}");
self.location_config(&dest_ps, dest_conf, None, false)
.await?;
if let Some(baseline) = baseline_lsns {
tracing::info!("🕑 Waiting for LSN to catch up...");
self.await_lsn(self.tenant_shard_id, &dest_ps_id, baseline)
self.await_lsn(self.tenant_shard_id, &dest_ps, baseline)
.await?;
}
tracing::info!("🔁 Notifying compute to use pageserver {}", dest_ps_id);
tracing::info!("🔁 Notifying compute to use pageserver {dest_ps}");
// During a live migration it is unhelpful to proceed if we couldn't notify compute: if we detach
// the origin without notifying compute, we will render the tenant unavailable.
while let Err(e) = self.compute_notify().await {
match e {
NotifyError::Fatal(_) => return Err(anyhow::anyhow!(e)),
NotifyError::Fatal(_) => return Err(ReconcileError::Notify(e)),
_ => {
tracing::warn!(
"Live migration blocked by compute notification error, retrying: {e}"
@@ -382,22 +422,19 @@ impl Reconciler {
None,
Some(LocationConfigSecondary { warm: true }),
);
self.location_config(origin_ps_id, origin_secondary_conf.clone(), None)
self.location_config(&origin_ps, origin_secondary_conf.clone(), None, false)
.await?;
// TODO: we should also be setting the ObservedState on earlier API calls, in case we fail
// partway through. In fact, all location conf API calls should be in a wrapper that sets
// the observed state to None, then runs, then sets it to what we wrote.
self.observed.locations.insert(
origin_ps_id,
origin_ps.get_id(),
ObservedStateLocation {
conf: Some(origin_secondary_conf),
},
);
println!(
"🔁 Switching to AttachedSingle mode on pageserver {}",
dest_ps_id
);
tracing::info!("🔁 Switching to AttachedSingle mode on node {dest_ps}",);
let dest_final_conf = build_location_config(
&self.shard,
&self.config,
@@ -405,16 +442,61 @@ impl Reconciler {
self.generation,
None,
);
self.location_config(dest_ps_id, dest_final_conf.clone(), None)
self.location_config(&dest_ps, dest_final_conf.clone(), None, false)
.await?;
self.observed.locations.insert(
dest_ps_id,
dest_ps.get_id(),
ObservedStateLocation {
conf: Some(dest_final_conf),
},
);
println!("✅ Migration complete");
tracing::info!("✅ Migration complete");
Ok(())
}
async fn maybe_refresh_observed(&mut self) -> Result<(), ReconcileError> {
// If the attached node has uncertain state, read it from the pageserver before proceeding: this
// is important to avoid spurious generation increments.
//
// We don't need to do this for secondary/detach locations because it's harmless to just PUT their
// location conf, whereas for attached locations it can interrupt clients if we spuriously destroy/recreate
// the `Timeline` object in the pageserver.
let Some(attached_node) = self.intent.attached.as_ref() else {
// Nothing to do
return Ok(());
};
if matches!(
self.observed.locations.get(&attached_node.get_id()),
Some(ObservedStateLocation { conf: None })
) {
let tenant_shard_id = self.tenant_shard_id;
let observed_conf = match attached_node
.with_client_retries(
|client| async move { client.get_location_config(tenant_shard_id).await },
&self.service_config.jwt_token,
1,
1,
Duration::from_secs(5),
&self.cancel,
)
.await
{
Some(Ok(observed)) => observed,
Some(Err(e)) => return Err(e.into()),
None => return Err(ReconcileError::Cancel),
};
tracing::info!("Scanned location configuration on {attached_node}: {observed_conf:?}");
self.observed.locations.insert(
attached_node.get_id(),
ObservedStateLocation {
conf: observed_conf,
},
);
}
Ok(())
}
@@ -426,14 +508,14 @@ impl Reconciler {
/// general case reconciliation where we walk through the intent by pageserver
/// and call out to the pageserver to apply the desired state.
pub(crate) async fn reconcile(&mut self) -> Result<(), ReconcileError> {
// TODO: if any of self.observed is None, call to remote pageservers
// to learn correct state.
// Prepare: if we have uncertain `observed` state for our would-be attachement location, then refresh it
self.maybe_refresh_observed().await?;
// Special case: live migration
self.maybe_live_migrate().await?;
// If the attached pageserver is not attached, do so now.
if let Some(node_id) = self.intent.attached {
if let Some(node) = self.intent.attached.as_ref() {
// If we are in an attached policy, then generation must have been set (null generations
// are only present when a tenant is initially loaded with a secondary policy)
debug_assert!(self.generation.is_some());
@@ -444,10 +526,10 @@ impl Reconciler {
};
let mut wanted_conf = attached_location_conf(generation, &self.shard, &self.config);
match self.observed.locations.get(&node_id) {
match self.observed.locations.get(&node.get_id()) {
Some(conf) if conf.conf.as_ref() == Some(&wanted_conf) => {
// Nothing to do
tracing::info!(%node_id, "Observed configuration already correct.")
tracing::info!(node_id=%node.get_id(), "Observed configuration already correct.")
}
observed => {
// In all cases other than a matching observed configuration, we will
@@ -485,13 +567,21 @@ impl Reconciler {
if increment_generation {
let generation = self
.persistence
.increment_generation(self.tenant_shard_id, node_id)
.increment_generation(self.tenant_shard_id, node.get_id())
.await?;
self.generation = Some(generation);
wanted_conf.generation = generation.into();
}
tracing::info!(%node_id, "Observed configuration requires update.");
self.location_config(node_id, wanted_conf, None).await?;
tracing::info!(node_id=%node.get_id(), "Observed configuration requires update.");
// Because `node` comes from a ref to &self, clone it before calling into a &mut self
// function: this could be avoided by refactoring the state mutated by location_config into
// a separate type to Self.
let node = node.clone();
// Use lazy=true, because we may run many of Self concurrently, and do not want to
// overload the pageserver with logical size calculations.
self.location_config(&node, wanted_conf, None, true).await?;
self.compute_notify().await?;
}
}
@@ -500,33 +590,27 @@ impl Reconciler {
// Configure secondary locations: if these were previously attached this
// implicitly downgrades them from attached to secondary.
let mut changes = Vec::new();
for node_id in &self.intent.secondary {
for node in &self.intent.secondary {
let wanted_conf = secondary_location_conf(&self.shard, &self.config);
match self.observed.locations.get(node_id) {
match self.observed.locations.get(&node.get_id()) {
Some(conf) if conf.conf.as_ref() == Some(&wanted_conf) => {
// Nothing to do
tracing::info!(%node_id, "Observed configuration already correct.")
tracing::info!(node_id=%node.get_id(), "Observed configuration already correct.")
}
_ => {
// In all cases other than a matching observed configuration, we will
// reconcile this location.
tracing::info!(%node_id, "Observed configuration requires update.");
changes.push((*node_id, wanted_conf))
tracing::info!(node_id=%node.get_id(), "Observed configuration requires update.");
changes.push((node.clone(), wanted_conf))
}
}
}
// Detach any extraneous pageservers that are no longer referenced
// by our intent.
let all_pageservers = self.intent.all_pageservers();
for node_id in self.observed.locations.keys() {
if all_pageservers.contains(node_id) {
// We are only detaching pageservers that aren't used at all.
continue;
}
for node in &self.detach {
changes.push((
*node_id,
node.clone(),
LocationConfig {
mode: LocationConfigMode::Detached,
generation: None,
@@ -539,11 +623,11 @@ impl Reconciler {
));
}
for (node_id, conf) in changes {
for (node, conf) in changes {
if self.cancel.is_cancelled() {
return Err(ReconcileError::Cancel);
}
self.location_config(node_id, conf, None).await?;
self.location_config(&node, conf, None, false).await?;
}
Ok(())
@@ -552,16 +636,21 @@ impl Reconciler {
pub(crate) async fn compute_notify(&mut self) -> Result<(), NotifyError> {
// Whenever a particular Reconciler emits a notification, it is always notifying for the intended
// destination.
if let Some(node_id) = self.intent.attached {
if let Some(node) = &self.intent.attached {
let result = self
.compute_hook
.notify(self.tenant_shard_id, node_id, &self.cancel)
.notify(
self.tenant_shard_id,
node.get_id(),
self.shard.stripe_size,
&self.cancel,
)
.await;
if let Err(e) = &result {
// It is up to the caller whether they want to drop out on this error, but they don't have to:
// in general we should avoid letting unavailability of the cloud control plane stop us from
// making progress.
tracing::warn!("Failed to notify compute of attached pageserver {node_id}: {e}");
tracing::warn!("Failed to notify compute of attached pageserver {node}: {e}");
// Set this flag so that in our ReconcileResult we will set the flag on the shard that it
// needs to retry at some point.
self.compute_notify_failure = true;

View File

@@ -43,7 +43,7 @@ impl Scheduler {
let mut scheduler_nodes = HashMap::new();
for node in nodes {
scheduler_nodes.insert(
node.id,
node.get_id(),
SchedulerNode {
shard_count: 0,
may_schedule: node.may_schedule(),
@@ -68,7 +68,7 @@ impl Scheduler {
let mut expect_nodes: HashMap<NodeId, SchedulerNode> = HashMap::new();
for node in nodes {
expect_nodes.insert(
node.id,
node.get_id(),
SchedulerNode {
shard_count: 0,
may_schedule: node.may_schedule(),
@@ -156,7 +156,7 @@ impl Scheduler {
pub(crate) fn node_upsert(&mut self, node: &Node) {
use std::collections::hash_map::Entry::*;
match self.nodes.entry(node.id) {
match self.nodes.entry(node.get_id()) {
Occupied(mut entry) => {
entry.get_mut().may_schedule = node.may_schedule();
}
@@ -255,7 +255,6 @@ impl Scheduler {
pub(crate) mod test_utils {
use crate::node::Node;
use pageserver_api::controller_api::{NodeAvailability, NodeSchedulingPolicy};
use std::collections::HashMap;
use utils::id::NodeId;
/// Test helper: synthesize the requested number of nodes, all in active state.
@@ -264,18 +263,17 @@ pub(crate) mod test_utils {
pub(crate) fn make_test_nodes(n: u64) -> HashMap<NodeId, Node> {
(1..n + 1)
.map(|i| {
(
NodeId(i),
Node {
id: NodeId(i),
availability: NodeAvailability::Active,
scheduling: NodeSchedulingPolicy::Active,
listen_http_addr: format!("httphost-{i}"),
listen_http_port: 80 + i as u16,
listen_pg_addr: format!("pghost-{i}"),
listen_pg_port: 5432 + i as u16,
},
)
(NodeId(i), {
let node = Node::new(
NodeId(i),
format!("httphost-{i}"),
80 + i as u16,
format!("pghost-{i}"),
5432 + i as u16,
);
assert!(node.is_available());
node
})
})
.collect()
}

File diff suppressed because it is too large Load Diff

View File

@@ -1,7 +1,11 @@
use std::{collections::HashMap, sync::Arc, time::Duration};
use std::{
collections::{HashMap, HashSet},
sync::Arc,
time::Duration,
};
use crate::{metrics, persistence::TenantShardPersistence};
use pageserver_api::controller_api::NodeAvailability;
use pageserver_api::controller_api::PlacementPolicy;
use pageserver_api::{
models::{LocationConfig, LocationConfigMode, TenantConfig},
shard::{ShardIdentity, TenantShardId},
@@ -25,7 +29,7 @@ use crate::{
attached_location_conf, secondary_location_conf, ReconcileError, Reconciler, TargetState,
},
scheduler::{ScheduleError, Scheduler},
service, PlacementPolicy, Sequence,
service, Sequence,
};
/// Serialization helper
@@ -370,7 +374,7 @@ impl TenantState {
/// [`ObservedState`], even if it violates my [`PlacementPolicy`]. Call [`Self::schedule`] next,
/// to get an intent state that complies with placement policy. The overall goal is to do scheduling
/// in a way that makes use of any configured locations that already exist in the outside world.
pub(crate) fn intent_from_observed(&mut self) {
pub(crate) fn intent_from_observed(&mut self, scheduler: &mut Scheduler) {
// Choose an attached location by filtering observed locations, and then sorting to get the highest
// generation
let mut attached_locs = self
@@ -395,7 +399,7 @@ impl TenantState {
attached_locs.sort_by_key(|i| i.1);
if let Some((node_id, _gen)) = attached_locs.into_iter().last() {
self.intent.attached = Some(*node_id);
self.intent.set_attached(scheduler, Some(*node_id));
}
// All remaining observed locations generate secondary intents. This includes None
@@ -406,7 +410,7 @@ impl TenantState {
// will take care of promoting one of these secondaries to be attached.
self.observed.locations.keys().for_each(|node_id| {
if Some(*node_id) != self.intent.attached {
self.intent.secondary.push(*node_id);
self.intent.push_secondary(scheduler, *node_id);
}
});
}
@@ -564,7 +568,9 @@ impl TenantState {
}
}
fn dirty(&self) -> bool {
fn dirty(&self, nodes: &Arc<HashMap<NodeId, Node>>) -> bool {
let mut dirty_nodes = HashSet::new();
if let Some(node_id) = self.intent.attached {
// Maybe panic: it is a severe bug if we try to attach while generation is null.
let generation = self
@@ -575,7 +581,7 @@ impl TenantState {
match self.observed.locations.get(&node_id) {
Some(conf) if conf.conf.as_ref() == Some(&wanted_conf) => {}
Some(_) | None => {
return true;
dirty_nodes.insert(node_id);
}
}
}
@@ -585,7 +591,7 @@ impl TenantState {
match self.observed.locations.get(node_id) {
Some(conf) if conf.conf.as_ref() == Some(&wanted_conf) => {}
Some(_) | None => {
return true;
dirty_nodes.insert(*node_id);
}
}
}
@@ -593,24 +599,25 @@ impl TenantState {
for node_id in self.observed.locations.keys() {
if self.intent.attached != Some(*node_id) && !self.intent.secondary.contains(node_id) {
// We have observed state that isn't part of our intent: need to clean it up.
return true;
dirty_nodes.insert(*node_id);
}
}
// Even if there is no pageserver work to be done, if we have a pending notification to computes,
// wake up a reconciler to send it.
if self.pending_compute_notification {
return true;
}
dirty_nodes.retain(|node_id| {
nodes
.get(node_id)
.map(|n| n.is_available())
.unwrap_or(false)
});
false
!dirty_nodes.is_empty()
}
#[allow(clippy::too_many_arguments)]
#[instrument(skip_all, fields(tenant_id=%self.tenant_shard_id.tenant_id, shard_id=%self.tenant_shard_id.shard_slug()))]
pub(crate) fn maybe_reconcile(
&mut self,
result_tx: tokio::sync::mpsc::UnboundedSender<ReconcileResult>,
result_tx: &tokio::sync::mpsc::UnboundedSender<ReconcileResult>,
pageservers: &Arc<HashMap<NodeId, Node>>,
compute_hook: &Arc<ComputeHook>,
service_config: &service::Config,
@@ -625,15 +632,20 @@ impl TenantState {
let node = pageservers
.get(node_id)
.expect("Nodes may not be removed while referenced");
if observed_loc.conf.is_none()
&& !matches!(node.availability, NodeAvailability::Offline)
{
if observed_loc.conf.is_none() && node.is_available() {
dirty_observed = true;
break;
}
}
if !self.dirty() && !dirty_observed {
let active_nodes_dirty = self.dirty(pageservers);
// Even if there is no pageserver work to be done, if we have a pending notification to computes,
// wake up a reconciler to send it.
let do_reconcile =
active_nodes_dirty || dirty_observed || self.pending_compute_notification;
if !do_reconcile {
tracing::info!("Not dirty, no reconciliation needed.");
return None;
}
@@ -663,6 +675,21 @@ impl TenantState {
}
}
// Build list of nodes from which the reconciler should detach
let mut detach = Vec::new();
for node_id in self.observed.locations.keys() {
if self.intent.get_attached() != &Some(*node_id)
&& !self.intent.secondary.contains(node_id)
{
detach.push(
pageservers
.get(node_id)
.expect("Intent references non-existent pageserver")
.clone(),
)
}
}
// Reconcile in flight for a stale sequence? Our sequence's task will wait for it before
// doing our sequence's work.
let old_handle = self.reconciler.take();
@@ -677,14 +704,15 @@ impl TenantState {
self.sequence = self.sequence.next();
let reconciler_cancel = cancel.child_token();
let reconciler_intent = TargetState::from_intent(pageservers, &self.intent);
let mut reconciler = Reconciler {
tenant_shard_id: self.tenant_shard_id,
shard: self.shard,
generation: self.generation,
intent: TargetState::from_intent(&self.intent),
intent: reconciler_intent,
detach,
config: self.config.clone(),
observed: self.observed.clone(),
pageservers: pageservers.clone(),
compute_hook: compute_hook.clone(),
service_config: service_config.clone(),
_gate_guard: gate_guard,
@@ -701,6 +729,7 @@ impl TenantState {
tenant_id=%reconciler.tenant_shard_id.tenant_id,
shard_id=%reconciler.tenant_shard_id.shard_slug());
metrics::RECONCILER.spawned.inc();
let result_tx = result_tx.clone();
let join_handle = tokio::task::spawn(
async move {
// Wait for any previous reconcile task to complete before we start
@@ -819,7 +848,10 @@ impl TenantState {
#[cfg(test)]
pub(crate) mod tests {
use pageserver_api::shard::{ShardCount, ShardNumber};
use pageserver_api::{
controller_api::NodeAvailability,
shard::{ShardCount, ShardNumber},
};
use utils::id::TenantId;
use crate::scheduler::test_utils::make_test_nodes;
@@ -878,7 +910,10 @@ pub(crate) mod tests {
assert_eq!(tenant_state.intent.secondary.len(), 2);
// Update the scheduler state to indicate the node is offline
nodes.get_mut(&attached_node_id).unwrap().availability = NodeAvailability::Offline;
nodes
.get_mut(&attached_node_id)
.unwrap()
.set_availability(NodeAvailability::Offline);
scheduler.node_upsert(nodes.get(&attached_node_id).unwrap());
// Scheduling the node should promote the still-available secondary node to attached
@@ -897,4 +932,54 @@ pub(crate) mod tests {
Ok(())
}
#[test]
fn intent_from_observed() -> anyhow::Result<()> {
let nodes = make_test_nodes(3);
let mut scheduler = Scheduler::new(nodes.values());
let mut tenant_state = make_test_tenant_shard(PlacementPolicy::Double(1));
tenant_state.observed.locations.insert(
NodeId(3),
ObservedStateLocation {
conf: Some(LocationConfig {
mode: LocationConfigMode::AttachedMulti,
generation: Some(2),
secondary_conf: None,
shard_number: tenant_state.shard.number.0,
shard_count: tenant_state.shard.count.literal(),
shard_stripe_size: tenant_state.shard.stripe_size.0,
tenant_conf: TenantConfig::default(),
}),
},
);
tenant_state.observed.locations.insert(
NodeId(2),
ObservedStateLocation {
conf: Some(LocationConfig {
mode: LocationConfigMode::AttachedStale,
generation: Some(1),
secondary_conf: None,
shard_number: tenant_state.shard.number.0,
shard_count: tenant_state.shard.count.literal(),
shard_stripe_size: tenant_state.shard.stripe_size.0,
tenant_conf: TenantConfig::default(),
}),
},
);
tenant_state.intent_from_observed(&mut scheduler);
// The highest generationed attached location gets used as attached
assert_eq!(tenant_state.intent.attached, Some(NodeId(3)));
// Other locations get used as secondary
assert_eq!(tenant_state.intent.secondary, vec![NodeId(2)]);
scheduler.consistency_check(nodes.values(), [&tenant_state].into_iter())?;
tenant_state.intent.clear(&mut scheduler);
Ok(())
}
}

View File

@@ -34,7 +34,7 @@ pub struct AttachmentService {
client: reqwest::Client,
}
const COMMAND: &str = "attachment_service";
const COMMAND: &str = "storage_controller";
const ATTACHMENT_SERVICE_POSTGRES_VERSION: u32 = 16;

View File

@@ -15,7 +15,7 @@ use control_plane::pageserver::{PageServerNode, PAGESERVER_REMOTE_STORAGE_DIR};
use control_plane::safekeeper::SafekeeperNode;
use control_plane::{broker, local_env};
use pageserver_api::controller_api::{
NodeAvailability, NodeConfigureRequest, NodeSchedulingPolicy,
NodeAvailability, NodeConfigureRequest, NodeSchedulingPolicy, PlacementPolicy,
};
use pageserver_api::models::{
ShardParameters, TenantCreateRequest, TimelineCreateRequest, TimelineInfo,
@@ -435,6 +435,11 @@ async fn handle_tenant(
let shard_stripe_size: Option<u32> =
create_match.get_one::<u32>("shard-stripe-size").cloned();
let placement_policy = match create_match.get_one::<String>("placement-policy") {
Some(s) if !s.is_empty() => serde_json::from_str::<PlacementPolicy>(s)?,
_ => PlacementPolicy::Single,
};
let tenant_conf = PageServerNode::parse_config(tenant_conf)?;
// If tenant ID was not specified, generate one
@@ -456,6 +461,7 @@ async fn handle_tenant(
.map(ShardStripeSize)
.unwrap_or(ShardParameters::DEFAULT_STRIPE_SIZE),
},
placement_policy: Some(placement_policy),
config: tenant_conf,
})
.await?;
@@ -1024,7 +1030,7 @@ async fn handle_endpoint(ep_match: &ArgMatches, env: &local_env::LocalEnv) -> Re
})
.collect::<Vec<_>>()
};
endpoint.reconfigure(pageservers).await?;
endpoint.reconfigure(pageservers, None).await?;
}
"stop" => {
let endpoint_id = sub_args
@@ -1562,6 +1568,7 @@ fn cli() -> Command {
.help("Use this tenant in future CLI commands where tenant_id is needed, but not specified"))
.arg(Arg::new("shard-count").value_parser(value_parser!(u8)).long("shard-count").action(ArgAction::Set).help("Number of shards in the new tenant (default 1)"))
.arg(Arg::new("shard-stripe-size").value_parser(value_parser!(u32)).long("shard-stripe-size").action(ArgAction::Set).help("Sharding stripe size in pages"))
.arg(Arg::new("placement-policy").value_parser(value_parser!(String)).long("placement-policy").action(ArgAction::Set).help("Placement policy shards in this tenant"))
)
.subcommand(Command::new("set-default").arg(tenant_id_arg.clone().required(true))
.about("Set a particular tenant as default in future CLI commands where tenant_id is needed, but not specified"))

View File

@@ -52,6 +52,7 @@ use compute_api::spec::RemoteExtSpec;
use compute_api::spec::Role;
use nix::sys::signal::kill;
use nix::sys::signal::Signal;
use pageserver_api::shard::ShardStripeSize;
use serde::{Deserialize, Serialize};
use url::Host;
use utils::id::{NodeId, TenantId, TimelineId};
@@ -655,7 +656,7 @@ impl Endpoint {
// Wait for it to start
let mut attempt = 0;
const ATTEMPT_INTERVAL: Duration = Duration::from_millis(100);
const MAX_ATTEMPTS: u32 = 10 * 30; // Wait up to 30 s
const MAX_ATTEMPTS: u32 = 10 * 90; // Wait up to 1.5 min
loop {
attempt += 1;
match self.get_status().await {
@@ -735,7 +736,11 @@ impl Endpoint {
}
}
pub async fn reconfigure(&self, mut pageservers: Vec<(Host, u16)>) -> Result<()> {
pub async fn reconfigure(
&self,
mut pageservers: Vec<(Host, u16)>,
stripe_size: Option<ShardStripeSize>,
) -> Result<()> {
let mut spec: ComputeSpec = {
let spec_path = self.endpoint_path().join("spec.json");
let file = std::fs::File::open(spec_path)?;
@@ -765,6 +770,9 @@ impl Endpoint {
let pageserver_connstr = Self::build_pageserver_connstr(&pageservers);
assert!(!pageserver_connstr.is_empty());
spec.pageserver_connstring = Some(pageserver_connstr);
if stripe_size.is_some() {
spec.shard_stripe_size = stripe_size.map(|s| s.0 as usize);
}
let client = reqwest::Client::new();
let response = client

View File

@@ -232,7 +232,7 @@ impl LocalEnv {
// run from the same location as neon_local. This means that for compatibility
// tests that run old pageserver/safekeeper, they still run latest attachment service.
let neon_local_bin_dir = env::current_exe().unwrap().parent().unwrap().to_owned();
neon_local_bin_dir.join("attachment_service")
neon_local_bin_dir.join("storage_controller")
}
pub fn safekeeper_bin(&self) -> PathBuf {

View File

@@ -429,6 +429,8 @@ impl PageServerNode {
generation,
config,
shard_parameters: ShardParameters::default(),
// Placement policy is not meaningful for creations not done via storage controller
placement_policy: None,
};
if !settings.is_empty() {
bail!("Unrecognized tenant settings: {settings:?}")
@@ -537,10 +539,11 @@ impl PageServerNode {
tenant_shard_id: TenantShardId,
config: LocationConfig,
flush_ms: Option<Duration>,
lazy: bool,
) -> anyhow::Result<()> {
Ok(self
.http_client
.location_config(tenant_shard_id, config, flush_ms)
.location_config(tenant_shard_id, config, flush_ms, lazy)
.await?)
}
@@ -605,7 +608,7 @@ impl PageServerNode {
eprintln!("connection error: {}", e);
}
});
tokio::pin!(client);
let client = std::pin::pin!(client);
// Init base reader
let (start_lsn, base_tarfile_path) = base;

View File

@@ -29,7 +29,6 @@ pub mod launch_timestamp;
mod wrappers;
pub use wrappers::{CountedReader, CountedWriter};
mod hll;
pub mod metric_vec_duration;
pub use hll::{HyperLogLog, HyperLogLogVec};
#[cfg(target_os = "linux")]
pub mod more_process_metrics;

View File

@@ -1,23 +0,0 @@
//! Helpers for observing duration on `HistogramVec` / `CounterVec` / `GaugeVec` / `MetricVec<T>`.
use std::{future::Future, time::Instant};
pub trait DurationResultObserver {
fn observe_result<T, E>(&self, res: &Result<T, E>, duration: std::time::Duration);
}
pub async fn observe_async_block_duration_by_result<
T,
E,
F: Future<Output = Result<T, E>>,
O: DurationResultObserver,
>(
observer: &O,
block: F,
) -> Result<T, E> {
let start = Instant::now();
let result = block.await;
let duration = start.elapsed();
observer.observe_result(&result, duration);
result
}

View File

@@ -125,5 +125,45 @@ impl From<NodeSchedulingPolicy> for String {
}
}
/// Controls how tenant shards are mapped to locations on pageservers, e.g. whether
/// to create secondary locations.
#[derive(Clone, Serialize, Deserialize, Debug, PartialEq, Eq)]
pub enum PlacementPolicy {
/// Cheapest way to attach a tenant: just one pageserver, no secondary
Single,
/// Production-ready way to attach a tenant: one attached pageserver and
/// some number of secondaries.
Double(usize),
/// Create one secondary mode locations. This is useful when onboarding
/// a tenant, or for an idle tenant that we might want to bring online quickly.
Secondary,
/// Do not attach to any pageservers. This is appropriate for tenants that
/// have been idle for a long time, where we do not mind some delay in making
/// them available in future.
Detached,
}
#[derive(Serialize, Deserialize, Debug)]
pub struct TenantShardMigrateResponse {}
#[cfg(test)]
mod test {
use super::*;
use serde_json;
/// Check stability of PlacementPolicy's serialization
#[test]
fn placement_policy_encoding() -> anyhow::Result<()> {
let v = PlacementPolicy::Double(1);
let encoded = serde_json::to_string(&v)?;
assert_eq!(encoded, "{\"Double\":1}");
assert_eq!(serde_json::from_str::<PlacementPolicy>(&encoded)?, v);
let v = PlacementPolicy::Single;
let encoded = serde_json::to_string(&v)?;
assert_eq!(encoded, "\"Single\"");
assert_eq!(serde_json::from_str::<PlacementPolicy>(&encoded)?, v);
Ok(())
}
}

View File

@@ -21,6 +21,7 @@ use utils::{
lsn::Lsn,
};
use crate::controller_api::PlacementPolicy;
use crate::{
reltag::RelTag,
shard::{ShardCount, ShardStripeSize, TenantShardId},
@@ -242,6 +243,11 @@ pub struct TenantCreateRequest {
#[serde(skip_serializing_if = "ShardParameters::is_unsharded")]
pub shard_parameters: ShardParameters,
// This parameter is only meaningful in requests sent to the storage controller
#[serde(default)]
#[serde(skip_serializing_if = "Option::is_none")]
pub placement_policy: Option<PlacementPolicy>,
#[serde(flatten)]
pub config: TenantConfig, // as we have a flattened field, we should reject all unknown fields in it
}
@@ -435,6 +441,8 @@ pub struct TenantShardLocation {
#[serde(deny_unknown_fields)]
pub struct TenantLocationConfigResponse {
pub shards: Vec<TenantShardLocation>,
// If the shards' ShardCount count is >1, stripe_size will be set.
pub stripe_size: Option<ShardStripeSize>,
}
#[derive(Serialize, Deserialize, Debug)]

View File

@@ -6,7 +6,6 @@
#![deny(clippy::undocumented_unsafe_blocks)]
use anyhow::Context;
use bytes::Bytes;
use futures::pin_mut;
use serde::{Deserialize, Serialize};
use std::io::ErrorKind;
use std::net::SocketAddr;
@@ -378,8 +377,7 @@ impl<IO: AsyncRead + AsyncWrite + Unpin> PostgresBackend<IO> {
&mut self,
cx: &mut std::task::Context<'_>,
) -> Poll<Result<(), std::io::Error>> {
let flush_fut = self.flush();
pin_mut!(flush_fut);
let flush_fut = std::pin::pin!(self.flush());
flush_fut.poll(cx)
}

View File

@@ -72,14 +72,19 @@ async fn simple_select() {
}
}
static KEY: Lazy<rustls::PrivateKey> = Lazy::new(|| {
static KEY: Lazy<rustls::pki_types::PrivateKeyDer<'static>> = Lazy::new(|| {
let mut cursor = Cursor::new(include_bytes!("key.pem"));
rustls::PrivateKey(rustls_pemfile::rsa_private_keys(&mut cursor).unwrap()[0].clone())
let key = rustls_pemfile::rsa_private_keys(&mut cursor)
.next()
.unwrap()
.unwrap();
rustls::pki_types::PrivateKeyDer::Pkcs1(key)
});
static CERT: Lazy<rustls::Certificate> = Lazy::new(|| {
static CERT: Lazy<rustls::pki_types::CertificateDer<'static>> = Lazy::new(|| {
let mut cursor = Cursor::new(include_bytes!("cert.pem"));
rustls::Certificate(rustls_pemfile::certs(&mut cursor).unwrap()[0].clone())
let cert = rustls_pemfile::certs(&mut cursor).next().unwrap().unwrap();
cert
});
// test that basic select with ssl works
@@ -88,9 +93,8 @@ async fn simple_select_ssl() {
let (client_sock, server_sock) = make_tcp_pair().await;
let server_cfg = rustls::ServerConfig::builder()
.with_safe_defaults()
.with_no_client_auth()
.with_single_cert(vec![CERT.clone()], KEY.clone())
.with_single_cert(vec![CERT.clone()], KEY.clone_key())
.unwrap();
let tls_config = Some(Arc::new(server_cfg));
let pgbackend =
@@ -102,10 +106,9 @@ async fn simple_select_ssl() {
});
let client_cfg = rustls::ClientConfig::builder()
.with_safe_defaults()
.with_root_certificates({
let mut store = rustls::RootCertStore::empty();
store.add(&CERT).unwrap();
store.add(CERT.clone()).unwrap();
store
})
.with_no_client_auth();

View File

@@ -17,6 +17,7 @@ use remote_storage::{
};
use test_context::test_context;
use test_context::AsyncTestContext;
use tokio::io::AsyncBufReadExt;
use tokio_util::sync::CancellationToken;
use tracing::info;
@@ -484,32 +485,33 @@ async fn download_is_cancelled(ctx: &mut MaybeEnabledStorage) {
))
.unwrap();
let len = upload_large_enough_file(&ctx.client, &path, &cancel).await;
let file_len = upload_large_enough_file(&ctx.client, &path, &cancel).await;
{
let mut stream = ctx
let stream = ctx
.client
.download(&path, &cancel)
.await
.expect("download succeeds")
.download_stream;
let first = stream
.next()
.await
.expect("should have the first blob")
.expect("should have succeeded");
let mut reader = std::pin::pin!(tokio_util::io::StreamReader::new(stream));
tracing::info!(len = first.len(), "downloaded first chunk");
let first = reader.fill_buf().await.expect("should have the first blob");
let len = first.len();
tracing::info!(len, "downloaded first chunk");
assert!(
first.len() < len,
first.len() < file_len,
"uploaded file is too small, we downloaded all on first chunk"
);
reader.consume(len);
cancel.cancel();
let next = stream.next().await.expect("stream should have more");
let next = reader.fill_buf().await;
let e = next.expect_err("expected an error, but got a chunk?");
@@ -520,6 +522,10 @@ async fn download_is_cancelled(ctx: &mut MaybeEnabledStorage) {
.is_some_and(|e| matches!(e, DownloadError::Cancelled)),
"{inner:?}"
);
let e = DownloadError::from(e);
assert!(matches!(e, DownloadError::Cancelled), "{e:?}");
}
let cancel = CancellationToken::new();

View File

@@ -84,6 +84,9 @@ where
info!("Handling request");
}
// Take a copy of the path for error logging
let path = request.uri().path().to_string();
// No special handling for panics here. There's a `tracing_panic_hook` from another
// module to do that globally.
let res = handler(request).await;
@@ -110,7 +113,7 @@ where
}
Ok(response)
}
Err(err) => Ok(api_error_handler(err)),
Err(err) => Ok(api_error_handler(err, Some(&path))),
}
}
.instrument(request_span)

View File

@@ -108,7 +108,7 @@ impl HttpErrorBody {
pub async fn route_error_handler(err: routerify::RouteError) -> Response<Body> {
match err.downcast::<ApiError>() {
Ok(api_error) => api_error_handler(*api_error),
Ok(api_error) => api_error_handler(*api_error, None),
Err(other_error) => {
// We expect all the request handlers to return an ApiError, so this should
// not be reached. But just in case.
@@ -121,12 +121,16 @@ pub async fn route_error_handler(err: routerify::RouteError) -> Response<Body> {
}
}
pub fn api_error_handler(api_error: ApiError) -> Response<Body> {
pub fn api_error_handler(api_error: ApiError, path: Option<&str>) -> Response<Body> {
// Print a stack trace for Internal Server errors
match api_error {
ApiError::Forbidden(_) | ApiError::Unauthorized(_) => {
warn!("Error processing HTTP request: {api_error:#}")
warn!(
"Error processing HTTP request: {api_error:#} {}{}",
path.as_ref().map(|_| "at").unwrap_or(""),
path.unwrap_or("")
)
}
ApiError::ResourceUnavailable(_) => info!("Error processing HTTP request: {api_error:#}"),
ApiError::NotFound(_) => info!("Error processing HTTP request: {api_error:#}"),

View File

@@ -324,11 +324,11 @@ extern "C" fn finish_sync_safekeepers(wp: *mut WalProposer, lsn: XLogRecPtr) {
}
}
extern "C" fn process_safekeeper_feedback(wp: *mut WalProposer, commit_lsn: XLogRecPtr) {
extern "C" fn process_safekeeper_feedback(wp: *mut WalProposer) {
unsafe {
let callback_data = (*(*wp).config).callback_data;
let api = callback_data as *mut Box<dyn ApiImpl>;
(*api).process_safekeeper_feedback(&mut (*wp), commit_lsn)
(*api).process_safekeeper_feedback(&mut (*wp))
}
}

View File

@@ -142,7 +142,7 @@ pub trait ApiImpl {
todo!()
}
fn process_safekeeper_feedback(&self, _wp: &mut WalProposer, _commit_lsn: u64) {
fn process_safekeeper_feedback(&mut self, _wp: &mut WalProposer) {
todo!()
}

View File

@@ -7,7 +7,7 @@ use utils::{
pub mod util;
#[derive(Debug)]
#[derive(Debug, Clone)]
pub struct Client {
mgmt_api_endpoint: String,
authorization_header: Option<String>,
@@ -24,6 +24,9 @@ pub enum Error {
#[error("pageserver API: {1}")]
ApiError(StatusCode, String),
#[error("Cancelled")]
Cancelled,
}
pub type Result<T> = std::result::Result<T, Error>;
@@ -251,21 +254,30 @@ impl Client {
tenant_shard_id: TenantShardId,
config: LocationConfig,
flush_ms: Option<std::time::Duration>,
lazy: bool,
) -> Result<()> {
let req_body = TenantLocationConfigRequest {
tenant_id: tenant_shard_id,
config,
};
let path = format!(
let mut path = reqwest::Url::parse(&format!(
"{}/v1/tenant/{}/location_config",
self.mgmt_api_endpoint, tenant_shard_id
);
let path = if let Some(flush_ms) = flush_ms {
format!("{}?flush_ms={}", path, flush_ms.as_millis())
} else {
path
};
self.request(Method::PUT, &path, &req_body).await?;
))
// Should always work: mgmt_api_endpoint is configuration, not user input.
.expect("Cannot build URL");
if lazy {
path.query_pairs_mut().append_pair("lazy", "true");
}
if let Some(flush_ms) = flush_ms {
path.query_pairs_mut()
.append_pair("flush_ms", &format!("{}", flush_ms.as_millis()));
}
self.request(Method::PUT, path, &req_body).await?;
Ok(())
}
@@ -278,6 +290,21 @@ impl Client {
.map_err(Error::ReceiveBody)
}
pub async fn get_location_config(
&self,
tenant_shard_id: TenantShardId,
) -> Result<Option<LocationConfig>> {
let path = format!(
"{}/v1/location_config/{tenant_shard_id}",
self.mgmt_api_endpoint
);
self.request(Method::GET, &path, ())
.await?
.json()
.await
.map_err(Error::ReceiveBody)
}
pub async fn timeline_create(
&self,
tenant_shard_id: TenantShardId,

View File

@@ -63,7 +63,7 @@ pub async fn compact_tiered<E: CompactionJobExecutor>(
);
// Identify the range of LSNs that belong to this level. We assume that
// each file in this level span an LSN range up to 1.75x target file
// each file in this level spans an LSN range up to 1.75x target file
// size. That should give us enough slop that if we created a slightly
// oversized L0 layer, e.g. because flushing the in-memory layer was
// delayed for some reason, we don't consider the oversized layer to
@@ -248,7 +248,6 @@ enum CompactionStrategy {
CreateImage,
}
#[allow(dead_code)] // Todo
struct CompactionJob<E: CompactionJobExecutor> {
key_range: Range<E::Key>,
lsn_range: Range<Lsn>,
@@ -345,7 +344,7 @@ where
///
/// TODO: Currently, this is called exactly once for the level, and we
/// decide whether to create new image layers to cover the whole level, or
/// write a new set of delta. In the future, this should try to partition
/// write a new set of deltas. In the future, this should try to partition
/// the key space, and make the decision separately for each partition.
async fn divide_job(&mut self, job_id: JobId, ctx: &E::RequestContext) -> anyhow::Result<()> {
let job = &self.jobs[job_id.0];
@@ -709,18 +708,6 @@ where
}
}
// Sliding window through keyspace and values
//
// This is used to decide what layer to write next, from the beginning of the window.
//
// Candidates:
//
// 1. Create an image layer, snapping to previous images
// 2. Create a delta layer, snapping to previous images
// 3. Create an image layer, snapping to
//
//
// Take previous partitioning, based on the image layers below.
//
// Candidate is at the front:
@@ -739,6 +726,10 @@ struct WindowElement<K> {
last_key: K, // inclusive
accum_size: u64,
}
// Sliding window through keyspace and values
//
// This is used to decide what layer to write next, from the beginning of the window.
struct Window<K> {
elems: VecDeque<WindowElement<K>>,

View File

@@ -1,5 +1,5 @@
//! An LSM tree consists of multiple levels, each exponential larger than the
//! previous level. And each level consists of be multiple "tiers". With tiered
//! An LSM tree consists of multiple levels, each exponentially larger than the
//! previous level. And each level consists of multiple "tiers". With tiered
//! compaction, a level is compacted when it has accumulated more than N tiers,
//! forming one tier on the next level.
//!
@@ -170,13 +170,6 @@ where
})
}
// helper struct used in depth()
struct Event<K> {
key: K,
layer_idx: usize,
start: bool,
}
impl<L> Level<L> {
/// Count the number of deltas stacked on each other.
pub fn depth<K>(&self) -> u64
@@ -184,6 +177,11 @@ impl<L> Level<L> {
K: CompactionKey,
L: CompactionLayer<K>,
{
struct Event<K> {
key: K,
layer_idx: usize,
start: bool,
}
let mut events: Vec<Event<K>> = Vec::new();
for (idx, l) in self.layers.iter().enumerate() {
events.push(Event {
@@ -202,7 +200,7 @@ impl<L> Level<L> {
// Sweep the key space left to right. Stop at each distinct key, and
// count the number of deltas on top of the highest image at that key.
//
// This is a little enefficient, as we walk through the active_set on
// This is a little inefficient, as we walk through the active_set on
// every key. We could increment/decrement a counter on each step
// instead, but that'd require a bit more complex bookkeeping.
let mut active_set: BTreeSet<(Lsn, bool, usize)> = BTreeSet::new();
@@ -236,6 +234,7 @@ impl<L> Level<L> {
}
}
}
debug_assert_eq!(active_set, BTreeSet::new());
max_depth
}
}

View File

@@ -4,12 +4,12 @@
//! All the heavy lifting is done by the create_image and create_delta
//! functions that the implementor provides.
use async_trait::async_trait;
use futures::Future;
use pageserver_api::{key::Key, keyspace::key_range_size};
use std::ops::Range;
use utils::lsn::Lsn;
/// Public interface. This is the main thing that the implementor needs to provide
#[async_trait]
pub trait CompactionJobExecutor {
// Type system.
//
@@ -17,8 +17,7 @@ pub trait CompactionJobExecutor {
// compaction doesn't distinguish whether they are stored locally or
// remotely.
//
// The keyspace is defined by CompactionKey trait.
//
// The keyspace is defined by the CompactionKey trait.
type Key: CompactionKey;
type Layer: CompactionLayer<Self::Key> + Clone;
@@ -35,27 +34,27 @@ pub trait CompactionJobExecutor {
// ----
/// Return all layers that overlap the given bounding box.
async fn get_layers(
fn get_layers(
&mut self,
key_range: &Range<Self::Key>,
lsn_range: &Range<Lsn>,
ctx: &Self::RequestContext,
) -> anyhow::Result<Vec<Self::Layer>>;
) -> impl Future<Output = anyhow::Result<Vec<Self::Layer>>> + Send;
async fn get_keyspace(
fn get_keyspace(
&mut self,
key_range: &Range<Self::Key>,
lsn: Lsn,
ctx: &Self::RequestContext,
) -> anyhow::Result<CompactionKeySpace<Self::Key>>;
) -> impl Future<Output = anyhow::Result<CompactionKeySpace<Self::Key>>> + Send;
/// NB: This is a pretty expensive operation. In the real pageserver
/// implementation, it downloads the layer, and keeps it resident
/// until the DeltaLayer is dropped.
async fn downcast_delta_layer(
fn downcast_delta_layer(
&self,
layer: &Self::Layer,
) -> anyhow::Result<Option<Self::DeltaLayer>>;
) -> impl Future<Output = anyhow::Result<Option<Self::DeltaLayer>>> + Send;
// ----
// Functions to execute the plan
@@ -63,33 +62,33 @@ pub trait CompactionJobExecutor {
/// Create a new image layer, materializing all the values in the key range,
/// at given 'lsn'.
async fn create_image(
fn create_image(
&mut self,
lsn: Lsn,
key_range: &Range<Self::Key>,
ctx: &Self::RequestContext,
) -> anyhow::Result<()>;
) -> impl Future<Output = anyhow::Result<()>> + Send;
/// Create a new delta layer, containing all the values from 'input_layers'
/// in the given key and LSN range.
async fn create_delta(
fn create_delta(
&mut self,
lsn_range: &Range<Lsn>,
key_range: &Range<Self::Key>,
input_layers: &[Self::DeltaLayer],
ctx: &Self::RequestContext,
) -> anyhow::Result<()>;
) -> impl Future<Output = anyhow::Result<()>> + Send;
/// Delete a layer. The compaction implementation will call this only after
/// all the create_image() or create_delta() calls that deletion of this
/// layer depends on have finished. But if the implementor has extra lazy
/// background tasks, like uploading the index json file to remote storage,
/// background tasks, like uploading the index json file to remote storage.
/// it is the implementation's responsibility to track those.
async fn delete_layer(
fn delete_layer(
&mut self,
layer: &Self::Layer,
ctx: &Self::RequestContext,
) -> anyhow::Result<()>;
) -> impl Future<Output = anyhow::Result<()>> + Send;
}
pub trait CompactionKey: std::cmp::Ord + Clone + Copy + std::fmt::Display {

View File

@@ -429,7 +429,6 @@ impl From<&Arc<MockImageLayer>> for MockLayer {
}
}
#[async_trait]
impl interface::CompactionJobExecutor for MockTimeline {
type Key = Key;
type Layer = MockLayer;

View File

@@ -83,6 +83,10 @@ pub mod defaults {
pub const DEFAULT_INGEST_BATCH_SIZE: u64 = 100;
#[cfg(target_os = "linux")]
pub const DEFAULT_VIRTUAL_FILE_IO_ENGINE: &str = "tokio-epoll-uring";
#[cfg(not(target_os = "linux"))]
pub const DEFAULT_VIRTUAL_FILE_IO_ENGINE: &str = "std-fs";
pub const DEFAULT_GET_VECTORED_IMPL: &str = "sequential";

View File

@@ -88,13 +88,16 @@
use crate::task_mgr::TaskKind;
pub(crate) mod optional_counter;
// The main structure of this module, see module-level comment.
#[derive(Clone, Debug)]
#[derive(Debug)]
pub struct RequestContext {
task_kind: TaskKind,
download_behavior: DownloadBehavior,
access_stats_behavior: AccessStatsBehavior,
page_content_kind: PageContentKind,
pub micros_spent_throttled: optional_counter::MicroSecondsCounterU32,
}
/// The kind of access to the page cache.
@@ -150,6 +153,7 @@ impl RequestContextBuilder {
download_behavior: DownloadBehavior::Download,
access_stats_behavior: AccessStatsBehavior::Update,
page_content_kind: PageContentKind::Unknown,
micros_spent_throttled: Default::default(),
},
}
}
@@ -163,6 +167,7 @@ impl RequestContextBuilder {
download_behavior: original.download_behavior,
access_stats_behavior: original.access_stats_behavior,
page_content_kind: original.page_content_kind,
micros_spent_throttled: Default::default(),
},
}
}

View File

@@ -0,0 +1,101 @@
use std::{
sync::atomic::{AtomicU32, Ordering},
time::Duration,
};
#[derive(Debug)]
pub struct CounterU32 {
inner: AtomicU32,
}
impl Default for CounterU32 {
fn default() -> Self {
Self {
inner: AtomicU32::new(u32::MAX),
}
}
}
impl CounterU32 {
pub fn open(&self) -> Result<(), &'static str> {
match self
.inner
.compare_exchange(u32::MAX, 0, Ordering::Relaxed, Ordering::Relaxed)
{
Ok(_) => Ok(()),
Err(_) => Err("open() called on clsoed state"),
}
}
pub fn close(&self) -> Result<u32, &'static str> {
match self.inner.swap(u32::MAX, Ordering::Relaxed) {
u32::MAX => Err("close() called on closed state"),
x => Ok(x),
}
}
pub fn add(&self, count: u32) -> Result<(), &'static str> {
if count == 0 {
return Ok(());
}
let mut had_err = None;
self.inner
.fetch_update(Ordering::Relaxed, Ordering::Relaxed, |cur| match cur {
u32::MAX => {
had_err = Some("add() called on closed state");
None
}
x => {
let (new, overflowed) = x.overflowing_add(count);
if new == u32::MAX || overflowed {
had_err = Some("add() overflowed the counter");
None
} else {
Some(new)
}
}
})
.map_err(|_| had_err.expect("we set it whenever the function returns None"))
.map(|_| ())
}
}
#[derive(Default, Debug)]
pub struct MicroSecondsCounterU32 {
inner: CounterU32,
}
impl MicroSecondsCounterU32 {
pub fn open(&self) -> Result<(), &'static str> {
self.inner.open()
}
pub fn add(&self, duration: Duration) -> Result<(), &'static str> {
match duration.as_micros().try_into() {
Ok(x) => self.inner.add(x),
Err(_) => Err("add(): duration conversion error"),
}
}
pub fn close_and_checked_sub_from(&self, from: Duration) -> Result<Duration, &'static str> {
let val = self.inner.close()?;
let val = Duration::from_micros(val as u64);
let subbed = match from.checked_sub(val) {
Some(v) => v,
None => return Err("Duration::checked_sub"),
};
Ok(subbed)
}
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_basic() {
let counter = MicroSecondsCounterU32::default();
counter.open().unwrap();
counter.add(Duration::from_micros(23)).unwrap();
let res = counter
.close_and_checked_sub_from(Duration::from_micros(42))
.unwrap();
assert_eq!(res, Duration::from_micros(42 - 23));
}
}

View File

@@ -932,6 +932,59 @@ paths:
schema:
$ref: "#/components/schemas/ServiceUnavailableError"
/v1/tenant/{tenant_shard_id}/heatmap_upload:
parameters:
- name: tenant_shard_id
in: path
required: true
schema:
type: string
post:
description: |
If the location is in an attached mode, upload the current state to the remote heatmap
responses:
"200":
description: Success
"500":
description: Generic operation error
content:
application/json:
schema:
$ref: "#/components/schemas/Error"
"503":
description: Temporarily unavailable, please retry.
content:
application/json:
schema:
$ref: "#/components/schemas/ServiceUnavailableError"
/v1/tenant/{tenant_shard_id}/secondary/download:
parameters:
- name: tenant_shard_id
in: path
required: true
schema:
type: string
post:
description: |
If the location is in secondary mode, download latest heatmap and layers
responses:
"200":
description: Success
"500":
description: Generic operation error
content:
application/json:
schema:
$ref: "#/components/schemas/Error"
"503":
description: Temporarily unavailable, please retry.
content:
application/json:
schema:
$ref: "#/components/schemas/ServiceUnavailableError"
/v1/tenant/{tenant_id}/timeline/:
parameters:
@@ -1339,6 +1392,10 @@ components:
type: array
items:
$ref: "#/components/schemas/TenantShardLocation"
stripe_size:
description: If multiple shards are present, this field contains the sharding stripe size, else it is null.
type: integer
nullable: true
TenantShardLocation:
type: object
required:
@@ -1387,7 +1444,7 @@ components:
trace_read_requests:
type: boolean
heatmap_period:
type: integer
type: string
TenantConfigResponse:
type: object
properties:

View File

@@ -14,6 +14,7 @@ use hyper::header;
use hyper::StatusCode;
use hyper::{Body, Request, Response, Uri};
use metrics::launch_timestamp::LaunchTimestamp;
use pageserver_api::models::LocationConfig;
use pageserver_api::models::LocationConfigListResponse;
use pageserver_api::models::ShardParameters;
use pageserver_api::models::TenantDetails;
@@ -1451,11 +1452,12 @@ async fn put_tenant_location_config_handler(
tenant::SpawnMode::Eager
};
let attached = state
let tenant = state
.tenant_manager
.upsert_location(tenant_shard_id, location_conf, flush, spawn_mode, &ctx)
.await?
.is_some();
.await?;
let stripe_size = tenant.as_ref().map(|t| t.get_shard_stripe_size());
let attached = tenant.is_some();
if let Some(_flush_ms) = flush {
match state
@@ -1477,12 +1479,20 @@ async fn put_tenant_location_config_handler(
// This API returns a vector of pageservers where the tenant is attached: this is
// primarily for use in the sharding service. For compatibilty, we also return this
// when called directly on a pageserver, but the payload is always zero or one shards.
let mut response = TenantLocationConfigResponse { shards: Vec::new() };
let mut response = TenantLocationConfigResponse {
shards: Vec::new(),
stripe_size: None,
};
if attached {
response.shards.push(TenantShardLocation {
shard_id: tenant_shard_id,
node_id: state.conf.id,
})
});
if tenant_shard_id.shard_count.count() > 1 {
// Stripe size should be set if we are attached
debug_assert!(stripe_size.is_some());
response.stripe_size = stripe_size;
}
}
json_response(StatusCode::OK, response)
@@ -1510,6 +1520,29 @@ async fn list_location_config_handler(
json_response(StatusCode::OK, result)
}
async fn get_location_config_handler(
request: Request<Body>,
_cancel: CancellationToken,
) -> Result<Response<Body>, ApiError> {
let state = get_state(&request);
let tenant_shard_id: TenantShardId = parse_request_param(&request, "tenant_shard_id")?;
let slot = state.tenant_manager.get(tenant_shard_id);
let Some(slot) = slot else {
return Err(ApiError::NotFound(
anyhow::anyhow!("Tenant shard not found").into(),
));
};
let result: Option<LocationConfig> = match slot {
TenantSlot::Attached(t) => Some(t.get_location_conf()),
TenantSlot::Secondary(s) => Some(s.get_location_conf()),
TenantSlot::InProgress(_) => None,
};
json_response(StatusCode::OK, result)
}
// Do a time travel recovery on the given tenant/tenant shard. Tenant needs to be detached
// (from all pageservers) as it invalidates consistency assumptions.
async fn tenant_time_travel_remote_storage_handler(
@@ -2214,6 +2247,9 @@ pub fn make_router(
.get("/v1/location_config", |r| {
api_handler(r, list_location_config_handler)
})
.get("/v1/location_config/:tenant_id", |r| {
api_handler(r, get_location_config_handler)
})
.put(
"/v1/tenant/:tenant_shard_id/time_travel_remote_storage",
|r| api_handler(r, tenant_time_travel_remote_storage_handler),

View File

@@ -1,5 +1,4 @@
use enum_map::EnumMap;
use metrics::metric_vec_duration::DurationResultObserver;
use metrics::{
register_counter_vec, register_gauge_vec, register_histogram, register_histogram_vec,
register_int_counter, register_int_counter_pair_vec, register_int_counter_vec,
@@ -11,6 +10,7 @@ use once_cell::sync::Lazy;
use pageserver_api::shard::TenantShardId;
use strum::{EnumCount, IntoEnumIterator, VariantNames};
use strum_macros::{EnumVariantNames, IntoStaticStr};
use tracing::warn;
use utils::id::TimelineId;
/// Prometheus histogram buckets (in seconds) for operations in the critical
@@ -1005,15 +1005,39 @@ impl GlobalAndPerTimelineHistogram {
}
}
struct GlobalAndPerTimelineHistogramTimer<'a> {
struct GlobalAndPerTimelineHistogramTimer<'a, 'c> {
h: &'a GlobalAndPerTimelineHistogram,
ctx: &'c RequestContext,
start: std::time::Instant,
op: SmgrQueryType,
}
impl<'a> Drop for GlobalAndPerTimelineHistogramTimer<'a> {
impl<'a, 'c> Drop for GlobalAndPerTimelineHistogramTimer<'a, 'c> {
fn drop(&mut self) {
let elapsed = self.start.elapsed();
self.h.observe(elapsed.as_secs_f64());
let ex_throttled = self
.ctx
.micros_spent_throttled
.close_and_checked_sub_from(elapsed);
let ex_throttled = match ex_throttled {
Ok(res) => res,
Err(error) => {
use utils::rate_limit::RateLimit;
static LOGGED: Lazy<Mutex<enum_map::EnumMap<SmgrQueryType, RateLimit>>> =
Lazy::new(|| {
Mutex::new(enum_map::EnumMap::from_array(std::array::from_fn(|_| {
RateLimit::new(Duration::from_secs(10))
})))
});
let mut guard = LOGGED.lock().unwrap();
let rate_limit = &mut guard[self.op];
rate_limit.call(|| {
warn!(op=?self.op, error, "error deducting time spent throttled; this message is logged at a global rate limit");
});
elapsed
}
};
self.h.observe(ex_throttled.as_secs_f64());
}
}
@@ -1025,6 +1049,7 @@ impl<'a> Drop for GlobalAndPerTimelineHistogramTimer<'a> {
strum_macros::EnumCount,
strum_macros::EnumIter,
strum_macros::FromRepr,
enum_map::Enum,
)]
#[strum(serialize_all = "snake_case")]
pub enum SmgrQueryType {
@@ -1130,11 +1155,35 @@ impl SmgrQueryTimePerTimeline {
});
Self { metrics }
}
pub(crate) fn start_timer(&self, op: SmgrQueryType) -> impl Drop + '_ {
pub(crate) fn start_timer<'c: 'a, 'a>(
&'a self,
op: SmgrQueryType,
ctx: &'c RequestContext,
) -> impl Drop + '_ {
let metric = &self.metrics[op as usize];
let start = Instant::now();
match ctx.micros_spent_throttled.open() {
Ok(()) => (),
Err(error) => {
use utils::rate_limit::RateLimit;
static LOGGED: Lazy<Mutex<enum_map::EnumMap<SmgrQueryType, RateLimit>>> =
Lazy::new(|| {
Mutex::new(enum_map::EnumMap::from_array(std::array::from_fn(|_| {
RateLimit::new(Duration::from_secs(10))
})))
});
let mut guard = LOGGED.lock().unwrap();
let rate_limit = &mut guard[op];
rate_limit.call(|| {
warn!(?op, error, "error opening micros_spent_throttled; this message is logged at a global rate limit");
});
}
}
GlobalAndPerTimelineHistogramTimer {
h: metric,
start: std::time::Instant::now(),
ctx,
start,
op,
}
}
}
@@ -1145,6 +1194,11 @@ mod smgr_query_time_tests {
use strum::IntoEnumIterator;
use utils::id::{TenantId, TimelineId};
use crate::{
context::{DownloadBehavior, RequestContext},
task_mgr::TaskKind,
};
// Regression test, we used hard-coded string constants before using an enum.
#[test]
fn op_label_name() {
@@ -1193,7 +1247,8 @@ mod smgr_query_time_tests {
let (pre_global, pre_per_tenant_timeline) = get_counts();
assert_eq!(pre_per_tenant_timeline, 0);
let timer = metrics.start_timer(*op);
let ctx = RequestContext::new(TaskKind::UnitTest, DownloadBehavior::Download);
let timer = metrics.start_timer(*op, &ctx);
drop(timer);
let (post_global, post_per_tenant_timeline) = get_counts();
@@ -1227,11 +1282,65 @@ pub(crate) static BASEBACKUP_QUERY_TIME: Lazy<BasebackupQueryTime> = Lazy::new(|
})
});
impl DurationResultObserver for BasebackupQueryTime {
fn observe_result<T, E>(&self, res: &Result<T, E>, duration: std::time::Duration) {
pub(crate) struct BasebackupQueryTimeOngoingRecording<'a, 'c> {
parent: &'a BasebackupQueryTime,
ctx: &'c RequestContext,
start: std::time::Instant,
}
impl BasebackupQueryTime {
pub(crate) fn start_recording<'c: 'a, 'a>(
&'a self,
ctx: &'c RequestContext,
) -> BasebackupQueryTimeOngoingRecording<'_, '_> {
let start = Instant::now();
match ctx.micros_spent_throttled.open() {
Ok(()) => (),
Err(error) => {
use utils::rate_limit::RateLimit;
static LOGGED: Lazy<Mutex<RateLimit>> =
Lazy::new(|| Mutex::new(RateLimit::new(Duration::from_secs(10))));
let mut rate_limit = LOGGED.lock().unwrap();
rate_limit.call(|| {
warn!(error, "error opening micros_spent_throttled; this message is logged at a global rate limit");
});
}
}
BasebackupQueryTimeOngoingRecording {
parent: self,
ctx,
start,
}
}
}
impl<'a, 'c> BasebackupQueryTimeOngoingRecording<'a, 'c> {
pub(crate) fn observe<T, E>(self, res: &Result<T, E>) {
let elapsed = self.start.elapsed();
let ex_throttled = self
.ctx
.micros_spent_throttled
.close_and_checked_sub_from(elapsed);
let ex_throttled = match ex_throttled {
Ok(ex_throttled) => ex_throttled,
Err(error) => {
use utils::rate_limit::RateLimit;
static LOGGED: Lazy<Mutex<RateLimit>> =
Lazy::new(|| Mutex::new(RateLimit::new(Duration::from_secs(10))));
let mut rate_limit = LOGGED.lock().unwrap();
rate_limit.call(|| {
warn!(error, "error deducting time spent throttled; this message is logged at a global rate limit");
});
elapsed
}
};
let label_value = if res.is_ok() { "ok" } else { "error" };
let metric = self.0.get_metric_with_label_values(&[label_value]).unwrap();
metric.observe(duration.as_secs_f64());
let metric = self
.parent
.0
.get_metric_with_label_values(&[label_value])
.unwrap();
metric.observe(ex_throttled.as_secs_f64());
}
}
@@ -1908,10 +2017,8 @@ impl TimelineMetrics {
pub(crate) fn resident_physical_size_get(&self) -> u64 {
self.resident_physical_size_gauge.get()
}
}
impl Drop for TimelineMetrics {
fn drop(&mut self) {
pub(crate) fn shutdown(&self) {
let tenant_id = &self.tenant_id;
let timeline_id = &self.timeline_id;
let shard_id = &self.shard_id;
@@ -2567,6 +2674,12 @@ pub fn preinitialize_metrics() {
Lazy::force(&crate::tenant::storage_layer::layer::LAYER_IMPL_METRICS);
Lazy::force(&disk_usage_based_eviction::METRICS);
for state_name in pageserver_api::models::TenantState::VARIANTS {
// initialize the metric for all gauges, otherwise the time series might seemingly show
// values from last restart.
TENANT_STATE_METRIC.with_label_values(&[state_name]).set(0);
}
// countervecs
[&BACKGROUND_LOOP_PERIOD_OVERRUN_COUNT]
.into_iter()

View File

@@ -910,7 +910,7 @@ impl PageServerHandler {
let timeline = self.get_timeline_shard_zero(tenant_id, timeline_id).await?;
let _timer = timeline
.query_metrics
.start_timer(metrics::SmgrQueryType::GetRelExists);
.start_timer(metrics::SmgrQueryType::GetRelExists, ctx);
let latest_gc_cutoff_lsn = timeline.get_latest_gc_cutoff_lsn();
let lsn =
@@ -938,7 +938,7 @@ impl PageServerHandler {
let _timer = timeline
.query_metrics
.start_timer(metrics::SmgrQueryType::GetRelSize);
.start_timer(metrics::SmgrQueryType::GetRelSize, ctx);
let latest_gc_cutoff_lsn = timeline.get_latest_gc_cutoff_lsn();
let lsn =
@@ -966,7 +966,7 @@ impl PageServerHandler {
let _timer = timeline
.query_metrics
.start_timer(metrics::SmgrQueryType::GetDbSize);
.start_timer(metrics::SmgrQueryType::GetDbSize, ctx);
let latest_gc_cutoff_lsn = timeline.get_latest_gc_cutoff_lsn();
let lsn =
@@ -1144,7 +1144,7 @@ impl PageServerHandler {
let _timer = timeline
.query_metrics
.start_timer(metrics::SmgrQueryType::GetPageAtLsn);
.start_timer(metrics::SmgrQueryType::GetPageAtLsn, ctx);
let latest_gc_cutoff_lsn = timeline.get_latest_gc_cutoff_lsn();
let lsn =
@@ -1172,7 +1172,7 @@ impl PageServerHandler {
let _timer = timeline
.query_metrics
.start_timer(metrics::SmgrQueryType::GetSlruSegment);
.start_timer(metrics::SmgrQueryType::GetSlruSegment, ctx);
let latest_gc_cutoff_lsn = timeline.get_latest_gc_cutoff_lsn();
let lsn =
@@ -1199,7 +1199,7 @@ impl PageServerHandler {
prev_lsn: Option<Lsn>,
full_backup: bool,
gzip: bool,
ctx: RequestContext,
ctx: &RequestContext,
) -> Result<(), QueryError>
where
IO: AsyncRead + AsyncWrite + Send + Sync + Unpin,
@@ -1214,7 +1214,7 @@ impl PageServerHandler {
if let Some(lsn) = lsn {
// Backup was requested at a particular LSN. Wait for it to arrive.
info!("waiting for {}", lsn);
timeline.wait_lsn(lsn, &ctx).await?;
timeline.wait_lsn(lsn, ctx).await?;
timeline
.check_lsn_is_in_scope(lsn, &latest_gc_cutoff_lsn)
.context("invalid basebackup lsn")?;
@@ -1236,7 +1236,7 @@ impl PageServerHandler {
lsn,
prev_lsn,
full_backup,
&ctx,
ctx,
)
.await?;
} else {
@@ -1257,7 +1257,7 @@ impl PageServerHandler {
lsn,
prev_lsn,
full_backup,
&ctx,
ctx,
)
.await?;
// shutdown the encoder to ensure the gzip footer is written
@@ -1269,7 +1269,7 @@ impl PageServerHandler {
lsn,
prev_lsn,
full_backup,
&ctx,
ctx,
)
.await?;
}
@@ -1449,25 +1449,25 @@ where
false
};
::metrics::metric_vec_duration::observe_async_block_duration_by_result(
&*metrics::BASEBACKUP_QUERY_TIME,
async move {
self.handle_basebackup_request(
pgb,
tenant_id,
timeline_id,
lsn,
None,
false,
gzip,
ctx,
)
.await?;
pgb.write_message_noflush(&BeMessage::CommandComplete(b"SELECT 1"))?;
Result::<(), QueryError>::Ok(())
},
)
.await?;
let metric_recording = metrics::BASEBACKUP_QUERY_TIME.start_recording(&ctx);
let res = async {
self.handle_basebackup_request(
pgb,
tenant_id,
timeline_id,
lsn,
None,
false,
gzip,
&ctx,
)
.await?;
pgb.write_message_noflush(&BeMessage::CommandComplete(b"SELECT 1"))?;
Result::<(), QueryError>::Ok(())
}
.await;
metric_recording.observe(&res);
res?;
}
// return pair of prev_lsn and last_lsn
else if query_string.starts_with("get_last_record_rlsn ") {
@@ -1563,7 +1563,7 @@ where
prev_lsn,
true,
false,
ctx,
&ctx,
)
.await?;
pgb.write_message_noflush(&BeMessage::CommandComplete(b"SELECT 1"))?;

View File

@@ -15,6 +15,7 @@ use crate::walrecord::NeonWalRecord;
use anyhow::{ensure, Context};
use bytes::{Buf, Bytes, BytesMut};
use enum_map::Enum;
use itertools::Itertools;
use pageserver_api::key::{
dbdir_key_range, is_rel_block_key, is_slru_block_key, rel_block_to_key, rel_dir_to_key,
rel_key_range, rel_size_to_key, relmap_file_key, slru_block_to_key, slru_dir_to_key,
@@ -1498,7 +1499,7 @@ impl<'a> DatadirModification<'a> {
return Ok(());
}
let writer = self.tline.writer().await;
let mut writer = self.tline.writer().await;
// Flush relation and SLRU data blocks, keep metadata.
let mut retained_pending_updates = HashMap::<_, Vec<_>>::new();
@@ -1537,14 +1538,22 @@ impl<'a> DatadirModification<'a> {
/// All the modifications in this atomic update are stamped by the specified LSN.
///
pub async fn commit(&mut self, ctx: &RequestContext) -> anyhow::Result<()> {
let writer = self.tline.writer().await;
let mut writer = self.tline.writer().await;
let pending_nblocks = self.pending_nblocks;
self.pending_nblocks = 0;
if !self.pending_updates.is_empty() {
writer.put_batch(&self.pending_updates, ctx).await?;
self.pending_updates.clear();
// The put_batch call below expects expects the inputs to be sorted by Lsn,
// so we do that first.
let lsn_ordered_batch: Vec<(Key, Lsn, Value)> = self
.pending_updates
.drain()
.map(|(key, vals)| vals.into_iter().map(move |(lsn, val)| (key, lsn, val)))
.kmerge_by(|lhs, rhs| lhs.1 .0 < rhs.1 .0)
.collect();
writer.put_batch(lsn_ordered_batch, ctx).await?;
}
if !self.pending_deletions.is_empty() {
@@ -1677,7 +1686,7 @@ struct RelDirectory {
rels: HashSet<(Oid, u8)>,
}
#[derive(Debug, Serialize, Deserialize, Default)]
#[derive(Debug, Serialize, Deserialize, Default, PartialEq)]
pub(crate) struct AuxFilesDirectory {
pub(crate) files: HashMap<String, Bytes>,
}

View File

@@ -272,9 +272,6 @@ pub enum TaskKind {
// Task that uploads a file to remote storage
RemoteUploadTask,
// Task that downloads a file from remote storage
RemoteDownloadTask,
// task that handles the initial downloading of all tenants
InitialLoad,

View File

@@ -22,6 +22,7 @@ use pageserver_api::models;
use pageserver_api::models::TimelineState;
use pageserver_api::models::WalRedoManagerStatus;
use pageserver_api::shard::ShardIdentity;
use pageserver_api::shard::ShardStripeSize;
use pageserver_api::shard::TenantShardId;
use remote_storage::DownloadError;
use remote_storage::GenericRemoteStorage;
@@ -1845,6 +1846,8 @@ impl Tenant {
// Wait for any in-flight operations to complete
self.gate.close().await;
remove_tenant_metrics(&self.tenant_shard_id);
Ok(())
}
@@ -2086,6 +2089,10 @@ impl Tenant {
&self.tenant_shard_id
}
pub(crate) fn get_shard_stripe_size(&self) -> ShardStripeSize {
self.shard_identity.stripe_size
}
pub(crate) fn get_generation(&self) -> Generation {
self.generation
}
@@ -3552,11 +3559,6 @@ async fn run_initdb(
Ok(())
}
impl Drop for Tenant {
fn drop(&mut self) {
remove_tenant_metrics(&self.tenant_shard_id);
}
}
/// Dump contents of a layer file to stdout.
pub async fn dump_layerfile_from_path(
path: &Utf8Path,
@@ -3674,7 +3676,10 @@ pub(crate) mod harness {
}
impl TenantHarness {
pub fn create(test_name: &'static str) -> anyhow::Result<Self> {
pub fn create_custom(
test_name: &'static str,
tenant_conf: TenantConf,
) -> anyhow::Result<Self> {
setup_logging();
let repo_dir = PageServerConf::test_repo_dir(test_name);
@@ -3686,14 +3691,6 @@ pub(crate) mod harness {
// OK in a test.
let conf: &'static PageServerConf = Box::leak(Box::new(conf));
// Disable automatic GC and compaction to make the unit tests more deterministic.
// The tests perform them manually if needed.
let tenant_conf = TenantConf {
gc_period: Duration::ZERO,
compaction_period: Duration::ZERO,
..TenantConf::default()
};
let tenant_id = TenantId::generate();
let tenant_shard_id = TenantShardId::unsharded(tenant_id);
fs::create_dir_all(conf.tenant_path(&tenant_shard_id))?;
@@ -3721,6 +3718,18 @@ pub(crate) mod harness {
})
}
pub fn create(test_name: &'static str) -> anyhow::Result<Self> {
// Disable automatic GC and compaction to make the unit tests more deterministic.
// The tests perform them manually if needed.
let tenant_conf = TenantConf {
gc_period: Duration::ZERO,
compaction_period: Duration::ZERO,
..TenantConf::default()
};
Self::create_custom(test_name, tenant_conf)
}
pub fn span(&self) -> tracing::Span {
info_span!("TenantHarness", tenant_id=%self.tenant_shard_id.tenant_id, shard_id=%self.tenant_shard_id.shard_slug())
}
@@ -3828,6 +3837,7 @@ mod tests {
use crate::keyspace::KeySpaceAccum;
use crate::repository::{Key, Value};
use crate::tenant::harness::*;
use crate::tenant::timeline::CompactFlags;
use crate::DEFAULT_PG_VERSION;
use bytes::BytesMut;
use hex_literal::hex;
@@ -3844,7 +3854,7 @@ mod tests {
.create_test_timeline(TIMELINE_ID, Lsn(0x08), DEFAULT_PG_VERSION, &ctx)
.await?;
let writer = tline.writer().await;
let mut writer = tline.writer().await;
writer
.put(
*TEST_KEY,
@@ -3856,7 +3866,7 @@ mod tests {
writer.finish_write(Lsn(0x10));
drop(writer);
let writer = tline.writer().await;
let mut writer = tline.writer().await;
writer
.put(
*TEST_KEY,
@@ -3922,7 +3932,7 @@ mod tests {
let tline = tenant
.create_test_timeline(TIMELINE_ID, Lsn(0x10), DEFAULT_PG_VERSION, &ctx)
.await?;
let writer = tline.writer().await;
let mut writer = tline.writer().await;
#[allow(non_snake_case)]
let TEST_KEY_A: Key = Key::from_hex("110000000033333333444444445500000001").unwrap();
@@ -3956,7 +3966,7 @@ mod tests {
let newtline = tenant
.get_timeline(NEW_TIMELINE_ID, true)
.expect("Should have a local timeline");
let new_writer = newtline.writer().await;
let mut new_writer = newtline.writer().await;
new_writer
.put(TEST_KEY_A, Lsn(0x40), &test_value("bar at 0x40"), &ctx)
.await?;
@@ -3988,7 +3998,7 @@ mod tests {
) -> anyhow::Result<()> {
let mut lsn = start_lsn;
{
let writer = tline.writer().await;
let mut writer = tline.writer().await;
// Create a relation on the timeline
writer
.put(
@@ -4013,7 +4023,7 @@ mod tests {
}
tline.freeze_and_flush().await?;
{
let writer = tline.writer().await;
let mut writer = tline.writer().await;
writer
.put(
*TEST_KEY,
@@ -4376,7 +4386,7 @@ mod tests {
.create_test_timeline(TIMELINE_ID, Lsn(0x08), DEFAULT_PG_VERSION, &ctx)
.await?;
let writer = tline.writer().await;
let mut writer = tline.writer().await;
writer
.put(
*TEST_KEY,
@@ -4393,7 +4403,7 @@ mod tests {
.compact(&CancellationToken::new(), EnumSet::empty(), &ctx)
.await?;
let writer = tline.writer().await;
let mut writer = tline.writer().await;
writer
.put(
*TEST_KEY,
@@ -4410,7 +4420,7 @@ mod tests {
.compact(&CancellationToken::new(), EnumSet::empty(), &ctx)
.await?;
let writer = tline.writer().await;
let mut writer = tline.writer().await;
writer
.put(
*TEST_KEY,
@@ -4427,7 +4437,7 @@ mod tests {
.compact(&CancellationToken::new(), EnumSet::empty(), &ctx)
.await?;
let writer = tline.writer().await;
let mut writer = tline.writer().await;
writer
.put(
*TEST_KEY,
@@ -4484,7 +4494,7 @@ mod tests {
for _ in 0..repeat {
for _ in 0..key_count {
test_key.field6 = blknum;
let writer = timeline.writer().await;
let mut writer = timeline.writer().await;
writer
.put(
test_key,
@@ -4632,6 +4642,145 @@ mod tests {
Ok(())
}
// Test that vectored get handles layer gaps correctly
// by advancing into the next ancestor timeline if required.
//
// The test generates timelines that look like the diagram below.
// We leave a gap in one of the L1 layers at `gap_at_key` (`/` in the diagram).
// The reconstruct data for that key lies in the ancestor timeline (`X` in the diagram).
//
// ```
//-------------------------------+
// ... |
// [ L1 ] |
// [ / L1 ] | Child Timeline
// ... |
// ------------------------------+
// [ X L1 ] | Parent Timeline
// ------------------------------+
// ```
#[tokio::test]
async fn test_get_vectored_key_gap() -> anyhow::Result<()> {
let tenant_conf = TenantConf {
// Make compaction deterministic
gc_period: Duration::ZERO,
compaction_period: Duration::ZERO,
// Encourage creation of L1 layers
checkpoint_distance: 16 * 1024,
compaction_target_size: 8 * 1024,
..TenantConf::default()
};
let harness = TenantHarness::create_custom("test_get_vectored_key_gap", tenant_conf)?;
let (tenant, ctx) = harness.load().await;
let mut current_key = Key::from_hex("010000000033333333444444445500000000").unwrap();
let gap_at_key = current_key.add(100);
let mut current_lsn = Lsn(0x10);
const KEY_COUNT: usize = 10_000;
let timeline_id = TimelineId::generate();
let current_timeline = tenant
.create_test_timeline(timeline_id, current_lsn, DEFAULT_PG_VERSION, &ctx)
.await?;
current_lsn += 0x100;
let mut writer = current_timeline.writer().await;
writer
.put(
gap_at_key,
current_lsn,
&Value::Image(test_img(&format!("{} at {}", gap_at_key, current_lsn))),
&ctx,
)
.await?;
writer.finish_write(current_lsn);
drop(writer);
let mut latest_lsns = HashMap::new();
latest_lsns.insert(gap_at_key, current_lsn);
current_timeline.freeze_and_flush().await?;
let child_timeline_id = TimelineId::generate();
tenant
.branch_timeline_test(
&current_timeline,
child_timeline_id,
Some(current_lsn),
&ctx,
)
.await?;
let child_timeline = tenant
.get_timeline(child_timeline_id, true)
.expect("Should have the branched timeline");
for i in 0..KEY_COUNT {
if current_key == gap_at_key {
current_key = current_key.next();
continue;
}
current_lsn += 0x10;
let mut writer = child_timeline.writer().await;
writer
.put(
current_key,
current_lsn,
&Value::Image(test_img(&format!("{} at {}", current_key, current_lsn))),
&ctx,
)
.await?;
writer.finish_write(current_lsn);
drop(writer);
latest_lsns.insert(current_key, current_lsn);
current_key = current_key.next();
// Flush every now and then to encourage layer file creation.
if i % 500 == 0 {
child_timeline.freeze_and_flush().await?;
}
}
child_timeline.freeze_and_flush().await?;
let mut flags = EnumSet::new();
flags.insert(CompactFlags::ForceRepartition);
child_timeline
.compact(&CancellationToken::new(), flags, &ctx)
.await?;
let key_near_end = {
let mut tmp = current_key;
tmp.field6 -= 10;
tmp
};
let key_near_gap = {
let mut tmp = gap_at_key;
tmp.field6 -= 10;
tmp
};
let read = KeySpace {
ranges: vec![key_near_gap..gap_at_key.next(), key_near_end..current_key],
};
let results = child_timeline
.get_vectored_impl(read.clone(), current_lsn, &ctx)
.await?;
for (key, img_res) in results {
let expected = test_img(&format!("{} at {}", key, latest_lsns[&key]));
assert_eq!(img_res?, expected);
}
Ok(())
}
#[tokio::test]
async fn test_random_updates() -> anyhow::Result<()> {
let harness = TenantHarness::create("test_random_updates")?;
@@ -4655,7 +4804,7 @@ mod tests {
for blknum in 0..NUM_KEYS {
lsn = Lsn(lsn.0 + 0x10);
test_key.field6 = blknum as u32;
let writer = tline.writer().await;
let mut writer = tline.writer().await;
writer
.put(
test_key,
@@ -4676,7 +4825,7 @@ mod tests {
lsn = Lsn(lsn.0 + 0x10);
let blknum = thread_rng().gen_range(0..NUM_KEYS);
test_key.field6 = blknum as u32;
let writer = tline.writer().await;
let mut writer = tline.writer().await;
writer
.put(
test_key,
@@ -4744,7 +4893,7 @@ mod tests {
for blknum in 0..NUM_KEYS {
lsn = Lsn(lsn.0 + 0x10);
test_key.field6 = blknum as u32;
let writer = tline.writer().await;
let mut writer = tline.writer().await;
writer
.put(
test_key,
@@ -4773,7 +4922,7 @@ mod tests {
lsn = Lsn(lsn.0 + 0x10);
let blknum = thread_rng().gen_range(0..NUM_KEYS);
test_key.field6 = blknum as u32;
let writer = tline.writer().await;
let mut writer = tline.writer().await;
writer
.put(
test_key,
@@ -4850,7 +4999,7 @@ mod tests {
lsn = Lsn(lsn.0 + 0x10);
let blknum = thread_rng().gen_range(0..NUM_KEYS);
test_key.field6 = blknum as u32;
let writer = tline.writer().await;
let mut writer = tline.writer().await;
writer
.put(
test_key,

View File

@@ -12,7 +12,7 @@
//! len >= 128: 1XXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX
//!
use bytes::{BufMut, BytesMut};
use tokio_epoll_uring::{BoundedBuf, Slice};
use tokio_epoll_uring::{BoundedBuf, IoBuf, Slice};
use crate::context::RequestContext;
use crate::page_cache::PAGE_SZ;
@@ -127,7 +127,7 @@ impl<const BUFFERED: bool> BlobWriter<BUFFERED> {
/// You need to make sure that the internal buffer is empty, otherwise
/// data will be written in wrong order.
#[inline(always)]
async fn write_all_unbuffered<B: BoundedBuf>(
async fn write_all_unbuffered<B: BoundedBuf<Buf = Buf>, Buf: IoBuf + Send>(
&mut self,
src_buf: B,
) -> (B::Buf, Result<(), Error>) {
@@ -162,7 +162,10 @@ impl<const BUFFERED: bool> BlobWriter<BUFFERED> {
}
/// Internal, possibly buffered, write function
async fn write_all<B: BoundedBuf>(&mut self, src_buf: B) -> (B::Buf, Result<(), Error>) {
async fn write_all<B: BoundedBuf<Buf = Buf>, Buf: IoBuf + Send>(
&mut self,
src_buf: B,
) -> (B::Buf, Result<(), Error>) {
if !BUFFERED {
assert!(self.buf.is_empty());
return self.write_all_unbuffered(src_buf).await;
@@ -210,7 +213,10 @@ impl<const BUFFERED: bool> BlobWriter<BUFFERED> {
/// Write a blob of data. Returns the offset that it was written to,
/// which can be used to retrieve the data later.
pub async fn write_blob<B: BoundedBuf>(&mut self, srcbuf: B) -> (B::Buf, Result<u64, Error>) {
pub async fn write_blob<B: BoundedBuf<Buf = Buf>, Buf: IoBuf + Send>(
&mut self,
srcbuf: B,
) -> (B::Buf, Result<u64, Error>) {
let offset = self.offset;
let len = srcbuf.bytes_init();

View File

@@ -354,6 +354,7 @@ pub struct TenantConf {
/// If non-zero, the period between uploads of a heatmap from attached tenants. This
/// may be disabled if a Tenant will not have secondary locations: only secondary
/// locations will use the heatmap uploaded by attached locations.
#[serde(with = "humantime_serde")]
pub heatmap_period: Duration,
/// If true then SLRU segments are dowloaded on demand, if false SLRU segments are included in basebackup

View File

@@ -18,10 +18,19 @@
//! - An Iterator interface would be more convenient for the callers than the
//! 'visit' function
//!
use async_stream::try_stream;
use byteorder::{ReadBytesExt, BE};
use bytes::{BufMut, Bytes, BytesMut};
use either::Either;
use std::{cmp::Ordering, io, result};
use futures::Stream;
use hex;
use std::{
cmp::Ordering,
io,
iter::Rev,
ops::{Range, RangeInclusive},
result,
};
use thiserror::Error;
use tracing::error;
@@ -250,6 +259,90 @@ where
Ok(result)
}
/// Return a stream which yields all key, value pairs from the index
/// starting from the first key greater or equal to `start_key`.
///
/// Note that this is a copy of [`Self::visit`].
/// TODO: Once the sequential read path is removed this will become
/// the only index traversal method.
pub fn get_stream_from<'a>(
&'a self,
start_key: &'a [u8; L],
ctx: &'a RequestContext,
) -> impl Stream<Item = std::result::Result<(Vec<u8>, u64), DiskBtreeError>> + 'a {
try_stream! {
let mut stack = Vec::new();
stack.push((self.root_blk, None));
let block_cursor = self.reader.block_cursor();
while let Some((node_blknum, opt_iter)) = stack.pop() {
// Locate the node.
let node_buf = block_cursor
.read_blk(self.start_blk + node_blknum, ctx)
.await?;
let node = OnDiskNode::deparse(node_buf.as_ref())?;
let prefix_len = node.prefix_len as usize;
let suffix_len = node.suffix_len as usize;
assert!(node.num_children > 0);
let mut keybuf = Vec::new();
keybuf.extend(node.prefix);
keybuf.resize(prefix_len + suffix_len, 0);
let mut iter: Either<Range<usize>, Rev<RangeInclusive<usize>>> = if let Some(iter) = opt_iter {
iter
} else {
// Locate the first match
let idx = match node.binary_search(start_key, keybuf.as_mut_slice()) {
Ok(idx) => idx,
Err(idx) => {
if node.level == 0 {
// Imagine that the node contains the following keys:
//
// 1
// 3 <-- idx
// 5
//
// If the search key is '2' and there is exact match,
// the binary search would return the index of key
// '3'. That's cool, '3' is the first key to return.
idx
} else {
// This is an internal page, so each key represents a lower
// bound for what's in the child page. If there is no exact
// match, we have to return the *previous* entry.
//
// 1 <-- return this
// 3 <-- idx
// 5
idx.saturating_sub(1)
}
}
};
Either::Left(idx..node.num_children.into())
};
// idx points to the first match now. Keep going from there
while let Some(idx) = iter.next() {
let key_off = idx * suffix_len;
let suffix = &node.keys[key_off..key_off + suffix_len];
keybuf[prefix_len..].copy_from_slice(suffix);
let value = node.value(idx);
#[allow(clippy::collapsible_if)]
if node.level == 0 {
// leaf
yield (keybuf.clone(), value.to_u64());
} else {
stack.push((node_blknum, Some(iter)));
stack.push((value.to_blknum(), None));
break;
}
}
}
}
}
///
/// Scan the tree, starting from 'search_key', in the given direction. 'visitor'
/// will be called for every key >= 'search_key' (or <= 'search_key', if scanning

View File

@@ -460,15 +460,22 @@ impl LayerMap {
}
}
pub fn range_search(&self, key_range: Range<Key>, end_lsn: Lsn) -> Option<RangeSearchResult> {
let version = self.historic.get().unwrap().get_version(end_lsn.0 - 1)?;
pub fn range_search(&self, key_range: Range<Key>, end_lsn: Lsn) -> RangeSearchResult {
let version = match self.historic.get().unwrap().get_version(end_lsn.0 - 1) {
Some(version) => version,
None => {
let mut result = RangeSearchResult::new();
result.not_found.add_range(key_range);
return result;
}
};
let raw_range = key_range.start.to_i128()..key_range.end.to_i128();
let delta_changes = version.delta_coverage.range_overlaps(&raw_range);
let image_changes = version.image_coverage.range_overlaps(&raw_range);
let collector = RangeSearchCollector::new(key_range, end_lsn, delta_changes, image_changes);
Some(collector.collect())
collector.collect()
}
/// Start a batch of updates, applied on drop
@@ -995,8 +1002,13 @@ mod tests {
let layer_map = LayerMap::default();
let range = Key::from_i128(100)..Key::from_i128(200);
let res = layer_map.range_search(range, Lsn(100));
assert!(res.is_none());
let res = layer_map.range_search(range.clone(), Lsn(100));
assert_eq!(
res.not_found.to_keyspace(),
KeySpace {
ranges: vec![range]
}
);
}
#[test]
@@ -1033,7 +1045,7 @@ mod tests {
for start in 0..60 {
for end in (start + 1)..60 {
let range = Key::from_i128(start)..Key::from_i128(end);
let result = layer_map.range_search(range.clone(), Lsn(100)).unwrap();
let result = layer_map.range_search(range.clone(), Lsn(100));
let expected = brute_force_range_search(&layer_map, range, Lsn(100));
assert_range_search_result_eq(result, expected);

View File

@@ -1358,6 +1358,16 @@ impl TenantManager {
}
}
pub(crate) fn get(&self, tenant_shard_id: TenantShardId) -> Option<TenantSlot> {
let locked = self.tenants.read().unwrap();
match &*locked {
TenantsMap::Initializing => None,
TenantsMap::Open(map) | TenantsMap::ShuttingDown(map) => {
map.get(&tenant_shard_id).cloned()
}
}
}
pub(crate) async fn delete_tenant(
&self,
tenant_shard_id: TenantShardId,

View File

@@ -14,14 +14,14 @@ use tokio::io::{AsyncSeekExt, AsyncWriteExt};
use tokio_util::io::StreamReader;
use tokio_util::sync::CancellationToken;
use tracing::warn;
use utils::{backoff, crashsafe};
use utils::backoff;
use crate::config::PageServerConf;
use crate::span::debug_assert_current_span_has_tenant_and_timeline_id;
use crate::tenant::remote_timeline_client::{remote_layer_path, remote_timelines_path};
use crate::tenant::storage_layer::LayerFileName;
use crate::tenant::Generation;
use crate::virtual_file::on_fatal_io_error;
use crate::virtual_file::{on_fatal_io_error, MaybeFatalIo, VirtualFile};
use crate::TEMP_FILE_SUFFIX;
use remote_storage::{DownloadError, GenericRemoteStorage, ListingMode};
use utils::crashsafe::path_with_suffix_extension;
@@ -50,9 +50,8 @@ pub async fn download_layer_file<'a>(
) -> Result<u64, DownloadError> {
debug_assert_current_span_has_tenant_and_timeline_id();
let local_path = conf
.timeline_path(&tenant_shard_id, &timeline_id)
.join(layer_file_name.file_name());
let timeline_path = conf.timeline_path(&tenant_shard_id, &timeline_id);
let local_path = timeline_path.join(layer_file_name.file_name());
let remote_path = remote_layer_path(
&tenant_shard_id.tenant_id,
@@ -149,10 +148,21 @@ pub async fn download_layer_file<'a>(
.with_context(|| format!("rename download layer file to {local_path}"))
.map_err(DownloadError::Other)?;
crashsafe::fsync_async(&local_path)
.await
.with_context(|| format!("fsync layer file {local_path}"))
.map_err(DownloadError::Other)?;
// We use fatal_err() below because the after the rename above,
// the in-memory state of the filesystem already has the layer file in its final place,
// and subsequent pageserver code could think it's durable while it really isn't.
let work = async move {
let timeline_dir = VirtualFile::open(&timeline_path)
.await
.fatal_err("VirtualFile::open for timeline dir fsync");
timeline_dir
.sync_all()
.await
.fatal_err("VirtualFile::sync_all timeline dir");
};
crate::virtual_file::io_engine::get()
.spawn_blocking_and_block_on_if_std(work)
.await;
tracing::debug!("download complete: {local_path}");

View File

@@ -46,6 +46,7 @@ use crate::{DELTA_FILE_MAGIC, STORAGE_FORMAT_VERSION};
use anyhow::{anyhow, bail, ensure, Context, Result};
use bytes::BytesMut;
use camino::{Utf8Path, Utf8PathBuf};
use futures::StreamExt;
use pageserver_api::keyspace::KeySpace;
use pageserver_api::models::LayerAccessKind;
use pageserver_api::shard::TenantShardId;
@@ -847,10 +848,33 @@ impl DeltaLayerInner {
reconstruct_state: &mut ValuesReconstructState,
ctx: &RequestContext,
) -> Result<(), GetVectoredError> {
let reads = self
.plan_reads(keyspace, lsn_range, reconstruct_state, ctx)
.await
.map_err(GetVectoredError::Other)?;
let block_reader = FileBlockReader::new(&self.file, self.file_id);
let index_reader = DiskBtreeReader::<_, DELTA_KEY_SIZE>::new(
self.index_start_blk,
self.index_root_blk,
block_reader,
);
let planner = VectoredReadPlanner::new(
self.max_vectored_read_bytes
.expect("Layer is loaded with max vectored bytes config")
.0
.into(),
);
let data_end_offset = self.index_start_blk as u64 * PAGE_SZ as u64;
let reads = Self::plan_reads(
keyspace,
lsn_range,
data_end_offset,
index_reader,
planner,
reconstruct_state,
ctx,
)
.await
.map_err(GetVectoredError::Other)?;
self.do_reads_and_update_state(reads, reconstruct_state)
.await;
@@ -858,73 +882,64 @@ impl DeltaLayerInner {
Ok(())
}
async fn plan_reads(
&self,
async fn plan_reads<Reader>(
keyspace: KeySpace,
lsn_range: Range<Lsn>,
data_end_offset: u64,
index_reader: DiskBtreeReader<Reader, DELTA_KEY_SIZE>,
mut planner: VectoredReadPlanner,
reconstruct_state: &mut ValuesReconstructState,
ctx: &RequestContext,
) -> anyhow::Result<Vec<VectoredRead>> {
let mut planner = VectoredReadPlanner::new(
self.max_vectored_read_bytes
.expect("Layer is loaded with max vectored bytes config")
.0
.into(),
);
let block_reader = FileBlockReader::new(&self.file, self.file_id);
let tree_reader = DiskBtreeReader::<_, DELTA_KEY_SIZE>::new(
self.index_start_blk,
self.index_root_blk,
block_reader,
);
) -> anyhow::Result<Vec<VectoredRead>>
where
Reader: BlockReader,
{
let ctx = RequestContextBuilder::extend(ctx)
.page_content_kind(PageContentKind::DeltaLayerBtreeNode)
.build();
for range in keyspace.ranges.iter() {
let mut range_end_handled = false;
let start_key = DeltaKey::from_key_lsn(&range.start, lsn_range.start);
tree_reader
.visit(
&start_key.0,
VisitDirection::Forwards,
|raw_key, value| {
let key = Key::from_slice(&raw_key[..KEY_SIZE]);
let lsn = DeltaKey::extract_lsn_from_buf(raw_key);
let blob_ref = BlobRef(value);
let index_stream = index_reader.get_stream_from(&start_key.0, &ctx);
let mut index_stream = std::pin::pin!(index_stream);
assert!(key >= range.start && lsn >= lsn_range.start);
while let Some(index_entry) = index_stream.next().await {
let (raw_key, value) = index_entry?;
let key = Key::from_slice(&raw_key[..KEY_SIZE]);
let lsn = DeltaKey::extract_lsn_from_buf(&raw_key);
let blob_ref = BlobRef(value);
let cached_lsn = reconstruct_state.get_cached_lsn(&key);
let flag = {
if cached_lsn >= Some(lsn) {
BlobFlag::Ignore
} else if blob_ref.will_init() {
BlobFlag::Replaces
} else {
BlobFlag::None
}
};
// Lsns are not monotonically increasing across keys, so we don't assert on them.
assert!(key >= range.start);
if key >= range.end || (key.next() == range.end && lsn >= lsn_range.end) {
planner.handle_range_end(blob_ref.pos());
range_end_handled = true;
false
} else {
planner.handle(key, lsn, blob_ref.pos(), flag);
true
}
},
&RequestContextBuilder::extend(ctx)
.page_content_kind(PageContentKind::DeltaLayerBtreeNode)
.build(),
)
.await
.map_err(|err| anyhow!(err))?;
let outside_lsn_range = !lsn_range.contains(&lsn);
let below_cached_lsn = reconstruct_state.get_cached_lsn(&key) >= Some(lsn);
let flag = {
if outside_lsn_range || below_cached_lsn {
BlobFlag::Ignore
} else if blob_ref.will_init() {
BlobFlag::ReplaceAll
} else {
// Usual path: add blob to the read
BlobFlag::None
}
};
if key >= range.end || (key.next() == range.end && lsn >= lsn_range.end) {
planner.handle_range_end(blob_ref.pos());
range_end_handled = true;
break;
} else {
planner.handle(key, lsn, blob_ref.pos(), flag);
}
}
if !range_end_handled {
let payload_end = self.index_start_blk as u64 * PAGE_SZ as u64;
tracing::info!("Handling range end fallback at {}", payload_end);
planner.handle_range_end(payload_end);
tracing::info!("Handling range end fallback at {}", data_end_offset);
planner.handle_range_end(data_end_offset);
}
}
@@ -1190,3 +1205,131 @@ impl<'a> pageserver_compaction::interface::CompactionDeltaEntry<'a, Key> for Del
self.size
}
}
#[cfg(test)]
mod test {
use std::collections::BTreeMap;
use super::*;
use crate::{
context::DownloadBehavior, task_mgr::TaskKind, tenant::disk_btree::tests::TestDisk,
};
/// Construct an index for a fictional delta layer and and then
/// traverse in order to plan vectored reads for a query. Finally,
/// verify that the traversal fed the right index key and value
/// pairs into the planner.
#[tokio::test]
async fn test_delta_layer_index_traversal() {
let base_key = Key {
field1: 0,
field2: 1663,
field3: 12972,
field4: 16396,
field5: 0,
field6: 246080,
};
// Populate the index with some entries
let entries: BTreeMap<Key, Vec<Lsn>> = BTreeMap::from([
(base_key, vec![Lsn(1), Lsn(5), Lsn(25), Lsn(26), Lsn(28)]),
(base_key.add(1), vec![Lsn(2), Lsn(5), Lsn(10), Lsn(50)]),
(base_key.add(2), vec![Lsn(2), Lsn(5), Lsn(10), Lsn(50)]),
(base_key.add(5), vec![Lsn(10), Lsn(15), Lsn(16), Lsn(20)]),
]);
let mut disk = TestDisk::default();
let mut writer = DiskBtreeBuilder::<_, DELTA_KEY_SIZE>::new(&mut disk);
let mut disk_offset = 0;
for (key, lsns) in &entries {
for lsn in lsns {
let index_key = DeltaKey::from_key_lsn(key, *lsn);
let blob_ref = BlobRef::new(disk_offset, false);
writer
.append(&index_key.0, blob_ref.0)
.expect("In memory disk append should never fail");
disk_offset += 1;
}
}
// Prepare all the arguments for the call into `plan_reads` below
let (root_offset, _writer) = writer
.finish()
.expect("In memory disk finish should never fail");
let reader = DiskBtreeReader::<_, DELTA_KEY_SIZE>::new(0, root_offset, disk);
let planner = VectoredReadPlanner::new(100);
let mut reconstruct_state = ValuesReconstructState::new();
let ctx = RequestContext::new(TaskKind::UnitTest, DownloadBehavior::Error);
let keyspace = KeySpace {
ranges: vec![
base_key..base_key.add(3),
base_key.add(3)..base_key.add(100),
],
};
let lsn_range = Lsn(2)..Lsn(40);
// Plan and validate
let vectored_reads = DeltaLayerInner::plan_reads(
keyspace.clone(),
lsn_range.clone(),
disk_offset,
reader,
planner,
&mut reconstruct_state,
&ctx,
)
.await
.expect("Read planning should not fail");
validate(keyspace, lsn_range, vectored_reads, entries);
}
fn validate(
keyspace: KeySpace,
lsn_range: Range<Lsn>,
vectored_reads: Vec<VectoredRead>,
index_entries: BTreeMap<Key, Vec<Lsn>>,
) {
#[derive(Debug, PartialEq, Eq)]
struct BlobSpec {
key: Key,
lsn: Lsn,
at: u64,
}
let mut planned_blobs = Vec::new();
for read in vectored_reads {
for (at, meta) in read.blobs_at.as_slice() {
planned_blobs.push(BlobSpec {
key: meta.key,
lsn: meta.lsn,
at: *at,
});
}
}
let mut expected_blobs = Vec::new();
let mut disk_offset = 0;
for (key, lsns) in index_entries {
for lsn in lsns {
let key_included = keyspace.ranges.iter().any(|range| range.contains(&key));
let lsn_included = lsn_range.contains(&lsn);
if key_included && lsn_included {
expected_blobs.push(BlobSpec {
key,
lsn,
at: disk_offset,
});
}
disk_offset += 1;
}
}
assert_eq!(planned_blobs, expected_blobs);
}
}

View File

@@ -43,6 +43,7 @@ use crate::{IMAGE_FILE_MAGIC, STORAGE_FORMAT_VERSION, TEMP_FILE_SUFFIX};
use anyhow::{anyhow, bail, ensure, Context, Result};
use bytes::{Bytes, BytesMut};
use camino::{Utf8Path, Utf8PathBuf};
use hex;
use pageserver_api::keyspace::KeySpace;
use pageserver_api::models::LayerAccessKind;
use pageserver_api::shard::TenantShardId;
@@ -54,6 +55,7 @@ use std::ops::Range;
use std::os::unix::prelude::FileExt;
use std::sync::Arc;
use tokio::sync::OnceCell;
use tokio_stream::StreamExt;
use tracing::*;
use utils::{
@@ -488,35 +490,33 @@ impl ImageLayerInner {
let tree_reader =
DiskBtreeReader::new(self.index_start_blk, self.index_root_blk, block_reader);
let ctx = RequestContextBuilder::extend(ctx)
.page_content_kind(PageContentKind::ImageLayerBtreeNode)
.build();
for range in keyspace.ranges.iter() {
let mut range_end_handled = false;
let mut search_key: [u8; KEY_SIZE] = [0u8; KEY_SIZE];
range.start.write_to_byte_slice(&mut search_key);
tree_reader
.visit(
&search_key,
VisitDirection::Forwards,
|raw_key, offset| {
let key = Key::from_slice(&raw_key[..KEY_SIZE]);
assert!(key >= range.start);
let index_stream = tree_reader.get_stream_from(&search_key, &ctx);
let mut index_stream = std::pin::pin!(index_stream);
if key >= range.end {
planner.handle_range_end(offset);
range_end_handled = true;
false
} else {
planner.handle(key, self.lsn, offset, BlobFlag::None);
true
}
},
&RequestContextBuilder::extend(ctx)
.page_content_kind(PageContentKind::ImageLayerBtreeNode)
.build(),
)
.await
.map_err(|err| GetVectoredError::Other(anyhow!(err)))?;
while let Some(index_entry) = index_stream.next().await {
let (raw_key, offset) = index_entry?;
let key = Key::from_slice(&raw_key[..KEY_SIZE]);
assert!(key >= range.start);
if key >= range.end {
planner.handle_range_end(offset);
range_end_handled = true;
break;
} else {
planner.handle(key, self.lsn, offset, BlobFlag::None);
}
}
if !range_end_handled {
let payload_end = self.index_start_blk as u64 * PAGE_SZ as u64;

View File

@@ -336,32 +336,17 @@ impl InMemoryLayer {
/// Common subroutine of the public put_wal_record() and put_page_image() functions.
/// Adds the page version to the in-memory tree
pub(crate) async fn put_value(
&self,
key: Key,
lsn: Lsn,
val: &Value,
buf: &[u8],
ctx: &RequestContext,
) -> Result<()> {
let mut inner = self.inner.write().await;
self.assert_writable();
self.put_value_locked(&mut inner, key, lsn, val, ctx).await
}
pub(crate) async fn put_values(
&self,
values: &HashMap<Key, Vec<(Lsn, Value)>>,
ctx: &RequestContext,
) -> Result<()> {
let mut inner = self.inner.write().await;
self.assert_writable();
for (key, vals) in values {
for (lsn, val) in vals {
self.put_value_locked(&mut inner, *key, *lsn, val, ctx)
.await?;
}
}
Ok(())
self.put_value_locked(&mut inner, key, lsn, buf, ctx).await
}
async fn put_value_locked(
@@ -369,22 +354,16 @@ impl InMemoryLayer {
locked_inner: &mut RwLockWriteGuard<'_, InMemoryLayerInner>,
key: Key,
lsn: Lsn,
val: &Value,
buf: &[u8],
ctx: &RequestContext,
) -> Result<()> {
trace!("put_value key {} at {}/{}", key, self.timeline_id, lsn);
let off = {
// Avoid doing allocations for "small" values.
// In the regression test suite, the limit of 256 avoided allocations in 95% of cases:
// https://github.com/neondatabase/neon/pull/5056#discussion_r1301975061
let mut buf = smallvec::SmallVec::<[u8; 256]>::new();
buf.clear();
val.ser_into(&mut buf)?;
locked_inner
.file
.write_blob(
&buf,
buf,
&RequestContextBuilder::extend(ctx)
.page_content_kind(PageContentKind::InMemoryLayer)
.build(),
@@ -412,7 +391,12 @@ impl InMemoryLayer {
pub async fn freeze(&self, end_lsn: Lsn) {
let inner = self.inner.write().await;
assert!(self.start_lsn < end_lsn);
assert!(
self.start_lsn < end_lsn,
"{} >= {}",
self.start_lsn,
end_lsn
);
self.end_lsn.set(end_lsn).expect("end_lsn set only once");
for vec_map in inner.index.values() {

View File

@@ -195,6 +195,7 @@ impl Layer {
let downloaded = resident.expect("just initialized");
// if the rename works, the path is as expected
// TODO: sync system call
std::fs::rename(temp_path, owner.local_path())
.with_context(|| format!("rename temporary file as correct path for {owner}"))?;
@@ -535,6 +536,18 @@ impl Drop for LayerInner {
// carry this until we are finished for [`Layer::wait_drop`] support
let _status = status;
let Some(timeline) = timeline.upgrade() else {
// no need to nag that timeline is gone: under normal situation on
// task_mgr::remove_tenant_from_memory the timeline is gone before we get dropped.
LAYER_IMPL_METRICS.inc_deletes_failed(DeleteFailed::TimelineGone);
return;
};
let Ok(_guard) = timeline.gate.enter() else {
LAYER_IMPL_METRICS.inc_deletes_failed(DeleteFailed::TimelineGone);
return;
};
let removed = match std::fs::remove_file(path) {
Ok(()) => true,
Err(e) if e.kind() == std::io::ErrorKind::NotFound => {
@@ -553,32 +566,26 @@ impl Drop for LayerInner {
}
};
if let Some(timeline) = timeline.upgrade() {
if removed {
timeline.metrics.resident_physical_size_sub(file_size);
}
if let Some(remote_client) = timeline.remote_client.as_ref() {
let res = remote_client.schedule_deletion_of_unlinked(vec![(file_name, meta)]);
if removed {
timeline.metrics.resident_physical_size_sub(file_size);
}
if let Some(remote_client) = timeline.remote_client.as_ref() {
let res = remote_client.schedule_deletion_of_unlinked(vec![(file_name, meta)]);
if let Err(e) = res {
// test_timeline_deletion_with_files_stuck_in_upload_queue is good at
// demonstrating this deadlock (without spawn_blocking): stop will drop
// queued items, which will have ResidentLayer's, and those drops would try
// to re-entrantly lock the RemoteTimelineClient inner state.
if !timeline.is_active() {
tracing::info!("scheduling deletion on drop failed: {e:#}");
} else {
tracing::warn!("scheduling deletion on drop failed: {e:#}");
}
LAYER_IMPL_METRICS.inc_deletes_failed(DeleteFailed::DeleteSchedulingFailed);
if let Err(e) = res {
// test_timeline_deletion_with_files_stuck_in_upload_queue is good at
// demonstrating this deadlock (without spawn_blocking): stop will drop
// queued items, which will have ResidentLayer's, and those drops would try
// to re-entrantly lock the RemoteTimelineClient inner state.
if !timeline.is_active() {
tracing::info!("scheduling deletion on drop failed: {e:#}");
} else {
LAYER_IMPL_METRICS.inc_completed_deletes();
tracing::warn!("scheduling deletion on drop failed: {e:#}");
}
LAYER_IMPL_METRICS.inc_deletes_failed(DeleteFailed::DeleteSchedulingFailed);
} else {
LAYER_IMPL_METRICS.inc_completed_deletes();
}
} else {
// no need to nag that timeline is gone: under normal situation on
// task_mgr::remove_tenant_from_memory the timeline is gone before we get dropped.
LAYER_IMPL_METRICS.inc_deletes_failed(DeleteFailed::TimelineGone);
}
});
}
@@ -879,23 +886,18 @@ impl LayerInner {
) -> Result<heavier_once_cell::InitPermit, DownloadError> {
debug_assert_current_span_has_tenant_and_timeline_id();
let task_name = format!("download layer {}", self);
let (tx, rx) = tokio::sync::oneshot::channel();
// this is sadly needed because of task_mgr::shutdown_tasks, otherwise we cannot
// block tenant::mgr::remove_tenant_from_memory.
let this: Arc<Self> = self.clone();
crate::task_mgr::spawn(
&tokio::runtime::Handle::current(),
crate::task_mgr::TaskKind::RemoteDownloadTask,
Some(self.desc.tenant_shard_id),
Some(self.desc.timeline_id),
&task_name,
false,
async move {
let guard = timeline
.gate
.enter()
.map_err(|_| DownloadError::DownloadCancelled)?;
tokio::task::spawn(async move {
let _guard = guard;
let client = timeline
.remote_client
@@ -905,7 +907,7 @@ impl LayerInner {
let result = client.download_layer_file(
&this.desc.filename(),
&this.metadata(),
&crate::task_mgr::shutdown_token()
&timeline.cancel
)
.await;
@@ -928,7 +930,6 @@ impl LayerInner {
tokio::select! {
_ = tokio::time::sleep(backoff) => {},
_ = crate::task_mgr::shutdown_token().cancelled_owned() => {},
_ = timeline.cancel.cancelled() => {},
};
@@ -958,11 +959,10 @@ impl LayerInner {
}
}
}
Ok(())
}
.in_current_span(),
);
match rx.await {
Ok((Ok(()), permit)) => {
if let Some(reason) = self
@@ -975,7 +975,7 @@ impl LayerInner {
}
self.consecutive_failures.store(0, Ordering::Relaxed);
tracing::info!("on-demand download successful");
tracing::info!(size=%self.desc.file_size, "on-demand download successful");
Ok(permit)
}
@@ -1101,6 +1101,10 @@ impl LayerInner {
return Err(EvictionCancelled::TimelineGone);
};
let Ok(_gate) = timeline.gate.enter() else {
return Err(EvictionCancelled::TimelineGone);
};
// to avoid starting a new download while we evict, keep holding on to the
// permit.
let _permit = {

View File

@@ -101,6 +101,7 @@ pub fn start_background_loops(
_ = completion::Barrier::maybe_wait(background_jobs_can_start) => {}
};
compaction_loop(tenant, cancel)
// If you rename this span, change the RUST_LOG env variable in test_runner/performance/test_branch_creation.py
.instrument(info_span!("compaction_loop", tenant_id = %tenant_shard_id.tenant_id, shard_id = %tenant_shard_id.shard_slug()))
.await;
Ok(())
@@ -198,7 +199,11 @@ async fn compaction_loop(tenant: Arc<Tenant>, cancel: CancellationToken) {
}
};
warn_when_period_overrun(started_at.elapsed(), period, BackgroundLoopKind::Compaction);
let elapsed = started_at.elapsed();
warn_when_period_overrun(elapsed, period, BackgroundLoopKind::Compaction);
// the duration is recorded by performance tests by enabling debug in this function
tracing::debug!(elapsed_ms=elapsed.as_millis(), "compaction iteration complete");
// Perhaps we did no work and the walredo process has been idle for some time:
// give it a chance to shut down to avoid leaving walredo process running indefinitely.
@@ -217,7 +222,7 @@ async fn compaction_loop(tenant: Arc<Tenant>, cancel: CancellationToken) {
}
let allowed_rps = tenant.timeline_get_throttle.steady_rps();
let delta = now - prev;
warn!(
info!(
n_seconds=%format_args!("{:.3}",
delta.as_secs_f64()),
count_accounted,

View File

@@ -2,14 +2,14 @@ use std::{
str::FromStr,
sync::{
atomic::{AtomicU64, Ordering},
Arc,
Arc, Mutex,
},
time::{Duration, Instant},
};
use arc_swap::ArcSwap;
use enumset::EnumSet;
use tracing::error;
use tracing::{error, warn};
use crate::{context::RequestContext, task_mgr::TaskKind};
@@ -157,6 +157,19 @@ where
.fetch_add(wait_time.as_micros() as u64, Ordering::Relaxed);
let observation = Observation { wait_time };
self.metric.observe_throttling(&observation);
match ctx.micros_spent_throttled.add(wait_time) {
Ok(res) => res,
Err(error) => {
use once_cell::sync::Lazy;
use utils::rate_limit::RateLimit;
static WARN_RATE_LIMIT: Lazy<Mutex<RateLimit>> =
Lazy::new(|| Mutex::new(RateLimit::new(Duration::from_secs(10))));
let mut guard = WARN_RATE_LIMIT.lock().unwrap();
guard.call(move || {
warn!(error, "error adding time spent throttled; this message is logged at a global rate limit");
});
}
}
}
}
}

File diff suppressed because it is too large Load Diff

View File

@@ -4,24 +4,32 @@
//!
//! The old legacy algorithm is implemented directly in `timeline.rs`.
use std::collections::BinaryHeap;
use std::ops::{Deref, Range};
use std::sync::Arc;
use super::Timeline;
use super::layer_manager::LayerManager;
use super::{CompactFlags, DurationRecorder, RecordedDuration, Timeline};
use anyhow::{anyhow, Context};
use async_trait::async_trait;
use enumset::EnumSet;
use fail::fail_point;
use itertools::Itertools;
use pageserver_api::shard::TenantShardId;
use tokio_util::sync::CancellationToken;
use tracing::{debug, trace, warn};
use tracing::{debug, info, info_span, trace, warn, Instrument};
use utils::id::TimelineId;
use crate::context::RequestContext;
use crate::context::{AccessStatsBehavior, RequestContext, RequestContextBuilder};
use crate::tenant::storage_layer::{AsLayerDesc, PersistentLayerDesc};
use crate::tenant::timeline::{is_rel_fsm_block_key, is_rel_vm_block_key};
use crate::tenant::timeline::{drop_rlock, is_rel_fsm_block_key, is_rel_vm_block_key, Hole};
use crate::tenant::timeline::{DeltaLayerWriter, ImageLayerWriter};
use crate::tenant::timeline::{Layer, ResidentLayer};
use crate::tenant::DeltaLayer;
use crate::tenant::PageReconstructError;
use crate::ZERO_PAGE;
use crate::virtual_file::{MaybeFatalIo, VirtualFile};
use crate::{page_cache, ZERO_PAGE};
use crate::keyspace::KeySpace;
use crate::repository::Key;
@@ -33,6 +41,694 @@ use pageserver_compaction::interface::*;
use super::CompactionError;
impl Timeline {
/// TODO: cancellation
pub(crate) async fn compact_legacy(
self: &Arc<Self>,
_cancel: &CancellationToken,
flags: EnumSet<CompactFlags>,
ctx: &RequestContext,
) -> Result<(), CompactionError> {
// High level strategy for compaction / image creation:
//
// 1. First, calculate the desired "partitioning" of the
// currently in-use key space. The goal is to partition the
// key space into roughly fixed-size chunks, but also take into
// account any existing image layers, and try to align the
// chunk boundaries with the existing image layers to avoid
// too much churn. Also try to align chunk boundaries with
// relation boundaries. In principle, we don't know about
// relation boundaries here, we just deal with key-value
// pairs, and the code in pgdatadir_mapping.rs knows how to
// map relations into key-value pairs. But in practice we know
// that 'field6' is the block number, and the fields 1-5
// identify a relation. This is just an optimization,
// though.
//
// 2. Once we know the partitioning, for each partition,
// decide if it's time to create a new image layer. The
// criteria is: there has been too much "churn" since the last
// image layer? The "churn" is fuzzy concept, it's a
// combination of too many delta files, or too much WAL in
// total in the delta file. Or perhaps: if creating an image
// file would allow to delete some older files.
//
// 3. After that, we compact all level0 delta files if there
// are too many of them. While compacting, we also garbage
// collect any page versions that are no longer needed because
// of the new image layers we created in step 2.
//
// TODO: This high level strategy hasn't been implemented yet.
// Below are functions compact_level0() and create_image_layers()
// but they are a bit ad hoc and don't quite work like it's explained
// above. Rewrite it.
// Is the timeline being deleted?
if self.is_stopping() {
trace!("Dropping out of compaction on timeline shutdown");
return Err(CompactionError::ShuttingDown);
}
let target_file_size = self.get_checkpoint_distance();
// Define partitioning schema if needed
// FIXME: the match should only cover repartitioning, not the next steps
match self
.repartition(
self.get_last_record_lsn(),
self.get_compaction_target_size(),
flags,
ctx,
)
.await
{
Ok((partitioning, lsn)) => {
// Disables access_stats updates, so that the files we read remain candidates for eviction after we're done with them
let image_ctx = RequestContextBuilder::extend(ctx)
.access_stats_behavior(AccessStatsBehavior::Skip)
.build();
// 2. Compact
let timer = self.metrics.compact_time_histo.start_timer();
self.compact_level0(target_file_size, ctx).await?;
timer.stop_and_record();
// 3. Create new image layers for partitions that have been modified
// "enough".
let layers = self
.create_image_layers(
&partitioning,
lsn,
flags.contains(CompactFlags::ForceImageLayerCreation),
&image_ctx,
)
.await
.map_err(anyhow::Error::from)?;
if let Some(remote_client) = &self.remote_client {
for layer in layers {
remote_client.schedule_layer_file_upload(layer)?;
}
}
if let Some(remote_client) = &self.remote_client {
// should any new image layer been created, not uploading index_part will
// result in a mismatch between remote_physical_size and layermap calculated
// size, which will fail some tests, but should not be an issue otherwise.
remote_client.schedule_index_upload_for_file_changes()?;
}
}
Err(err) => {
// no partitioning? This is normal, if the timeline was just created
// as an empty timeline. Also in unit tests, when we use the timeline
// as a simple key-value store, ignoring the datadir layout. Log the
// error but continue.
//
// Suppress error when it's due to cancellation
if !self.cancel.is_cancelled() {
tracing::error!("could not compact, repartitioning keyspace failed: {err:?}");
}
}
};
Ok(())
}
/// Collect a bunch of Level 0 layer files, and compact and reshuffle them as
/// as Level 1 files.
async fn compact_level0(
self: &Arc<Self>,
target_file_size: u64,
ctx: &RequestContext,
) -> Result<(), CompactionError> {
let CompactLevel0Phase1Result {
new_layers,
deltas_to_compact,
} = {
let phase1_span = info_span!("compact_level0_phase1");
let ctx = ctx.attached_child();
let mut stats = CompactLevel0Phase1StatsBuilder {
version: Some(2),
tenant_id: Some(self.tenant_shard_id),
timeline_id: Some(self.timeline_id),
..Default::default()
};
let begin = tokio::time::Instant::now();
let phase1_layers_locked = Arc::clone(&self.layers).read_owned().await;
let now = tokio::time::Instant::now();
stats.read_lock_acquisition_micros =
DurationRecorder::Recorded(RecordedDuration(now - begin), now);
self.compact_level0_phase1(phase1_layers_locked, stats, target_file_size, &ctx)
.instrument(phase1_span)
.await?
};
if new_layers.is_empty() && deltas_to_compact.is_empty() {
// nothing to do
return Ok(());
}
self.finish_compact_batch(&new_layers, &Vec::new(), &deltas_to_compact)
.await?;
Ok(())
}
/// Level0 files first phase of compaction, explained in the [`Self::compact_legacy`] comment.
async fn compact_level0_phase1(
self: &Arc<Self>,
guard: tokio::sync::OwnedRwLockReadGuard<LayerManager>,
mut stats: CompactLevel0Phase1StatsBuilder,
target_file_size: u64,
ctx: &RequestContext,
) -> Result<CompactLevel0Phase1Result, CompactionError> {
stats.read_lock_held_spawn_blocking_startup_micros =
stats.read_lock_acquisition_micros.till_now(); // set by caller
let layers = guard.layer_map();
let level0_deltas = layers.get_level0_deltas()?;
let mut level0_deltas = level0_deltas
.into_iter()
.map(|x| guard.get_from_desc(&x))
.collect_vec();
stats.level0_deltas_count = Some(level0_deltas.len());
// Only compact if enough layers have accumulated.
let threshold = self.get_compaction_threshold();
if level0_deltas.is_empty() || level0_deltas.len() < threshold {
debug!(
level0_deltas = level0_deltas.len(),
threshold, "too few deltas to compact"
);
return Ok(CompactLevel0Phase1Result::default());
}
// This failpoint is used together with `test_duplicate_layers` integration test.
// It returns the compaction result exactly the same layers as input to compaction.
// We want to ensure that this will not cause any problem when updating the layer map
// after the compaction is finished.
//
// Currently, there are two rare edge cases that will cause duplicated layers being
// inserted.
// 1. The compaction job is inturrupted / did not finish successfully. Assume we have file 1, 2, 3, 4, which
// is compacted to 5, but the page server is shut down, next time we start page server we will get a layer
// map containing 1, 2, 3, 4, and 5, whereas 5 has the same content as 4. If we trigger L0 compation at this
// point again, it is likely that we will get a file 6 which has the same content and the key range as 5,
// and this causes an overwrite. This is acceptable because the content is the same, and we should do a
// layer replace instead of the normal remove / upload process.
// 2. The input workload pattern creates exactly n files that are sorted, non-overlapping and is of target file
// size length. Compaction will likely create the same set of n files afterwards.
//
// This failpoint is a superset of both of the cases.
if cfg!(feature = "testing") {
let active = (|| {
::fail::fail_point!("compact-level0-phase1-return-same", |_| true);
false
})();
if active {
let mut new_layers = Vec::with_capacity(level0_deltas.len());
for delta in &level0_deltas {
// we are just faking these layers as being produced again for this failpoint
new_layers.push(
delta
.download_and_keep_resident()
.await
.context("download layer for failpoint")?,
);
}
tracing::info!("compact-level0-phase1-return-same"); // so that we can check if we hit the failpoint
return Ok(CompactLevel0Phase1Result {
new_layers,
deltas_to_compact: level0_deltas,
});
}
}
// Gather the files to compact in this iteration.
//
// Start with the oldest Level 0 delta file, and collect any other
// level 0 files that form a contiguous sequence, such that the end
// LSN of previous file matches the start LSN of the next file.
//
// Note that if the files don't form such a sequence, we might
// "compact" just a single file. That's a bit pointless, but it allows
// us to get rid of the level 0 file, and compact the other files on
// the next iteration. This could probably made smarter, but such
// "gaps" in the sequence of level 0 files should only happen in case
// of a crash, partial download from cloud storage, or something like
// that, so it's not a big deal in practice.
level0_deltas.sort_by_key(|l| l.layer_desc().lsn_range.start);
let mut level0_deltas_iter = level0_deltas.iter();
let first_level0_delta = level0_deltas_iter.next().unwrap();
let mut prev_lsn_end = first_level0_delta.layer_desc().lsn_range.end;
let mut deltas_to_compact = Vec::with_capacity(level0_deltas.len());
deltas_to_compact.push(first_level0_delta.download_and_keep_resident().await?);
for l in level0_deltas_iter {
let lsn_range = &l.layer_desc().lsn_range;
if lsn_range.start != prev_lsn_end {
break;
}
deltas_to_compact.push(l.download_and_keep_resident().await?);
prev_lsn_end = lsn_range.end;
}
let lsn_range = Range {
start: deltas_to_compact
.first()
.unwrap()
.layer_desc()
.lsn_range
.start,
end: deltas_to_compact.last().unwrap().layer_desc().lsn_range.end,
};
info!(
"Starting Level0 compaction in LSN range {}-{} for {} layers ({} deltas in total)",
lsn_range.start,
lsn_range.end,
deltas_to_compact.len(),
level0_deltas.len()
);
for l in deltas_to_compact.iter() {
info!("compact includes {l}");
}
// We don't need the original list of layers anymore. Drop it so that
// we don't accidentally use it later in the function.
drop(level0_deltas);
stats.read_lock_held_prerequisites_micros = stats
.read_lock_held_spawn_blocking_startup_micros
.till_now();
// Determine N largest holes where N is number of compacted layers.
let max_holes = deltas_to_compact.len();
let last_record_lsn = self.get_last_record_lsn();
let min_hole_range = (target_file_size / page_cache::PAGE_SZ as u64) as i128;
let min_hole_coverage_size = 3; // TODO: something more flexible?
// min-heap (reserve space for one more element added before eviction)
let mut heap: BinaryHeap<Hole> = BinaryHeap::with_capacity(max_holes + 1);
let mut prev: Option<Key> = None;
let mut all_keys = Vec::new();
for l in deltas_to_compact.iter() {
all_keys.extend(l.load_keys(ctx).await?);
}
// FIXME: should spawn_blocking the rest of this function
// The current stdlib sorting implementation is designed in a way where it is
// particularly fast where the slice is made up of sorted sub-ranges.
all_keys.sort_by_key(|DeltaEntry { key, lsn, .. }| (*key, *lsn));
stats.read_lock_held_key_sort_micros = stats.read_lock_held_prerequisites_micros.till_now();
for &DeltaEntry { key: next_key, .. } in all_keys.iter() {
if let Some(prev_key) = prev {
// just first fast filter
if next_key.to_i128() - prev_key.to_i128() >= min_hole_range {
let key_range = prev_key..next_key;
// Measuring hole by just subtraction of i128 representation of key range boundaries
// has not so much sense, because largest holes will corresponds field1/field2 changes.
// But we are mostly interested to eliminate holes which cause generation of excessive image layers.
// That is why it is better to measure size of hole as number of covering image layers.
let coverage_size = layers.image_coverage(&key_range, last_record_lsn).len();
if coverage_size >= min_hole_coverage_size {
heap.push(Hole {
key_range,
coverage_size,
});
if heap.len() > max_holes {
heap.pop(); // remove smallest hole
}
}
}
}
prev = Some(next_key.next());
}
stats.read_lock_held_compute_holes_micros = stats.read_lock_held_key_sort_micros.till_now();
drop_rlock(guard);
stats.read_lock_drop_micros = stats.read_lock_held_compute_holes_micros.till_now();
let mut holes = heap.into_vec();
holes.sort_unstable_by_key(|hole| hole.key_range.start);
let mut next_hole = 0; // index of next hole in holes vector
// This iterator walks through all key-value pairs from all the layers
// we're compacting, in key, LSN order.
let all_values_iter = all_keys.iter();
// This iterator walks through all keys and is needed to calculate size used by each key
let mut all_keys_iter = all_keys
.iter()
.map(|DeltaEntry { key, lsn, size, .. }| (*key, *lsn, *size))
.coalesce(|mut prev, cur| {
// Coalesce keys that belong to the same key pair.
// This ensures that compaction doesn't put them
// into different layer files.
// Still limit this by the target file size,
// so that we keep the size of the files in
// check.
if prev.0 == cur.0 && prev.2 < target_file_size {
prev.2 += cur.2;
Ok(prev)
} else {
Err((prev, cur))
}
});
// Merge the contents of all the input delta layers into a new set
// of delta layers, based on the current partitioning.
//
// We split the new delta layers on the key dimension. We iterate through the key space, and for each key, check if including the next key to the current output layer we're building would cause the layer to become too large. If so, dump the current output layer and start new one.
// It's possible that there is a single key with so many page versions that storing all of them in a single layer file
// would be too large. In that case, we also split on the LSN dimension.
//
// LSN
// ^
// |
// | +-----------+ +--+--+--+--+
// | | | | | | | |
// | +-----------+ | | | | |
// | | | | | | | |
// | +-----------+ ==> | | | | |
// | | | | | | | |
// | +-----------+ | | | | |
// | | | | | | | |
// | +-----------+ +--+--+--+--+
// |
// +--------------> key
//
//
// If one key (X) has a lot of page versions:
//
// LSN
// ^
// | (X)
// | +-----------+ +--+--+--+--+
// | | | | | | | |
// | +-----------+ | | +--+ |
// | | | | | | | |
// | +-----------+ ==> | | | | |
// | | | | | +--+ |
// | +-----------+ | | | | |
// | | | | | | | |
// | +-----------+ +--+--+--+--+
// |
// +--------------> key
// TODO: this actually divides the layers into fixed-size chunks, not
// based on the partitioning.
//
// TODO: we should also opportunistically materialize and
// garbage collect what we can.
let mut new_layers = Vec::new();
let mut prev_key: Option<Key> = None;
let mut writer: Option<DeltaLayerWriter> = None;
let mut key_values_total_size = 0u64;
let mut dup_start_lsn: Lsn = Lsn::INVALID; // start LSN of layer containing values of the single key
let mut dup_end_lsn: Lsn = Lsn::INVALID; // end LSN of layer containing values of the single key
for &DeltaEntry {
key, lsn, ref val, ..
} in all_values_iter
{
let value = val.load(ctx).await?;
let same_key = prev_key.map_or(false, |prev_key| prev_key == key);
// We need to check key boundaries once we reach next key or end of layer with the same key
if !same_key || lsn == dup_end_lsn {
let mut next_key_size = 0u64;
let is_dup_layer = dup_end_lsn.is_valid();
dup_start_lsn = Lsn::INVALID;
if !same_key {
dup_end_lsn = Lsn::INVALID;
}
// Determine size occupied by this key. We stop at next key or when size becomes larger than target_file_size
for (next_key, next_lsn, next_size) in all_keys_iter.by_ref() {
next_key_size = next_size;
if key != next_key {
if dup_end_lsn.is_valid() {
// We are writting segment with duplicates:
// place all remaining values of this key in separate segment
dup_start_lsn = dup_end_lsn; // new segments starts where old stops
dup_end_lsn = lsn_range.end; // there are no more values of this key till end of LSN range
}
break;
}
key_values_total_size += next_size;
// Check if it is time to split segment: if total keys size is larger than target file size.
// We need to avoid generation of empty segments if next_size > target_file_size.
if key_values_total_size > target_file_size && lsn != next_lsn {
// Split key between multiple layers: such layer can contain only single key
dup_start_lsn = if dup_end_lsn.is_valid() {
dup_end_lsn // new segment with duplicates starts where old one stops
} else {
lsn // start with the first LSN for this key
};
dup_end_lsn = next_lsn; // upper LSN boundary is exclusive
break;
}
}
// handle case when loop reaches last key: in this case dup_end is non-zero but dup_start is not set.
if dup_end_lsn.is_valid() && !dup_start_lsn.is_valid() {
dup_start_lsn = dup_end_lsn;
dup_end_lsn = lsn_range.end;
}
if writer.is_some() {
let written_size = writer.as_mut().unwrap().size();
let contains_hole =
next_hole < holes.len() && key >= holes[next_hole].key_range.end;
// check if key cause layer overflow or contains hole...
if is_dup_layer
|| dup_end_lsn.is_valid()
|| written_size + key_values_total_size > target_file_size
|| contains_hole
{
// ... if so, flush previous layer and prepare to write new one
new_layers.push(
writer
.take()
.unwrap()
.finish(prev_key.unwrap().next(), self)
.await?,
);
writer = None;
if contains_hole {
// skip hole
next_hole += 1;
}
}
}
// Remember size of key value because at next iteration we will access next item
key_values_total_size = next_key_size;
}
fail_point!("delta-layer-writer-fail-before-finish", |_| {
Err(CompactionError::Other(anyhow::anyhow!(
"failpoint delta-layer-writer-fail-before-finish"
)))
});
if !self.shard_identity.is_key_disposable(&key) {
if writer.is_none() {
// Create writer if not initiaized yet
writer = Some(
DeltaLayerWriter::new(
self.conf,
self.timeline_id,
self.tenant_shard_id,
key,
if dup_end_lsn.is_valid() {
// this is a layer containing slice of values of the same key
debug!("Create new dup layer {}..{}", dup_start_lsn, dup_end_lsn);
dup_start_lsn..dup_end_lsn
} else {
debug!("Create new layer {}..{}", lsn_range.start, lsn_range.end);
lsn_range.clone()
},
)
.await?,
);
}
writer.as_mut().unwrap().put_value(key, lsn, value).await?;
} else {
debug!(
"Dropping key {} during compaction (it belongs on shard {:?})",
key,
self.shard_identity.get_shard_number(&key)
);
}
if !new_layers.is_empty() {
fail_point!("after-timeline-compacted-first-L1");
}
prev_key = Some(key);
}
if let Some(writer) = writer {
new_layers.push(writer.finish(prev_key.unwrap().next(), self).await?);
}
// Sync layers
if !new_layers.is_empty() {
// Print a warning if the created layer is larger than double the target size
// Add two pages for potential overhead. This should in theory be already
// accounted for in the target calculation, but for very small targets,
// we still might easily hit the limit otherwise.
let warn_limit = target_file_size * 2 + page_cache::PAGE_SZ as u64 * 2;
for layer in new_layers.iter() {
if layer.layer_desc().file_size > warn_limit {
warn!(
%layer,
"created delta file of size {} larger than double of target of {target_file_size}", layer.layer_desc().file_size
);
}
}
// The writer.finish() above already did the fsync of the inodes.
// We just need to fsync the directory in which these inodes are linked,
// which we know to be the timeline directory.
//
// We use fatal_err() below because the after writer.finish() returns with success,
// the in-memory state of the filesystem already has the layer file in its final place,
// and subsequent pageserver code could think it's durable while it really isn't.
let timeline_dir = VirtualFile::open(
&self
.conf
.timeline_path(&self.tenant_shard_id, &self.timeline_id),
)
.await
.fatal_err("VirtualFile::open for timeline dir fsync");
timeline_dir
.sync_all()
.await
.fatal_err("VirtualFile::sync_all timeline dir");
}
stats.write_layer_files_micros = stats.read_lock_drop_micros.till_now();
stats.new_deltas_count = Some(new_layers.len());
stats.new_deltas_size = Some(new_layers.iter().map(|l| l.layer_desc().file_size).sum());
match TryInto::<CompactLevel0Phase1Stats>::try_into(stats)
.and_then(|stats| serde_json::to_string(&stats).context("serde_json::to_string"))
{
Ok(stats_json) => {
info!(
stats_json = stats_json.as_str(),
"compact_level0_phase1 stats available"
)
}
Err(e) => {
warn!("compact_level0_phase1 stats failed to serialize: {:#}", e);
}
}
Ok(CompactLevel0Phase1Result {
new_layers,
deltas_to_compact: deltas_to_compact
.into_iter()
.map(|x| x.drop_eviction_guard())
.collect::<Vec<_>>(),
})
}
}
#[derive(Default)]
struct CompactLevel0Phase1Result {
new_layers: Vec<ResidentLayer>,
deltas_to_compact: Vec<Layer>,
}
#[derive(Default)]
struct CompactLevel0Phase1StatsBuilder {
version: Option<u64>,
tenant_id: Option<TenantShardId>,
timeline_id: Option<TimelineId>,
read_lock_acquisition_micros: DurationRecorder,
read_lock_held_spawn_blocking_startup_micros: DurationRecorder,
read_lock_held_key_sort_micros: DurationRecorder,
read_lock_held_prerequisites_micros: DurationRecorder,
read_lock_held_compute_holes_micros: DurationRecorder,
read_lock_drop_micros: DurationRecorder,
write_layer_files_micros: DurationRecorder,
level0_deltas_count: Option<usize>,
new_deltas_count: Option<usize>,
new_deltas_size: Option<u64>,
}
#[derive(serde::Serialize)]
struct CompactLevel0Phase1Stats {
version: u64,
tenant_id: TenantShardId,
timeline_id: TimelineId,
read_lock_acquisition_micros: RecordedDuration,
read_lock_held_spawn_blocking_startup_micros: RecordedDuration,
read_lock_held_key_sort_micros: RecordedDuration,
read_lock_held_prerequisites_micros: RecordedDuration,
read_lock_held_compute_holes_micros: RecordedDuration,
read_lock_drop_micros: RecordedDuration,
write_layer_files_micros: RecordedDuration,
level0_deltas_count: usize,
new_deltas_count: usize,
new_deltas_size: u64,
}
impl TryFrom<CompactLevel0Phase1StatsBuilder> for CompactLevel0Phase1Stats {
type Error = anyhow::Error;
fn try_from(value: CompactLevel0Phase1StatsBuilder) -> Result<Self, Self::Error> {
Ok(Self {
version: value.version.ok_or_else(|| anyhow!("version not set"))?,
tenant_id: value
.tenant_id
.ok_or_else(|| anyhow!("tenant_id not set"))?,
timeline_id: value
.timeline_id
.ok_or_else(|| anyhow!("timeline_id not set"))?,
read_lock_acquisition_micros: value
.read_lock_acquisition_micros
.into_recorded()
.ok_or_else(|| anyhow!("read_lock_acquisition_micros not set"))?,
read_lock_held_spawn_blocking_startup_micros: value
.read_lock_held_spawn_blocking_startup_micros
.into_recorded()
.ok_or_else(|| anyhow!("read_lock_held_spawn_blocking_startup_micros not set"))?,
read_lock_held_key_sort_micros: value
.read_lock_held_key_sort_micros
.into_recorded()
.ok_or_else(|| anyhow!("read_lock_held_key_sort_micros not set"))?,
read_lock_held_prerequisites_micros: value
.read_lock_held_prerequisites_micros
.into_recorded()
.ok_or_else(|| anyhow!("read_lock_held_prerequisites_micros not set"))?,
read_lock_held_compute_holes_micros: value
.read_lock_held_compute_holes_micros
.into_recorded()
.ok_or_else(|| anyhow!("read_lock_held_compute_holes_micros not set"))?,
read_lock_drop_micros: value
.read_lock_drop_micros
.into_recorded()
.ok_or_else(|| anyhow!("read_lock_drop_micros not set"))?,
write_layer_files_micros: value
.write_layer_files_micros
.into_recorded()
.ok_or_else(|| anyhow!("write_layer_files_micros not set"))?,
level0_deltas_count: value
.level0_deltas_count
.ok_or_else(|| anyhow!("level0_deltas_count not set"))?,
new_deltas_count: value
.new_deltas_count
.ok_or_else(|| anyhow!("new_deltas_count not set"))?,
new_deltas_size: value
.new_deltas_size
.ok_or_else(|| anyhow!("new_deltas_size not set"))?,
})
}
}
impl Timeline {
/// Entry point for new tiered compaction algorithm.
///
@@ -134,7 +830,6 @@ struct ResidentDeltaLayer(ResidentLayer);
#[derive(Clone)]
struct ResidentImageLayer(ResidentLayer);
#[async_trait]
impl CompactionJobExecutor for TimelineAdaptor {
type Key = crate::repository::Key;

View File

@@ -343,23 +343,6 @@ pub(super) async fn handle_walreceiver_connection(
modification.commit(&ctx).await?;
uncommitted_records = 0;
filtered_records = 0;
//
// We should check checkpoint distance after appending each ingest_batch_size bytes because otherwise
// layer size can become much larger than `checkpoint_distance`.
// It can append because wal-sender is sending WAL using 125kb chucks and some WAL records can cause writing large
// amount of data to key-value storage. So performing this check only after processing
// all WAL records in the chunk, can cause huge L0 layer files.
//
timeline
.check_checkpoint_distance()
.await
.with_context(|| {
format!(
"Failed to check checkpoint distance for timeline {}",
timeline.timeline_id
)
})?;
}
}
@@ -406,15 +389,16 @@ pub(super) async fn handle_walreceiver_connection(
}
}
timeline
.check_checkpoint_distance()
.await
.with_context(|| {
format!(
"Failed to check checkpoint distance for timeline {}",
timeline.timeline_id
)
})?;
{
// This is a hack. It piggybacks on the keepalive messages sent by the
// safekeeper in order to enforce `checkpoint_timeout` on the currently
// open layer. This hack doesn't provide a bound on the total size of
// in-memory layers on a pageserver. See https://github.com/neondatabase/neon/issues/6916.
let mut writer = timeline.writer().await;
if let Err(err) = writer.tick().await {
warn!("Timeline writer tick failed: {err}");
}
}
if let Some(last_lsn) = status_update {
let timeline_remote_consistent_lsn = timeline

View File

@@ -128,7 +128,7 @@ impl VectoredReadBuilder {
pub enum BlobFlag {
None,
Ignore,
Replaces,
ReplaceAll,
}
/// Planner for vectored blob reads.
@@ -170,7 +170,7 @@ impl VectoredReadPlanner {
/// incorrect data to the user.
///
/// The `flag` argument has two interesting values:
/// * [`BlobFlag::Replaces`]: The blob for this key should replace all existing blobs.
/// * [`BlobFlag::ReplaceAll`]: The blob for this key should replace all existing blobs.
/// This is used for WAL records that `will_init`.
/// * [`BlobFlag::Ignore`]: This blob should not be included in the read. This happens
/// if the blob is cached.
@@ -204,7 +204,7 @@ impl VectoredReadPlanner {
let blobs_for_key = self.blobs.entry(key).or_default();
blobs_for_key.push((lsn, start_offset, end_offset));
}
BlobFlag::Replaces => {
BlobFlag::ReplaceAll => {
let blobs_for_key = self.blobs.entry(key).or_default();
blobs_for_key.clear();
blobs_for_key.push((lsn, start_offset, end_offset));
@@ -411,10 +411,10 @@ mod tests {
let blob_descriptions = vec![
(first_key, lsn, 0, BlobFlag::None), // First in read 1
(first_key, lsn, 1024, BlobFlag::None), // Last in read 1
(second_key, lsn, 2 * 1024, BlobFlag::Replaces),
(second_key, lsn, 2 * 1024, BlobFlag::ReplaceAll),
(second_key, lsn, 3 * 1024, BlobFlag::None),
(second_key, lsn, 4 * 1024, BlobFlag::Replaces), // First in read 2
(second_key, lsn, 5 * 1024, BlobFlag::None), // Last in read 2
(second_key, lsn, 4 * 1024, BlobFlag::ReplaceAll), // First in read 2
(second_key, lsn, 5 * 1024, BlobFlag::None), // Last in read 2
];
let ranges = [&blob_descriptions[0..2], &blob_descriptions[4..]];

View File

@@ -17,20 +17,21 @@ use crate::tenant::TENANTS_SEGMENT_NAME;
use camino::{Utf8Path, Utf8PathBuf};
use once_cell::sync::OnceCell;
use pageserver_api::shard::TenantShardId;
use std::fs::{self, File};
use std::fs::File;
use std::io::{Error, ErrorKind, Seek, SeekFrom};
use tokio_epoll_uring::{BoundedBuf, IoBuf, IoBufMut, Slice};
use std::os::fd::{AsRawFd, FromRawFd, IntoRawFd, OwnedFd, RawFd};
use std::os::unix::fs::FileExt;
use std::sync::atomic::{AtomicBool, AtomicUsize, Ordering};
use tokio::sync::{RwLock, RwLockReadGuard, RwLockWriteGuard};
use tokio::time::Instant;
pub use pageserver_api::models::virtual_file as api;
pub(crate) mod io_engine;
mod metadata;
mod open_options;
pub(crate) use io_engine::IoEngineKind;
pub(crate) use metadata::Metadata;
pub(crate) use open_options::*;
///
@@ -435,13 +436,25 @@ impl VirtualFile {
/// Call File::sync_all() on the underlying File.
pub async fn sync_all(&self) -> Result<(), Error> {
with_file!(self, StorageIoOperation::Fsync, |file_guard| file_guard
.with_std_file(|std_file| std_file.sync_all()))
with_file!(self, StorageIoOperation::Fsync, |file_guard| {
let (_file_guard, res) = io_engine::get().sync_all(file_guard).await;
res
})
}
pub async fn metadata(&self) -> Result<fs::Metadata, Error> {
with_file!(self, StorageIoOperation::Metadata, |file_guard| file_guard
.with_std_file(|std_file| std_file.metadata()))
/// Call File::sync_data() on the underlying File.
pub async fn sync_data(&self) -> Result<(), Error> {
with_file!(self, StorageIoOperation::Fsync, |file_guard| {
let (_file_guard, res) = io_engine::get().sync_data(file_guard).await;
res
})
}
pub async fn metadata(&self) -> Result<Metadata, Error> {
with_file!(self, StorageIoOperation::Metadata, |file_guard| {
let (_file_guard, res) = io_engine::get().metadata(file_guard).await;
res
})
}
/// Helper function internal to `VirtualFile` that looks up the underlying File,
@@ -579,7 +592,7 @@ impl VirtualFile {
}
// Copied from https://doc.rust-lang.org/1.72.0/src/std/os/unix/fs.rs.html#219-235
pub async fn write_all_at<B: BoundedBuf>(
pub async fn write_all_at<B: BoundedBuf<Buf = Buf>, Buf: IoBuf + Send>(
&self,
buf: B,
mut offset: u64,
@@ -590,8 +603,9 @@ impl VirtualFile {
}
let mut buf = buf.slice(0..buf_len);
while !buf.is_empty() {
// TODO: push `buf` further down
match self.write_at(&buf, offset).await {
let res;
(buf, res) = self.write_at(buf, offset).await;
match res {
Ok(0) => {
return (
Slice::into_inner(buf),
@@ -605,7 +619,7 @@ impl VirtualFile {
buf = buf.slice(n..);
offset += n as u64;
}
Err(ref e) if e.kind() == std::io::ErrorKind::Interrupted => {}
Err(e) if e.kind() == std::io::ErrorKind::Interrupted => {}
Err(e) => return (Slice::into_inner(buf), Err(e)),
}
}
@@ -616,15 +630,19 @@ impl VirtualFile {
/// Returns the IoBuf that is underlying the BoundedBuf `buf`.
/// I.e., the returned value's `bytes_init()` method returns something different than the `bytes_init()` that was passed in.
/// It's quite brittle and easy to mis-use, so, we return the size in the Ok() variant.
pub async fn write_all<B: BoundedBuf>(&mut self, buf: B) -> (B::Buf, Result<usize, Error>) {
pub async fn write_all<B: BoundedBuf<Buf = Buf>, Buf: IoBuf + Send>(
&mut self,
buf: B,
) -> (B::Buf, Result<usize, Error>) {
let nbytes = buf.bytes_init();
if nbytes == 0 {
return (Slice::into_inner(buf.slice_full()), Ok(0));
}
let mut buf = buf.slice(0..nbytes);
while !buf.is_empty() {
// TODO: push `Slice` further down
match self.write(&buf).await {
let res;
(buf, res) = self.write(buf).await;
match res {
Ok(0) => {
return (
Slice::into_inner(buf),
@@ -644,11 +662,18 @@ impl VirtualFile {
(Slice::into_inner(buf), Ok(nbytes))
}
async fn write(&mut self, buf: &[u8]) -> Result<usize, std::io::Error> {
async fn write<B: IoBuf + Send>(
&mut self,
buf: Slice<B>,
) -> (Slice<B>, Result<usize, std::io::Error>) {
let pos = self.pos;
let n = self.write_at(buf, pos).await?;
let (buf, res) = self.write_at(buf, pos).await;
let n = match res {
Ok(n) => n,
Err(e) => return (buf, Err(e)),
};
self.pos += n as u64;
Ok(n)
(buf, Ok(n))
}
pub(crate) async fn read_at<B>(&self, buf: B, offset: u64) -> (B, Result<usize, Error>)
@@ -676,16 +701,30 @@ impl VirtualFile {
})
}
async fn write_at(&self, buf: &[u8], offset: u64) -> Result<usize, Error> {
let result = with_file!(self, StorageIoOperation::Write, |file_guard| {
file_guard.with_std_file(|std_file| std_file.write_at(buf, offset))
});
if let Ok(size) = result {
STORAGE_IO_SIZE
.with_label_values(&["write", &self.tenant_id, &self.shard_id, &self.timeline_id])
.add(size as i64);
}
result
async fn write_at<B: IoBuf + Send>(
&self,
buf: Slice<B>,
offset: u64,
) -> (Slice<B>, Result<usize, Error>) {
let file_guard = match self.lock_file().await {
Ok(file_guard) => file_guard,
Err(e) => return (buf, Err(e)),
};
observe_duration!(StorageIoOperation::Write, {
let ((_file_guard, buf), result) =
io_engine::get().write_at(file_guard, offset, buf).await;
if let Ok(size) = result {
STORAGE_IO_SIZE
.with_label_values(&[
"write",
&self.tenant_id,
&self.shard_id,
&self.timeline_id,
])
.add(size as i64);
}
(buf, result)
})
}
}
@@ -1083,6 +1122,7 @@ mod tests {
use rand::Rng;
use std::future::Future;
use std::io::Write;
use std::os::unix::fs::FileExt;
use std::sync::Arc;
enum MaybeVirtualFile {
@@ -1103,7 +1143,11 @@ mod tests {
MaybeVirtualFile::File(file) => file.read_exact_at(&mut buf, offset).map(|()| buf),
}
}
async fn write_all_at<B: BoundedBuf>(&self, buf: B, offset: u64) -> Result<(), Error> {
async fn write_all_at<B: BoundedBuf<Buf = Buf>, Buf: IoBuf + Send>(
&self,
buf: B,
offset: u64,
) -> Result<(), Error> {
match self {
MaybeVirtualFile::VirtualFile(file) => {
let (_buf, res) = file.write_all_at(buf, offset).await;
@@ -1124,7 +1168,10 @@ mod tests {
MaybeVirtualFile::File(file) => file.seek(pos),
}
}
async fn write_all<B: BoundedBuf>(&mut self, buf: B) -> Result<(), Error> {
async fn write_all<B: BoundedBuf<Buf = Buf>, Buf: IoBuf + Send>(
&mut self,
buf: B,
) -> Result<(), Error> {
match self {
MaybeVirtualFile::VirtualFile(file) => {
let (_buf, res) = file.write_all(buf).await;

View File

@@ -7,6 +7,9 @@
//!
//! Then use [`get`] and [`super::OpenOptions`].
use tokio_epoll_uring::{IoBuf, Slice};
use tracing::Instrument;
pub(crate) use super::api::IoEngineKind;
#[derive(Clone, Copy)]
#[repr(u8)]
@@ -61,7 +64,8 @@ pub(super) fn init(engine_kind: IoEngineKind) {
set(engine_kind);
}
pub(super) fn get() -> IoEngine {
/// Longer-term, this API should only be used by [`super::VirtualFile`].
pub(crate) fn get() -> IoEngine {
let cur = IoEngine::try_from(IO_ENGINE.load(Ordering::Relaxed)).unwrap();
if cfg!(test) {
let env_var_name = "NEON_PAGESERVER_UNIT_TEST_VIRTUAL_FILE_IOENGINE";
@@ -98,7 +102,17 @@ use std::{
sync::atomic::{AtomicU8, Ordering},
};
use super::FileGuard;
use super::{FileGuard, Metadata};
#[cfg(target_os = "linux")]
fn epoll_uring_error_to_std(e: tokio_epoll_uring::Error<std::io::Error>) -> std::io::Error {
match e {
tokio_epoll_uring::Error::Op(e) => e,
tokio_epoll_uring::Error::System(system) => {
std::io::Error::new(std::io::ErrorKind::Other, system)
}
}
}
impl IoEngine {
pub(super) async fn read_at<B>(
@@ -133,16 +147,109 @@ impl IoEngine {
IoEngine::TokioEpollUring => {
let system = tokio_epoll_uring::thread_local_system().await;
let (resources, res) = system.read(file_guard, offset, buf).await;
(resources, res.map_err(epoll_uring_error_to_std))
}
}
}
pub(super) async fn sync_all(&self, file_guard: FileGuard) -> (FileGuard, std::io::Result<()>) {
match self {
IoEngine::NotSet => panic!("not initialized"),
IoEngine::StdFs => {
let res = file_guard.with_std_file(|std_file| std_file.sync_all());
(file_guard, res)
}
#[cfg(target_os = "linux")]
IoEngine::TokioEpollUring => {
let system = tokio_epoll_uring::thread_local_system().await;
let (resources, res) = system.fsync(file_guard).await;
(resources, res.map_err(epoll_uring_error_to_std))
}
}
}
pub(super) async fn sync_data(
&self,
file_guard: FileGuard,
) -> (FileGuard, std::io::Result<()>) {
match self {
IoEngine::NotSet => panic!("not initialized"),
IoEngine::StdFs => {
let res = file_guard.with_std_file(|std_file| std_file.sync_data());
(file_guard, res)
}
#[cfg(target_os = "linux")]
IoEngine::TokioEpollUring => {
let system = tokio_epoll_uring::thread_local_system().await;
let (resources, res) = system.fdatasync(file_guard).await;
(resources, res.map_err(epoll_uring_error_to_std))
}
}
}
pub(super) async fn metadata(
&self,
file_guard: FileGuard,
) -> (FileGuard, std::io::Result<Metadata>) {
match self {
IoEngine::NotSet => panic!("not initialized"),
IoEngine::StdFs => {
let res =
file_guard.with_std_file(|std_file| std_file.metadata().map(Metadata::from));
(file_guard, res)
}
#[cfg(target_os = "linux")]
IoEngine::TokioEpollUring => {
let system = tokio_epoll_uring::thread_local_system().await;
let (resources, res) = system.statx(file_guard).await;
(
resources,
res.map_err(|e| match e {
tokio_epoll_uring::Error::Op(e) => e,
tokio_epoll_uring::Error::System(system) => {
std::io::Error::new(std::io::ErrorKind::Other, system)
}
}),
res.map_err(epoll_uring_error_to_std).map(Metadata::from),
)
}
}
}
pub(super) async fn write_at<B: IoBuf + Send>(
&self,
file_guard: FileGuard,
offset: u64,
buf: Slice<B>,
) -> ((FileGuard, Slice<B>), std::io::Result<usize>) {
match self {
IoEngine::NotSet => panic!("not initialized"),
IoEngine::StdFs => {
let result = file_guard.with_std_file(|std_file| std_file.write_at(&buf, offset));
((file_guard, buf), result)
}
#[cfg(target_os = "linux")]
IoEngine::TokioEpollUring => {
let system = tokio_epoll_uring::thread_local_system().await;
let (resources, res) = system.write(file_guard, offset, buf).await;
(resources, res.map_err(epoll_uring_error_to_std))
}
}
}
/// If we switch a user of [`tokio::fs`] to use [`super::io_engine`],
/// they'd start blocking the executor thread if [`IoEngine::StdFs`] is configured
/// whereas before the switch to [`super::io_engine`], that wasn't the case.
/// This method helps avoid such a regression.
///
/// Panics if the `spawn_blocking` fails, see [`tokio::task::JoinError`] for reasons why that can happen.
pub(crate) async fn spawn_blocking_and_block_on_if_std<Fut, R>(&self, work: Fut) -> R
where
Fut: 'static + Send + std::future::Future<Output = R>,
R: 'static + Send,
{
match self {
IoEngine::NotSet => panic!("not initialized"),
IoEngine::StdFs => {
let span = tracing::info_span!("spawn_blocking_block_on_if_std");
tokio::task::spawn_blocking({
move || tokio::runtime::Handle::current().block_on(work.instrument(span))
})
.await
.expect("failed to join blocking code most likely it panicked, panicking as well")
}
#[cfg(target_os = "linux")]
IoEngine::TokioEpollUring => work.await,
}
}
}

View File

@@ -0,0 +1,30 @@
use std::fs;
pub enum Metadata {
StdFs(fs::Metadata),
#[cfg(target_os = "linux")]
TokioEpollUring(Box<tokio_epoll_uring::ops::statx::statx>),
}
#[cfg(target_os = "linux")]
impl From<Box<tokio_epoll_uring::ops::statx::statx>> for Metadata {
fn from(value: Box<tokio_epoll_uring::ops::statx::statx>) -> Self {
Metadata::TokioEpollUring(value)
}
}
impl From<std::fs::Metadata> for Metadata {
fn from(value: std::fs::Metadata) -> Self {
Metadata::StdFs(value)
}
}
impl Metadata {
pub fn len(&self) -> u64 {
match self {
Metadata::StdFs(metadata) => metadata.len(),
#[cfg(target_os = "linux")]
Metadata::TokioEpollUring(statx) => statx.stx_size,
}
}
}

View File

@@ -149,7 +149,7 @@ hnsw_check_available_memory(Size requested)
struct sysinfo si;
Size total;
if (sysinfo(&si) < 0)
elog(ERROR, "Failed to get amount of RAM: %n");
elog(ERROR, "Failed to get amount of RAM: %m");
total = si.totalram*si.mem_unit;
if ((Size)NBuffers*BLCKSZ + requested >= total)

View File

@@ -21,7 +21,7 @@ SHLIB_LINK_INTERNAL = $(libpq)
SHLIB_LINK = -lcurl
EXTENSION = neon
DATA = neon--1.0.sql neon--1.0--1.1.sql neon--1.1--1.2.sql neon--1.2--1.3.sql
DATA = neon--1.0.sql neon--1.0--1.1.sql neon--1.1--1.2.sql neon--1.2--1.3.sql neon--1.3--1.2.sql neon--1.2--1.1.sql neon--1.1--1.0.sql
PGFILEDESC = "neon - cloud storage for PostgreSQL"
EXTRA_CLEAN = \

View File

@@ -35,6 +35,7 @@
#include "utils/memutils.h"
#include "utils/jsonb.h"
#include "control_plane_connector.h"
#include "neon_utils.h"
static ProcessUtility_hook_type PreviousProcessUtilityHook = NULL;
@@ -113,6 +114,8 @@ ConstructDeltaMessage()
if (RootTable.db_table)
{
JsonbValue dbs;
HASH_SEQ_STATUS status;
DbEntry *entry;
dbs.type = jbvString;
dbs.val.string.val = "dbs";
@@ -120,9 +123,6 @@ ConstructDeltaMessage()
pushJsonbValue(&state, WJB_KEY, &dbs);
pushJsonbValue(&state, WJB_BEGIN_ARRAY, NULL);
HASH_SEQ_STATUS status;
DbEntry *entry;
hash_seq_init(&status, RootTable.db_table);
while ((entry = hash_seq_search(&status)) != NULL)
{
@@ -168,8 +168,9 @@ ConstructDeltaMessage()
#else
const char *logdetail;
#endif
char *encrypted_password;
PushKeyValue(&state, "password", (char *) entry->password);
char *encrypted_password = get_role_password(entry->name, &logdetail);
encrypted_password = get_role_password(entry->name, &logdetail);
if (encrypted_password)
{
@@ -831,7 +832,7 @@ NeonProcessUtility(
}
}
extern void
void
InitControlPlaneConnector()
{
PreviousProcessUtilityHook = ProcessUtility_hook;

View File

@@ -1,6 +1,6 @@
#ifndef CONTROL_PLANE_CONNECTOR_H
#define CONTROL_PLANE_CONNECTOR_H
void InitControlPlaneConnector();
void InitControlPlaneConnector(void);
#endif

View File

@@ -14,6 +14,7 @@
#include "utils/guc.h"
#include "extension_server.h"
#include "neon_utils.h"
static int extension_server_port = 0;
@@ -37,7 +38,6 @@ neon_download_extension_file_http(const char *filename, bool is_library)
CURLcode res;
char *compute_ctl_url;
char *postdata;
bool ret = false;
if (handle == NULL)

View File

@@ -0,0 +1,17 @@
/*-------------------------------------------------------------------------
*
* extension_server.h
* Request compute_ctl to download extension files.
*
* IDENTIFICATION
* contrib/neon/extension_server.h
*
*-------------------------------------------------------------------------
*/
#ifndef EXTENSION_SERVER_H
#define EXTENSION_SERVER_H
void pg_init_extension_server(void);
#endif /* EXTENSION_SERVER_H */

View File

@@ -316,6 +316,7 @@ pageserver_connect(shardno_t shard_no, int elevel)
static uint64_t delay_us = MIN_RECONNECT_INTERVAL_USEC;
TimestampTz now;
uint64_t us_since_last_connect;
bool broke_from_loop = false;
Assert(page_servers[shard_no].conn == NULL);
@@ -418,7 +419,9 @@ pageserver_connect(shardno_t shard_no, int elevel)
neon_shard_log(shard_no, elevel, "could not complete handshake with pageserver: %s",
msg);
return false;
/* Returning from inside PG_TRY is bad, so we break/return later */
broke_from_loop = true;
break;
}
}
}
@@ -431,6 +434,11 @@ pageserver_connect(shardno_t shard_no, int elevel)
}
PG_END_TRY();
if (broke_from_loop)
{
return false;
}
neon_shard_log(shard_no, LOG, "libpagestore: connected to '%s'", connstr);
page_servers[shard_no].conn = conn;
page_servers[shard_no].wes = wes;

View File

@@ -0,0 +1,6 @@
-- the order of operations is important here
-- because the view depends on the function
DROP VIEW IF EXISTS neon_lfc_stats CASCADE;
DROP FUNCTION IF EXISTS neon_get_lfc_stats CASCADE;

View File

@@ -0,0 +1 @@
DROP VIEW IF EXISTS NEON_STAT_FILE_CACHE CASCADE;

View File

@@ -0,0 +1 @@
DROP FUNCTION IF EXISTS approximate_working_set_size(bool) CASCADE;

View File

@@ -29,6 +29,7 @@
#include "utils/guc.h"
#include "utils/wait_event.h"
#include "extension_server.h"
#include "neon.h"
#include "walproposer.h"
#include "pagestore_client.h"
@@ -94,7 +95,6 @@ get_num_snap_files_lsn_threshold(void)
DIR *dirdesc;
struct dirent *de;
char *snap_path = "pg_logical/snapshots/";
int cnt = 0;
int lsns_allocated = 1024;
int lsns_num = 0;
XLogRecPtr *lsns;
@@ -160,9 +160,6 @@ get_num_snap_files_lsn_threshold(void)
PGDLLEXPORT void
LogicalSlotsMonitorMain(Datum main_arg)
{
TimestampTz now,
last_checked;
/* Establish signal handlers. */
pqsignal(SIGUSR1, procsignal_sigusr1_handler);
pqsignal(SIGHUP, SignalHandlerForConfigReload);

View File

@@ -25,12 +25,11 @@ extern int wal_acceptor_connection_timeout;
extern void pg_init_libpagestore(void);
extern void pg_init_walproposer(void);
extern void pg_init_extension_server(void);
extern uint64 BackpressureThrottlingTime(void);
extern void replication_feedback_get_lsns(XLogRecPtr *writeLsn, XLogRecPtr *flushLsn, XLogRecPtr *applyLsn);
extern void PGDLLEXPORT WalProposerSync(int argc, char *argv[]);
extern void PGDLLEXPORT WalProposerMain(Datum main_arg);
PGDLLEXPORT void LogicalSlotsMonitorMain(Datum main_arg);
#endif /* NEON_H */

View File

@@ -6,6 +6,7 @@
#include "postgres.h"
#include "neon_utils.h"
#include "lib/stringinfo.h"
#include "libpq/pqformat.h"
@@ -14,7 +15,7 @@
*
* Returns -1 if the character is not a hexadecimal digit.
*/
int
static int
HexDecodeChar(char c)
{
if (c >= '0' && c <= '9')

View File

@@ -12,7 +12,7 @@ uint32 pq_getmsgint32_le(StringInfo msg);
uint64 pq_getmsgint64_le(StringInfo msg);
void pq_sendint32_le(StringInfo buf, uint32 i);
void pq_sendint64_le(StringInfo buf, uint64 i);
extern void disable_core_dump();
void disable_core_dump(void);
#ifndef WALPROPOSER_LIB

View File

@@ -1888,7 +1888,6 @@ neon_zeroextend(SMgrRelation reln, ForkNumber forkNum, BlockNumber blocknum,
int nblocks, bool skipFsync)
{
const PGAlignedBlock buffer = {0};
BlockNumber curblocknum = blocknum;
int remblocks = nblocks;
XLogRecPtr lsn = 0;

View File

@@ -1220,7 +1220,7 @@ PrepareAppendRequest(WalProposer *wp, AppendRequestHeader *req, XLogRecPtr begin
req->epochStartLsn = wp->propEpochStartLsn;
req->beginLsn = beginLsn;
req->endLsn = endLsn;
req->commitLsn = GetAcknowledgedByQuorumWALPosition(wp);
req->commitLsn = wp->commitLsn;
req->truncateLsn = wp->truncateLsn;
req->proposerId = wp->greetRequest.proposerId;
}
@@ -1405,7 +1405,7 @@ static bool
RecvAppendResponses(Safekeeper *sk)
{
WalProposer *wp = sk->wp;
XLogRecPtr minQuorumLsn;
XLogRecPtr newCommitLsn;
bool readAnything = false;
while (true)
@@ -1444,23 +1444,24 @@ RecvAppendResponses(Safekeeper *sk)
if (!readAnything)
return sk->state == SS_ACTIVE;
HandleSafekeeperResponse(wp);
/* update commit_lsn */
newCommitLsn = GetAcknowledgedByQuorumWALPosition(wp);
/*
* Also send the new commit lsn to all the safekeepers.
* Send the new value to all safekeepers.
*/
minQuorumLsn = GetAcknowledgedByQuorumWALPosition(wp);
if (minQuorumLsn > wp->lastSentCommitLsn)
if (newCommitLsn > wp->commitLsn)
{
wp->commitLsn = newCommitLsn;
BroadcastAppendRequest(wp);
wp->lastSentCommitLsn = minQuorumLsn;
}
HandleSafekeeperResponse(wp);
return sk->state == SS_ACTIVE;
}
/* Parse a PageserverFeedback message, or the PageserverFeedback part of an AppendResponse */
void
static void
ParsePageserverFeedbackMessage(WalProposer *wp, StringInfo reply_message, PageserverFeedback *rf)
{
uint8 nkeys;
@@ -1590,9 +1591,9 @@ GetAcknowledgedByQuorumWALPosition(WalProposer *wp)
Safekeeper *
GetDonor(WalProposer *wp, XLogRecPtr *donor_lsn)
{
*donor_lsn = InvalidXLogRecPtr;
Safekeeper *donor = NULL;
int i;
*donor_lsn = InvalidXLogRecPtr;
if (wp->n_votes < wp->quorum)
{
@@ -1632,11 +1633,9 @@ GetDonor(WalProposer *wp, XLogRecPtr *donor_lsn)
static void
HandleSafekeeperResponse(WalProposer *wp)
{
XLogRecPtr minQuorumLsn;
XLogRecPtr candidateTruncateLsn;
minQuorumLsn = GetAcknowledgedByQuorumWALPosition(wp);
wp->api.process_safekeeper_feedback(wp, minQuorumLsn);
wp->api.process_safekeeper_feedback(wp);
/*
* Try to advance truncateLsn -- the last record flushed to all
@@ -1649,7 +1648,7 @@ HandleSafekeeperResponse(WalProposer *wp)
* can't commit entries from previous term' in Raft); 2)
*/
candidateTruncateLsn = CalculateMinFlushLsn(wp);
candidateTruncateLsn = Min(candidateTruncateLsn, minQuorumLsn);
candidateTruncateLsn = Min(candidateTruncateLsn, wp->commitLsn);
if (candidateTruncateLsn > wp->truncateLsn)
{
wp->truncateLsn = candidateTruncateLsn;

View File

@@ -564,7 +564,7 @@ typedef struct walproposer_api
* backpressure feedback and to confirm WAL persistence (has been commited
* on the quorum of safekeepers).
*/
void (*process_safekeeper_feedback) (WalProposer *wp, XLogRecPtr commitLsn);
void (*process_safekeeper_feedback) (WalProposer *wp);
/*
* Write a log message to the internal log processor. This is used only
@@ -646,8 +646,8 @@ typedef struct WalProposer
/* WAL has been generated up to this point */
XLogRecPtr availableLsn;
/* last commitLsn broadcasted to safekeepers */
XLogRecPtr lastSentCommitLsn;
/* cached GetAcknowledgedByQuorumWALPosition result */
XLogRecPtr commitLsn;
ProposerGreeting greetRequest;

View File

@@ -68,6 +68,8 @@ static WalproposerShmemState *walprop_shared;
static WalProposerConfig walprop_config;
static XLogRecPtr sentPtr = InvalidXLogRecPtr;
static const walproposer_api walprop_pg;
static volatile sig_atomic_t got_SIGUSR2 = false;
static bool reported_sigusr2 = false;
static void nwp_shmem_startup_hook(void);
static void nwp_register_gucs(void);
@@ -101,6 +103,8 @@ static void add_nwr_event_set(Safekeeper *sk, uint32 events);
static void update_nwr_event_set(Safekeeper *sk, uint32 events);
static void rm_safekeeper_event_set(Safekeeper *to_remove, bool is_sk);
static void CheckGracefulShutdown(WalProposer *wp);
static XLogRecPtr GetLogRepRestartLSN(WalProposer *wp);
static void
@@ -398,7 +402,7 @@ walprop_pg_get_shmem_state(WalProposer *wp)
return walprop_shared;
}
void
static void
replication_feedback_set(PageserverFeedback *rf)
{
SpinLockAcquire(&walprop_shared->mutex);
@@ -492,6 +496,24 @@ walprop_pg_init_standalone_sync_safekeepers(void)
BackgroundWorkerUnblockSignals();
}
/*
* We pretend to be a walsender process, and the lifecycle of a walsender is
* slightly different than other procesess. At shutdown, walsender processes
* stay alive until the very end, after the checkpointer has written the
* shutdown checkpoint. When the checkpointer exits, the postmaster sends all
* remaining walsender processes SIGUSR2. On receiving SIGUSR2, we try to send
* the remaining WAL, and then exit. This ensures that the checkpoint record
* reaches durable storage (in safekeepers), before the server shuts down
* completely.
*/
static void
walprop_sigusr2(SIGNAL_ARGS)
{
got_SIGUSR2 = true;
SetLatch(MyLatch);
}
static void
walprop_pg_init_bgworker(void)
{
@@ -503,6 +525,7 @@ walprop_pg_init_bgworker(void)
pqsignal(SIGUSR1, procsignal_sigusr1_handler);
pqsignal(SIGHUP, SignalHandlerForConfigReload);
pqsignal(SIGTERM, die);
pqsignal(SIGUSR2, walprop_sigusr2);
BackgroundWorkerUnblockSignals();
@@ -1026,7 +1049,7 @@ static void
StartProposerReplication(WalProposer *wp, StartReplicationCmd *cmd)
{
XLogRecPtr FlushPtr;
TimeLineID currTLI;
__attribute__((unused)) TimeLineID currTLI;
#if PG_VERSION_NUM < 150000
if (ThisTimeLineID == 0)
@@ -1075,14 +1098,26 @@ StartProposerReplication(WalProposer *wp, StartReplicationCmd *cmd)
#endif
/*
* When we first start replication the standby will be behind the primary.
* For some applications, for example synchronous replication, it is
* important to have a clear state for this initial catchup mode, so we
* can trigger actions when we change streaming state later. We may stay
* in this state for a long time, which is exactly why we want to be able
* to monitor whether or not we are still here.
* XXX: Move straight to STOPPING state, skipping the STREAMING state.
*
* This is a bit weird. Normal walsenders stay in STREAMING state, until
* the checkpointer signals them that it is about to start writing the
* shutdown checkpoint. The walsenders acknowledge that they have received
* that signal by switching to STOPPING state. That tells the walsenders
* that they must not write any new WAL.
*
* However, we cannot easily intercept that signal from the checkpointer.
* It's sent by WalSndInitStopping(), using
* SendProcSignal(PROCSIGNAL_WALSND_INIT_STOPPING). It's received by
* HandleWalSndInitStopping, which sets a process-local got_STOPPING flag.
* However, that's all private to walsender.c.
*
* We don't need to do anything special upon receiving the signal, the
* walproposer doesn't write any WAL anyway, so we skip the STREAMING
* state and go directly to STOPPING mode. That way, the checkpointer
* won't wait for us.
*/
WalSndSetState(WALSNDSTATE_CATCHUP);
WalSndSetState(WALSNDSTATE_STOPPING);
/*
* Don't allow a request to stream from a future point in WAL that hasn't
@@ -1122,6 +1157,8 @@ StartProposerReplication(WalProposer *wp, StartReplicationCmd *cmd)
static void
WalSndLoop(WalProposer *wp)
{
XLogRecPtr flushPtr;
/* Clear any already-pending wakeups */
ResetLatch(MyLatch);
@@ -1130,9 +1167,6 @@ WalSndLoop(WalProposer *wp)
CHECK_FOR_INTERRUPTS();
XLogBroadcastWalProposer(wp);
if (MyWalSnd->state == WALSNDSTATE_CATCHUP)
WalSndSetState(WALSNDSTATE_STREAMING);
WalProposerPoll(wp);
}
}
@@ -1230,7 +1264,6 @@ WalProposerRecovery(WalProposer *wp, Safekeeper *sk)
TimeLineID timeline;
XLogRecPtr startpos;
XLogRecPtr endpos;
uint64 download_range_mb;
startpos = GetLogRepRestartLSN(wp);
if (startpos == InvalidXLogRecPtr)
@@ -1745,6 +1778,9 @@ walprop_pg_wait_event_set(WalProposer *wp, long timeout, Safekeeper **sk, uint32
{
ConditionVariableCancelSleep();
ResetLatch(MyLatch);
CheckGracefulShutdown(wp);
*events = WL_LATCH_SET;
return 1;
}
@@ -1798,6 +1834,41 @@ walprop_pg_finish_sync_safekeepers(WalProposer *wp, XLogRecPtr lsn)
exit(0);
}
/*
* Like vanilla walsender, on sigusr2 send all remaining WAL and exit.
*
* Note that unlike sync-safekeepers waiting here is not reliable: we
* don't check that majority of safekeepers received and persisted
* commit_lsn -- only that walproposer reached it (which immediately
* broadcasts new value). Doing that without incurring redundant control
* file syncing would need wp -> sk protocol change. OTOH unlike
* sync-safekeepers which must bump commit_lsn or basebackup will fail,
* this catchup is important only for tests where safekeepers/network
* don't crash on their own.
*/
static void
CheckGracefulShutdown(WalProposer *wp)
{
if (got_SIGUSR2)
{
if (!reported_sigusr2)
{
XLogRecPtr flushPtr = walprop_pg_get_flush_rec_ptr(wp);
wpg_log(LOG, "walproposer will send and wait for remaining WAL between %X/%X and %X/%X",
LSN_FORMAT_ARGS(wp->commitLsn), LSN_FORMAT_ARGS(flushPtr));
reported_sigusr2 = true;
}
if (wp->commitLsn >= walprop_pg_get_flush_rec_ptr(wp))
{
wpg_log(LOG, "walproposer sent all WAL up to %X/%X, exiting",
LSN_FORMAT_ARGS(wp->commitLsn));
proc_exit(0);
}
}
}
/*
* Choose most advanced PageserverFeedback and set it to *rf.
*/
@@ -1878,7 +1949,7 @@ CombineHotStanbyFeedbacks(HotStandbyFeedback *hs, WalProposer *wp)
* None of that is functional in sync-safekeepers.
*/
static void
walprop_pg_process_safekeeper_feedback(WalProposer *wp, XLogRecPtr commitLsn)
walprop_pg_process_safekeeper_feedback(WalProposer *wp)
{
HotStandbyFeedback hsFeedback;
XLogRecPtr oldDiskConsistentLsn;
@@ -1893,10 +1964,10 @@ walprop_pg_process_safekeeper_feedback(WalProposer *wp, XLogRecPtr commitLsn)
replication_feedback_set(&quorumFeedback.rf);
SetZenithCurrentClusterSize(quorumFeedback.rf.currentClusterSize);
if (commitLsn > quorumFeedback.flushLsn || oldDiskConsistentLsn != quorumFeedback.rf.disk_consistent_lsn)
if (wp->commitLsn > quorumFeedback.flushLsn || oldDiskConsistentLsn != quorumFeedback.rf.disk_consistent_lsn)
{
if (commitLsn > quorumFeedback.flushLsn)
quorumFeedback.flushLsn = commitLsn;
if (wp->commitLsn > quorumFeedback.flushLsn)
quorumFeedback.flushLsn = wp->commitLsn;
/*
* Advance the replication slot to commitLsn. WAL before it is
@@ -1929,6 +2000,8 @@ walprop_pg_process_safekeeper_feedback(WalProposer *wp, XLogRecPtr commitLsn)
XidFromFullTransactionId(hsFeedback.catalog_xmin),
EpochFromFullTransactionId(hsFeedback.catalog_xmin));
}
CheckGracefulShutdown(wp);
}
static XLogRecPtr

View File

@@ -182,8 +182,6 @@ test_consume_memory(PG_FUNCTION_ARGS)
Datum
test_release_memory(PG_FUNCTION_ARGS)
{
TimestampTz start;
if (PG_ARGISNULL(0))
{
if (consume_cxt)

View File

@@ -220,6 +220,9 @@ enter_seccomp_mode(void)
}
#endif /* HAVE_LIBSECCOMP */
PGDLLEXPORT void
WalRedoMain(int argc, char *argv[]);
/*
* Entry point for the WAL redo process.
*

View File

@@ -68,7 +68,6 @@ task-local-extensions.workspace = true
thiserror.workspace = true
tikv-jemallocator.workspace = true
tikv-jemalloc-ctl = { workspace = true, features = ["use_std"] }
tls-listener.workspace = true
tokio-postgres.workspace = true
tokio-rustls.workspace = true
tokio-util.workspace = true

View File

@@ -102,6 +102,8 @@ pub(super) async fn authenticate(
ctx.set_user(db_info.user.into());
ctx.set_project(db_info.aux.clone());
let cold_start_info = db_info.aux.cold_start_info.clone().unwrap_or_default();
info!(?cold_start_info, "woken up a compute node");
// Backwards compatibility. pg_sni_proxy uses "--" in domain names
// while direct connections do not. Once we migrate to pg_sni_proxy

View File

@@ -10,6 +10,7 @@ use itertools::Itertools;
use proxy::config::TlsServerEndPoint;
use proxy::context::RequestMonitoring;
use proxy::proxy::run_until_cancelled;
use rustls::pki_types::PrivateKeyDer;
use tokio::net::TcpListener;
use anyhow::{anyhow, bail, ensure, Context};
@@ -76,37 +77,40 @@ async fn main() -> anyhow::Result<()> {
(Some(key_path), Some(cert_path)) => {
let key = {
let key_bytes = std::fs::read(key_path).context("TLS key file")?;
let mut keys = rustls_pemfile::pkcs8_private_keys(&mut &key_bytes[..])
.context(format!("Failed to read TLS keys at '{key_path}'"))?;
let mut keys =
rustls_pemfile::pkcs8_private_keys(&mut &key_bytes[..]).collect_vec();
ensure!(keys.len() == 1, "keys.len() = {} (should be 1)", keys.len());
keys.pop().map(rustls::PrivateKey).unwrap()
PrivateKeyDer::Pkcs8(
keys.pop()
.unwrap()
.context(format!("Failed to read TLS keys at '{key_path}'"))?,
)
};
let cert_chain_bytes = std::fs::read(cert_path)
.context(format!("Failed to read TLS cert file at '{cert_path}.'"))?;
let cert_chain = {
let cert_chain: Vec<_> = {
rustls_pemfile::certs(&mut &cert_chain_bytes[..])
.context(format!(
"Failed to read TLS certificate chain from bytes from file at '{cert_path}'."
))?
.into_iter()
.map(rustls::Certificate)
.collect_vec()
.try_collect()
.with_context(|| {
format!("Failed to read TLS certificate chain from bytes from file at '{cert_path}'.")
})?
};
// needed for channel bindings
let first_cert = cert_chain.first().context("missing certificate")?;
let tls_server_end_point = TlsServerEndPoint::new(first_cert)?;
let tls_config = rustls::ServerConfig::builder()
.with_safe_default_cipher_suites()
.with_safe_default_kx_groups()
.with_protocol_versions(&[&rustls::version::TLS13, &rustls::version::TLS12])?
.with_no_client_auth()
.with_single_cert(cert_chain, key)?
.into();
let tls_config = rustls::ServerConfig::builder_with_protocol_versions(&[
&rustls::version::TLS13,
&rustls::version::TLS12,
])
.with_no_client_auth()
.with_single_cert(cert_chain, key)?
.into();
(tls_config, tls_server_end_point)
}

View File

@@ -1,6 +1,10 @@
use crate::{auth, rate_limiter::RateBucketInfo, serverless::GlobalConnPoolOptions};
use anyhow::{bail, ensure, Context, Ok};
use rustls::{sign, Certificate, PrivateKey};
use itertools::Itertools;
use rustls::{
crypto::ring::sign,
pki_types::{CertificateDer, PrivateKeyDer},
};
use sha2::{Digest, Sha256};
use std::{
collections::{HashMap, HashSet},
@@ -88,14 +92,14 @@ pub fn configure_tls(
let cert_resolver = Arc::new(cert_resolver);
let config = rustls::ServerConfig::builder()
.with_safe_default_cipher_suites()
.with_safe_default_kx_groups()
// allow TLS 1.2 to be compatible with older client libraries
.with_protocol_versions(&[&rustls::version::TLS13, &rustls::version::TLS12])?
.with_no_client_auth()
.with_cert_resolver(cert_resolver.clone())
.into();
// allow TLS 1.2 to be compatible with older client libraries
let config = rustls::ServerConfig::builder_with_protocol_versions(&[
&rustls::version::TLS13,
&rustls::version::TLS12,
])
.with_no_client_auth()
.with_cert_resolver(cert_resolver.clone())
.into();
Ok(TlsConfig {
config,
@@ -133,14 +137,14 @@ pub enum TlsServerEndPoint {
}
impl TlsServerEndPoint {
pub fn new(cert: &Certificate) -> anyhow::Result<Self> {
pub fn new(cert: &CertificateDer) -> anyhow::Result<Self> {
let sha256_oids = [
// I'm explicitly not adding MD5 or SHA1 here... They're bad.
oid_registry::OID_SIG_ECDSA_WITH_SHA256,
oid_registry::OID_PKCS1_SHA256WITHRSA,
];
let pem = x509_parser::parse_x509_certificate(&cert.0)
let pem = x509_parser::parse_x509_certificate(cert)
.context("Failed to parse PEM object from cerficiate")?
.1;
@@ -150,8 +154,7 @@ impl TlsServerEndPoint {
let oid = pem.signature_algorithm.oid();
let alg = reg.get(oid);
if sha256_oids.contains(oid) {
let tls_server_end_point: [u8; 32] =
Sha256::new().chain_update(&cert.0).finalize().into();
let tls_server_end_point: [u8; 32] = Sha256::new().chain_update(cert).finalize().into();
info!(subject = %pem.subject, signature_algorithm = alg.map(|a| a.description()), tls_server_end_point = %base64::encode(tls_server_end_point), "determined channel binding");
Ok(Self::Sha256(tls_server_end_point))
} else {
@@ -165,7 +168,7 @@ impl TlsServerEndPoint {
}
}
#[derive(Default)]
#[derive(Default, Debug)]
pub struct CertResolver {
certs: HashMap<String, (Arc<rustls::sign::CertifiedKey>, TlsServerEndPoint)>,
default: Option<(Arc<rustls::sign::CertifiedKey>, TlsServerEndPoint)>,
@@ -185,11 +188,14 @@ impl CertResolver {
let priv_key = {
let key_bytes = std::fs::read(key_path)
.context(format!("Failed to read TLS keys at '{key_path}'"))?;
let mut keys = rustls_pemfile::pkcs8_private_keys(&mut &key_bytes[..])
.context(format!("Failed to parse TLS keys at '{key_path}'"))?;
let mut keys = rustls_pemfile::pkcs8_private_keys(&mut &key_bytes[..]).collect_vec();
ensure!(keys.len() == 1, "keys.len() = {} (should be 1)", keys.len());
keys.pop().map(rustls::PrivateKey).unwrap()
PrivateKeyDer::Pkcs8(
keys.pop()
.unwrap()
.context(format!("Failed to parse TLS keys at '{key_path}'"))?,
)
};
let cert_chain_bytes = std::fs::read(cert_path)
@@ -197,14 +203,10 @@ impl CertResolver {
let cert_chain = {
rustls_pemfile::certs(&mut &cert_chain_bytes[..])
.try_collect()
.with_context(|| {
format!(
"Failed to read TLS certificate chain from bytes from file at '{cert_path}'."
)
format!("Failed to read TLS certificate chain from bytes from file at '{cert_path}'.")
})?
.into_iter()
.map(rustls::Certificate)
.collect()
};
self.add_cert(priv_key, cert_chain, is_default)
@@ -212,15 +214,15 @@ impl CertResolver {
pub fn add_cert(
&mut self,
priv_key: PrivateKey,
cert_chain: Vec<Certificate>,
priv_key: PrivateKeyDer<'static>,
cert_chain: Vec<CertificateDer<'static>>,
is_default: bool,
) -> anyhow::Result<()> {
let key = sign::any_supported_type(&priv_key).context("invalid private key")?;
let first_cert = &cert_chain[0];
let tls_server_end_point = TlsServerEndPoint::new(first_cert)?;
let pem = x509_parser::parse_x509_certificate(&first_cert.0)
let pem = x509_parser::parse_x509_certificate(first_cert)
.context("Failed to parse PEM object from cerficiate")?
.1;

View File

@@ -101,9 +101,10 @@ pub struct MetricsAuxInfo {
pub cold_start_info: Option<ColdStartInfo>,
}
#[derive(Debug, Serialize, Deserialize, Clone)]
#[derive(Debug, Default, Serialize, Deserialize, Clone)]
#[serde(rename_all = "snake_case")]
pub enum ColdStartInfo {
#[default]
Unknown = 0,
Warm = 1,
PoolHit = 2,

View File

@@ -73,7 +73,7 @@ pub mod errors {
// Status 406: endpoint is disabled (we don't allow connections).
format!("{REQUEST_FAILED}: endpoint is disabled")
}
http::StatusCode::LOCKED => {
http::StatusCode::LOCKED | http::StatusCode::UNPROCESSABLE_ENTITY => {
// Status 423: project might be in maintenance mode (or bad state), or quotas exceeded.
format!("{REQUEST_FAILED}: endpoint is temporary unavailable. check your quotas and/or contact our support")
}
@@ -91,6 +91,12 @@ pub mod errors {
status: http::StatusCode::NOT_FOUND | http::StatusCode::NOT_ACCEPTABLE,
..
} => crate::error::ErrorKind::User,
ApiError::Console {
status: http::StatusCode::UNPROCESSABLE_ENTITY,
text,
} if text.contains("compute time quota of non-primary branches is exceeded") => {
crate::error::ErrorKind::User
}
ApiError::Console {
status: http::StatusCode::LOCKED,
text,
@@ -120,6 +126,11 @@ pub mod errors {
status: http::StatusCode::BAD_REQUEST,
..
} => true,
// don't retry when quotas are exceeded
Self::Console {
status: http::StatusCode::UNPROCESSABLE_ENTITY,
ref text,
} => !text.contains("compute time quota of non-primary branches is exceeded"),
// locked can be returned when the endpoint was in transition
// or when quotas are exceeded. don't retry when quotas are exceeded
Self::Console {

View File

@@ -259,6 +259,9 @@ impl super::Api for Api {
}
let node = self.do_wake_compute(ctx, user_info).await?;
ctx.set_project(node.aux.clone());
let cold_start_info = node.aux.cold_start_info.clone().unwrap_or_default();
info!(?cold_start_info, "woken up a compute node");
let (_, cached) = self.caches.node_info.insert(key.clone(), node);
info!(key = &*key, "created a cache entry for compute node info");

View File

@@ -4,7 +4,7 @@ use ::metrics::{
register_int_gauge_vec, Histogram, HistogramVec, HyperLogLogVec, IntCounterPairVec,
IntCounterVec, IntGauge, IntGaugeVec,
};
use metrics::{register_int_counter_pair, IntCounterPair};
use metrics::{register_int_counter, register_int_counter_pair, IntCounter, IntCounterPair};
use once_cell::sync::Lazy;
use tokio::time;
@@ -303,3 +303,20 @@ pub static ENDPOINT_ERRORS_BY_KIND: Lazy<HyperLogLogVec<32>> = Lazy::new(|| {
)
.unwrap()
});
pub static REDIS_BROKEN_MESSAGES: Lazy<IntCounterVec> = Lazy::new(|| {
register_int_counter_vec!(
"proxy_redis_errors_total",
"Number of errors by a given classification",
&["channel"],
)
.unwrap()
});
pub static TLS_HANDSHAKE_FAILURES: Lazy<IntCounter> = Lazy::new(|| {
register_int_counter!(
"proxy_tls_handshake_failures",
"Number of TLS handshake failures",
)
.unwrap()
});

View File

@@ -1,22 +1,27 @@
//! Proxy Protocol V2 implementation
use std::{
future::poll_fn,
future::Future,
future::{poll_fn, Future},
io,
net::SocketAddr,
pin::{pin, Pin},
sync::Mutex,
task::{ready, Context, Poll},
};
use bytes::{Buf, BytesMut};
use hyper::server::accept::Accept;
use hyper::server::conn::{AddrIncoming, AddrStream};
use metrics::IntCounterPairGuard;
use pin_project_lite::pin_project;
use tls_listener::AsyncAccept;
use tokio::io::{AsyncRead, AsyncReadExt, AsyncWrite, ReadBuf};
use uuid::Uuid;
use crate::{metrics::NUM_CLIENT_CONNECTION_GAUGE, serverless::tls_listener::AsyncAccept};
pub struct ProxyProtocolAccept {
pub incoming: AddrIncoming,
pub protocol: &'static str,
}
pin_project! {
@@ -327,7 +332,7 @@ impl<T: AsyncRead> AsyncRead for WithClientIp<T> {
}
impl AsyncAccept for ProxyProtocolAccept {
type Connection = WithClientIp<AddrStream>;
type Connection = WithConnectionGuard<WithClientIp<AddrStream>>;
type Error = io::Error;
@@ -336,11 +341,74 @@ impl AsyncAccept for ProxyProtocolAccept {
cx: &mut Context<'_>,
) -> Poll<Option<Result<Self::Connection, Self::Error>>> {
let conn = ready!(Pin::new(&mut self.incoming).poll_accept(cx)?);
tracing::info!(protocol = self.protocol, "accepted new TCP connection");
let Some(conn) = conn else {
return Poll::Ready(None);
};
Poll::Ready(Some(Ok(WithClientIp::new(conn))))
Poll::Ready(Some(Ok(WithConnectionGuard {
inner: WithClientIp::new(conn),
connection_id: Uuid::new_v4(),
gauge: Mutex::new(Some(
NUM_CLIENT_CONNECTION_GAUGE
.with_label_values(&[self.protocol])
.guard(),
)),
})))
}
}
pin_project! {
pub struct WithConnectionGuard<T> {
#[pin]
pub inner: T,
pub connection_id: Uuid,
pub gauge: Mutex<Option<IntCounterPairGuard>>,
}
}
impl<T: AsyncWrite> AsyncWrite for WithConnectionGuard<T> {
#[inline]
fn poll_write(
self: Pin<&mut Self>,
cx: &mut Context<'_>,
buf: &[u8],
) -> Poll<Result<usize, io::Error>> {
self.project().inner.poll_write(cx, buf)
}
#[inline]
fn poll_flush(self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<Result<(), io::Error>> {
self.project().inner.poll_flush(cx)
}
#[inline]
fn poll_shutdown(self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<Result<(), io::Error>> {
self.project().inner.poll_shutdown(cx)
}
#[inline]
fn poll_write_vectored(
self: Pin<&mut Self>,
cx: &mut Context<'_>,
bufs: &[io::IoSlice<'_>],
) -> Poll<Result<usize, io::Error>> {
self.project().inner.poll_write_vectored(cx, bufs)
}
#[inline]
fn is_write_vectored(&self) -> bool {
self.inner.is_write_vectored()
}
}
impl<T: AsyncRead> AsyncRead for WithConnectionGuard<T> {
fn poll_read(
self: Pin<&mut Self>,
cx: &mut Context<'_>,
buf: &mut ReadBuf<'_>,
) -> Poll<io::Result<()>> {
self.project().inner.poll_read(cx, buf)
}
}

View File

@@ -24,6 +24,7 @@ use crate::{
};
use futures::TryFutureExt;
use itertools::Itertools;
use metrics::IntCounterPairGuard;
use once_cell::sync::OnceCell;
use pq_proto::{BeMessage as Be, StartupMessageParams};
use regex::Regex;
@@ -78,10 +79,16 @@ pub async fn task_main(
{
let (socket, peer_addr) = accept_result?;
let conn_gauge = NUM_CLIENT_CONNECTION_GAUGE
.with_label_values(&["tcp"])
.guard();
let session_id = uuid::Uuid::new_v4();
let cancellation_handler = Arc::clone(&cancellation_handler);
let endpoint_rate_limiter = endpoint_rate_limiter.clone();
tracing::info!(protocol = "tcp", %session_id, "accepted new TCP connection");
connections.spawn(async move {
let mut socket = WithClientIp::new(socket);
let mut peer_addr = peer_addr.ip();
@@ -116,6 +123,7 @@ pub async fn task_main(
socket,
ClientMode::Tcp,
endpoint_rate_limiter,
conn_gauge,
)
.instrument(span.clone())
.await;
@@ -229,13 +237,11 @@ pub async fn handle_client<S: AsyncRead + AsyncWrite + Unpin>(
stream: S,
mode: ClientMode,
endpoint_rate_limiter: Arc<EndpointRateLimiter>,
conn_gauge: IntCounterPairGuard,
) -> Result<Option<ProxyPassthrough<S>>, ClientRequestError> {
info!("handling interactive connection from client");
let proto = ctx.protocol;
let _client_gauge = NUM_CLIENT_CONNECTION_GAUGE
.with_label_values(&[proto])
.guard();
let _request_gauge = NUM_CONNECTION_REQUESTS_GAUGE
.with_label_values(&[proto])
.guard();
@@ -325,7 +331,7 @@ pub async fn handle_client<S: AsyncRead + AsyncWrite + Unpin>(
aux: node.aux.clone(),
compute: node,
req: _request_gauge,
conn: _client_gauge,
conn: conn_gauge,
cancel: session,
}))
}
@@ -374,6 +380,11 @@ impl NeonOptions {
Self::parse_from_iter(StartupMessageParams::parse_options_raw(options))
}
pub fn is_ephemeral(&self) -> bool {
// Currently, neon endpoint options are all reserved for ephemeral endpoints.
!self.0.is_empty()
}
fn parse_from_iter<'a>(options: impl Iterator<Item = &'a str>) -> Self {
let mut options = options
.filter_map(neon_option)

View File

@@ -20,6 +20,7 @@ use crate::{http, sasl, scram};
use anyhow::{bail, Context};
use async_trait::async_trait;
use rstest::rstest;
use rustls::pki_types;
use tokio_postgres::config::SslMode;
use tokio_postgres::tls::{MakeTlsConnect, NoTls};
use tokio_postgres_rustls::{MakeRustlsConnect, RustlsStream};
@@ -28,7 +29,11 @@ use tokio_postgres_rustls::{MakeRustlsConnect, RustlsStream};
fn generate_certs(
hostname: &str,
common_name: &str,
) -> anyhow::Result<(rustls::Certificate, rustls::Certificate, rustls::PrivateKey)> {
) -> anyhow::Result<(
pki_types::CertificateDer<'static>,
pki_types::CertificateDer<'static>,
pki_types::PrivateKeyDer<'static>,
)> {
let ca = rcgen::Certificate::from_params({
let mut params = rcgen::CertificateParams::default();
params.is_ca = rcgen::IsCa::Ca(rcgen::BasicConstraints::Unconstrained);
@@ -45,9 +50,9 @@ fn generate_certs(
})?;
Ok((
rustls::Certificate(ca.serialize_der()?),
rustls::Certificate(cert.serialize_der_with_signer(&ca)?),
rustls::PrivateKey(cert.serialize_private_key_der()),
pki_types::CertificateDer::from(ca.serialize_der()?),
pki_types::CertificateDer::from(cert.serialize_der_with_signer(&ca)?),
pki_types::PrivateKeyDer::Pkcs8(cert.serialize_private_key_der().into()),
))
}
@@ -82,9 +87,8 @@ fn generate_tls_config<'a>(
let tls_config = {
let config = rustls::ServerConfig::builder()
.with_safe_defaults()
.with_no_client_auth()
.with_single_cert(vec![cert.clone()], key.clone())?
.with_single_cert(vec![cert.clone()], key.clone_key())?
.into();
let mut cert_resolver = CertResolver::new();
@@ -101,10 +105,9 @@ fn generate_tls_config<'a>(
let client_config = {
let config = rustls::ClientConfig::builder()
.with_safe_defaults()
.with_root_certificates({
let mut store = rustls::RootCertStore::empty();
store.add(&ca)?;
store.add(ca)?;
store
})
.with_no_client_auth();

View File

@@ -69,6 +69,12 @@ fn report_error(e: &WakeComputeError, retry: bool) {
{
"quota_exceeded"
}
WakeComputeError::ApiError(ApiError::Console {
status: StatusCode::UNPROCESSABLE_ENTITY,
ref text,
}) if text.contains("compute time quota of non-primary branches is exceeded") => {
"quota_exceeded"
}
WakeComputeError::ApiError(ApiError::Console {
status: StatusCode::LOCKED,
..

View File

@@ -10,6 +10,7 @@ use crate::{
cache::project_info::ProjectInfoCache,
cancellation::{CancelMap, CancellationHandler, NotificationsCancellationHandler},
intern::{ProjectIdInt, RoleNameInt},
metrics::REDIS_BROKEN_MESSAGES,
};
const CPLANE_CHANNEL_NAME: &str = "neondb-proxy-ws-updates";
@@ -115,6 +116,9 @@ impl<
let msg: Notification = match serde_json::from_str(&payload) {
Ok(msg) => msg,
Err(e) => {
REDIS_BROKEN_MESSAGES
.with_label_values(&[msg.get_channel_name()])
.inc();
tracing::error!("broken message: {e}");
return Ok(());
}

Some files were not shown because too many files have changed in this diff Show More