Compare commits

...

86 Commits

Author SHA1 Message Date
Arseny Sher
32d4e4914a Add wait events without query to metric. 2023-11-16 23:56:04 +01:00
Arseny Sher
d4d577e7ff Add query to pg_wait_sampling metric 2023-11-16 22:42:08 +01:00
Arseny Sher
f552aa05fa Add pg_wait_sampling metric for vms. 2023-11-16 22:04:29 +01:00
Arthur Petukhovsky
779badb7c5 Join postgres multiline logs 2023-11-16 20:54:02 +00:00
Arseny Sher
e6eb548491 create extension pg_wait_sampling in compute_ctl 2023-11-16 20:54:02 +00:00
Arseny Sher
16e9eb2832 Try to enable a custom postgres_exporter query. 2023-11-16 20:54:02 +00:00
Arseny Sher
042686183b Add pg_wait_sampling extension. 2023-11-16 20:54:02 +00:00
khanova
0c243faf96 Proxy log pid hack (#5869)
## Problem

Improve observability for the compute node.

## Summary of changes

Log pid from the compute node. Doesn't work with pgbouncer.
2023-11-16 20:46:23 +00:00
Em Sharnoff
d0a842a509 Update vm-builder to v0.19.0 and move its customization here (#5783)
ref neondatabase/autoscaling#600 for more
2023-11-16 18:17:42 +01:00
khanova
6b82f22ada Collect number of connections by sni type (#5867)
## Problem

We don't know the number of users with the different kind of
authentication: ["sni", "endpoint in options" (A and B from
[here](https://neon.tech/docs/connect/connection-errors)),
"password_hack"]

## Summary of changes

Collect metrics by sni kind.
2023-11-16 12:19:13 +00:00
John Spray
ab631e6792 pageserver: make TenantsMap shard-aware (#5819)
## Problem

When using TenantId as the key, we are unable to handle multiple tenant
shards attached to the same pageserver for the same tenant ID. This is
an expected scenario if we have e.g. 8 shards and 5 pageservers.

## Summary of changes

- TenantsMap is now a BTreeMap instead of a HashMap: this enables
looking up by range. In future, we will need this for page_service, as
incoming requests will just specify the Key, and we'll have to figure
out which shard to route it to.
- A new key type TenantShardId is introduced, to act as the key in
TenantsMap, and as the id type in external APIs. Its human readable
serialization is backward compatible with TenantId, and also
forward-compatible as long as sharding is not actually used (when we
construct a TenantShardId with ShardCount(0), it serializes to an
old-fashioned TenantId).
- Essential tenant APIs are updated to accept TenantShardIds:
tenant/timeline create, tenant delete, and /location_conf. These are the
APIs that will enable driving sharded tenants. Other apis like /attach
/detach /load /ignore will not work with sharding: those will soon be
deprecated and replaced with /location_conf as part of the live
migration work.

Closes: #5787
2023-11-15 23:20:21 +02:00
Alexander Bayandin
f84ac2b98d Fix baseline commit and branch for code coverage (#5769)
## Problem

`HEAD` commit for a PR is a phantom merge commit which skews the baseline
commit for coverage reports.

See
https://github.com/neondatabase/neon/pull/5751#issuecomment-1790717867

## Summary of changes
- Use commit hash instead of `HEAD` for finding baseline commits for
code coverage
- Use the base branch for PRs or the current branch for pushes
2023-11-15 12:40:21 +01:00
dependabot[bot]
5cd5b93066 build(deps): bump aiohttp from 3.8.5 to 3.8.6 (#5864) 2023-11-15 11:08:49 +00:00
khanova
2f0d245c2a Proxy control plane rate limiter (#5785)
## Problem

Proxy might overload the control plane.

## Summary of changes

Implement rate limiter for proxy<->control plane connection.
Resolves https://github.com/neondatabase/neon/issues/5707

Used implementation ideas from https://github.com/conradludgate/squeeze/
2023-11-15 09:15:59 +00:00
Joonas Koivunen
462f04d377 Smaller test addition and change (#5858)
- trivial serialization roundtrip test for
`pageserver::repository::Value`
- add missing `start_paused = true` to 15s test making it <0s test
- completely unrelated future clippy lint avoidance (helps beta channel
users)
2023-11-14 18:04:34 +01:00
Arpad Müller
31a54d663c Migrate links from wiki to notion (#5862)
See the slack discussion:
https://neondb.slack.com/archives/C033A2WE6BZ/p1696429688621489?thread_ts=1695647103.117499
2023-11-14 15:36:47 +00:00
John Spray
7709c91fe5 neon_local: use remote storage by default, add cargo neon tenant migrate (#5760)
## Problem

Currently the only way to exercise tenant migration is via python test
code. We need a convenient way for developers to do it directly in a
neon local environment.

## Summary of changes

- Add a `--num-pageservers` argument to `cargo neon init` so that it's
easy to run with multiple pageservers
- Modify default pageserver overrides in neon_local to set up `LocalFs`
remote storage, as any migration/attach/detach stuff doesn't work in the
legacy local storage mode. This also unblocks removing the pageserver's
support for the legacy local mode.
- Add a new `cargo neon tenant migrate` command that orchestrates tenant
migration, including endpoints.
2023-11-14 09:51:51 +00:00
Arpad Müller
f7249b9018 Fix comment in find_lsn_for_timestamp (#5855)
We still subtract 1 from low to compute `commit_lsn`. the comment
moved/added by #5844 should point this out.
2023-11-11 00:32:00 +00:00
Joonas Koivunen
74d150ba45 build: upgrade ahash (#5851)
`cargo deny` was complaining the version 0.8.3 was yanked (for possible
DoS attack [wiki]), but the latest version (0.8.5) also includes aarch64
fixes which may or may not be relevant. Our usage of ahash limits to
proxy, but I don't think we are at any risk.

[wiki]: https://github.com/tkaitchuck/aHash/wiki/Yanked-versions
2023-11-10 19:10:54 +00:00
Joonas Koivunen
b7f45204a2 build: deny async-std and friends (#5849)
rationale: some crates pull these in as default; hopefully these hints
will require less cleanup-after and Cargo.lock file watching.

follow-up to #5848.
2023-11-10 18:02:22 +01:00
Joonas Koivunen
a05f104cce build: remove async-std dependency (#5848)
Introduced by accident (missing `default-features = false`) in
e09d5ada6a. We directly need only `http_types::StatusCode`.
2023-11-10 16:05:21 +02:00
John Spray
d672e44eee pageserver: error type for collect_keyspace (#5846)
## Problem

This is a log hygiene fix, for an occasional test failure.

warn-level logging in imitate_timeline_cached_layer_accesses can't
distinguish actual errors from shutdown cases.

## Summary of changes

Replaced anyhow::Error with an explicit CollectKeySpaceError type, that
includes conversion from PageReconstructError::Cancelled.
2023-11-10 13:58:18 +00:00
Rahul Modpur
a6f892e200 metric: add started and killed walredo processes counter (#5809)
In OOM situations, knowing exactly how many walredo processes there were
at a time would help afterwards to understand why was pageserver OOM
killed. Add `pageserver_wal_redo_process_total` metric to keep track of
total wal redo process started, shutdown and killed since pageserver
start.

Closes #5722

---------

Signed-off-by: Rahul Modpur <rmodpur2@gmail.com>
Co-authored-by: Joonas Koivunen <joonas@neon.tech>
Co-authored-by: Christian Schwarz <me@cschwarz.com>
2023-11-10 15:05:22 +02:00
Alexander Bayandin
71b380f90a Set BUILD_TAG for build-neon job (#5847)
## Problem

I've added `BUILD_TAG` to docker images.
(https://github.com/neondatabase/neon/pull/5812), but forgot to add it
to services that we build for tests

## Summary of changes
- Set `BUILD_TAG` in `build-neon` job
2023-11-10 12:49:52 +00:00
Alexander Bayandin
6e145a44fa workflows/neon_extra_builds: run check-codestyle-rust & build-neon on arm64 (#5832)
## Problem

Some developers use workstations with arm CPUs, and sometimes x86-64
code is not fully compatible with it (for example,
https://github.com/neondatabase/neon/pull/5827).
Although we don't have arm CPUs in the prod (yet?), it is worth having
some basic checks for this architecture to have a better developer
experience.

Closes https://github.com/neondatabase/neon/issues/5829

## Summary of changes
- Run `check-codestyle-rust`-like & `build-neon`-like jobs on Arm runner
- Add `run-extra-build-*` label to run all available extra builds
2023-11-10 12:45:41 +00:00
Arpad Müller
8e5e3971ba find_lsn_for_timestamp fixes (#5844)
Includes the changes of #3689 that address point 1 of #3689, plus some
further improvements. In particular, this PR does:

* set `min_lsn` to a safe value to create branches from (and verify it
in tests)
* return `min_lsn` instead of `max_lsn` for `NoData` and `Past` (verify
it in test for `Past`, `NoData` is harder and not as important)
* return `commit_lsn` instead of `max_lsn` for Future (and verify it in
the tests)
* add some comments

Split out of #5686 to get something more minimal out to users.
2023-11-10 13:38:44 +01:00
Joonas Koivunen
8dd29f1e27 fix(pageserver): spawn all kinds of tenant shutdowns (#5841)
Minor bugfix, something noticed while manual code-review. Use the same
joinset for inprogress tenants so we can get the benefit of the
buffering logging just as we get for attached tenants, and no single
inprogress task can hold up shutdown of other tenants.
2023-11-09 21:36:57 +00:00
Joonas Koivunen
f5344fb85a temp: log all layer loading errors while we lose them (#5816)
Temporary workaround while some errors are not being logged.

Cc: #5815.
2023-11-09 21:31:53 +00:00
Arpad Müller
f95f001b8b Lsn for get_timestamp_of_lsn should be string, not integer (#5840)
The `get_timestamp_of_lsn` pageserver endpoint has been added in #5497,
but the yml it added was wrong: the lsn is expected in hex format, not
in integer (decimal) format.
2023-11-09 16:12:18 +00:00
John Spray
e0821e1eab pageserver: refined Timeline shutdown (#5833)
## Problem

We have observed the shutdown of a timeline taking a long time when a
deletion arrives at a busy time for the system. This suggests that we
are not respecting cancellation tokens promptly enough.

## Summary of changes

- Refactor timeline shutdown so that rather than having a shutdown()
function that takes a flag for optionally flushing, there are two
distinct functions, one for graceful flushing shutdown, and another that
does the "normal" shutdown where we're just setting a cancellation token
and then tearing down as fast as we can. This makes things a bit easier
to reason about, and enables us to remove the hand-written variant of
shutdown that was maintained in `delete.rs`
- Layer flush task checks cancellation token more carefully
- Logical size calculation's handling of cancellation tokens is
simplified: rather than passing one in, it respects the Timeline's
cancellation token.

This PR doesn't touch RemoteTimelineClient, which will be a key thing to
fix as well, so that a slow remote storage op doesn't hold up shutdown.
2023-11-09 16:02:59 +00:00
bojanserafimov
4469b1a62c Fix blob_io test (#5818) 2023-11-09 10:47:03 -05:00
Joonas Koivunen
842223b47f fix(metric): remove pageserver_wal_redo_wait_seconds (#5791)
the meaning of the values recorded in this histogram changed with #5560
and we never had it visualized as a histogram, just the
`increase(_sum)`. The histogram is not too interesting to look at, so
remove it per discussion in [slack
thread](https://neondb.slack.com/archives/C063LJFF26S/p1699008316109999?thread_ts=1698852436.637559&cid=C063LJFF26S).
2023-11-09 16:40:52 +02:00
Anna Stepanyan
893616051d Update epic-template.md (#5709)
replace the checkbox list with a a proper task list in the epic template

NB: this PR does not change the code, it only touches the github issue
templates
2023-11-09 15:24:43 +01:00
Conrad Ludgate
7cdde285a5 proxy: limit concurrent wake_compute requests per endpoint (#5799)
## Problem

A user can perform many database connections at the same instant of time
- these will all cache miss and materialise as requests to the control
plane. #5705

## Summary of changes

I am using a `DashMap` (a sharded `RwLock<HashMap>`) of endpoints ->
semaphores to apply a limiter. If the limiter is enabled (permits > 0),
the semaphore will be retrieved per endpoint and a permit will be
awaited before continuing to call the wake_compute endpoint.

### Important details

This dashmap would grow uncontrollably without maintenance. It's not a
cache so I don't think an LRU-based reclamation makes sense. Instead,
I've made use of the sharding functionality of DashMap to lock a single
shard and clear out unused semaphores periodically.

I ran a test in release, using 128 tokio tasks among 12 threads each
pushing 1000 entries into the map per second, clearing a shard every 2
seconds (64 second epoch with 32 shards). The endpoint names were
sampled from a gamma distribution to make sure some overlap would occur,
and each permit was held for 1ms. The histogram for time to clear each
shard settled between 256-512us without any variance in my testing.

Holding a lock for under a millisecond for 1 of the shards does not
concern me as blocking
2023-11-09 14:14:30 +00:00
John Spray
9c30883c4b remote_storage: use S3 SDK's adaptive retry policy (#5813)
## Problem

Currently, we aren't doing any explicit slowdown in response to 429
responses. Recently, as we hit remote storage a bit harder (pageserver
does more ListObjectsv2 requests than it used to since #5580 ), we're
seeing storms of 429 responses that may be the result of not just doing
too may requests, but continuing to do those extra requests without
backing off any more than our usual backoff::exponential.

## Summary of changes

Switch from AWS's "Standard" retry policy to "Adaptive" -- docs describe
this as experimental but it has been around for a long time. The main
difference between Standard and Adaptive is that Adaptive rate-limits
the client in response to feedback from the server, which is meant to
avoid scenarios where the client would otherwise repeatedly hit
throttling responses.
2023-11-09 13:50:13 +00:00
Arthur Petukhovsky
0495798591 Fix walproposer build on aarch64 (#5827)
There was a compilation error due to `std::ffi::c_char` being different type on different platforms. Clippy also complained due to a similar reason.
2023-11-09 13:05:17 +00:00
Sasha Krassovsky
87389bc933 Add test simulating bad connection between pageserver and compute (#5728)
## Problem
We have a funny 3-day timeout for connections between the compute and
pageserver. We want to get rid of it, so to do that we need to make sure
the compute is resilient to connection failures.

Closes: https://github.com/neondatabase/neon/issues/5518

## Summary of changes
This test makes the pageserver randomly drop the connection if the
failpoint is enabled, and ensures we can keep querying the pageserver.

This PR also reduces the default timeout to 10 minutes from 3 days.
2023-11-08 19:48:57 +00:00
Arpad Müller
ea118a238a JWT logging improvements (#5823)
* lower level on auth success from info to debug (fixes #5820)
* don't log stacktraces on auth errors (as requested on slack). we do this by introducing an `AuthError` type instead of using `anyhow` and `bail`.
* return errors that have been censored for improved security.
2023-11-08 16:56:53 +00:00
Christian Schwarz
e9b227a11e cleanup unused RemoteStorage fields (#5830)
Found this while working on #5771
2023-11-08 16:54:33 +00:00
John Spray
40441f8ada pageserver: use Gate for stronger safety check in SlotGuard (#5793)
## Problem

#5711 and #5367 raced -- the `SlotGuard` type needs `Gate` to properly
enforce its invariant that we may not drop an `Arc<Tenant>` from a slot.

## Summary of changes

Replace the TODO with the intended check of Gate.
2023-11-08 13:00:11 +00:00
John Spray
a8a39cd464 test: de-flake test_deletion_queue_recovery (#5822)
## Problem

This test could fail if timing is unlucky, and the deletions in the test
land in two deletion lists instead of one.

## Summary of changes

We await _some_ validations instead of _all_ validations, because our
execution failpoint
will prevent validation proceeding for any but the first DeletionList.
Usually the workload
just generates one, but if it generates two due to timing, then we must
not expect that the
second one will be validated.
2023-11-08 12:41:48 +00:00
John Spray
b989ad1922 extend test_change_pageserver for failure case, rework changing pageserver (#5693)
Reproducer for https://github.com/neondatabase/neon/issues/5692

The test change in this PR intentionally fails, to demonstrate the
issue.

---------

Co-authored-by: Sasha Krassovsky <krassovskysasha@gmail.com>
2023-11-08 11:26:56 +00:00
Em Sharnoff
acef742a6e vm-monitor: Remove dependency on workspace_hack (#5752)
neondatabase/autoscaling builds libs/vm-monitor during CI because it's a
necessary component of autoscaling.

workspace_hack includes a lot of crates that are not necessary for
vm-monitor, which artificially inflates the build time on the
autoscaling side, so hopefully removing the dependency should speed
things up.

Co-authored-by: Joonas Koivunen <joonas@neon.tech>
2023-11-07 09:41:20 -08:00
duguorong009
11d9d801b5 pageserver: improve the shutdown log error (#5792)
## Problem
- Close #5784 

## Summary of changes
- Update the `GetActiveTenantError` -> `QueryError` conversion process
in `pageserver/src/page_service.rs`
- Update the pytest logging exceptions in
`./test_runner/regress/test_tenant_detach.py`
2023-11-07 16:57:26 +00:00
Andrew Rudenko
fc47af156f Passing neon options to the console (#5781)
The idea is to pass neon_* prefixed options to control plane. It can be
used by cplane to dynamically create timelines and computes. Such
options also should be excluded from passing to compute. Another issue
is how connection caching is working now, because compute's instance now
depends not only on hostname but probably on such options too I included
them to cache key.
2023-11-07 16:49:26 +01:00
Arpad Müller
e310533ed3 Support JWT key reload in pageserver (#5594)
## Problem

For quickly rotating JWT secrets, we want to be able to reload the JWT
public key file in the pageserver, and also support multiple JWT keys.

See #4897.

## Summary of changes

* Allow directories for the `auth_validation_public_key_path` config
param instead of just files. for the safekeepers, all of their config options
also support multiple JWT keys.
* For the pageservers, make the JWT public keys easily globally swappable
by using the `arc-swap` crate.
* Add an endpoint to the pageserver, triggered by a POST to
`/v1/reload_auth_validation_keys`, that reloads the JWT public keys from
the pre-configured path (for security reasons, you cannot upload any
keys yourself).

Fixes #4897

---------

Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
Co-authored-by: Joonas Koivunen <joonas@neon.tech>
2023-11-07 15:43:29 +01:00
John Spray
1d68f52b57 pageserver: move deletion failpoint inside backoff (#5814)
## Problem

When enabled, this failpoint would busy-spin in a loop that emits log
messages.

## Summary of changes

Move the failpoint inside a backoff::exponential block: it will still
spam the log, but at much lower rate.

---------

Co-authored-by: Joonas Koivunen <joonas@neon.tech>
2023-11-07 14:25:51 +00:00
Alexander Bayandin
4cd47b7d4b Dockerfile: Set BUILD_TAG for storage services (#5812)
## Problem

https://github.com/neondatabase/neon/pull/5576 added `build-tag`
reporting to `libmetrics_build_info`, but it's not reported because we
didn't set the corresponding env variable in the build process.

## Summary of changes
- Add `BUILD_TAG` env var while building services
2023-11-07 13:45:59 +00:00
Fernando Luz
0141c95788 build: Add warning when missing postgres submodule during the build (#5614)
I forked the project and in my local repo, I wasn't able to compile the
project and in my search, I found the solution in neon forum. After a PR
discussion, I made a change in the makefile to alert the missing `git
submodules update` step.

---------

Signed-off-by: Fernando Luz <prof.fernando.luz@gmail.com>
Co-authored-by: Joonas Koivunen <joonas@neon.tech>
2023-11-07 12:13:05 +00:00
Shany Pozin
0ac4cf67a6 Use self.tenants instead of TENANTS (#5811) 2023-11-07 11:38:02 +00:00
Joonas Koivunen
4be6bc7251 refactor: remove unnecessary unsafe (#5802)
unsafe impls for `Send` and `Sync` should not be added by default. in
the case of `SlotGuard` removing them does not cause any issues, as the
compiler automatically derives those.

This PR adds requirement to document the unsafety (see
[clippy::undocumented_unsafe_blocks]) and opportunistically adds
`#![deny(unsafe_code)]` to most places where we don't have unsafe code
right now.

TRPL on Send and Sync:
https://doc.rust-lang.org/book/ch16-04-extensible-concurrency-sync-and-send.html

[clippy::undocumented_unsafe_blocks]:
https://rust-lang.github.io/rust-clippy/master/#/undocumented_unsafe_blocks
2023-11-07 10:26:25 +00:00
John Spray
a394f49e0d pageserver: avoid converting an error to anyhow::Error (#5803)
This was preventing it getting cleanly converted to a
CalculateLogicalSizeError::Cancelled, resulting in "Logical size
calculation failed" errors in logs.
2023-11-07 09:35:45 +00:00
John Spray
c00651ff9b pageserver: start refactoring into TenantManager (#5797)
## Problem

See: https://github.com/neondatabase/neon/issues/5796

## Summary of changes

Completing the refactor is quite verbose and can be done in stages: each
interface that is currently called directly from a top-level mgr.rs
function can be moved into TenantManager once the relevant subsystems
have access to it.

Landing the initial change to create of TenantManager is useful because
it enables new code to use it without having to be altered later, and
sets us up to incrementally fix the existing code to use an explicit
Arc<TenantManager> instead of relying on the static TENANTS.
2023-11-07 09:06:53 +00:00
Richy Wang
bea8efac24 Fix comments in 'receive_wal.rs'. (#5807)
## Problem
Some comments in 'receive_wal.rs' is not suitable. It may copy from
'send_wal.rs' and leave it unchanged.
## Summary of changes
This commit fixes two comments in the code:
Changed "/// Unregister walsender." to "/// Unregister walreceiver."
Changed "///Scope guard to access slot in WalSenders registry" to
"///Scope guard to access slot in WalReceivers registry."
2023-11-07 09:13:01 +01:00
Conrad Ludgate
ad5b02e175 proxy: remove unsafe (#5805)
## Problem

`unsafe {}`

## Summary of changes

`CStr` has a method to parse the bytes up to a null byte, so we don't
have to do it ourselves.
2023-11-06 17:44:44 +00:00
Arpad Müller
b09a851705 Make azure blob storage not do extra metadata requests (#5777)
Load the metadata from the returned `GetBlobResponse` and avoid
downloading it via a separate request.
As it turns out, the SDK does return the metadata:
https://github.com/Azure/azure-sdk-for-rust/issues/1439 .

This PR will reduce the number of requests to Azure caused by downloads.

Fixes #5571
2023-11-06 15:16:55 +00:00
John Spray
85cd97af61 pageserver: add InProgress tenant map state, use a sync lock for the map (#5367)
## Problem

Follows on from #5299 
- We didn't have a generic way to protect a tenant undergoing changes:
`Tenant` had states, but for our arbitrary transitions between
secondary/attached, we need a general way to say "reserve this tenant
ID, and don't allow any other ops on it, but don't try and report it as
being in any particular state".
- The TenantsMap structure was behind an async RwLock, but it was never
correct to hold it across await points: that would block any other
changes for all tenants.


## Summary of changes

- Add the `TenantSlot::InProgress` value.  This means:
  - Incoming administrative operations on the tenant should retry later
- Anything trying to read the live state of the tenant (e.g. a page
service reader) should retry later or block.
- Store TenantsMap in `std::sync::RwLock`
- Provide an extended `get_active_tenant_with_timeout` for page_service
to use, which will wait on InProgress slots as well as non-active
tenants.

Closes: https://github.com/neondatabase/neon/issues/5378

---------

Co-authored-by: Christian Schwarz <christian@neon.tech>
2023-11-06 14:03:22 +00:00
Arpad Müller
e6470ee92e Add API description for safekeeper copy endpoint (#5770)
Adds a yaml API description for a new endpoint that allows creation of a
new timeline as the copy of an existing one.
 
Part of #5282
2023-11-06 15:00:07 +01:00
bojanserafimov
dc72567288 Layer flush minor speedup (#5765)
Convert keys to `i128` before sorting
2023-11-06 08:58:20 -05:00
John Spray
6defa2b5d5 pageserver: add Gate as a partner to CancellationToken for safe shutdown of Tenant & Timeline (#5711)
## Problem

When shutting down a Tenant, it isn't just important to cause any
background tasks to stop. It's also important to wait until they have
stopped before declaring shutdown complete, in cases where we may re-use
the tenant's local storage for something else, such as running in
secondary mode, or creating a new tenant with the same ID.

## Summary of changes

A `Gate` class is added, inspired by
[seastar::gate](https://docs.seastar.io/master/classseastar_1_1gate.html).
For types that have an important lifetime that corresponds to some
physical resource, use of a Gate as well as a CancellationToken provides
a robust pattern for async requests & shutdown:
- Requests must always acquire the gate as long as they are using the
object
- Shutdown must set the cancellation token, and then `close()` the gate
to wait for requests in progress before returning.

This is not for memory safety: it's for expressing the difference
between "Arc<Tenant> exists", and "This tenant's files on disk are
eligible to be read/written".

- Both Tenant and Timeline get a Gate & CancellationToken.
- The Timeline gate is held during eviction of layers, and during
page_service requests.
- Existing cancellation support in page_service is refined to use the
timeline-scope cancellation token instead of a process-scope
cancellation token. This replaces the use of `task_mgr::associate_with`:
tasks no longer change their tenant/timelineidentity after being
spawned.

The Tenant's Gate is not yet used, but will be important for
Tenant-scoped operations in secondary mode, where we must ensure that
our secondary-mode downloads for a tenant are gated wrt the activity of
an attached Tenant.

This is part of a broader move away from using the global-state driven
`task_mgr` shutdown tokens:
- less global state where we rely on implicit knowledge of what task a
given function is running in, and more explicit references to the
cancellation token that a particular function/type will respect, making
shutdown easier to reason about.
- eventually avoid the big global TASKS mutex.

---------

Co-authored-by: Joonas Koivunen <joonas@neon.tech>
2023-11-06 12:39:20 +00:00
duguorong009
b3d3a2587d feat: improve the serde impl for several types(Lsn, TenantId, TimelineId ...) (#5335)
Improve the serde impl for several types (`Lsn`, `TenantId`,
`TimelineId`) by making them sensitive to
`Serializer::is_human_readadable` (true for json, false for bincode).

Fixes #3511 by:
- Implement the custom serde for `Lsn`
- Implement the custom serde for `Id`
- Add the helper module `serde_as_u64` in `libs/utils/src/lsn.rs`
- Remove the unnecessary attr `#[serde_as(as = "DisplayFromStr")]` in
all possible structs

Additionally some safekeeper types gained serde tests.

---------

Co-authored-by: Joonas Koivunen <joonas@neon.tech>
2023-11-06 11:40:03 +02:00
Heikki Linnakangas
b85fc39bdb Update control plane API path for getting compute spec. (#5357)
We changed the path in the control plane. The old path is still accepted
for compatibility with existing computes, but we'd like to phase it out.
2023-11-06 09:26:09 +02:00
duguorong009
09b5954526 refactor: use streaming in safekeeper /v1/debug_dump http response (#5731)
- Update the handler for `/v1/debug_dump` http response in safekeeper
- Update the `debug_dump::build()` to use the streaming in JSON build
process
2023-11-05 10:16:54 +00:00
John Spray
306c4f9967 s3_scrubber: prepare for scrubbing buckets with generation-aware content (#5700)
## Problem

The scrubber didn't know how to find the latest index_part when
generations were in use.

## Summary of changes

- Teach the scrubber to do the same dance that pageserver does when
finding the latest index_part.json
- Teach the scrubber how to understand layer files with generation
suffixes.
- General improvement to testability: scan_metadata has a machine
readable output that the testing `S3Scrubber` wrapper can read.
- Existing test coverage of scrubber was false-passing because it just
didn't see any data due to prefixing of data in the bucket. Fix that.

This is incremental improvement: the more confidence we can have in the
scrubber, the more we can use it in integration tests to validate the
state of remote storage.

---------

Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
2023-11-03 17:36:02 +00:00
Konstantin Knizhnik
5ceccdc7de Logical replication startup fixes (#5750)
## Problem

See https://neondb.slack.com/archives/C04DGM6SMTM/p1698226491736459

## Summary of changes

Update WAL affected buffers when restoring WAL from safekeeper

## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist

---------

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
Co-authored-by: Arseny Sher <sher-ars@yandex.ru>
2023-11-03 18:40:27 +02:00
Conrad Ludgate
cdcaa329bf proxy: no more statements (#5747)
## Problem

my prepared statements change in tokio-postgres landed in the latest
release. it didn't work as we intended

## Summary of changes

https://github.com/neondatabase/rust-postgres/pull/24
2023-11-03 08:30:58 +00:00
Joonas Koivunen
27bdbf5e36 chore(layer): restore logging, doc changes (#5766)
Some of the log messages were lost with the #4938. This PR adds some of
them back, most notably:

- starting to on-demand download
- successful completion of on-demand download
- ability to see when there were many waiters for the layer download
- "unexpectedly on-demand downloading ..." is now `info!`

Additionally some rare events are logged as error, which should never
happen.
2023-11-02 19:05:33 +00:00
khanova
4c7fa12a2a Proxy introduce allowed ips (#5729)
## Problem

Proxy doesn't accept wake_compute responses with the allowed IPs.

## Summary of changes

Extend wake_compute api to be able to return allowed_ips.
2023-11-02 16:26:15 +00:00
Em Sharnoff
367971a0e9 vm-monitor: Remove support for file cache in tmpfs (#5617)
ref neondatabase/cloud#7516.

We switched everything over to file cache on disk, now time to remove
support for having it in tmpfs.
2023-11-02 16:06:16 +00:00
bojanserafimov
51570114ea Remove outdated and flaky perf test (#5762) 2023-11-02 10:43:59 -04:00
Joonas Koivunen
098d3111a5 fix(layer): get_and_upgrade and metrics (#5767)
when introducing `get_and_upgrade` I forgot that an `evict_and_wait`
would had already incremented the counter for started evictions, but an
upgrade would just "silently" cancel the eviction as no drop would ever
run. these metrics are likely sources for alerts with the next release,
so it's important to keep them correct.
2023-11-02 13:06:14 +00:00
Joonas Koivunen
3737fe3a4b fix(layer): error out early if layer path is non-file (#5756)
In an earlier PR
https://github.com/neondatabase/neon/pull/5743#discussion_r1378625244 I
added a FIXME and there's a simple solution suggested by @jcsp, so
implement it. Wondering why I did not implement this originally, there
is no concept of a permanent failure, so this failure will happen quite
often. I don't think the frequency is a problem however.

Sadly for std::fs::FileType there is only decimal and hex formatting, no
octal.
2023-11-02 11:03:38 +00:00
John Spray
5650138532 pageserver: helpers for explicitly dying on fatal I/O errors (#5651)
Following from discussion on
https://github.com/neondatabase/neon/pull/5436 where hacking an implicit
die-on-fatal-io behavior into an Error type was a source of disagreement
-- in this PR, dying on fatal I/O errors is explicit, with `fatal_err`
and `maybe_fatal_err` helpers in the `MaybeFatalIo` trait, which is
implemented for std::io::Result.

To enable this approach with `crashsafe_overwrite`, the return type of
that function is changed to std::io::Result -- the previous error enum
for this function was not used for any logic, and the utility of saying
exactly which step in the function failed is outweighed by the hygiene
of having an I/O funciton return an io::Result.

The initial use case for these helpers is the deletion queue.
2023-11-02 09:14:26 +00:00
Joonas Koivunen
2dca4c03fc feat(layer): cancellable get_or_maybe_download (#5744)
With the layer implementation as was done in #4938, it is possible via
cancellation to cause two concurrent downloads on the same path, due to
how `RemoteTimelineClient::download_remote_layer` does tempfiles. Thread
the init semaphore through the spawned task of downloading to make this
impossible to happen.
2023-11-02 08:06:32 +00:00
bojanserafimov
0b790b6d00 Record wal size in import benchmark (#5755) 2023-11-01 17:02:58 -04:00
Joonas Koivunen
e82d1ad6b8 fix(layer): reinit on access before eviction happens (#5743)
Right before merging, I added a loop to `fn
LayerInner::get_or_maybe_download`, which was always supposed to be
there. However I had forgotten to restart initialization instead of
waiting for the eviction to happen to support original design goal of
"eviction should always lose to redownload (or init)". This was wrong.
After this fix, if `spawn_blocking` queue is blocked on something,
nothing bad will happen.

Part of #5737.
2023-11-01 17:38:32 +02:00
Muhammet Yazici
4f0a8e92ad fix: Add bearer prefix to Authorization header (#5740)
## Problem

Some requests with `Authorization` header did not properly set the
`Bearer ` prefix. Problem explained here
https://github.com/neondatabase/cloud/issues/6390.

## Summary of changes

Added `Bearer ` prefix to missing requests.
2023-11-01 09:41:48 +03:00
Konstantin Knizhnik
5952f350cb Always handle POLLHUP in walredo error poll loop (#5716)
## Problem

test_stderr hangs on MacOS.

See https://neondb.slack.com/archives/C036U0GRMRB/p1698438997903919

## Summary of changes

Always handle POLLHUP to prevent infinite loop.

## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
2023-10-31 20:57:03 +02:00
Tristan Partin
726c8e6730 Add docs for updating Postgres for new minor versions 2023-10-31 12:31:14 -05:00
Em Sharnoff
f7067a38b7 compute_ctl: Assume --vm-monitor-addr arg is always present (#5611)
It has a default value, so this should be sound.
Treating its presence as semantically significant was leading to
spurious warnings.
2023-10-31 10:00:23 -07:00
Joonas Koivunen
896347f307 refactor(layer): remove version checking with atomics (#5742)
The `LayerInner::version` never needed to be read in more than one
place. Clarified while fixing #5737 of which this is the first step.
This decrements possible wrong atomics usage in Layer, but does not
really fix anything.
2023-10-31 18:40:08 +02:00
John Spray
e5c81fef86 tests: minor improvements (#5674)
Minor changes from while I have been working on HA tests:
- Manual pytest executions came with some warnings from `log.warn()`
usage
- When something fails in a generations-enabled test, it it useful to
have a log from the attachment service of what attached when, and with
which generation.

---------

Co-authored-by: Joonas Koivunen <joonas@neon.tech>
2023-10-31 11:44:35 +00:00
Christian Schwarz
7ebe9ca1ac pageserver: /attach: clarify semantics of 409 (#5698)
context: https://app.incident.io/neondb/incidents/75
specifically:
https://neondb.slack.com/archives/C0634NXQ6E7/p1698422852902959?thread_ts=1698419362.155059&cid=C0634NXQ6E7
2023-10-31 09:32:58 +01:00
Shany Pozin
1588601503 Move release PR creation to Friday (#5721)
Prepare for a new release workflow
* Release PR is created on Fridays
* The discussion/approval happens during Friday
* Sunday morning the deployment will be done in central-il and perf
tests will be run
* On Monday early IST morning gradually start rolling (starting from US
regions as they are still in weekend time)

See slack for discussion:
https://neondb.slack.com/archives/C04P81J55LK/p1698565305607839?thread_ts=1698428241.031979&cid=C04P81J55LK
2023-10-30 22:10:24 +01:00
John Spray
9c35e1e6e5 pageserver: downgrade slow task warnings from warn to info (#5724)
## Problem

In #5658 we suppressed the first-iteration output from these logs, but
the volume of warnings is still problematic.

## Summary of changes

- Downgrade all slow task warnings to INFO. The information is still
there if we actively want to know about which tasks are running slowly,
without polluting the overall stream of warnings with situations that
are unsurprising to us.
- Revert the previous change so that we output on the first iteration as
we used to do. There is no reason to suppress these, now that the
severity is just info.
2023-10-30 18:32:30 +00:00
Conrad Ludgate
d8c21ec70d fix nightly 1.75 (#5719)
## Problem

Neon doesn't compile on nightly and had numerous clippy complaints.

## Summary of changes

1. Fixed troublesome dependency
2. Fixed or ignored the lints where appropriate
2023-10-30 16:43:06 +00:00
194 changed files with 10259 additions and 5170 deletions

View File

@@ -22,5 +22,11 @@ platforms = [
# "x86_64-pc-windows-msvc", # "x86_64-pc-windows-msvc",
] ]
[final-excludes]
# vm_monitor benefits from the same Cargo.lock as the rest of our artifacts, but
# it is built primarly in separate repo neondatabase/autoscaling and thus is excluded
# from depending on workspace-hack because most of the dependencies are not used.
workspace-members = ["vm_monitor"]
# Write out exact versions rather than a semver range. (Defaults to false.) # Write out exact versions rather than a semver range. (Defaults to false.)
# exact-versions = true # exact-versions = true

View File

@@ -17,8 +17,9 @@ assignees: ''
## Implementation ideas ## Implementation ideas
## Tasks ```[tasklist]
- [ ] ### Tasks
```
## Other related tasks and Epics ## Other related tasks and Epics

View File

@@ -3,7 +3,7 @@
**NB: this PR must be merged only by 'Create a merge commit'!** **NB: this PR must be merged only by 'Create a merge commit'!**
### Checklist when preparing for release ### Checklist when preparing for release
- [ ] Read or refresh [the release flow guide](https://github.com/neondatabase/cloud/wiki/Release:-general-flow) - [ ] Read or refresh [the release flow guide](https://www.notion.so/neondatabase/Release-general-flow-61f2e39fd45d4d14a70c7749604bd70b)
- [ ] Ask in the [cloud Slack channel](https://neondb.slack.com/archives/C033A2WE6BZ) that you are going to rollout the release. Any blockers? - [ ] Ask in the [cloud Slack channel](https://neondb.slack.com/archives/C033A2WE6BZ) that you are going to rollout the release. Any blockers?
- [ ] Does this release contain any db migrations? Destructive ones? What is the rollback plan? - [ ] Does this release contain any db migrations? Destructive ones? What is the rollback plan?

View File

@@ -1,5 +1,7 @@
self-hosted-runner: self-hosted-runner:
labels: labels:
- arm64
- dev
- gen3 - gen3
- large - large
- small - small

View File

@@ -172,10 +172,10 @@ jobs:
# https://github.com/EmbarkStudios/cargo-deny # https://github.com/EmbarkStudios/cargo-deny
- name: Check rust licenses/bans/advisories/sources - name: Check rust licenses/bans/advisories/sources
if: ${{ !cancelled() }} if: ${{ !cancelled() }}
run: cargo deny check run: cargo deny check --hide-inclusion-graph
build-neon: build-neon:
needs: [ check-permissions ] needs: [ check-permissions, tag ]
runs-on: [ self-hosted, gen3, large ] runs-on: [ self-hosted, gen3, large ]
container: container:
image: 369495373322.dkr.ecr.eu-central-1.amazonaws.com/rust:pinned image: 369495373322.dkr.ecr.eu-central-1.amazonaws.com/rust:pinned
@@ -187,6 +187,7 @@ jobs:
env: env:
BUILD_TYPE: ${{ matrix.build_type }} BUILD_TYPE: ${{ matrix.build_type }}
GIT_VERSION: ${{ github.event.pull_request.head.sha || github.sha }} GIT_VERSION: ${{ github.event.pull_request.head.sha || github.sha }}
BUILD_TAG: ${{ needs.tag.outputs.build-tag }}
steps: steps:
- name: Fix git ownership - name: Fix git ownership
@@ -585,10 +586,13 @@ jobs:
id: upload-coverage-report-new id: upload-coverage-report-new
env: env:
BUCKET: neon-github-public-dev BUCKET: neon-github-public-dev
# A differential coverage report is available only for PRs.
# (i.e. for pushes into main/release branches we have a regular coverage report)
COMMIT_SHA: ${{ github.event.pull_request.head.sha || github.sha }} COMMIT_SHA: ${{ github.event.pull_request.head.sha || github.sha }}
BASE_SHA: ${{ github.event.pull_request.base.sha || github.sha }}
run: | run: |
BASELINE="$(git merge-base HEAD origin/main)"
CURRENT="${COMMIT_SHA}" CURRENT="${COMMIT_SHA}"
BASELINE="$(git merge-base $BASE_SHA $CURRENT)"
cp /tmp/coverage/report/lcov.info ./${CURRENT}.info cp /tmp/coverage/report/lcov.info ./${CURRENT}.info
@@ -723,6 +727,7 @@ jobs:
--cache-repo 369495373322.dkr.ecr.eu-central-1.amazonaws.com/cache --cache-repo 369495373322.dkr.ecr.eu-central-1.amazonaws.com/cache
--context . --context .
--build-arg GIT_VERSION=${{ github.event.pull_request.head.sha || github.sha }} --build-arg GIT_VERSION=${{ github.event.pull_request.head.sha || github.sha }}
--build-arg BUILD_TAG=${{ needs.tag.outputs.build-tag }}
--build-arg REPOSITORY=369495373322.dkr.ecr.eu-central-1.amazonaws.com --build-arg REPOSITORY=369495373322.dkr.ecr.eu-central-1.amazonaws.com
--destination 369495373322.dkr.ecr.eu-central-1.amazonaws.com/neon:${{needs.tag.outputs.build-tag}} --destination 369495373322.dkr.ecr.eu-central-1.amazonaws.com/neon:${{needs.tag.outputs.build-tag}}
--destination neondatabase/neon:${{needs.tag.outputs.build-tag}} --destination neondatabase/neon:${{needs.tag.outputs.build-tag}}
@@ -847,7 +852,7 @@ jobs:
run: run:
shell: sh -eu {0} shell: sh -eu {0}
env: env:
VM_BUILDER_VERSION: v0.18.5 VM_BUILDER_VERSION: v0.19.0
steps: steps:
- name: Checkout - name: Checkout
@@ -869,8 +874,7 @@ jobs:
- name: Build vm image - name: Build vm image
run: | run: |
./vm-builder \ ./vm-builder \
-enable-file-cache \ -spec=vm-image-spec.yaml \
-cgroup-uid=postgres \
-src=369495373322.dkr.ecr.eu-central-1.amazonaws.com/compute-node-${{ matrix.version }}:${{needs.tag.outputs.build-tag}} \ -src=369495373322.dkr.ecr.eu-central-1.amazonaws.com/compute-node-${{ matrix.version }}:${{needs.tag.outputs.build-tag}} \
-dst=369495373322.dkr.ecr.eu-central-1.amazonaws.com/vm-compute-node-${{ matrix.version }}:${{needs.tag.outputs.build-tag}} -dst=369495373322.dkr.ecr.eu-central-1.amazonaws.com/vm-compute-node-${{ matrix.version }}:${{needs.tag.outputs.build-tag}}

View File

@@ -21,7 +21,10 @@ env:
jobs: jobs:
check-macos-build: check-macos-build:
if: github.ref_name == 'main' || contains(github.event.pull_request.labels.*.name, 'run-extra-build-macos') if: |
contains(github.event.pull_request.labels.*.name, 'run-extra-build-macos') ||
contains(github.event.pull_request.labels.*.name, 'run-extra-build-*') ||
github.ref_name == 'main'
timeout-minutes: 90 timeout-minutes: 90
runs-on: macos-latest runs-on: macos-latest
@@ -112,8 +115,182 @@ jobs:
- name: Check that no warnings are produced - name: Check that no warnings are produced
run: ./run_clippy.sh run: ./run_clippy.sh
check-linux-arm-build:
timeout-minutes: 90
runs-on: [ self-hosted, dev, arm64 ]
env:
# Use release build only, to have less debug info around
# Hence keeping target/ (and general cache size) smaller
BUILD_TYPE: release
CARGO_FEATURES: --features testing
CARGO_FLAGS: --locked --release
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_DEV }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_KEY_DEV }}
container:
image: 369495373322.dkr.ecr.eu-central-1.amazonaws.com/rust:pinned
options: --init
steps:
- name: Fix git ownership
run: |
# Workaround for `fatal: detected dubious ownership in repository at ...`
#
# Use both ${{ github.workspace }} and ${GITHUB_WORKSPACE} because they're different on host and in containers
# Ref https://github.com/actions/checkout/issues/785
#
git config --global --add safe.directory ${{ github.workspace }}
git config --global --add safe.directory ${GITHUB_WORKSPACE}
- name: Checkout
uses: actions/checkout@v4
with:
submodules: true
fetch-depth: 1
- name: Set pg 14 revision for caching
id: pg_v14_rev
run: echo pg_rev=$(git rev-parse HEAD:vendor/postgres-v14) >> $GITHUB_OUTPUT
- name: Set pg 15 revision for caching
id: pg_v15_rev
run: echo pg_rev=$(git rev-parse HEAD:vendor/postgres-v15) >> $GITHUB_OUTPUT
- name: Set pg 16 revision for caching
id: pg_v16_rev
run: echo pg_rev=$(git rev-parse HEAD:vendor/postgres-v16) >> $GITHUB_OUTPUT
- name: Set env variables
run: |
echo "CARGO_HOME=${GITHUB_WORKSPACE}/.cargo" >> $GITHUB_ENV
- name: Cache postgres v14 build
id: cache_pg_14
uses: actions/cache@v3
with:
path: pg_install/v14
key: v1-${{ runner.os }}-${{ runner.arch }}-${{ env.BUILD_TYPE }}-pg-${{ steps.pg_v14_rev.outputs.pg_rev }}-${{ hashFiles('Makefile') }}
- name: Cache postgres v15 build
id: cache_pg_15
uses: actions/cache@v3
with:
path: pg_install/v15
key: v1-${{ runner.os }}-${{ runner.arch }}-${{ env.BUILD_TYPE }}-pg-${{ steps.pg_v15_rev.outputs.pg_rev }}-${{ hashFiles('Makefile') }}
- name: Cache postgres v16 build
id: cache_pg_16
uses: actions/cache@v3
with:
path: pg_install/v16
key: v1-${{ runner.os }}-${{ runner.arch }}-${{ env.BUILD_TYPE }}-pg-${{ steps.pg_v16_rev.outputs.pg_rev }}-${{ hashFiles('Makefile') }}
- name: Build postgres v14
if: steps.cache_pg_14.outputs.cache-hit != 'true'
run: mold -run make postgres-v14 -j$(nproc)
- name: Build postgres v15
if: steps.cache_pg_15.outputs.cache-hit != 'true'
run: mold -run make postgres-v15 -j$(nproc)
- name: Build postgres v16
if: steps.cache_pg_16.outputs.cache-hit != 'true'
run: mold -run make postgres-v16 -j$(nproc)
- name: Build neon extensions
run: mold -run make neon-pg-ext -j$(nproc)
- name: Build walproposer-lib
run: mold -run make walproposer-lib -j$(nproc)
- name: Run cargo build
run: |
mold -run cargo build $CARGO_FLAGS $CARGO_FEATURES --bins --tests
- name: Run cargo test
run: |
cargo test $CARGO_FLAGS $CARGO_FEATURES
# Run separate tests for real S3
export ENABLE_REAL_S3_REMOTE_STORAGE=nonempty
export REMOTE_STORAGE_S3_BUCKET=neon-github-public-dev
export REMOTE_STORAGE_S3_REGION=eu-central-1
# Avoid `$CARGO_FEATURES` since there's no `testing` feature in the e2e tests now
cargo test $CARGO_FLAGS --package remote_storage --test test_real_s3
# Run separate tests for real Azure Blob Storage
# XXX: replace region with `eu-central-1`-like region
export ENABLE_REAL_AZURE_REMOTE_STORAGE=y
export AZURE_STORAGE_ACCOUNT="${{ secrets.AZURE_STORAGE_ACCOUNT_DEV }}"
export AZURE_STORAGE_ACCESS_KEY="${{ secrets.AZURE_STORAGE_ACCESS_KEY_DEV }}"
export REMOTE_STORAGE_AZURE_CONTAINER="${{ vars.REMOTE_STORAGE_AZURE_CONTAINER }}"
export REMOTE_STORAGE_AZURE_REGION="${{ vars.REMOTE_STORAGE_AZURE_REGION }}"
# Avoid `$CARGO_FEATURES` since there's no `testing` feature in the e2e tests now
cargo test $CARGO_FLAGS --package remote_storage --test test_real_azure
check-codestyle-rust-arm:
timeout-minutes: 90
runs-on: [ self-hosted, dev, arm64 ]
container:
image: 369495373322.dkr.ecr.eu-central-1.amazonaws.com/rust:pinned
options: --init
steps:
- name: Checkout
uses: actions/checkout@v4
with:
submodules: true
fetch-depth: 1
# Some of our rust modules use FFI and need those to be checked
- name: Get postgres headers
run: make postgres-headers -j$(nproc)
# cargo hack runs the given cargo subcommand (clippy in this case) for all feature combinations.
# This will catch compiler & clippy warnings in all feature combinations.
# TODO: use cargo hack for build and test as well, but, that's quite expensive.
# NB: keep clippy args in sync with ./run_clippy.sh
- run: |
CLIPPY_COMMON_ARGS="$( source .neon_clippy_args; echo "$CLIPPY_COMMON_ARGS")"
if [ "$CLIPPY_COMMON_ARGS" = "" ]; then
echo "No clippy args found in .neon_clippy_args"
exit 1
fi
echo "CLIPPY_COMMON_ARGS=${CLIPPY_COMMON_ARGS}" >> $GITHUB_ENV
- name: Run cargo clippy (debug)
run: cargo hack --feature-powerset clippy $CLIPPY_COMMON_ARGS
- name: Run cargo clippy (release)
run: cargo hack --feature-powerset clippy --release $CLIPPY_COMMON_ARGS
- name: Check documentation generation
run: cargo doc --workspace --no-deps --document-private-items
env:
RUSTDOCFLAGS: "-Dwarnings -Arustdoc::private_intra_doc_links"
# Use `${{ !cancelled() }}` to run quck tests after the longer clippy run
- name: Check formatting
if: ${{ !cancelled() }}
run: cargo fmt --all -- --check
# https://github.com/facebookincubator/cargo-guppy/tree/bec4e0eb29dcd1faac70b1b5360267fc02bf830e/tools/cargo-hakari#2-keep-the-workspace-hack-up-to-date-in-ci
- name: Check rust dependencies
if: ${{ !cancelled() }}
run: |
cargo hakari generate --diff # workspace-hack Cargo.toml is up-to-date
cargo hakari manage-deps --dry-run # all workspace crates depend on workspace-hack
# https://github.com/EmbarkStudios/cargo-deny
- name: Check rust licenses/bans/advisories/sources
if: ${{ !cancelled() }}
run: cargo deny check
gather-rust-build-stats: gather-rust-build-stats:
if: github.ref_name == 'main' || contains(github.event.pull_request.labels.*.name, 'run-extra-build-stats') if: |
contains(github.event.pull_request.labels.*.name, 'run-extra-build-stats') ||
contains(github.event.pull_request.labels.*.name, 'run-extra-build-*') ||
github.ref_name == 'main'
runs-on: [ self-hosted, gen3, large ] runs-on: [ self-hosted, gen3, large ]
container: container:
image: 369495373322.dkr.ecr.eu-central-1.amazonaws.com/rust:pinned image: 369495373322.dkr.ecr.eu-central-1.amazonaws.com/rust:pinned

View File

@@ -2,7 +2,7 @@ name: Create Release Branch
on: on:
schedule: schedule:
- cron: '0 7 * * 2' - cron: '0 7 * * 5'
workflow_dispatch: workflow_dispatch:
jobs: jobs:

677
Cargo.lock generated

File diff suppressed because it is too large Load Diff

View File

@@ -36,6 +36,7 @@ license = "Apache-2.0"
## All dependency versions, used in the project ## All dependency versions, used in the project
[workspace.dependencies] [workspace.dependencies]
anyhow = { version = "1.0", features = ["backtrace"] } anyhow = { version = "1.0", features = ["backtrace"] }
arc-swap = "1.6"
async-compression = { version = "0.4.0", features = ["tokio", "gzip"] } async-compression = { version = "0.4.0", features = ["tokio", "gzip"] }
azure_core = "0.16" azure_core = "0.16"
azure_identity = "0.16" azure_identity = "0.16"
@@ -47,6 +48,7 @@ async-trait = "0.1"
aws-config = { version = "0.56", default-features = false, features=["rustls"] } aws-config = { version = "0.56", default-features = false, features=["rustls"] }
aws-sdk-s3 = "0.29" aws-sdk-s3 = "0.29"
aws-smithy-http = "0.56" aws-smithy-http = "0.56"
aws-smithy-async = { version = "0.56", default-features = false, features=["rt-tokio"] }
aws-credential-types = "0.56" aws-credential-types = "0.56"
aws-types = "0.56" aws-types = "0.56"
axum = { version = "0.6.20", features = ["ws"] } axum = { version = "0.6.20", features = ["ws"] }
@@ -65,7 +67,7 @@ comfy-table = "6.1"
const_format = "0.2" const_format = "0.2"
crc32c = "0.6" crc32c = "0.6"
crossbeam-utils = "0.8.5" crossbeam-utils = "0.8.5"
dashmap = "5.5.0" dashmap = { version = "5.5.0", features = ["raw-api"] }
either = "1.8" either = "1.8"
enum-map = "2.4.2" enum-map = "2.4.2"
enumset = "1.0.12" enumset = "1.0.12"
@@ -81,7 +83,7 @@ hex = "0.4"
hex-literal = "0.4" hex-literal = "0.4"
hmac = "0.12.1" hmac = "0.12.1"
hostname = "0.3.1" hostname = "0.3.1"
http-types = "2" http-types = { version = "2", default-features = false }
humantime = "2.1" humantime = "2.1"
humantime-serde = "1.1.1" humantime-serde = "1.1.1"
hyper = "0.14" hyper = "0.14"
@@ -124,6 +126,7 @@ sentry = { version = "0.31", default-features = false, features = ["backtrace",
serde = { version = "1.0", features = ["derive"] } serde = { version = "1.0", features = ["derive"] }
serde_json = "1" serde_json = "1"
serde_with = "2.0" serde_with = "2.0"
serde_assert = "0.5.0"
sha2 = "0.10.2" sha2 = "0.10.2"
signal-hook = "0.3" signal-hook = "0.3"
smallvec = "1.11" smallvec = "1.11"
@@ -133,6 +136,7 @@ strum_macros = "0.24"
svg_fmt = "0.4.1" svg_fmt = "0.4.1"
sync_wrapper = "0.1.2" sync_wrapper = "0.1.2"
tar = "0.4" tar = "0.4"
task-local-extensions = "0.1.4"
test-context = "0.1" test-context = "0.1"
thiserror = "1.0" thiserror = "1.0"
tls-listener = { version = "0.7", features = ["rustls", "hyper-h1"] } tls-listener = { version = "0.7", features = ["rustls", "hyper-h1"] }
@@ -161,11 +165,11 @@ env_logger = "0.10"
log = "0.4" log = "0.4"
## Libraries from neondatabase/ git forks, ideally with changes to be upstreamed ## Libraries from neondatabase/ git forks, ideally with changes to be upstreamed
postgres = { git = "https://github.com/neondatabase/rust-postgres.git", rev="7434d9388965a17a6d113e5dfc0e65666a03b4c2" } postgres = { git = "https://github.com/neondatabase/rust-postgres.git", rev="6ce32f791526e27533cab0232a6bb243b2c32584" }
postgres-native-tls = { git = "https://github.com/neondatabase/rust-postgres.git", rev="7434d9388965a17a6d113e5dfc0e65666a03b4c2" } postgres-native-tls = { git = "https://github.com/neondatabase/rust-postgres.git", rev="6ce32f791526e27533cab0232a6bb243b2c32584" }
postgres-protocol = { git = "https://github.com/neondatabase/rust-postgres.git", rev="7434d9388965a17a6d113e5dfc0e65666a03b4c2" } postgres-protocol = { git = "https://github.com/neondatabase/rust-postgres.git", rev="6ce32f791526e27533cab0232a6bb243b2c32584" }
postgres-types = { git = "https://github.com/neondatabase/rust-postgres.git", rev="7434d9388965a17a6d113e5dfc0e65666a03b4c2" } postgres-types = { git = "https://github.com/neondatabase/rust-postgres.git", rev="6ce32f791526e27533cab0232a6bb243b2c32584" }
tokio-postgres = { git = "https://github.com/neondatabase/rust-postgres.git", rev="7434d9388965a17a6d113e5dfc0e65666a03b4c2" } tokio-postgres = { git = "https://github.com/neondatabase/rust-postgres.git", rev="6ce32f791526e27533cab0232a6bb243b2c32584" }
## Other git libraries ## Other git libraries
heapless = { default-features=false, features=[], git = "https://github.com/japaric/heapless.git", rev = "644653bf3b831c6bb4963be2de24804acf5e5001" } # upstream release pending heapless = { default-features=false, features=[], git = "https://github.com/japaric/heapless.git", rev = "644653bf3b831c6bb4963be2de24804acf5e5001" } # upstream release pending
@@ -202,7 +206,7 @@ tonic-build = "0.9"
# This is only needed for proxy's tests. # This is only needed for proxy's tests.
# TODO: we should probably fork `tokio-postgres-rustls` instead. # TODO: we should probably fork `tokio-postgres-rustls` instead.
tokio-postgres = { git = "https://github.com/neondatabase/rust-postgres.git", rev="7434d9388965a17a6d113e5dfc0e65666a03b4c2" } tokio-postgres = { git = "https://github.com/neondatabase/rust-postgres.git", rev="6ce32f791526e27533cab0232a6bb243b2c32584" }
################# Binary contents sections ################# Binary contents sections

View File

@@ -27,6 +27,7 @@ RUN set -e \
FROM $REPOSITORY/$IMAGE:$TAG AS build FROM $REPOSITORY/$IMAGE:$TAG AS build
WORKDIR /home/nonroot WORKDIR /home/nonroot
ARG GIT_VERSION=local ARG GIT_VERSION=local
ARG BUILD_TAG
# Enable https://github.com/paritytech/cachepot to cache Rust crates' compilation results in Docker builds. # Enable https://github.com/paritytech/cachepot to cache Rust crates' compilation results in Docker builds.
# Set up cachepot to use an AWS S3 bucket for cache results, to reuse it between `docker build` invocations. # Set up cachepot to use an AWS S3 bucket for cache results, to reuse it between `docker build` invocations.
@@ -78,9 +79,9 @@ COPY --from=build --chown=neon:neon /home/nonroot/target/release/pg_sni_router
COPY --from=build --chown=neon:neon /home/nonroot/target/release/pageserver /usr/local/bin COPY --from=build --chown=neon:neon /home/nonroot/target/release/pageserver /usr/local/bin
COPY --from=build --chown=neon:neon /home/nonroot/target/release/pagectl /usr/local/bin COPY --from=build --chown=neon:neon /home/nonroot/target/release/pagectl /usr/local/bin
COPY --from=build --chown=neon:neon /home/nonroot/target/release/safekeeper /usr/local/bin COPY --from=build --chown=neon:neon /home/nonroot/target/release/safekeeper /usr/local/bin
COPY --from=build --chown=neon:neon /home/nonroot/target/release/storage_broker /usr/local/bin COPY --from=build --chown=neon:neon /home/nonroot/target/release/storage_broker /usr/local/bin
COPY --from=build --chown=neon:neon /home/nonroot/target/release/proxy /usr/local/bin COPY --from=build --chown=neon:neon /home/nonroot/target/release/proxy /usr/local/bin
COPY --from=build --chown=neon:neon /home/nonroot/target/release/neon_local /usr/local/bin COPY --from=build --chown=neon:neon /home/nonroot/target/release/neon_local /usr/local/bin
COPY --from=pg-build /home/nonroot/pg_install/v14 /usr/local/v14/ COPY --from=pg-build /home/nonroot/pg_install/v14 /usr/local/v14/
COPY --from=pg-build /home/nonroot/pg_install/v15 /usr/local/v15/ COPY --from=pg-build /home/nonroot/pg_install/v15 /usr/local/v15/

View File

@@ -714,6 +714,23 @@ RUN wget https://github.com/pksunkara/pgx_ulid/archive/refs/tags/v0.1.3.tar.gz -
cargo pgrx install --release && \ cargo pgrx install --release && \
echo "trusted = true" >> /usr/local/pgsql/share/extension/ulid.control echo "trusted = true" >> /usr/local/pgsql/share/extension/ulid.control
#########################################################################################
#
# Layer "pg-wait-sampling-pg-build"
# compile pg_wait_sampling extension
#
#########################################################################################
FROM build-deps AS pg-wait-sampling-pg-build
COPY --from=pg-build /usr/local/pgsql/ /usr/local/pgsql/
ENV PATH "/usr/local/pgsql/bin/:$PATH"
RUN wget https://github.com/postgrespro/pg_wait_sampling/archive/refs/tags/v1.1.5.tar.gz -O pg_wait_sampling.tar.gz && \
echo 'a03da6a413f5652ce470a3635ed6ebba528c74cb26aa4cfced8aff8a8441f81ec6dd657ff62cd6ce96a4e6ce02cad9f2519ae9525367ece60497aa20faafde5c pg_wait_sampling.tar.gz' | sha512sum -c && \
mkdir pg_wait_sampling-src && cd pg_wait_sampling-src && tar xvzf ../pg_wait_sampling.tar.gz --strip-components=1 -C . && \
make USE_PGXS=1 -j $(getconf _NPROCESSORS_ONLN) && \
make USE_PGXS=1 -j $(getconf _NPROCESSORS_ONLN) install && \
echo 'trusted = true' >> /usr/local/pgsql/share/extension/pg_wait_sampling.control
######################################################################################### #########################################################################################
# #
# Layer "neon-pg-ext-build" # Layer "neon-pg-ext-build"
@@ -750,6 +767,7 @@ COPY --from=rdkit-pg-build /usr/local/pgsql/ /usr/local/pgsql/
COPY --from=pg-uuidv7-pg-build /usr/local/pgsql/ /usr/local/pgsql/ COPY --from=pg-uuidv7-pg-build /usr/local/pgsql/ /usr/local/pgsql/
COPY --from=pg-roaringbitmap-pg-build /usr/local/pgsql/ /usr/local/pgsql/ COPY --from=pg-roaringbitmap-pg-build /usr/local/pgsql/ /usr/local/pgsql/
COPY --from=pg-embedding-pg-build /usr/local/pgsql/ /usr/local/pgsql/ COPY --from=pg-embedding-pg-build /usr/local/pgsql/ /usr/local/pgsql/
COPY --from=pg-wait-sampling-pg-build /usr/local/pgsql/ /usr/local/pgsql/
COPY pgxn/ pgxn/ COPY pgxn/ pgxn/
RUN make -j $(getconf _NPROCESSORS_ONLN) \ RUN make -j $(getconf _NPROCESSORS_ONLN) \

View File

@@ -72,6 +72,10 @@ neon: postgres-headers walproposer-lib
# #
$(POSTGRES_INSTALL_DIR)/build/%/config.status: $(POSTGRES_INSTALL_DIR)/build/%/config.status:
+@echo "Configuring Postgres $* build" +@echo "Configuring Postgres $* build"
@test -s $(ROOT_PROJECT_DIR)/vendor/postgres-$*/configure || { \
echo "\nPostgres submodule not found in $(ROOT_PROJECT_DIR)/vendor/postgres-$*/, execute "; \
echo "'git submodule update --init --recursive --depth 2 --progress .' in project root.\n"; \
exit 1; }
mkdir -p $(POSTGRES_INSTALL_DIR)/build/$* mkdir -p $(POSTGRES_INSTALL_DIR)/build/$*
(cd $(POSTGRES_INSTALL_DIR)/build/$* && \ (cd $(POSTGRES_INSTALL_DIR)/build/$* && \
env PATH="$(EXTRA_PATH_OVERRIDES):$$PATH" $(ROOT_PROJECT_DIR)/vendor/postgres-$*/configure \ env PATH="$(EXTRA_PATH_OVERRIDES):$$PATH" $(ROOT_PROJECT_DIR)/vendor/postgres-$*/configure \

View File

@@ -278,32 +278,26 @@ fn main() -> Result<()> {
if #[cfg(target_os = "linux")] { if #[cfg(target_os = "linux")] {
use std::env; use std::env;
use tokio_util::sync::CancellationToken; use tokio_util::sync::CancellationToken;
use tracing::warn; let vm_monitor_addr = matches
let vm_monitor_addr = matches.get_one::<String>("vm-monitor-addr"); .get_one::<String>("vm-monitor-addr")
.expect("--vm-monitor-addr should always be set because it has a default arg");
let file_cache_connstr = matches.get_one::<String>("filecache-connstr"); let file_cache_connstr = matches.get_one::<String>("filecache-connstr");
let cgroup = matches.get_one::<String>("cgroup"); let cgroup = matches.get_one::<String>("cgroup");
let file_cache_on_disk = matches.get_flag("file-cache-on-disk");
// Only make a runtime if we need to. // Only make a runtime if we need to.
// Note: it seems like you can make a runtime in an inner scope and // Note: it seems like you can make a runtime in an inner scope and
// if you start a task in it it won't be dropped. However, make it // if you start a task in it it won't be dropped. However, make it
// in the outermost scope just to be safe. // in the outermost scope just to be safe.
let rt = match (env::var_os("AUTOSCALING"), vm_monitor_addr) { let rt = if env::var_os("AUTOSCALING").is_some() {
(None, None) => None, Some(
(None, Some(_)) => {
warn!("--vm-monitor-addr option set but AUTOSCALING env var not present");
None
}
(Some(_), None) => {
panic!("AUTOSCALING env var present but --vm-monitor-addr option not set")
}
(Some(_), Some(_)) => Some(
tokio::runtime::Builder::new_multi_thread() tokio::runtime::Builder::new_multi_thread()
.worker_threads(4) .worker_threads(4)
.enable_all() .enable_all()
.build() .build()
.expect("failed to create tokio runtime for monitor"), .expect("failed to create tokio runtime for monitor")
), )
} else {
None
}; };
// This token is used internally by the monitor to clean up all threads // This token is used internally by the monitor to clean up all threads
@@ -314,8 +308,7 @@ fn main() -> Result<()> {
Box::leak(Box::new(vm_monitor::Args { Box::leak(Box::new(vm_monitor::Args {
cgroup: cgroup.cloned(), cgroup: cgroup.cloned(),
pgconnstr: file_cache_connstr.cloned(), pgconnstr: file_cache_connstr.cloned(),
addr: vm_monitor_addr.cloned().unwrap(), addr: vm_monitor_addr.clone(),
file_cache_on_disk,
})), })),
token.clone(), token.clone(),
)) ))
@@ -487,6 +480,8 @@ fn cli() -> clap::Command {
.value_name("FILECACHE_CONNSTR"), .value_name("FILECACHE_CONNSTR"),
) )
.arg( .arg(
// DEPRECATED, NO LONGER DOES ANYTHING.
// See https://github.com/neondatabase/cloud/issues/7516
Arg::new("file-cache-on-disk") Arg::new("file-cache-on-disk")
.long("file-cache-on-disk") .long("file-cache-on-disk")
.action(clap::ArgAction::SetTrue), .action(clap::ArgAction::SetTrue),

View File

@@ -2,6 +2,7 @@ use std::collections::HashMap;
use std::env; use std::env;
use std::fs; use std::fs;
use std::io::BufRead; use std::io::BufRead;
use std::io::Write;
use std::os::unix::fs::PermissionsExt; use std::os::unix::fs::PermissionsExt;
use std::path::Path; use std::path::Path;
use std::process::{Command, Stdio}; use std::process::{Command, Stdio};
@@ -14,6 +15,7 @@ use chrono::{DateTime, Utc};
use futures::future::join_all; use futures::future::join_all;
use futures::stream::FuturesUnordered; use futures::stream::FuturesUnordered;
use futures::StreamExt; use futures::StreamExt;
use notify::event;
use postgres::{Client, NoTls}; use postgres::{Client, NoTls};
use tokio; use tokio;
use tokio_postgres; use tokio_postgres;
@@ -644,9 +646,30 @@ impl ComputeNode {
} else { } else {
vec![] vec![]
}) })
.stderr(Stdio::piped())
.spawn() .spawn()
.expect("cannot start postgres process"); .expect("cannot start postgres process");
let stderr = pg.stderr.take().unwrap();
std::thread::spawn(move || {
let reader = std::io::BufReader::new(stderr);
let mut last_lines = vec![];
for line in reader.lines() {
if let Ok(line) = line {
if line.starts_with("2023-") {
// print all lines from the previous postgres instance
let combined = format!("PG:{}\n", last_lines.join("\u{200B}"));
let res = std::io::stderr().lock().write_all(combined.as_bytes());
if let Err(e) = res {
error!("failed to write to stderr: {}", e);
}
last_lines.clear();
}
last_lines.push(line);
}
}
});
wait_for_postgres(&mut pg, pgdata_path)?; wait_for_postgres(&mut pg, pgdata_path)?;
Ok(pg) Ok(pg)
@@ -710,8 +733,12 @@ impl ComputeNode {
// `pg_ctl` for start / stop, so this just seems much easier to do as we already // `pg_ctl` for start / stop, so this just seems much easier to do as we already
// have opened connection to Postgres and superuser access. // have opened connection to Postgres and superuser access.
#[instrument(skip_all)] #[instrument(skip_all)]
fn pg_reload_conf(&self, client: &mut Client) -> Result<()> { fn pg_reload_conf(&self) -> Result<()> {
client.simple_query("SELECT pg_reload_conf()")?; let pgctl_bin = Path::new(&self.pgbin).parent().unwrap().join("pg_ctl");
Command::new(pgctl_bin)
.args(["reload", "-D", &self.pgdata])
.output()
.expect("cannot run pg_ctl process");
Ok(()) Ok(())
} }
@@ -724,9 +751,9 @@ impl ComputeNode {
// Write new config // Write new config
let pgdata_path = Path::new(&self.pgdata); let pgdata_path = Path::new(&self.pgdata);
config::write_postgres_conf(&pgdata_path.join("postgresql.conf"), &spec, None)?; config::write_postgres_conf(&pgdata_path.join("postgresql.conf"), &spec, None)?;
self.pg_reload_conf()?;
let mut client = Client::connect(self.connstr.as_str(), NoTls)?; let mut client = Client::connect(self.connstr.as_str(), NoTls)?;
self.pg_reload_conf(&mut client)?;
// Proceed with post-startup configuration. Note, that order of operations is important. // Proceed with post-startup configuration. Note, that order of operations is important.
// Disable DDL forwarding because control plane already knows about these roles/databases. // Disable DDL forwarding because control plane already knows about these roles/databases.

View File

@@ -78,7 +78,7 @@ use regex::Regex;
use remote_storage::*; use remote_storage::*;
use serde_json; use serde_json;
use std::io::Read; use std::io::Read;
use std::num::{NonZeroU32, NonZeroUsize}; use std::num::NonZeroUsize;
use std::path::Path; use std::path::Path;
use std::str; use std::str;
use tar::Archive; use tar::Archive;
@@ -133,45 +133,6 @@ fn parse_pg_version(human_version: &str) -> &str {
panic!("Unsuported postgres version {human_version}"); panic!("Unsuported postgres version {human_version}");
} }
#[cfg(test)]
mod tests {
use super::parse_pg_version;
#[test]
fn test_parse_pg_version() {
assert_eq!(parse_pg_version("PostgreSQL 15.4"), "v15");
assert_eq!(parse_pg_version("PostgreSQL 15.14"), "v15");
assert_eq!(
parse_pg_version("PostgreSQL 15.4 (Ubuntu 15.4-0ubuntu0.23.04.1)"),
"v15"
);
assert_eq!(parse_pg_version("PostgreSQL 14.15"), "v14");
assert_eq!(parse_pg_version("PostgreSQL 14.0"), "v14");
assert_eq!(
parse_pg_version("PostgreSQL 14.9 (Debian 14.9-1.pgdg120+1"),
"v14"
);
assert_eq!(parse_pg_version("PostgreSQL 16devel"), "v16");
assert_eq!(parse_pg_version("PostgreSQL 16beta1"), "v16");
assert_eq!(parse_pg_version("PostgreSQL 16rc2"), "v16");
assert_eq!(parse_pg_version("PostgreSQL 16extra"), "v16");
}
#[test]
#[should_panic]
fn test_parse_pg_unsupported_version() {
parse_pg_version("PostgreSQL 13.14");
}
#[test]
#[should_panic]
fn test_parse_pg_incorrect_version_format() {
parse_pg_version("PostgreSQL 14");
}
}
// download the archive for a given extension, // download the archive for a given extension,
// unzip it, and place files in the appropriate locations (share/lib) // unzip it, and place files in the appropriate locations (share/lib)
pub async fn download_extension( pub async fn download_extension(
@@ -281,9 +242,46 @@ pub fn init_remote_storage(remote_ext_config: &str) -> anyhow::Result<GenericRem
max_keys_per_list_response: None, max_keys_per_list_response: None,
}; };
let config = RemoteStorageConfig { let config = RemoteStorageConfig {
max_concurrent_syncs: NonZeroUsize::new(100).expect("100 != 0"),
max_sync_errors: NonZeroU32::new(100).expect("100 != 0"),
storage: RemoteStorageKind::AwsS3(config), storage: RemoteStorageKind::AwsS3(config),
}; };
GenericRemoteStorage::from_config(&config) GenericRemoteStorage::from_config(&config)
} }
#[cfg(test)]
mod tests {
use super::parse_pg_version;
#[test]
fn test_parse_pg_version() {
assert_eq!(parse_pg_version("PostgreSQL 15.4"), "v15");
assert_eq!(parse_pg_version("PostgreSQL 15.14"), "v15");
assert_eq!(
parse_pg_version("PostgreSQL 15.4 (Ubuntu 15.4-0ubuntu0.23.04.1)"),
"v15"
);
assert_eq!(parse_pg_version("PostgreSQL 14.15"), "v14");
assert_eq!(parse_pg_version("PostgreSQL 14.0"), "v14");
assert_eq!(
parse_pg_version("PostgreSQL 14.9 (Debian 14.9-1.pgdg120+1"),
"v14"
);
assert_eq!(parse_pg_version("PostgreSQL 16devel"), "v16");
assert_eq!(parse_pg_version("PostgreSQL 16beta1"), "v16");
assert_eq!(parse_pg_version("PostgreSQL 16rc2"), "v16");
assert_eq!(parse_pg_version("PostgreSQL 16extra"), "v16");
}
#[test]
#[should_panic]
fn test_parse_pg_unsupported_version() {
parse_pg_version("PostgreSQL 13.14");
}
#[test]
#[should_panic]
fn test_parse_pg_incorrect_version_format() {
parse_pg_version("PostgreSQL 14");
}
}

View File

@@ -1,7 +1,7 @@
//!
//! Various tools and helpers to handle cluster / compute node (Postgres) //! Various tools and helpers to handle cluster / compute node (Postgres)
//! configuration. //! configuration.
//! #![deny(unsafe_code)]
#![deny(clippy::undocumented_unsafe_blocks)]
pub mod checker; pub mod checker;
pub mod config; pub mod config;
pub mod configurator; pub mod configurator;

View File

@@ -24,7 +24,7 @@ fn do_control_plane_request(
) -> Result<ControlPlaneSpecResponse, (bool, String)> { ) -> Result<ControlPlaneSpecResponse, (bool, String)> {
let resp = reqwest::blocking::Client::new() let resp = reqwest::blocking::Client::new()
.get(uri) .get(uri)
.header("Authorization", jwt) .header("Authorization", format!("Bearer {}", jwt))
.send() .send()
.map_err(|e| { .map_err(|e| {
( (
@@ -68,7 +68,7 @@ pub fn get_spec_from_control_plane(
base_uri: &str, base_uri: &str,
compute_id: &str, compute_id: &str,
) -> Result<Option<ComputeSpec>> { ) -> Result<Option<ComputeSpec>> {
let cp_uri = format!("{base_uri}/management/api/v2/computes/{compute_id}/spec"); let cp_uri = format!("{base_uri}/compute/api/v2/computes/{compute_id}/spec");
let jwt: String = match std::env::var("NEON_CONTROL_PLANE_TOKEN") { let jwt: String = match std::env::var("NEON_CONTROL_PLANE_TOKEN") {
Ok(v) => v, Ok(v) => v,
Err(_) => "".to_string(), Err(_) => "".to_string(),
@@ -670,6 +670,12 @@ pub fn handle_extensions(spec: &ComputeSpec, client: &mut Client) -> Result<()>
info!("creating system extensions with query: {}", query); info!("creating system extensions with query: {}", query);
client.simple_query(query)?; client.simple_query(query)?;
} }
if libs.contains("pg_wait_sampling") {
// Create extension only if this compute really needs it
let query = "CREATE EXTENSION IF NOT EXISTS pg_wait_sampling";
info!("creating system extensions with query: {}", query);
client.simple_query(query)?;
}
} }
Ok(()) Ok(())

View File

@@ -2,7 +2,6 @@ use crate::{background_process, local_env::LocalEnv};
use anyhow::anyhow; use anyhow::anyhow;
use camino::Utf8PathBuf; use camino::Utf8PathBuf;
use serde::{Deserialize, Serialize}; use serde::{Deserialize, Serialize};
use serde_with::{serde_as, DisplayFromStr};
use std::{path::PathBuf, process::Child}; use std::{path::PathBuf, process::Child};
use utils::id::{NodeId, TenantId}; use utils::id::{NodeId, TenantId};
@@ -10,14 +9,13 @@ pub struct AttachmentService {
env: LocalEnv, env: LocalEnv,
listen: String, listen: String,
path: PathBuf, path: PathBuf,
client: reqwest::blocking::Client,
} }
const COMMAND: &str = "attachment_service"; const COMMAND: &str = "attachment_service";
#[serde_as]
#[derive(Serialize, Deserialize)] #[derive(Serialize, Deserialize)]
pub struct AttachHookRequest { pub struct AttachHookRequest {
#[serde_as(as = "DisplayFromStr")]
pub tenant_id: TenantId, pub tenant_id: TenantId,
pub node_id: Option<NodeId>, pub node_id: Option<NodeId>,
} }
@@ -27,6 +25,16 @@ pub struct AttachHookResponse {
pub gen: Option<u32>, pub gen: Option<u32>,
} }
#[derive(Serialize, Deserialize)]
pub struct InspectRequest {
pub tenant_id: TenantId,
}
#[derive(Serialize, Deserialize)]
pub struct InspectResponse {
pub attachment: Option<(u32, NodeId)>,
}
impl AttachmentService { impl AttachmentService {
pub fn from_env(env: &LocalEnv) -> Self { pub fn from_env(env: &LocalEnv) -> Self {
let path = env.base_data_dir.join("attachments.json"); let path = env.base_data_dir.join("attachments.json");
@@ -45,6 +53,9 @@ impl AttachmentService {
env: env.clone(), env: env.clone(),
path, path,
listen, listen,
client: reqwest::blocking::ClientBuilder::new()
.build()
.expect("Failed to construct http client"),
} }
} }
@@ -87,16 +98,13 @@ impl AttachmentService {
.unwrap() .unwrap()
.join("attach-hook") .join("attach-hook")
.unwrap(); .unwrap();
let client = reqwest::blocking::ClientBuilder::new()
.build()
.expect("Failed to construct http client");
let request = AttachHookRequest { let request = AttachHookRequest {
tenant_id, tenant_id,
node_id: Some(pageserver_id), node_id: Some(pageserver_id),
}; };
let response = client.post(url).json(&request).send()?; let response = self.client.post(url).json(&request).send()?;
if response.status() != StatusCode::OK { if response.status() != StatusCode::OK {
return Err(anyhow!("Unexpected status {}", response.status())); return Err(anyhow!("Unexpected status {}", response.status()));
} }
@@ -104,4 +112,26 @@ impl AttachmentService {
let response = response.json::<AttachHookResponse>()?; let response = response.json::<AttachHookResponse>()?;
Ok(response.gen) Ok(response.gen)
} }
pub fn inspect(&self, tenant_id: TenantId) -> anyhow::Result<Option<(u32, NodeId)>> {
use hyper::StatusCode;
let url = self
.env
.control_plane_api
.clone()
.unwrap()
.join("inspect")
.unwrap();
let request = InspectRequest { tenant_id };
let response = self.client.post(url).json(&request).send()?;
if response.status() != StatusCode::OK {
return Err(anyhow!("Unexpected status {}", response.status()));
}
let response = response.json::<InspectResponse>()?;
Ok(response.attachment)
}
} }

View File

@@ -262,7 +262,7 @@ where
P: Into<Utf8PathBuf>, P: Into<Utf8PathBuf>,
{ {
let path: Utf8PathBuf = path.into(); let path: Utf8PathBuf = path.into();
// SAFETY // SAFETY:
// pre_exec is marked unsafe because it runs between fork and exec. // pre_exec is marked unsafe because it runs between fork and exec.
// Why is that dangerous in various ways? // Why is that dangerous in various ways?
// Long answer: https://github.com/rust-lang/rust/issues/39575 // Long answer: https://github.com/rust-lang/rust/issues/39575

View File

@@ -12,6 +12,7 @@ use hyper::{Body, Request, Response};
use serde::{Deserialize, Serialize}; use serde::{Deserialize, Serialize};
use std::path::{Path, PathBuf}; use std::path::{Path, PathBuf};
use std::{collections::HashMap, sync::Arc}; use std::{collections::HashMap, sync::Arc};
use utils::http::endpoint::request_span;
use utils::logging::{self, LogFormat}; use utils::logging::{self, LogFormat};
use utils::signals::{ShutdownSignals, Signal}; use utils::signals::{ShutdownSignals, Signal};
@@ -31,7 +32,9 @@ use pageserver_api::control_api::{
ValidateResponseTenant, ValidateResponseTenant,
}; };
use control_plane::attachment_service::{AttachHookRequest, AttachHookResponse}; use control_plane::attachment_service::{
AttachHookRequest, AttachHookResponse, InspectRequest, InspectResponse,
};
#[derive(Parser)] #[derive(Parser)]
#[command(author, version, about, long_about = None)] #[command(author, version, about, long_about = None)]
@@ -221,8 +224,25 @@ async fn handle_attach_hook(mut req: Request<Body>) -> Result<Response<Body>, Ap
generation: 0, generation: 0,
}); });
if attach_req.node_id.is_some() { if let Some(attaching_pageserver) = attach_req.node_id.as_ref() {
tenant_state.generation += 1; tenant_state.generation += 1;
tracing::info!(
tenant_id = %attach_req.tenant_id,
ps_id = %attaching_pageserver,
generation = %tenant_state.generation,
"issuing",
);
} else if let Some(ps_id) = tenant_state.pageserver {
tracing::info!(
tenant_id = %attach_req.tenant_id,
%ps_id,
generation = %tenant_state.generation,
"dropping",
);
} else {
tracing::info!(
tenant_id = %attach_req.tenant_id,
"no-op: tenant already has no pageserver");
} }
tenant_state.pageserver = attach_req.node_id; tenant_state.pageserver = attach_req.node_id;
let generation = tenant_state.generation; let generation = tenant_state.generation;
@@ -237,12 +257,28 @@ async fn handle_attach_hook(mut req: Request<Body>) -> Result<Response<Body>, Ap
) )
} }
async fn handle_inspect(mut req: Request<Body>) -> Result<Response<Body>, ApiError> {
let inspect_req = json_request::<InspectRequest>(&mut req).await?;
let state = get_state(&req).inner.clone();
let locked = state.write().await;
let tenant_state = locked.tenants.get(&inspect_req.tenant_id);
json_response(
StatusCode::OK,
InspectResponse {
attachment: tenant_state.and_then(|s| s.pageserver.map(|ps| (s.generation, ps))),
},
)
}
fn make_router(persistent_state: PersistentState) -> RouterBuilder<hyper::Body, ApiError> { fn make_router(persistent_state: PersistentState) -> RouterBuilder<hyper::Body, ApiError> {
endpoint::make_router() endpoint::make_router()
.data(Arc::new(State::new(persistent_state))) .data(Arc::new(State::new(persistent_state)))
.post("/re-attach", handle_re_attach) .post("/re-attach", |r| request_span(r, handle_re_attach))
.post("/validate", handle_validate) .post("/validate", |r| request_span(r, handle_validate))
.post("/attach-hook", handle_attach_hook) .post("/attach-hook", |r| request_span(r, handle_attach_hook))
.post("/inspect", |r| request_span(r, handle_inspect))
} }
#[tokio::main] #[tokio::main]

View File

@@ -11,13 +11,14 @@ use compute_api::spec::ComputeMode;
use control_plane::attachment_service::AttachmentService; use control_plane::attachment_service::AttachmentService;
use control_plane::endpoint::ComputeControlPlane; use control_plane::endpoint::ComputeControlPlane;
use control_plane::local_env::LocalEnv; use control_plane::local_env::LocalEnv;
use control_plane::pageserver::PageServerNode; use control_plane::pageserver::{PageServerNode, PAGESERVER_REMOTE_STORAGE_DIR};
use control_plane::safekeeper::SafekeeperNode; use control_plane::safekeeper::SafekeeperNode;
use control_plane::tenant_migration::migrate_tenant;
use control_plane::{broker, local_env}; use control_plane::{broker, local_env};
use pageserver_api::models::TimelineInfo; use pageserver_api::models::TimelineInfo;
use pageserver_api::{ use pageserver_api::{
DEFAULT_HTTP_LISTEN_ADDR as DEFAULT_PAGESERVER_HTTP_ADDR, DEFAULT_HTTP_LISTEN_PORT as DEFAULT_PAGESERVER_HTTP_PORT,
DEFAULT_PG_LISTEN_ADDR as DEFAULT_PAGESERVER_PG_ADDR, DEFAULT_PG_LISTEN_PORT as DEFAULT_PAGESERVER_PG_PORT,
}; };
use postgres_backend::AuthType; use postgres_backend::AuthType;
use safekeeper_api::{ use safekeeper_api::{
@@ -46,8 +47,8 @@ const DEFAULT_PG_VERSION: &str = "15";
const DEFAULT_PAGESERVER_CONTROL_PLANE_API: &str = "http://127.0.0.1:1234/"; const DEFAULT_PAGESERVER_CONTROL_PLANE_API: &str = "http://127.0.0.1:1234/";
fn default_conf() -> String { fn default_conf(num_pageservers: u16) -> String {
format!( let mut template = format!(
r#" r#"
# Default built-in configuration, defined in main.rs # Default built-in configuration, defined in main.rs
control_plane_api = '{DEFAULT_PAGESERVER_CONTROL_PLANE_API}' control_plane_api = '{DEFAULT_PAGESERVER_CONTROL_PLANE_API}'
@@ -55,21 +56,33 @@ control_plane_api = '{DEFAULT_PAGESERVER_CONTROL_PLANE_API}'
[broker] [broker]
listen_addr = '{DEFAULT_BROKER_ADDR}' listen_addr = '{DEFAULT_BROKER_ADDR}'
[[pageservers]]
id = {DEFAULT_PAGESERVER_ID}
listen_pg_addr = '{DEFAULT_PAGESERVER_PG_ADDR}'
listen_http_addr = '{DEFAULT_PAGESERVER_HTTP_ADDR}'
pg_auth_type = '{trust_auth}'
http_auth_type = '{trust_auth}'
[[safekeepers]] [[safekeepers]]
id = {DEFAULT_SAFEKEEPER_ID} id = {DEFAULT_SAFEKEEPER_ID}
pg_port = {DEFAULT_SAFEKEEPER_PG_PORT} pg_port = {DEFAULT_SAFEKEEPER_PG_PORT}
http_port = {DEFAULT_SAFEKEEPER_HTTP_PORT} http_port = {DEFAULT_SAFEKEEPER_HTTP_PORT}
"#, "#,
trust_auth = AuthType::Trust, );
)
for i in 0..num_pageservers {
let pageserver_id = NodeId(DEFAULT_PAGESERVER_ID.0 + i as u64);
let pg_port = DEFAULT_PAGESERVER_PG_PORT + i;
let http_port = DEFAULT_PAGESERVER_HTTP_PORT + i;
template += &format!(
r#"
[[pageservers]]
id = {pageserver_id}
listen_pg_addr = '127.0.0.1:{pg_port}'
listen_http_addr = '127.0.0.1:{http_port}'
pg_auth_type = '{trust_auth}'
http_auth_type = '{trust_auth}'
"#,
trust_auth = AuthType::Trust,
)
}
template
} }
/// ///
@@ -295,6 +308,9 @@ fn parse_timeline_id(sub_match: &ArgMatches) -> anyhow::Result<Option<TimelineId
} }
fn handle_init(init_match: &ArgMatches) -> anyhow::Result<LocalEnv> { fn handle_init(init_match: &ArgMatches) -> anyhow::Result<LocalEnv> {
let num_pageservers = init_match
.get_one::<u16>("num-pageservers")
.expect("num-pageservers arg has a default");
// Create config file // Create config file
let toml_file: String = if let Some(config_path) = init_match.get_one::<PathBuf>("config") { let toml_file: String = if let Some(config_path) = init_match.get_one::<PathBuf>("config") {
// load and parse the file // load and parse the file
@@ -306,7 +322,7 @@ fn handle_init(init_match: &ArgMatches) -> anyhow::Result<LocalEnv> {
})? })?
} else { } else {
// Built-in default config // Built-in default config
default_conf() default_conf(*num_pageservers)
}; };
let pg_version = init_match let pg_version = init_match
@@ -320,6 +336,9 @@ fn handle_init(init_match: &ArgMatches) -> anyhow::Result<LocalEnv> {
env.init(pg_version, force) env.init(pg_version, force)
.context("Failed to initialize neon repository")?; .context("Failed to initialize neon repository")?;
// Create remote storage location for default LocalFs remote storage
std::fs::create_dir_all(env.base_data_dir.join(PAGESERVER_REMOTE_STORAGE_DIR))?;
// Initialize pageserver, create initial tenant and timeline. // Initialize pageserver, create initial tenant and timeline.
for ps_conf in &env.pageservers { for ps_conf in &env.pageservers {
PageServerNode::from_env(&env, ps_conf) PageServerNode::from_env(&env, ps_conf)
@@ -433,6 +452,15 @@ fn handle_tenant(tenant_match: &ArgMatches, env: &mut local_env::LocalEnv) -> an
.with_context(|| format!("Tenant config failed for tenant with id {tenant_id}"))?; .with_context(|| format!("Tenant config failed for tenant with id {tenant_id}"))?;
println!("tenant {tenant_id} successfully configured on the pageserver"); println!("tenant {tenant_id} successfully configured on the pageserver");
} }
Some(("migrate", matches)) => {
let tenant_id = get_tenant_id(matches, env)?;
let new_pageserver = get_pageserver(env, matches)?;
let new_pageserver_id = new_pageserver.conf.id;
migrate_tenant(env, tenant_id, new_pageserver)?;
println!("tenant {tenant_id} migrated to {}", new_pageserver_id);
}
Some((sub_name, _)) => bail!("Unexpected tenant subcommand '{}'", sub_name), Some((sub_name, _)) => bail!("Unexpected tenant subcommand '{}'", sub_name),
None => bail!("no tenant subcommand provided"), None => bail!("no tenant subcommand provided"),
} }
@@ -867,20 +895,20 @@ fn handle_mappings(sub_match: &ArgMatches, env: &mut local_env::LocalEnv) -> Res
} }
} }
fn get_pageserver(env: &local_env::LocalEnv, args: &ArgMatches) -> Result<PageServerNode> {
let node_id = if let Some(id_str) = args.get_one::<String>("pageserver-id") {
NodeId(id_str.parse().context("while parsing pageserver id")?)
} else {
DEFAULT_PAGESERVER_ID
};
Ok(PageServerNode::from_env(
env,
env.get_pageserver_conf(node_id)?,
))
}
fn handle_pageserver(sub_match: &ArgMatches, env: &local_env::LocalEnv) -> Result<()> { fn handle_pageserver(sub_match: &ArgMatches, env: &local_env::LocalEnv) -> Result<()> {
fn get_pageserver(env: &local_env::LocalEnv, args: &ArgMatches) -> Result<PageServerNode> {
let node_id = if let Some(id_str) = args.get_one::<String>("pageserver-id") {
NodeId(id_str.parse().context("while parsing pageserver id")?)
} else {
DEFAULT_PAGESERVER_ID
};
Ok(PageServerNode::from_env(
env,
env.get_pageserver_conf(node_id)?,
))
}
match sub_match.subcommand() { match sub_match.subcommand() {
Some(("start", subcommand_args)) => { Some(("start", subcommand_args)) => {
if let Err(e) = get_pageserver(env, subcommand_args)? if let Err(e) = get_pageserver(env, subcommand_args)?
@@ -917,6 +945,20 @@ fn handle_pageserver(sub_match: &ArgMatches, env: &local_env::LocalEnv) -> Resul
} }
} }
Some(("migrate", subcommand_args)) => {
let pageserver = get_pageserver(env, subcommand_args)?;
//TODO what shutdown strategy should we use here?
if let Err(e) = pageserver.stop(false) {
eprintln!("pageserver stop failed: {}", e);
exit(1);
}
if let Err(e) = pageserver.start(&pageserver_config_overrides(subcommand_args)) {
eprintln!("pageserver start failed: {e}");
exit(1);
}
}
Some(("status", subcommand_args)) => { Some(("status", subcommand_args)) => {
match get_pageserver(env, subcommand_args)?.check_status() { match get_pageserver(env, subcommand_args)?.check_status() {
Ok(_) => println!("Page server is up and running"), Ok(_) => println!("Page server is up and running"),
@@ -1224,6 +1266,13 @@ fn cli() -> Command {
.help("Force initialization even if the repository is not empty") .help("Force initialization even if the repository is not empty")
.required(false); .required(false);
let num_pageservers_arg = Arg::new("num-pageservers")
.value_parser(value_parser!(u16))
.long("num-pageservers")
.help("How many pageservers to create (default 1)")
.required(false)
.default_value("1");
Command::new("Neon CLI") Command::new("Neon CLI")
.arg_required_else_help(true) .arg_required_else_help(true)
.version(GIT_VERSION) .version(GIT_VERSION)
@@ -1231,6 +1280,7 @@ fn cli() -> Command {
Command::new("init") Command::new("init")
.about("Initialize a new Neon repository, preparing configs for services to start with") .about("Initialize a new Neon repository, preparing configs for services to start with")
.arg(pageserver_config_args.clone()) .arg(pageserver_config_args.clone())
.arg(num_pageservers_arg.clone())
.arg( .arg(
Arg::new("config") Arg::new("config")
.long("config") .long("config")
@@ -1301,6 +1351,10 @@ fn cli() -> Command {
.subcommand(Command::new("config") .subcommand(Command::new("config")
.arg(tenant_id_arg.clone()) .arg(tenant_id_arg.clone())
.arg(Arg::new("config").short('c').num_args(1).action(ArgAction::Append).required(false))) .arg(Arg::new("config").short('c').num_args(1).action(ArgAction::Append).required(false)))
.subcommand(Command::new("migrate")
.about("Migrate a tenant from one pageserver to another")
.arg(tenant_id_arg.clone())
.arg(pageserver_id_arg.clone()))
) )
.subcommand( .subcommand(
Command::new("pageserver") Command::new("pageserver")

View File

@@ -46,7 +46,6 @@ use std::time::Duration;
use anyhow::{anyhow, bail, Context, Result}; use anyhow::{anyhow, bail, Context, Result};
use serde::{Deserialize, Serialize}; use serde::{Deserialize, Serialize};
use serde_with::{serde_as, DisplayFromStr};
use utils::id::{NodeId, TenantId, TimelineId}; use utils::id::{NodeId, TenantId, TimelineId};
use crate::local_env::LocalEnv; use crate::local_env::LocalEnv;
@@ -57,13 +56,10 @@ use compute_api::responses::{ComputeState, ComputeStatus};
use compute_api::spec::{Cluster, ComputeMode, ComputeSpec}; use compute_api::spec::{Cluster, ComputeMode, ComputeSpec};
// contents of a endpoint.json file // contents of a endpoint.json file
#[serde_as]
#[derive(Serialize, Deserialize, PartialEq, Eq, Clone, Debug)] #[derive(Serialize, Deserialize, PartialEq, Eq, Clone, Debug)]
pub struct EndpointConf { pub struct EndpointConf {
endpoint_id: String, endpoint_id: String,
#[serde_as(as = "DisplayFromStr")]
tenant_id: TenantId, tenant_id: TenantId,
#[serde_as(as = "DisplayFromStr")]
timeline_id: TimelineId, timeline_id: TimelineId,
mode: ComputeMode, mode: ComputeMode,
pg_port: u16, pg_port: u16,

View File

@@ -1,11 +1,10 @@
// //! Local control plane.
// Local control plane. //!
// //! Can start, configure and stop postgres instances running as a local processes.
// Can start, configure and stop postgres instances running as a local processes. //!
// //! Intended to be used in integration tests and in CLI tools for
// Intended to be used in integration tests and in CLI tools for //! local installations.
// local installations. #![deny(clippy::undocumented_unsafe_blocks)]
//
pub mod attachment_service; pub mod attachment_service;
mod background_process; mod background_process;
@@ -15,3 +14,4 @@ pub mod local_env;
pub mod pageserver; pub mod pageserver;
pub mod postgresql_conf; pub mod postgresql_conf;
pub mod safekeeper; pub mod safekeeper;
pub mod tenant_migration;

View File

@@ -8,7 +8,6 @@ use anyhow::{bail, ensure, Context};
use postgres_backend::AuthType; use postgres_backend::AuthType;
use reqwest::Url; use reqwest::Url;
use serde::{Deserialize, Serialize}; use serde::{Deserialize, Serialize};
use serde_with::{serde_as, DisplayFromStr};
use std::collections::HashMap; use std::collections::HashMap;
use std::env; use std::env;
use std::fs; use std::fs;
@@ -33,7 +32,6 @@ pub const DEFAULT_PG_VERSION: u32 = 15;
// to 'neon_local init --config=<path>' option. See control_plane/simple.conf for // to 'neon_local init --config=<path>' option. See control_plane/simple.conf for
// an example. // an example.
// //
#[serde_as]
#[derive(Serialize, Deserialize, PartialEq, Eq, Clone, Debug)] #[derive(Serialize, Deserialize, PartialEq, Eq, Clone, Debug)]
pub struct LocalEnv { pub struct LocalEnv {
// Base directory for all the nodes (the pageserver, safekeepers and // Base directory for all the nodes (the pageserver, safekeepers and
@@ -59,7 +57,6 @@ pub struct LocalEnv {
// Default tenant ID to use with the 'neon_local' command line utility, when // Default tenant ID to use with the 'neon_local' command line utility, when
// --tenant_id is not explicitly specified. // --tenant_id is not explicitly specified.
#[serde(default)] #[serde(default)]
#[serde_as(as = "Option<DisplayFromStr>")]
pub default_tenant_id: Option<TenantId>, pub default_tenant_id: Option<TenantId>,
// used to issue tokens during e.g pg start // used to issue tokens during e.g pg start
@@ -84,7 +81,6 @@ pub struct LocalEnv {
// A `HashMap<String, HashMap<TenantId, TimelineId>>` would be more appropriate here, // A `HashMap<String, HashMap<TenantId, TimelineId>>` would be more appropriate here,
// but deserialization into a generic toml object as `toml::Value::try_from` fails with an error. // but deserialization into a generic toml object as `toml::Value::try_from` fails with an error.
// https://toml.io/en/v1.0.0 does not contain a concept of "a table inside another table". // https://toml.io/en/v1.0.0 does not contain a concept of "a table inside another table".
#[serde_as(as = "HashMap<_, Vec<(DisplayFromStr, DisplayFromStr)>>")]
branch_name_mappings: HashMap<String, Vec<(TenantId, TimelineId)>>, branch_name_mappings: HashMap<String, Vec<(TenantId, TimelineId)>>,
} }

View File

@@ -15,7 +15,10 @@ use std::{io, result};
use anyhow::{bail, Context}; use anyhow::{bail, Context};
use camino::Utf8PathBuf; use camino::Utf8PathBuf;
use pageserver_api::models::{self, TenantInfo, TimelineInfo}; use pageserver_api::models::{
self, LocationConfig, TenantInfo, TenantLocationConfigRequest, TimelineInfo,
};
use pageserver_api::shard::TenantShardId;
use postgres_backend::AuthType; use postgres_backend::AuthType;
use postgres_connection::{parse_host_port, PgConnectionConfig}; use postgres_connection::{parse_host_port, PgConnectionConfig};
use reqwest::blocking::{Client, RequestBuilder, Response}; use reqwest::blocking::{Client, RequestBuilder, Response};
@@ -31,6 +34,9 @@ use utils::{
use crate::local_env::PageServerConf; use crate::local_env::PageServerConf;
use crate::{background_process, local_env::LocalEnv}; use crate::{background_process, local_env::LocalEnv};
/// Directory within .neon which will be used by default for LocalFs remote storage.
pub const PAGESERVER_REMOTE_STORAGE_DIR: &str = "local_fs_remote_storage/pageserver";
#[derive(Error, Debug)] #[derive(Error, Debug)]
pub enum PageserverHttpError { pub enum PageserverHttpError {
#[error("Reqwest error: {0}")] #[error("Reqwest error: {0}")]
@@ -98,8 +104,10 @@ impl PageServerNode {
} }
} }
// pageserver conf overrides defined by neon_local configuration. /// Merge overrides provided by the user on the command line with our default overides derived from neon_local configuration.
fn neon_local_overrides(&self) -> Vec<String> { ///
/// These all end up on the command line of the `pageserver` binary.
fn neon_local_overrides(&self, cli_overrides: &[&str]) -> Vec<String> {
let id = format!("id={}", self.conf.id); let id = format!("id={}", self.conf.id);
// FIXME: the paths should be shell-escaped to handle paths with spaces, quotas etc. // FIXME: the paths should be shell-escaped to handle paths with spaces, quotas etc.
let pg_distrib_dir_param = format!( let pg_distrib_dir_param = format!(
@@ -132,12 +140,25 @@ impl PageServerNode {
)); ));
} }
if !cli_overrides
.iter()
.any(|c| c.starts_with("remote_storage"))
{
overrides.push(format!(
"remote_storage={{local_path='../{PAGESERVER_REMOTE_STORAGE_DIR}'}}"
));
}
if self.conf.http_auth_type != AuthType::Trust || self.conf.pg_auth_type != AuthType::Trust if self.conf.http_auth_type != AuthType::Trust || self.conf.pg_auth_type != AuthType::Trust
{ {
// Keys are generated in the toplevel repo dir, pageservers' workdirs // Keys are generated in the toplevel repo dir, pageservers' workdirs
// are one level below that, so refer to keys with ../ // are one level below that, so refer to keys with ../
overrides.push("auth_validation_public_key_path='../auth_public_key.pem'".to_owned()); overrides.push("auth_validation_public_key_path='../auth_public_key.pem'".to_owned());
} }
// Apply the user-provided overrides
overrides.extend(cli_overrides.iter().map(|&c| c.to_owned()));
overrides overrides
} }
@@ -203,9 +224,6 @@ impl PageServerNode {
} }
fn start_node(&self, config_overrides: &[&str], update_config: bool) -> anyhow::Result<Child> { fn start_node(&self, config_overrides: &[&str], update_config: bool) -> anyhow::Result<Child> {
let mut overrides = self.neon_local_overrides();
overrides.extend(config_overrides.iter().map(|&c| c.to_owned()));
let datadir = self.repo_path(); let datadir = self.repo_path();
print!( print!(
"Starting pageserver node {} at '{}' in {:?}", "Starting pageserver node {} at '{}' in {:?}",
@@ -248,8 +266,7 @@ impl PageServerNode {
) -> Vec<Cow<'a, str>> { ) -> Vec<Cow<'a, str>> {
let mut args = vec![Cow::Borrowed("-D"), Cow::Borrowed(datadir_path_str)]; let mut args = vec![Cow::Borrowed("-D"), Cow::Borrowed(datadir_path_str)];
let mut overrides = self.neon_local_overrides(); let overrides = self.neon_local_overrides(config_overrides);
overrides.extend(config_overrides.iter().map(|&c| c.to_owned()));
for config_override in overrides { for config_override in overrides {
args.push(Cow::Borrowed("-c")); args.push(Cow::Borrowed("-c"));
args.push(Cow::Owned(config_override)); args.push(Cow::Owned(config_override));
@@ -392,7 +409,7 @@ impl PageServerNode {
}; };
let request = models::TenantCreateRequest { let request = models::TenantCreateRequest {
new_tenant_id, new_tenant_id: TenantShardId::unsharded(new_tenant_id),
generation, generation,
config, config,
}; };
@@ -501,6 +518,27 @@ impl PageServerNode {
Ok(()) Ok(())
} }
pub fn location_config(
&self,
tenant_id: TenantId,
config: LocationConfig,
) -> anyhow::Result<()> {
let req_body = TenantLocationConfigRequest { tenant_id, config };
self.http_request(
Method::PUT,
format!(
"{}/tenant/{}/location_config",
self.http_base_url, tenant_id
),
)?
.json(&req_body)
.send()?
.error_from_body()?;
Ok(())
}
pub fn timeline_list(&self, tenant_id: &TenantId) -> anyhow::Result<Vec<TimelineInfo>> { pub fn timeline_list(&self, tenant_id: &TenantId) -> anyhow::Result<Vec<TimelineInfo>> {
let timeline_infos: Vec<TimelineInfo> = self let timeline_infos: Vec<TimelineInfo> = self
.http_request( .http_request(

View File

@@ -0,0 +1,202 @@
//!
//! Functionality for migrating tenants across pageservers: unlike most of neon_local, this code
//! isn't scoped to a particular physical service, as it needs to update compute endpoints to
//! point to the new pageserver.
//!
use crate::local_env::LocalEnv;
use crate::{
attachment_service::AttachmentService, endpoint::ComputeControlPlane,
pageserver::PageServerNode,
};
use pageserver_api::models::{
LocationConfig, LocationConfigMode, LocationConfigSecondary, TenantConfig,
};
use std::collections::HashMap;
use std::time::Duration;
use utils::{
generation::Generation,
id::{TenantId, TimelineId},
lsn::Lsn,
};
/// Given an attached pageserver, retrieve the LSN for all timelines
fn get_lsns(
tenant_id: TenantId,
pageserver: &PageServerNode,
) -> anyhow::Result<HashMap<TimelineId, Lsn>> {
let timelines = pageserver.timeline_list(&tenant_id)?;
Ok(timelines
.into_iter()
.map(|t| (t.timeline_id, t.last_record_lsn))
.collect())
}
/// Wait for the timeline LSNs on `pageserver` to catch up with or overtake
/// `baseline`.
fn await_lsn(
tenant_id: TenantId,
pageserver: &PageServerNode,
baseline: HashMap<TimelineId, Lsn>,
) -> anyhow::Result<()> {
loop {
let latest = match get_lsns(tenant_id, pageserver) {
Ok(l) => l,
Err(e) => {
println!(
"🕑 Can't get LSNs on pageserver {} yet, waiting ({e})",
pageserver.conf.id
);
std::thread::sleep(Duration::from_millis(500));
continue;
}
};
let mut any_behind: bool = false;
for (timeline_id, baseline_lsn) in &baseline {
match latest.get(timeline_id) {
Some(latest_lsn) => {
println!("🕑 LSN origin {baseline_lsn} vs destination {latest_lsn}");
if latest_lsn < baseline_lsn {
any_behind = true;
}
}
None => {
// Expected timeline isn't yet visible on migration destination.
// (IRL we would have to account for timeline deletion, but this
// is just test helper)
any_behind = true;
}
}
}
if !any_behind {
println!("✅ LSN caught up. Proceeding...");
break;
} else {
std::thread::sleep(Duration::from_millis(500));
}
}
Ok(())
}
/// This function spans multiple services, to demonstrate live migration of a tenant
/// between pageservers:
/// - Coordinate attach/secondary/detach on pageservers
/// - call into attachment_service for generations
/// - reconfigure compute endpoints to point to new attached pageserver
pub fn migrate_tenant(
env: &LocalEnv,
tenant_id: TenantId,
dest_ps: PageServerNode,
) -> anyhow::Result<()> {
// Get a new generation
let attachment_service = AttachmentService::from_env(env);
let previous = attachment_service.inspect(tenant_id)?;
let mut baseline_lsns = None;
if let Some((generation, origin_ps_id)) = &previous {
let origin_ps = PageServerNode::from_env(env, env.get_pageserver_conf(*origin_ps_id)?);
if origin_ps_id == &dest_ps.conf.id {
println!("🔁 Already attached to {origin_ps_id}, freshening...");
let gen = attachment_service.attach_hook(tenant_id, dest_ps.conf.id)?;
let dest_conf = LocationConfig {
mode: LocationConfigMode::AttachedSingle,
generation: gen.map(Generation::new),
secondary_conf: None,
tenant_conf: TenantConfig::default(),
};
dest_ps.location_config(tenant_id, dest_conf)?;
println!("✅ Migration complete");
return Ok(());
}
println!("🔁 Switching origin pageserver {origin_ps_id} to stale mode");
let stale_conf = LocationConfig {
mode: LocationConfigMode::AttachedStale,
generation: Some(Generation::new(*generation)),
secondary_conf: None,
tenant_conf: TenantConfig::default(),
};
origin_ps.location_config(tenant_id, stale_conf)?;
baseline_lsns = Some(get_lsns(tenant_id, &origin_ps)?);
}
let gen = attachment_service.attach_hook(tenant_id, dest_ps.conf.id)?;
let dest_conf = LocationConfig {
mode: LocationConfigMode::AttachedMulti,
generation: gen.map(Generation::new),
secondary_conf: None,
tenant_conf: TenantConfig::default(),
};
println!("🔁 Attaching to pageserver {}", dest_ps.conf.id);
dest_ps.location_config(tenant_id, dest_conf)?;
if let Some(baseline) = baseline_lsns {
println!("🕑 Waiting for LSN to catch up...");
await_lsn(tenant_id, &dest_ps, baseline)?;
}
let cplane = ComputeControlPlane::load(env.clone())?;
for (endpoint_name, endpoint) in &cplane.endpoints {
if endpoint.tenant_id == tenant_id {
println!(
"🔁 Reconfiguring endpoint {} to use pageserver {}",
endpoint_name, dest_ps.conf.id
);
endpoint.reconfigure(Some(dest_ps.conf.id))?;
}
}
for other_ps_conf in &env.pageservers {
if other_ps_conf.id == dest_ps.conf.id {
continue;
}
let other_ps = PageServerNode::from_env(env, other_ps_conf);
let other_ps_tenants = other_ps.tenant_list()?;
// Check if this tenant is attached
let found = other_ps_tenants
.into_iter()
.map(|t| t.id)
.any(|i| i == tenant_id);
if !found {
continue;
}
// Downgrade to a secondary location
let secondary_conf = LocationConfig {
mode: LocationConfigMode::Secondary,
generation: None,
secondary_conf: Some(LocationConfigSecondary { warm: true }),
tenant_conf: TenantConfig::default(),
};
println!(
"💤 Switching to secondary mode on pageserver {}",
other_ps.conf.id
);
other_ps.location_config(tenant_id, secondary_conf)?;
}
println!(
"🔁 Switching to AttachedSingle mode on pageserver {}",
dest_ps.conf.id
);
let dest_conf = LocationConfig {
mode: LocationConfigMode::AttachedSingle,
generation: gen.map(Generation::new),
secondary_conf: None,
tenant_conf: TenantConfig::default(),
};
dest_ps.location_config(tenant_id, dest_conf)?;
println!("✅ Migration complete");
Ok(())
}

View File

@@ -74,10 +74,30 @@ highlight = "all"
workspace-default-features = "allow" workspace-default-features = "allow"
external-default-features = "allow" external-default-features = "allow"
allow = [] allow = []
deny = []
skip = [] skip = []
skip-tree = [] skip-tree = []
[[bans.deny]]
# we use tokio, the same rationale applies for async-{io,waker,global-executor,executor,channel,lock}, smol
# if you find yourself here while adding a dependency, try "default-features = false", ask around on #rust
name = "async-std"
[[bans.deny]]
name = "async-io"
[[bans.deny]]
name = "async-waker"
[[bans.deny]]
name = "async-global-executor"
[[bans.deny]]
name = "async-executor"
[[bans.deny]]
name = "smol"
# This section is considered when running `cargo deny check sources`. # This section is considered when running `cargo deny check sources`.
# More documentation about the 'sources' section can be found here: # More documentation about the 'sources' section can be found here:
# https://embarkstudios.github.io/cargo-deny/checks/sources/cfg.html # https://embarkstudios.github.io/cargo-deny/checks/sources/cfg.html

View File

@@ -177,7 +177,7 @@ I e during migration create_branch can be called on old pageserver and newly cre
The difference of simplistic approach from one described above is that it calls ignore on source tenant first and then calls attach on target pageserver. Approach above does it in opposite order thus opening a possibility for race conditions we strive to avoid. The difference of simplistic approach from one described above is that it calls ignore on source tenant first and then calls attach on target pageserver. Approach above does it in opposite order thus opening a possibility for race conditions we strive to avoid.
The approach largely follows this guide: <https://github.com/neondatabase/cloud/wiki/Cloud:-Ad-hoc-tenant-relocation> The approach largely follows this guide: <https://www.notion.so/neondatabase/Cloud-Ad-hoc-tenant-relocation-f687474f7bfc42269e6214e3acba25c7>
The happy path sequence: The happy path sequence:

108
docs/updating-postgres.md Normal file
View File

@@ -0,0 +1,108 @@
# Updating Postgres
## Minor Versions
When upgrading to a new minor version of Postgres, please follow these steps:
_Example: 15.4 is the new minor version to upgrade to from 15.3._
1. Clone the Neon Postgres repository if you have not done so already.
```shell
git clone git@github.com:neondatabase/postgres.git
```
1. Add the Postgres upstream remote.
```shell
git remote add upstream https://git.postgresql.org/git/postgresql.git
```
1. Create a new branch based on the stable branch you are updating.
```shell
git checkout -b my-branch REL_15_STABLE_neon
```
1. Tag the last commit on the stable branch you are updating.
```shell
git tag REL_15_3_neon
```
1. Push the new tag to the Neon Postgres repository.
```shell
git push origin REL_15_3_neon
```
1. Find the release tags you're looking for. They are of the form `REL_X_Y`.
1. Rebase the branch you created on the tag and resolve any conflicts.
```shell
git fetch upstream REL_15_4
git rebase REL_15_4
```
1. Run the Postgres test suite to make sure our commits have not affected
Postgres in a negative way.
```shell
make check
# OR
meson test -C builddir
```
1. Push your branch to the Neon Postgres repository.
```shell
git push origin my-branch
```
1. Clone the Neon repository if you have not done so already.
```shell
git clone git@github.com:neondatabase/neon.git
```
1. Create a new branch.
1. Change the `revisions.json` file to point at the HEAD of your Postgres
branch.
1. Update the Git submodule.
```shell
git submodule set-branch --branch my-branch vendor/postgres-v15
git submodule update --remote vendor/postgres-v15
```
1. Run the Neon test suite to make sure that Neon is still good to go on this
minor Postgres release.
```shell
./scripts/poetry -k pg15
```
1. Commit your changes.
1. Create a pull request, and wait for CI to go green.
1. Force push the rebased Postgres branches into the Neon Postgres repository.
```shell
git push --force origin my-branch:REL_15_STABLE_neon
```
It may require disabling various branch protections.
1. Update your Neon PR to point at the branches.
```shell
git submodule set-branch --branch REL_15_STABLE_neon vendor/postgres-v15
git commit --amend --no-edit
git push --force origin
```
1. Merge the pull request after getting approval(s) and CI completion.

View File

@@ -1,3 +1,5 @@
#![deny(unsafe_code)]
#![deny(clippy::undocumented_unsafe_blocks)]
pub mod requests; pub mod requests;
pub mod responses; pub mod responses;
pub mod spec; pub mod spec;

View File

@@ -6,7 +6,6 @@
use std::collections::HashMap; use std::collections::HashMap;
use serde::{Deserialize, Serialize}; use serde::{Deserialize, Serialize};
use serde_with::{serde_as, DisplayFromStr};
use utils::id::{TenantId, TimelineId}; use utils::id::{TenantId, TimelineId};
use utils::lsn::Lsn; use utils::lsn::Lsn;
@@ -19,7 +18,6 @@ pub type PgIdent = String;
/// Cluster spec or configuration represented as an optional number of /// Cluster spec or configuration represented as an optional number of
/// delta operations + final cluster state description. /// delta operations + final cluster state description.
#[serde_as]
#[derive(Clone, Debug, Default, Deserialize, Serialize)] #[derive(Clone, Debug, Default, Deserialize, Serialize)]
pub struct ComputeSpec { pub struct ComputeSpec {
pub format_version: f32, pub format_version: f32,
@@ -50,12 +48,12 @@ pub struct ComputeSpec {
// these, and instead set the "neon.tenant_id", "neon.timeline_id", // these, and instead set the "neon.tenant_id", "neon.timeline_id",
// etc. GUCs in cluster.settings. TODO: Once the control plane has been // etc. GUCs in cluster.settings. TODO: Once the control plane has been
// updated to fill these fields, we can make these non optional. // updated to fill these fields, we can make these non optional.
#[serde_as(as = "Option<DisplayFromStr>")]
pub tenant_id: Option<TenantId>, pub tenant_id: Option<TenantId>,
#[serde_as(as = "Option<DisplayFromStr>")]
pub timeline_id: Option<TimelineId>, pub timeline_id: Option<TimelineId>,
#[serde_as(as = "Option<DisplayFromStr>")]
pub pageserver_connstring: Option<String>, pub pageserver_connstring: Option<String>,
#[serde(default)] #[serde(default)]
pub safekeeper_connstrings: Vec<String>, pub safekeeper_connstrings: Vec<String>,
@@ -140,14 +138,13 @@ impl RemoteExtSpec {
} }
} }
#[serde_as]
#[derive(Clone, Copy, Debug, Default, Eq, PartialEq, Deserialize, Serialize)] #[derive(Clone, Copy, Debug, Default, Eq, PartialEq, Deserialize, Serialize)]
pub enum ComputeMode { pub enum ComputeMode {
/// A read-write node /// A read-write node
#[default] #[default]
Primary, Primary,
/// A read-only node, pinned at a particular LSN /// A read-only node, pinned at a particular LSN
Static(#[serde_as(as = "DisplayFromStr")] Lsn), Static(Lsn),
/// A read-only node that follows the tip of the branch in hot standby mode /// A read-only node that follows the tip of the branch in hot standby mode
/// ///
/// Future versions may want to distinguish between replicas with hot standby /// Future versions may want to distinguish between replicas with hot standby

View File

@@ -1,6 +1,6 @@
//!
//! Shared code for consumption metics collection //! Shared code for consumption metics collection
//! #![deny(unsafe_code)]
#![deny(clippy::undocumented_unsafe_blocks)]
use chrono::{DateTime, Utc}; use chrono::{DateTime, Utc};
use rand::Rng; use rand::Rng;
use serde::{Deserialize, Serialize}; use serde::{Deserialize, Serialize};

View File

@@ -2,6 +2,7 @@
//! make sure that we use the same dep version everywhere. //! make sure that we use the same dep version everywhere.
//! Otherwise, we might not see all metrics registered via //! Otherwise, we might not see all metrics registered via
//! a default registry. //! a default registry.
#![deny(clippy::undocumented_unsafe_blocks)]
use once_cell::sync::Lazy; use once_cell::sync::Lazy;
use prometheus::core::{AtomicU64, Collector, GenericGauge, GenericGaugeVec}; use prometheus::core::{AtomicU64, Collector, GenericGauge, GenericGaugeVec};
pub use prometheus::opts; pub use prometheus::opts;

View File

@@ -17,5 +17,9 @@ postgres_ffi.workspace = true
enum-map.workspace = true enum-map.workspace = true
strum.workspace = true strum.workspace = true
strum_macros.workspace = true strum_macros.workspace = true
hex.workspace = true
workspace_hack.workspace = true workspace_hack.workspace = true
[dev-dependencies]
bincode.workspace = true

View File

@@ -4,7 +4,6 @@
//! See docs/rfcs/025-generation-numbers.md //! See docs/rfcs/025-generation-numbers.md
use serde::{Deserialize, Serialize}; use serde::{Deserialize, Serialize};
use serde_with::{serde_as, DisplayFromStr};
use utils::id::{NodeId, TenantId}; use utils::id::{NodeId, TenantId};
#[derive(Serialize, Deserialize)] #[derive(Serialize, Deserialize)]
@@ -12,10 +11,8 @@ pub struct ReAttachRequest {
pub node_id: NodeId, pub node_id: NodeId,
} }
#[serde_as]
#[derive(Serialize, Deserialize)] #[derive(Serialize, Deserialize)]
pub struct ReAttachResponseTenant { pub struct ReAttachResponseTenant {
#[serde_as(as = "DisplayFromStr")]
pub id: TenantId, pub id: TenantId,
pub gen: u32, pub gen: u32,
} }
@@ -25,10 +22,8 @@ pub struct ReAttachResponse {
pub tenants: Vec<ReAttachResponseTenant>, pub tenants: Vec<ReAttachResponseTenant>,
} }
#[serde_as]
#[derive(Serialize, Deserialize)] #[derive(Serialize, Deserialize)]
pub struct ValidateRequestTenant { pub struct ValidateRequestTenant {
#[serde_as(as = "DisplayFromStr")]
pub id: TenantId, pub id: TenantId,
pub gen: u32, pub gen: u32,
} }
@@ -43,10 +38,8 @@ pub struct ValidateResponse {
pub tenants: Vec<ValidateResponseTenant>, pub tenants: Vec<ValidateResponseTenant>,
} }
#[serde_as]
#[derive(Serialize, Deserialize)] #[derive(Serialize, Deserialize)]
pub struct ValidateResponseTenant { pub struct ValidateResponseTenant {
#[serde_as(as = "DisplayFromStr")]
pub id: TenantId, pub id: TenantId,
pub valid: bool, pub valid: bool,
} }

View File

@@ -0,0 +1,142 @@
use anyhow::{bail, Result};
use byteorder::{ByteOrder, BE};
use serde::{Deserialize, Serialize};
use std::fmt;
/// Key used in the Repository kv-store.
///
/// The Repository treats this as an opaque struct, but see the code in pgdatadir_mapping.rs
/// for what we actually store in these fields.
#[derive(Debug, Clone, Copy, Hash, PartialEq, Eq, Ord, PartialOrd, Serialize, Deserialize)]
pub struct Key {
pub field1: u8,
pub field2: u32,
pub field3: u32,
pub field4: u32,
pub field5: u8,
pub field6: u32,
}
pub const KEY_SIZE: usize = 18;
impl Key {
/// 'field2' is used to store tablespaceid for relations and small enum numbers for other relish.
/// As long as Neon does not support tablespace (because of lack of access to local file system),
/// we can assume that only some predefined namespace OIDs are used which can fit in u16
pub fn to_i128(&self) -> i128 {
assert!(self.field2 < 0xFFFF || self.field2 == 0xFFFFFFFF || self.field2 == 0x22222222);
(((self.field1 & 0xf) as i128) << 120)
| (((self.field2 & 0xFFFF) as i128) << 104)
| ((self.field3 as i128) << 72)
| ((self.field4 as i128) << 40)
| ((self.field5 as i128) << 32)
| self.field6 as i128
}
pub const fn from_i128(x: i128) -> Self {
Key {
field1: ((x >> 120) & 0xf) as u8,
field2: ((x >> 104) & 0xFFFF) as u32,
field3: (x >> 72) as u32,
field4: (x >> 40) as u32,
field5: (x >> 32) as u8,
field6: x as u32,
}
}
pub fn next(&self) -> Key {
self.add(1)
}
pub fn add(&self, x: u32) -> Key {
let mut key = *self;
let r = key.field6.overflowing_add(x);
key.field6 = r.0;
if r.1 {
let r = key.field5.overflowing_add(1);
key.field5 = r.0;
if r.1 {
let r = key.field4.overflowing_add(1);
key.field4 = r.0;
if r.1 {
let r = key.field3.overflowing_add(1);
key.field3 = r.0;
if r.1 {
let r = key.field2.overflowing_add(1);
key.field2 = r.0;
if r.1 {
let r = key.field1.overflowing_add(1);
key.field1 = r.0;
assert!(!r.1);
}
}
}
}
}
key
}
pub fn from_slice(b: &[u8]) -> Self {
Key {
field1: b[0],
field2: u32::from_be_bytes(b[1..5].try_into().unwrap()),
field3: u32::from_be_bytes(b[5..9].try_into().unwrap()),
field4: u32::from_be_bytes(b[9..13].try_into().unwrap()),
field5: b[13],
field6: u32::from_be_bytes(b[14..18].try_into().unwrap()),
}
}
pub fn write_to_byte_slice(&self, buf: &mut [u8]) {
buf[0] = self.field1;
BE::write_u32(&mut buf[1..5], self.field2);
BE::write_u32(&mut buf[5..9], self.field3);
BE::write_u32(&mut buf[9..13], self.field4);
buf[13] = self.field5;
BE::write_u32(&mut buf[14..18], self.field6);
}
}
impl fmt::Display for Key {
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
write!(
f,
"{:02X}{:08X}{:08X}{:08X}{:02X}{:08X}",
self.field1, self.field2, self.field3, self.field4, self.field5, self.field6
)
}
}
impl Key {
pub const MIN: Key = Key {
field1: u8::MIN,
field2: u32::MIN,
field3: u32::MIN,
field4: u32::MIN,
field5: u8::MIN,
field6: u32::MIN,
};
pub const MAX: Key = Key {
field1: u8::MAX,
field2: u32::MAX,
field3: u32::MAX,
field4: u32::MAX,
field5: u8::MAX,
field6: u32::MAX,
};
pub fn from_hex(s: &str) -> Result<Self> {
if s.len() != 36 {
bail!("parse error");
}
Ok(Key {
field1: u8::from_str_radix(&s[0..2], 16)?,
field2: u32::from_str_radix(&s[2..10], 16)?,
field3: u32::from_str_radix(&s[10..18], 16)?,
field4: u32::from_str_radix(&s[18..26], 16)?,
field5: u8::from_str_radix(&s[26..28], 16)?,
field6: u32::from_str_radix(&s[28..36], 16)?,
})
}
}

View File

@@ -1,9 +1,13 @@
#![deny(unsafe_code)]
#![deny(clippy::undocumented_unsafe_blocks)]
use const_format::formatcp; use const_format::formatcp;
/// Public API types /// Public API types
pub mod control_api; pub mod control_api;
pub mod key;
pub mod models; pub mod models;
pub mod reltag; pub mod reltag;
pub mod shard;
pub const DEFAULT_PG_LISTEN_PORT: u16 = 64000; pub const DEFAULT_PG_LISTEN_PORT: u16 = 64000;
pub const DEFAULT_PG_LISTEN_ADDR: &str = formatcp!("127.0.0.1:{DEFAULT_PG_LISTEN_PORT}"); pub const DEFAULT_PG_LISTEN_ADDR: &str = formatcp!("127.0.0.1:{DEFAULT_PG_LISTEN_PORT}");

View File

@@ -6,7 +6,7 @@ use std::{
use byteorder::{BigEndian, ReadBytesExt}; use byteorder::{BigEndian, ReadBytesExt};
use serde::{Deserialize, Serialize}; use serde::{Deserialize, Serialize};
use serde_with::{serde_as, DisplayFromStr}; use serde_with::serde_as;
use strum_macros; use strum_macros;
use utils::{ use utils::{
completion, completion,
@@ -16,7 +16,7 @@ use utils::{
lsn::Lsn, lsn::Lsn,
}; };
use crate::reltag::RelTag; use crate::{reltag::RelTag, shard::TenantShardId};
use anyhow::bail; use anyhow::bail;
use bytes::{BufMut, Bytes, BytesMut}; use bytes::{BufMut, Bytes, BytesMut};
@@ -174,26 +174,20 @@ pub enum TimelineState {
Broken { reason: String, backtrace: String }, Broken { reason: String, backtrace: String },
} }
#[serde_as]
#[derive(Serialize, Deserialize)] #[derive(Serialize, Deserialize)]
pub struct TimelineCreateRequest { pub struct TimelineCreateRequest {
#[serde_as(as = "DisplayFromStr")]
pub new_timeline_id: TimelineId, pub new_timeline_id: TimelineId,
#[serde(default)] #[serde(default)]
#[serde_as(as = "Option<DisplayFromStr>")]
pub ancestor_timeline_id: Option<TimelineId>, pub ancestor_timeline_id: Option<TimelineId>,
#[serde(default)] #[serde(default)]
#[serde_as(as = "Option<DisplayFromStr>")]
pub ancestor_start_lsn: Option<Lsn>, pub ancestor_start_lsn: Option<Lsn>,
pub pg_version: Option<u32>, pub pg_version: Option<u32>,
} }
#[serde_as]
#[derive(Serialize, Deserialize, Debug)] #[derive(Serialize, Deserialize, Debug)]
#[serde(deny_unknown_fields)] #[serde(deny_unknown_fields)]
pub struct TenantCreateRequest { pub struct TenantCreateRequest {
#[serde_as(as = "DisplayFromStr")] pub new_tenant_id: TenantShardId,
pub new_tenant_id: TenantId,
#[serde(default)] #[serde(default)]
#[serde(skip_serializing_if = "Option::is_none")] #[serde(skip_serializing_if = "Option::is_none")]
pub generation: Option<u32>, pub generation: Option<u32>,
@@ -201,7 +195,6 @@ pub struct TenantCreateRequest {
pub config: TenantConfig, // as we have a flattened field, we should reject all unknown fields in it pub config: TenantConfig, // as we have a flattened field, we should reject all unknown fields in it
} }
#[serde_as]
#[derive(Deserialize, Debug)] #[derive(Deserialize, Debug)]
#[serde(deny_unknown_fields)] #[serde(deny_unknown_fields)]
pub struct TenantLoadRequest { pub struct TenantLoadRequest {
@@ -278,31 +271,26 @@ pub struct LocationConfig {
pub tenant_conf: TenantConfig, pub tenant_conf: TenantConfig,
} }
#[serde_as]
#[derive(Serialize, Deserialize)] #[derive(Serialize, Deserialize)]
#[serde(transparent)] #[serde(transparent)]
pub struct TenantCreateResponse(#[serde_as(as = "DisplayFromStr")] pub TenantId); pub struct TenantCreateResponse(pub TenantId);
#[derive(Serialize)] #[derive(Serialize)]
pub struct StatusResponse { pub struct StatusResponse {
pub id: NodeId, pub id: NodeId,
} }
#[serde_as]
#[derive(Serialize, Deserialize, Debug)] #[derive(Serialize, Deserialize, Debug)]
#[serde(deny_unknown_fields)] #[serde(deny_unknown_fields)]
pub struct TenantLocationConfigRequest { pub struct TenantLocationConfigRequest {
#[serde_as(as = "DisplayFromStr")]
pub tenant_id: TenantId, pub tenant_id: TenantId,
#[serde(flatten)] #[serde(flatten)]
pub config: LocationConfig, // as we have a flattened field, we should reject all unknown fields in it pub config: LocationConfig, // as we have a flattened field, we should reject all unknown fields in it
} }
#[serde_as]
#[derive(Serialize, Deserialize, Debug)] #[derive(Serialize, Deserialize, Debug)]
#[serde(deny_unknown_fields)] #[serde(deny_unknown_fields)]
pub struct TenantConfigRequest { pub struct TenantConfigRequest {
#[serde_as(as = "DisplayFromStr")]
pub tenant_id: TenantId, pub tenant_id: TenantId,
#[serde(flatten)] #[serde(flatten)]
pub config: TenantConfig, // as we have a flattened field, we should reject all unknown fields in it pub config: TenantConfig, // as we have a flattened field, we should reject all unknown fields in it
@@ -374,10 +362,8 @@ pub enum TenantAttachmentStatus {
Failed { reason: String }, Failed { reason: String },
} }
#[serde_as]
#[derive(Serialize, Deserialize, Clone)] #[derive(Serialize, Deserialize, Clone)]
pub struct TenantInfo { pub struct TenantInfo {
#[serde_as(as = "DisplayFromStr")]
pub id: TenantId, pub id: TenantId,
// NB: intentionally not part of OpenAPI, we don't want to commit to a specific set of TenantState's // NB: intentionally not part of OpenAPI, we don't want to commit to a specific set of TenantState's
pub state: TenantState, pub state: TenantState,
@@ -388,33 +374,22 @@ pub struct TenantInfo {
} }
/// This represents the output of the "timeline_detail" and "timeline_list" API calls. /// This represents the output of the "timeline_detail" and "timeline_list" API calls.
#[serde_as]
#[derive(Debug, Serialize, Deserialize, Clone)] #[derive(Debug, Serialize, Deserialize, Clone)]
pub struct TimelineInfo { pub struct TimelineInfo {
#[serde_as(as = "DisplayFromStr")]
pub tenant_id: TenantId, pub tenant_id: TenantId,
#[serde_as(as = "DisplayFromStr")]
pub timeline_id: TimelineId, pub timeline_id: TimelineId,
#[serde_as(as = "Option<DisplayFromStr>")]
pub ancestor_timeline_id: Option<TimelineId>, pub ancestor_timeline_id: Option<TimelineId>,
#[serde_as(as = "Option<DisplayFromStr>")]
pub ancestor_lsn: Option<Lsn>, pub ancestor_lsn: Option<Lsn>,
#[serde_as(as = "DisplayFromStr")]
pub last_record_lsn: Lsn, pub last_record_lsn: Lsn,
#[serde_as(as = "Option<DisplayFromStr>")]
pub prev_record_lsn: Option<Lsn>, pub prev_record_lsn: Option<Lsn>,
#[serde_as(as = "DisplayFromStr")]
pub latest_gc_cutoff_lsn: Lsn, pub latest_gc_cutoff_lsn: Lsn,
#[serde_as(as = "DisplayFromStr")]
pub disk_consistent_lsn: Lsn, pub disk_consistent_lsn: Lsn,
/// The LSN that we have succesfully uploaded to remote storage /// The LSN that we have succesfully uploaded to remote storage
#[serde_as(as = "DisplayFromStr")]
pub remote_consistent_lsn: Lsn, pub remote_consistent_lsn: Lsn,
/// The LSN that we are advertizing to safekeepers /// The LSN that we are advertizing to safekeepers
#[serde_as(as = "DisplayFromStr")]
pub remote_consistent_lsn_visible: Lsn, pub remote_consistent_lsn_visible: Lsn,
pub current_logical_size: Option<u64>, // is None when timeline is Unloaded pub current_logical_size: Option<u64>, // is None when timeline is Unloaded
@@ -426,7 +401,6 @@ pub struct TimelineInfo {
pub timeline_dir_layer_file_size_sum: Option<u64>, pub timeline_dir_layer_file_size_sum: Option<u64>,
pub wal_source_connstr: Option<String>, pub wal_source_connstr: Option<String>,
#[serde_as(as = "Option<DisplayFromStr>")]
pub last_received_msg_lsn: Option<Lsn>, pub last_received_msg_lsn: Option<Lsn>,
/// the timestamp (in microseconds) of the last received message /// the timestamp (in microseconds) of the last received message
pub last_received_msg_ts: Option<u128>, pub last_received_msg_ts: Option<u128>,
@@ -523,23 +497,13 @@ pub struct LayerAccessStats {
pub residence_events_history: HistoryBufferWithDropCounter<LayerResidenceEvent, 16>, pub residence_events_history: HistoryBufferWithDropCounter<LayerResidenceEvent, 16>,
} }
#[serde_as]
#[derive(Debug, Clone, Serialize)] #[derive(Debug, Clone, Serialize)]
#[serde(tag = "kind")] #[serde(tag = "kind")]
pub enum InMemoryLayerInfo { pub enum InMemoryLayerInfo {
Open { Open { lsn_start: Lsn },
#[serde_as(as = "DisplayFromStr")] Frozen { lsn_start: Lsn, lsn_end: Lsn },
lsn_start: Lsn,
},
Frozen {
#[serde_as(as = "DisplayFromStr")]
lsn_start: Lsn,
#[serde_as(as = "DisplayFromStr")]
lsn_end: Lsn,
},
} }
#[serde_as]
#[derive(Debug, Clone, Serialize)] #[derive(Debug, Clone, Serialize)]
#[serde(tag = "kind")] #[serde(tag = "kind")]
pub enum HistoricLayerInfo { pub enum HistoricLayerInfo {
@@ -547,9 +511,7 @@ pub enum HistoricLayerInfo {
layer_file_name: String, layer_file_name: String,
layer_file_size: u64, layer_file_size: u64,
#[serde_as(as = "DisplayFromStr")]
lsn_start: Lsn, lsn_start: Lsn,
#[serde_as(as = "DisplayFromStr")]
lsn_end: Lsn, lsn_end: Lsn,
remote: bool, remote: bool,
access_stats: LayerAccessStats, access_stats: LayerAccessStats,
@@ -558,7 +520,6 @@ pub enum HistoricLayerInfo {
layer_file_name: String, layer_file_name: String,
layer_file_size: u64, layer_file_size: u64,
#[serde_as(as = "DisplayFromStr")]
lsn_start: Lsn, lsn_start: Lsn,
remote: bool, remote: bool,
access_stats: LayerAccessStats, access_stats: LayerAccessStats,

View File

@@ -0,0 +1,321 @@
use std::{ops::RangeInclusive, str::FromStr};
use hex::FromHex;
use serde::{Deserialize, Serialize};
use utils::id::TenantId;
#[derive(Ord, PartialOrd, Eq, PartialEq, Clone, Copy, Serialize, Deserialize, Debug)]
pub struct ShardNumber(pub u8);
#[derive(Ord, PartialOrd, Eq, PartialEq, Clone, Copy, Serialize, Deserialize, Debug)]
pub struct ShardCount(pub u8);
impl ShardCount {
pub const MAX: Self = Self(u8::MAX);
}
impl ShardNumber {
pub const MAX: Self = Self(u8::MAX);
}
/// TenantShardId identify the units of work for the Pageserver.
///
/// These are written as `<tenant_id>-<shard number><shard-count>`, for example:
///
/// # The second shard in a two-shard tenant
/// 072f1291a5310026820b2fe4b2968934-0102
///
/// Historically, tenants could not have multiple shards, and were identified
/// by TenantId. To support this, TenantShardId has a special legacy
/// mode where `shard_count` is equal to zero: this represents a single-sharded
/// tenant which should be written as a TenantId with no suffix.
///
/// The human-readable encoding of TenantShardId, such as used in API URLs,
/// is both forward and backward compatible: a legacy TenantId can be
/// decoded as a TenantShardId, and when re-encoded it will be parseable
/// as a TenantId.
///
/// Note that the binary encoding is _not_ backward compatible, because
/// at the time sharding is introduced, there are no existing binary structures
/// containing TenantId that we need to handle.
#[derive(Eq, PartialEq, PartialOrd, Ord, Clone, Copy)]
pub struct TenantShardId {
pub tenant_id: TenantId,
pub shard_number: ShardNumber,
pub shard_count: ShardCount,
}
impl TenantShardId {
pub fn unsharded(tenant_id: TenantId) -> Self {
Self {
tenant_id,
shard_number: ShardNumber(0),
shard_count: ShardCount(0),
}
}
/// The range of all TenantShardId that belong to a particular TenantId. This is useful when
/// you have a BTreeMap of TenantShardId, and are querying by TenantId.
pub fn tenant_range(tenant_id: TenantId) -> RangeInclusive<Self> {
RangeInclusive::new(
Self {
tenant_id,
shard_number: ShardNumber(0),
shard_count: ShardCount(0),
},
Self {
tenant_id,
shard_number: ShardNumber::MAX,
shard_count: ShardCount::MAX,
},
)
}
pub fn shard_slug(&self) -> String {
format!("{:02x}{:02x}", self.shard_number.0, self.shard_count.0)
}
}
impl std::fmt::Display for TenantShardId {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
if self.shard_count != ShardCount(0) {
write!(
f,
"{}-{:02x}{:02x}",
self.tenant_id, self.shard_number.0, self.shard_count.0
)
} else {
// Legacy case (shard_count == 0) -- format as just the tenant id. Note that this
// is distinct from the normal single shard case (shard count == 1).
self.tenant_id.fmt(f)
}
}
}
impl std::fmt::Debug for TenantShardId {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
// Debug is the same as Display: the compact hex representation
write!(f, "{}", self)
}
}
impl std::str::FromStr for TenantShardId {
type Err = hex::FromHexError;
fn from_str(s: &str) -> Result<Self, Self::Err> {
// Expect format: 16 byte TenantId, '-', 1 byte shard number, 1 byte shard count
if s.len() == 32 {
// Legacy case: no shard specified
Ok(Self {
tenant_id: TenantId::from_str(s)?,
shard_number: ShardNumber(0),
shard_count: ShardCount(0),
})
} else if s.len() == 37 {
let bytes = s.as_bytes();
let tenant_id = TenantId::from_hex(&bytes[0..32])?;
let mut shard_parts: [u8; 2] = [0u8; 2];
hex::decode_to_slice(&bytes[33..37], &mut shard_parts)?;
Ok(Self {
tenant_id,
shard_number: ShardNumber(shard_parts[0]),
shard_count: ShardCount(shard_parts[1]),
})
} else {
Err(hex::FromHexError::InvalidStringLength)
}
}
}
impl From<[u8; 18]> for TenantShardId {
fn from(b: [u8; 18]) -> Self {
let tenant_id_bytes: [u8; 16] = b[0..16].try_into().unwrap();
Self {
tenant_id: TenantId::from(tenant_id_bytes),
shard_number: ShardNumber(b[16]),
shard_count: ShardCount(b[17]),
}
}
}
impl Serialize for TenantShardId {
fn serialize<S>(&self, serializer: S) -> Result<S::Ok, S::Error>
where
S: serde::Serializer,
{
if serializer.is_human_readable() {
serializer.collect_str(self)
} else {
let mut packed: [u8; 18] = [0; 18];
packed[0..16].clone_from_slice(&self.tenant_id.as_arr());
packed[16] = self.shard_number.0;
packed[17] = self.shard_count.0;
packed.serialize(serializer)
}
}
}
impl<'de> Deserialize<'de> for TenantShardId {
fn deserialize<D>(deserializer: D) -> Result<Self, D::Error>
where
D: serde::Deserializer<'de>,
{
struct IdVisitor {
is_human_readable_deserializer: bool,
}
impl<'de> serde::de::Visitor<'de> for IdVisitor {
type Value = TenantShardId;
fn expecting(&self, formatter: &mut std::fmt::Formatter) -> std::fmt::Result {
if self.is_human_readable_deserializer {
formatter.write_str("value in form of hex string")
} else {
formatter.write_str("value in form of integer array([u8; 18])")
}
}
fn visit_seq<A>(self, seq: A) -> Result<Self::Value, A::Error>
where
A: serde::de::SeqAccess<'de>,
{
let s = serde::de::value::SeqAccessDeserializer::new(seq);
let id: [u8; 18] = Deserialize::deserialize(s)?;
Ok(TenantShardId::from(id))
}
fn visit_str<E>(self, v: &str) -> Result<Self::Value, E>
where
E: serde::de::Error,
{
TenantShardId::from_str(v).map_err(E::custom)
}
}
if deserializer.is_human_readable() {
deserializer.deserialize_str(IdVisitor {
is_human_readable_deserializer: true,
})
} else {
deserializer.deserialize_tuple(
18,
IdVisitor {
is_human_readable_deserializer: false,
},
)
}
}
}
#[cfg(test)]
mod tests {
use std::str::FromStr;
use bincode;
use utils::{id::TenantId, Hex};
use super::*;
const EXAMPLE_TENANT_ID: &str = "1f359dd625e519a1a4e8d7509690f6fc";
#[test]
fn tenant_shard_id_string() -> Result<(), hex::FromHexError> {
let example = TenantShardId {
tenant_id: TenantId::from_str(EXAMPLE_TENANT_ID).unwrap(),
shard_count: ShardCount(10),
shard_number: ShardNumber(7),
};
let encoded = format!("{example}");
let expected = format!("{EXAMPLE_TENANT_ID}-070a");
assert_eq!(&encoded, &expected);
let decoded = TenantShardId::from_str(&encoded)?;
assert_eq!(example, decoded);
Ok(())
}
#[test]
fn tenant_shard_id_binary() -> Result<(), hex::FromHexError> {
let example = TenantShardId {
tenant_id: TenantId::from_str(EXAMPLE_TENANT_ID).unwrap(),
shard_count: ShardCount(10),
shard_number: ShardNumber(7),
};
let encoded = bincode::serialize(&example).unwrap();
let expected: [u8; 18] = [
0x1f, 0x35, 0x9d, 0xd6, 0x25, 0xe5, 0x19, 0xa1, 0xa4, 0xe8, 0xd7, 0x50, 0x96, 0x90,
0xf6, 0xfc, 0x07, 0x0a,
];
assert_eq!(Hex(&encoded), Hex(&expected));
let decoded = bincode::deserialize(&encoded).unwrap();
assert_eq!(example, decoded);
Ok(())
}
#[test]
fn tenant_shard_id_backward_compat() -> Result<(), hex::FromHexError> {
// Test that TenantShardId can decode a TenantId in human
// readable form
let example = TenantId::from_str(EXAMPLE_TENANT_ID).unwrap();
let encoded = format!("{example}");
assert_eq!(&encoded, EXAMPLE_TENANT_ID);
let decoded = TenantShardId::from_str(&encoded)?;
assert_eq!(example, decoded.tenant_id);
assert_eq!(decoded.shard_count, ShardCount(0));
assert_eq!(decoded.shard_number, ShardNumber(0));
Ok(())
}
#[test]
fn tenant_shard_id_forward_compat() -> Result<(), hex::FromHexError> {
// Test that a legacy TenantShardId encodes into a form that
// can be decoded as TenantId
let example_tenant_id = TenantId::from_str(EXAMPLE_TENANT_ID).unwrap();
let example = TenantShardId::unsharded(example_tenant_id);
let encoded = format!("{example}");
assert_eq!(&encoded, EXAMPLE_TENANT_ID);
let decoded = TenantId::from_str(&encoded)?;
assert_eq!(example_tenant_id, decoded);
Ok(())
}
#[test]
fn tenant_shard_id_legacy_binary() -> Result<(), hex::FromHexError> {
// Unlike in human readable encoding, binary encoding does not
// do any special handling of legacy unsharded TenantIds: this test
// is equivalent to the main test for binary encoding, just verifying
// that the same behavior applies when we have used `unsharded()` to
// construct a TenantShardId.
let example = TenantShardId::unsharded(TenantId::from_str(EXAMPLE_TENANT_ID).unwrap());
let encoded = bincode::serialize(&example).unwrap();
let expected: [u8; 18] = [
0x1f, 0x35, 0x9d, 0xd6, 0x25, 0xe5, 0x19, 0xa1, 0xa4, 0xe8, 0xd7, 0x50, 0x96, 0x90,
0xf6, 0xfc, 0x00, 0x00,
];
assert_eq!(Hex(&encoded), Hex(&expected));
let decoded = bincode::deserialize::<TenantShardId>(&encoded).unwrap();
assert_eq!(example, decoded);
Ok(())
}
}

View File

@@ -2,6 +2,8 @@
//! To use, create PostgresBackend and run() it, passing the Handler //! To use, create PostgresBackend and run() it, passing the Handler
//! implementation determining how to process the queries. Currently its API //! implementation determining how to process the queries. Currently its API
//! is rather narrow, but we can extend it once required. //! is rather narrow, but we can extend it once required.
#![deny(unsafe_code)]
#![deny(clippy::undocumented_unsafe_blocks)]
use anyhow::Context; use anyhow::Context;
use bytes::Bytes; use bytes::Bytes;
use futures::pin_mut; use futures::pin_mut;
@@ -15,7 +17,7 @@ use std::{fmt, io};
use std::{future::Future, str::FromStr}; use std::{future::Future, str::FromStr};
use tokio::io::{AsyncRead, AsyncWrite}; use tokio::io::{AsyncRead, AsyncWrite};
use tokio_rustls::TlsAcceptor; use tokio_rustls::TlsAcceptor;
use tracing::{debug, error, info, trace}; use tracing::{debug, error, info, trace, warn};
use pq_proto::framed::{ConnectionError, Framed, FramedReader, FramedWriter}; use pq_proto::framed::{ConnectionError, Framed, FramedReader, FramedWriter};
use pq_proto::{ use pq_proto::{
@@ -33,6 +35,11 @@ pub enum QueryError {
/// We were instructed to shutdown while processing the query /// We were instructed to shutdown while processing the query
#[error("Shutting down")] #[error("Shutting down")]
Shutdown, Shutdown,
/// Authentication failure
#[error("Unauthorized: {0}")]
Unauthorized(std::borrow::Cow<'static, str>),
#[error("Simulated Connection Error")]
SimulatedConnectionError,
/// Some other error /// Some other error
#[error(transparent)] #[error(transparent)]
Other(#[from] anyhow::Error), Other(#[from] anyhow::Error),
@@ -47,8 +54,9 @@ impl From<io::Error> for QueryError {
impl QueryError { impl QueryError {
pub fn pg_error_code(&self) -> &'static [u8; 5] { pub fn pg_error_code(&self) -> &'static [u8; 5] {
match self { match self {
Self::Disconnected(_) => b"08006", // connection failure Self::Disconnected(_) | Self::SimulatedConnectionError => b"08006", // connection failure
Self::Shutdown => SQLSTATE_ADMIN_SHUTDOWN, Self::Shutdown => SQLSTATE_ADMIN_SHUTDOWN,
Self::Unauthorized(_) => SQLSTATE_INTERNAL_ERROR,
Self::Other(_) => SQLSTATE_INTERNAL_ERROR, // internal error Self::Other(_) => SQLSTATE_INTERNAL_ERROR, // internal error
} }
} }
@@ -608,7 +616,7 @@ impl<IO: AsyncRead + AsyncWrite + Unpin> PostgresBackend<IO> {
if let Err(e) = handler.check_auth_jwt(self, jwt_response) { if let Err(e) = handler.check_auth_jwt(self, jwt_response) {
self.write_message_noflush(&BeMessage::ErrorResponse( self.write_message_noflush(&BeMessage::ErrorResponse(
&e.to_string(), &short_error(&e),
Some(e.pg_error_code()), Some(e.pg_error_code()),
))?; ))?;
return Err(e); return Err(e);
@@ -728,12 +736,20 @@ impl<IO: AsyncRead + AsyncWrite + Unpin> PostgresBackend<IO> {
trace!("got query {query_string:?}"); trace!("got query {query_string:?}");
if let Err(e) = handler.process_query(self, query_string).await { if let Err(e) = handler.process_query(self, query_string).await {
log_query_error(query_string, &e); match e {
let short_error = short_error(&e); QueryError::Shutdown => return Ok(ProcessMsgResult::Break),
self.write_message_noflush(&BeMessage::ErrorResponse( QueryError::SimulatedConnectionError => {
&short_error, return Err(QueryError::SimulatedConnectionError)
Some(e.pg_error_code()), }
))?; e => {
log_query_error(query_string, &e);
let short_error = short_error(&e);
self.write_message_noflush(&BeMessage::ErrorResponse(
&short_error,
Some(e.pg_error_code()),
))?;
}
}
} }
self.write_message_noflush(&BeMessage::ReadyForQuery)?; self.write_message_noflush(&BeMessage::ReadyForQuery)?;
} }
@@ -959,6 +975,8 @@ pub fn short_error(e: &QueryError) -> String {
match e { match e {
QueryError::Disconnected(connection_error) => connection_error.to_string(), QueryError::Disconnected(connection_error) => connection_error.to_string(),
QueryError::Shutdown => "shutdown".to_string(), QueryError::Shutdown => "shutdown".to_string(),
QueryError::Unauthorized(_e) => "JWT authentication error".to_string(),
QueryError::SimulatedConnectionError => "simulated connection error".to_string(),
QueryError::Other(e) => format!("{e:#}"), QueryError::Other(e) => format!("{e:#}"),
} }
} }
@@ -975,9 +993,15 @@ fn log_query_error(query: &str, e: &QueryError) {
QueryError::Disconnected(other_connection_error) => { QueryError::Disconnected(other_connection_error) => {
error!("query handler for '{query}' failed with connection error: {other_connection_error:?}") error!("query handler for '{query}' failed with connection error: {other_connection_error:?}")
} }
QueryError::SimulatedConnectionError => {
error!("query handler for query '{query}' failed due to a simulated connection error")
}
QueryError::Shutdown => { QueryError::Shutdown => {
info!("query handler for '{query}' cancelled during tenant shutdown") info!("query handler for '{query}' cancelled during tenant shutdown")
} }
QueryError::Unauthorized(e) => {
warn!("query handler for '{query}' failed with authentication error: {e}");
}
QueryError::Other(e) => { QueryError::Other(e) => {
error!("query handler for '{query}' failed: {e:?}"); error!("query handler for '{query}' failed: {e:?}");
} }

View File

@@ -1,3 +1,5 @@
#![deny(unsafe_code)]
#![deny(clippy::undocumented_unsafe_blocks)]
use anyhow::{bail, Context}; use anyhow::{bail, Context};
use itertools::Itertools; use itertools::Itertools;
use std::borrow::Cow; use std::borrow::Cow;

View File

@@ -8,6 +8,7 @@
// modules included with the postgres_ffi macro depend on the types of the specific version's // modules included with the postgres_ffi macro depend on the types of the specific version's
// types, and trigger a too eager lint. // types, and trigger a too eager lint.
#![allow(clippy::duplicate_mod)] #![allow(clippy::duplicate_mod)]
#![deny(clippy::undocumented_unsafe_blocks)]
use bytes::Bytes; use bytes::Bytes;
use utils::bin_ser::SerializeError; use utils::bin_ser::SerializeError;
@@ -20,6 +21,7 @@ macro_rules! postgres_ffi {
pub mod bindings { pub mod bindings {
// bindgen generates bindings for a lot of stuff we don't need // bindgen generates bindings for a lot of stuff we don't need
#![allow(dead_code)] #![allow(dead_code)]
#![allow(clippy::undocumented_unsafe_blocks)]
use serde::{Deserialize, Serialize}; use serde::{Deserialize, Serialize};
include!(concat!( include!(concat!(

View File

@@ -14,6 +14,7 @@ macro_rules! xlog_utils_test {
($version:ident) => { ($version:ident) => {
#[path = "."] #[path = "."]
mod $version { mod $version {
#[allow(unused_imports)]
pub use postgres_ffi::$version::wal_craft_test_export::*; pub use postgres_ffi::$version::wal_craft_test_export::*;
#[allow(clippy::duplicate_mod)] #[allow(clippy::duplicate_mod)]
#[cfg(test)] #[cfg(test)]

View File

@@ -1,6 +1,7 @@
//! Postgres protocol messages serialization-deserialization. See //! Postgres protocol messages serialization-deserialization. See
//! <https://www.postgresql.org/docs/devel/protocol-message-formats.html> //! <https://www.postgresql.org/docs/devel/protocol-message-formats.html>
//! on message formats. //! on message formats.
#![deny(clippy::undocumented_unsafe_blocks)]
pub mod framed; pub mod framed;

View File

@@ -8,6 +8,7 @@ license.workspace = true
anyhow.workspace = true anyhow.workspace = true
async-trait.workspace = true async-trait.workspace = true
once_cell.workspace = true once_cell.workspace = true
aws-smithy-async.workspace = true
aws-smithy-http.workspace = true aws-smithy-http.workspace = true
aws-types.workspace = true aws-types.workspace = true
aws-config.workspace = true aws-config.workspace = true

View File

@@ -1,21 +1,18 @@
//! Azure Blob Storage wrapper //! Azure Blob Storage wrapper
use std::collections::HashMap;
use std::env; use std::env;
use std::num::NonZeroU32; use std::num::NonZeroU32;
use std::sync::Arc; use std::sync::Arc;
use std::{borrow::Cow, collections::HashMap, io::Cursor}; use std::{borrow::Cow, io::Cursor};
use super::REMOTE_STORAGE_PREFIX_SEPARATOR; use super::REMOTE_STORAGE_PREFIX_SEPARATOR;
use anyhow::Result; use anyhow::Result;
use azure_core::request_options::{MaxResults, Metadata, Range}; use azure_core::request_options::{MaxResults, Metadata, Range};
use azure_core::Header;
use azure_identity::DefaultAzureCredential; use azure_identity::DefaultAzureCredential;
use azure_storage::StorageCredentials; use azure_storage::StorageCredentials;
use azure_storage_blobs::prelude::ClientBuilder; use azure_storage_blobs::prelude::ClientBuilder;
use azure_storage_blobs::{ use azure_storage_blobs::{blob::operations::GetBlobBuilder, prelude::ContainerClient};
blob::operations::GetBlobBuilder,
prelude::{BlobClient, ContainerClient},
};
use futures_util::StreamExt; use futures_util::StreamExt;
use http_types::StatusCode; use http_types::StatusCode;
use tokio::io::AsyncRead; use tokio::io::AsyncRead;
@@ -112,16 +109,19 @@ impl AzureBlobStorage {
async fn download_for_builder( async fn download_for_builder(
&self, &self,
metadata: StorageMetadata,
builder: GetBlobBuilder, builder: GetBlobBuilder,
) -> Result<Download, DownloadError> { ) -> Result<Download, DownloadError> {
let mut response = builder.into_stream(); let mut response = builder.into_stream();
let mut metadata = HashMap::new();
// TODO give proper streaming response instead of buffering into RAM // TODO give proper streaming response instead of buffering into RAM
// https://github.com/neondatabase/neon/issues/5563 // https://github.com/neondatabase/neon/issues/5563
let mut buf = Vec::new(); let mut buf = Vec::new();
while let Some(part) = response.next().await { while let Some(part) = response.next().await {
let part = part.map_err(to_download_error)?; let part = part.map_err(to_download_error)?;
if let Some(blob_meta) = part.blob.metadata {
metadata.extend(blob_meta.iter().map(|(k, v)| (k.to_owned(), v.to_owned())));
}
let data = part let data = part
.data .data
.collect() .collect()
@@ -131,28 +131,9 @@ impl AzureBlobStorage {
} }
Ok(Download { Ok(Download {
download_stream: Box::pin(Cursor::new(buf)), download_stream: Box::pin(Cursor::new(buf)),
metadata: Some(metadata), metadata: Some(StorageMetadata(metadata)),
}) })
} }
// TODO get rid of this function once we have metadata included in the response
// https://github.com/Azure/azure-sdk-for-rust/issues/1439
async fn get_metadata(
&self,
blob_client: &BlobClient,
) -> Result<StorageMetadata, DownloadError> {
let builder = blob_client.get_metadata();
let response = builder.into_future().await.map_err(to_download_error)?;
let mut map = HashMap::new();
for md in response.metadata.iter() {
map.insert(
md.name().as_str().to_string(),
md.value().as_str().to_string(),
);
}
Ok(StorageMetadata(map))
}
async fn permit(&self, kind: RequestKind) -> tokio::sync::SemaphorePermit<'_> { async fn permit(&self, kind: RequestKind) -> tokio::sync::SemaphorePermit<'_> {
self.concurrency_limiter self.concurrency_limiter
@@ -269,11 +250,9 @@ impl RemoteStorage for AzureBlobStorage {
let _permit = self.permit(RequestKind::Get).await; let _permit = self.permit(RequestKind::Get).await;
let blob_client = self.client.blob_client(self.relative_path_to_name(from)); let blob_client = self.client.blob_client(self.relative_path_to_name(from));
let metadata = self.get_metadata(&blob_client).await?;
let builder = blob_client.get(); let builder = blob_client.get();
self.download_for_builder(metadata, builder).await self.download_for_builder(builder).await
} }
async fn download_byte_range( async fn download_byte_range(
@@ -285,8 +264,6 @@ impl RemoteStorage for AzureBlobStorage {
let _permit = self.permit(RequestKind::Get).await; let _permit = self.permit(RequestKind::Get).await;
let blob_client = self.client.blob_client(self.relative_path_to_name(from)); let blob_client = self.client.blob_client(self.relative_path_to_name(from));
let metadata = self.get_metadata(&blob_client).await?;
let mut builder = blob_client.get(); let mut builder = blob_client.get();
if let Some(end_exclusive) = end_exclusive { if let Some(end_exclusive) = end_exclusive {
@@ -301,7 +278,7 @@ impl RemoteStorage for AzureBlobStorage {
builder = builder.range(Range::new(start_inclusive, end_exclusive)); builder = builder.range(Range::new(start_inclusive, end_exclusive));
} }
self.download_for_builder(metadata, builder).await self.download_for_builder(builder).await
} }
async fn delete(&self, path: &RemotePath) -> anyhow::Result<()> { async fn delete(&self, path: &RemotePath) -> anyhow::Result<()> {

View File

@@ -6,19 +6,15 @@
//! * [`s3_bucket`] uses AWS S3 bucket as an external storage //! * [`s3_bucket`] uses AWS S3 bucket as an external storage
//! * [`azure_blob`] allows to use Azure Blob storage as an external storage //! * [`azure_blob`] allows to use Azure Blob storage as an external storage
//! //!
#![deny(unsafe_code)]
#![deny(clippy::undocumented_unsafe_blocks)]
mod azure_blob; mod azure_blob;
mod local_fs; mod local_fs;
mod s3_bucket; mod s3_bucket;
mod simulate_failures; mod simulate_failures;
use std::{ use std::{collections::HashMap, fmt::Debug, num::NonZeroUsize, pin::Pin, sync::Arc};
collections::HashMap,
fmt::Debug,
num::{NonZeroU32, NonZeroUsize},
pin::Pin,
sync::Arc,
};
use anyhow::{bail, Context}; use anyhow::{bail, Context};
use camino::{Utf8Path, Utf8PathBuf}; use camino::{Utf8Path, Utf8PathBuf};
@@ -34,12 +30,6 @@ pub use self::{
}; };
use s3_bucket::RequestKind; use s3_bucket::RequestKind;
/// How many different timelines can be processed simultaneously when synchronizing layers with the remote storage.
/// During regular work, pageserver produces one layer file per timeline checkpoint, with bursts of concurrency
/// during start (where local and remote timelines are compared and initial sync tasks are scheduled) and timeline attach.
/// Both cases may trigger timeline download, that might download a lot of layers. This concurrency is limited by the clients internally, if needed.
pub const DEFAULT_REMOTE_STORAGE_MAX_CONCURRENT_SYNCS: usize = 50;
pub const DEFAULT_REMOTE_STORAGE_MAX_SYNC_ERRORS: u32 = 10;
/// Currently, sync happens with AWS S3, that has two limits on requests per second: /// Currently, sync happens with AWS S3, that has two limits on requests per second:
/// ~200 RPS for IAM services /// ~200 RPS for IAM services
/// <https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/UsingWithRDS.IAMDBAuth.html> /// <https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/UsingWithRDS.IAMDBAuth.html>
@@ -441,10 +431,6 @@ pub struct StorageMetadata(HashMap<String, String>);
/// External backup storage configuration, enough for creating a client for that storage. /// External backup storage configuration, enough for creating a client for that storage.
#[derive(Debug, Clone, PartialEq, Eq)] #[derive(Debug, Clone, PartialEq, Eq)]
pub struct RemoteStorageConfig { pub struct RemoteStorageConfig {
/// Max allowed number of concurrent sync operations between the API user and the remote storage.
pub max_concurrent_syncs: NonZeroUsize,
/// Max allowed errors before the sync task is considered failed and evicted.
pub max_sync_errors: NonZeroU32,
/// The storage connection configuration. /// The storage connection configuration.
pub storage: RemoteStorageKind, pub storage: RemoteStorageKind,
} }
@@ -540,18 +526,6 @@ impl RemoteStorageConfig {
let use_azure = container_name.is_some() && container_region.is_some(); let use_azure = container_name.is_some() && container_region.is_some();
let max_concurrent_syncs = NonZeroUsize::new(
parse_optional_integer("max_concurrent_syncs", toml)?
.unwrap_or(DEFAULT_REMOTE_STORAGE_MAX_CONCURRENT_SYNCS),
)
.context("Failed to parse 'max_concurrent_syncs' as a positive integer")?;
let max_sync_errors = NonZeroU32::new(
parse_optional_integer("max_sync_errors", toml)?
.unwrap_or(DEFAULT_REMOTE_STORAGE_MAX_SYNC_ERRORS),
)
.context("Failed to parse 'max_sync_errors' as a positive integer")?;
let default_concurrency_limit = if use_azure { let default_concurrency_limit = if use_azure {
DEFAULT_REMOTE_STORAGE_AZURE_CONCURRENCY_LIMIT DEFAULT_REMOTE_STORAGE_AZURE_CONCURRENCY_LIMIT
} else { } else {
@@ -633,11 +607,7 @@ impl RemoteStorageConfig {
} }
}; };
Ok(Some(RemoteStorageConfig { Ok(Some(RemoteStorageConfig { storage }))
max_concurrent_syncs,
max_sync_errors,
storage,
}))
} }
} }

View File

@@ -4,23 +4,27 @@
//! allowing multiple api users to independently work with the same S3 bucket, if //! allowing multiple api users to independently work with the same S3 bucket, if
//! their bucket prefixes are both specified and different. //! their bucket prefixes are both specified and different.
use std::borrow::Cow; use std::{borrow::Cow, sync::Arc};
use anyhow::Context; use anyhow::Context;
use aws_config::{ use aws_config::{
environment::credentials::EnvironmentVariableCredentialsProvider, environment::credentials::EnvironmentVariableCredentialsProvider,
imds::credentials::ImdsCredentialsProvider, meta::credentials::CredentialsProviderChain, imds::credentials::ImdsCredentialsProvider,
provider_config::ProviderConfig, web_identity_token::WebIdentityTokenCredentialsProvider, meta::credentials::CredentialsProviderChain,
provider_config::ProviderConfig,
retry::{RetryConfigBuilder, RetryMode},
web_identity_token::WebIdentityTokenCredentialsProvider,
}; };
use aws_credential_types::cache::CredentialsCache; use aws_credential_types::cache::CredentialsCache;
use aws_sdk_s3::{ use aws_sdk_s3::{
config::{Config, Region}, config::{AsyncSleep, Config, Region, SharedAsyncSleep},
error::SdkError, error::SdkError,
operation::get_object::GetObjectError, operation::get_object::GetObjectError,
primitives::ByteStream, primitives::ByteStream,
types::{Delete, ObjectIdentifier}, types::{Delete, ObjectIdentifier},
Client, Client,
}; };
use aws_smithy_async::rt::sleep::TokioSleep;
use aws_smithy_http::body::SdkBody; use aws_smithy_http::body::SdkBody;
use hyper::Body; use hyper::Body;
use scopeguard::ScopeGuard; use scopeguard::ScopeGuard;
@@ -83,10 +87,23 @@ impl S3Bucket {
.or_else("imds", ImdsCredentialsProvider::builder().build()) .or_else("imds", ImdsCredentialsProvider::builder().build())
}; };
// AWS SDK requires us to specify how the RetryConfig should sleep when it wants to back off
let sleep_impl: Arc<dyn AsyncSleep> = Arc::new(TokioSleep::new());
// We do our own retries (see [`backoff::retry`]). However, for the AWS SDK to enable rate limiting in response to throttling
// responses (e.g. 429 on too many ListObjectsv2 requests), we must provide a retry config. We set it to use at most one
// attempt, and enable 'Adaptive' mode, which causes rate limiting to be enabled.
let mut retry_config = RetryConfigBuilder::new();
retry_config
.set_max_attempts(Some(1))
.set_mode(Some(RetryMode::Adaptive));
let mut config_builder = Config::builder() let mut config_builder = Config::builder()
.region(region) .region(region)
.credentials_cache(CredentialsCache::lazy()) .credentials_cache(CredentialsCache::lazy())
.credentials_provider(credentials_provider); .credentials_provider(credentials_provider)
.sleep_impl(SharedAsyncSleep::from(sleep_impl))
.retry_config(retry_config.build());
if let Some(custom_endpoint) = aws_config.endpoint.clone() { if let Some(custom_endpoint) = aws_config.endpoint.clone() {
config_builder = config_builder config_builder = config_builder

View File

@@ -1,6 +1,6 @@
use std::collections::HashSet; use std::collections::HashSet;
use std::env; use std::env;
use std::num::{NonZeroU32, NonZeroUsize}; use std::num::NonZeroUsize;
use std::ops::ControlFlow; use std::ops::ControlFlow;
use std::path::PathBuf; use std::path::PathBuf;
use std::sync::Arc; use std::sync::Arc;
@@ -469,8 +469,6 @@ fn create_azure_client(
let random = rand::thread_rng().gen::<u32>(); let random = rand::thread_rng().gen::<u32>();
let remote_storage_config = RemoteStorageConfig { let remote_storage_config = RemoteStorageConfig {
max_concurrent_syncs: NonZeroUsize::new(100).unwrap(),
max_sync_errors: NonZeroU32::new(5).unwrap(),
storage: RemoteStorageKind::AzureContainer(AzureConfig { storage: RemoteStorageKind::AzureContainer(AzureConfig {
container_name: remote_storage_azure_container, container_name: remote_storage_azure_container,
container_region: remote_storage_azure_region, container_region: remote_storage_azure_region,

View File

@@ -1,6 +1,6 @@
use std::collections::HashSet; use std::collections::HashSet;
use std::env; use std::env;
use std::num::{NonZeroU32, NonZeroUsize}; use std::num::NonZeroUsize;
use std::ops::ControlFlow; use std::ops::ControlFlow;
use std::path::PathBuf; use std::path::PathBuf;
use std::sync::Arc; use std::sync::Arc;
@@ -396,8 +396,6 @@ fn create_s3_client(
let random = rand::thread_rng().gen::<u32>(); let random = rand::thread_rng().gen::<u32>();
let remote_storage_config = RemoteStorageConfig { let remote_storage_config = RemoteStorageConfig {
max_concurrent_syncs: NonZeroUsize::new(100).unwrap(),
max_sync_errors: NonZeroU32::new(5).unwrap(),
storage: RemoteStorageKind::AwsS3(S3Config { storage: RemoteStorageKind::AwsS3(S3Config {
bucket_name: remote_storage_s3_bucket, bucket_name: remote_storage_s3_bucket,
bucket_region: remote_storage_s3_region, bucket_region: remote_storage_s3_region,

View File

@@ -1,3 +1,5 @@
#![deny(unsafe_code)]
#![deny(clippy::undocumented_unsafe_blocks)]
use const_format::formatcp; use const_format::formatcp;
/// Public API types /// Public API types

View File

@@ -1,23 +1,18 @@
use serde::{Deserialize, Serialize}; use serde::{Deserialize, Serialize};
use serde_with::{serde_as, DisplayFromStr};
use utils::{ use utils::{
id::{NodeId, TenantId, TimelineId}, id::{NodeId, TenantId, TimelineId},
lsn::Lsn, lsn::Lsn,
}; };
#[serde_as]
#[derive(Serialize, Deserialize)] #[derive(Serialize, Deserialize)]
pub struct TimelineCreateRequest { pub struct TimelineCreateRequest {
#[serde_as(as = "DisplayFromStr")]
pub tenant_id: TenantId, pub tenant_id: TenantId,
#[serde_as(as = "DisplayFromStr")]
pub timeline_id: TimelineId, pub timeline_id: TimelineId,
pub peer_ids: Option<Vec<NodeId>>, pub peer_ids: Option<Vec<NodeId>>,
pub pg_version: u32, pub pg_version: u32,
pub system_id: Option<u64>, pub system_id: Option<u64>,
pub wal_seg_size: Option<u32>, pub wal_seg_size: Option<u32>,
#[serde_as(as = "DisplayFromStr")]
pub commit_lsn: Lsn, pub commit_lsn: Lsn,
// If not passed, it is assigned to the beginning of commit_lsn segment. // If not passed, it is assigned to the beginning of commit_lsn segment.
pub local_start_lsn: Option<Lsn>, pub local_start_lsn: Option<Lsn>,
@@ -28,7 +23,6 @@ fn lsn_invalid() -> Lsn {
} }
/// Data about safekeeper's timeline, mirrors broker.proto. /// Data about safekeeper's timeline, mirrors broker.proto.
#[serde_as]
#[derive(Debug, Clone, Deserialize, Serialize)] #[derive(Debug, Clone, Deserialize, Serialize)]
pub struct SkTimelineInfo { pub struct SkTimelineInfo {
/// Term. /// Term.
@@ -36,25 +30,19 @@ pub struct SkTimelineInfo {
/// Term of the last entry. /// Term of the last entry.
pub last_log_term: Option<u64>, pub last_log_term: Option<u64>,
/// LSN of the last record. /// LSN of the last record.
#[serde_as(as = "DisplayFromStr")]
#[serde(default = "lsn_invalid")] #[serde(default = "lsn_invalid")]
pub flush_lsn: Lsn, pub flush_lsn: Lsn,
/// Up to which LSN safekeeper regards its WAL as committed. /// Up to which LSN safekeeper regards its WAL as committed.
#[serde_as(as = "DisplayFromStr")]
#[serde(default = "lsn_invalid")] #[serde(default = "lsn_invalid")]
pub commit_lsn: Lsn, pub commit_lsn: Lsn,
/// LSN up to which safekeeper has backed WAL. /// LSN up to which safekeeper has backed WAL.
#[serde_as(as = "DisplayFromStr")]
#[serde(default = "lsn_invalid")] #[serde(default = "lsn_invalid")]
pub backup_lsn: Lsn, pub backup_lsn: Lsn,
/// LSN of last checkpoint uploaded by pageserver. /// LSN of last checkpoint uploaded by pageserver.
#[serde_as(as = "DisplayFromStr")]
#[serde(default = "lsn_invalid")] #[serde(default = "lsn_invalid")]
pub remote_consistent_lsn: Lsn, pub remote_consistent_lsn: Lsn,
#[serde_as(as = "DisplayFromStr")]
#[serde(default = "lsn_invalid")] #[serde(default = "lsn_invalid")]
pub peer_horizon_lsn: Lsn, pub peer_horizon_lsn: Lsn,
#[serde_as(as = "DisplayFromStr")]
#[serde(default = "lsn_invalid")] #[serde(default = "lsn_invalid")]
pub local_start_lsn: Lsn, pub local_start_lsn: Lsn,
/// A connection string to use for WAL receiving. /// A connection string to use for WAL receiving.

View File

@@ -1,4 +1,6 @@
//! Synthetic size calculation //! Synthetic size calculation
#![deny(unsafe_code)]
#![deny(clippy::undocumented_unsafe_blocks)]
mod calculation; mod calculation;
pub mod svg; pub mod svg;

View File

@@ -32,6 +32,8 @@
//! .init(); //! .init();
//! } //! }
//! ``` //! ```
#![deny(unsafe_code)]
#![deny(clippy::undocumented_unsafe_blocks)]
use opentelemetry::sdk::Resource; use opentelemetry::sdk::Resource;
use opentelemetry::KeyValue; use opentelemetry::KeyValue;

View File

@@ -5,6 +5,7 @@ edition.workspace = true
license.workspace = true license.workspace = true
[dependencies] [dependencies]
arc-swap.workspace = true
sentry.workspace = true sentry.workspace = true
async-trait.workspace = true async-trait.workspace = true
anyhow.workspace = true anyhow.workspace = true
@@ -55,6 +56,7 @@ bytes.workspace = true
criterion.workspace = true criterion.workspace = true
hex-literal.workspace = true hex-literal.workspace = true
camino-tempfile.workspace = true camino-tempfile.workspace = true
serde_assert.workspace = true
[[bench]] [[bench]]
name = "benchmarks" name = "benchmarks"

View File

@@ -1,7 +1,8 @@
// For details about authentication see docs/authentication.md // For details about authentication see docs/authentication.md
use arc_swap::ArcSwap;
use serde; use serde;
use std::fs; use std::{borrow::Cow, fmt::Display, fs, sync::Arc};
use anyhow::Result; use anyhow::Result;
use camino::Utf8Path; use camino::Utf8Path;
@@ -9,9 +10,8 @@ use jsonwebtoken::{
decode, encode, Algorithm, DecodingKey, EncodingKey, Header, TokenData, Validation, decode, encode, Algorithm, DecodingKey, EncodingKey, Header, TokenData, Validation,
}; };
use serde::{Deserialize, Serialize}; use serde::{Deserialize, Serialize};
use serde_with::{serde_as, DisplayFromStr};
use crate::id::TenantId; use crate::{http::error::ApiError, id::TenantId};
/// Algorithm to use. We require EdDSA. /// Algorithm to use. We require EdDSA.
const STORAGE_TOKEN_ALGORITHM: Algorithm = Algorithm::EdDSA; const STORAGE_TOKEN_ALGORITHM: Algorithm = Algorithm::EdDSA;
@@ -32,11 +32,9 @@ pub enum Scope {
} }
/// JWT payload. See docs/authentication.md for the format /// JWT payload. See docs/authentication.md for the format
#[serde_as]
#[derive(Debug, Serialize, Deserialize, Clone, PartialEq)] #[derive(Debug, Serialize, Deserialize, Clone, PartialEq)]
pub struct Claims { pub struct Claims {
#[serde(default)] #[serde(default)]
#[serde_as(as = "Option<DisplayFromStr>")]
pub tenant_id: Option<TenantId>, pub tenant_id: Option<TenantId>,
pub scope: Scope, pub scope: Scope,
} }
@@ -47,31 +45,106 @@ impl Claims {
} }
} }
pub struct SwappableJwtAuth(ArcSwap<JwtAuth>);
impl SwappableJwtAuth {
pub fn new(jwt_auth: JwtAuth) -> Self {
SwappableJwtAuth(ArcSwap::new(Arc::new(jwt_auth)))
}
pub fn swap(&self, jwt_auth: JwtAuth) {
self.0.swap(Arc::new(jwt_auth));
}
pub fn decode(&self, token: &str) -> std::result::Result<TokenData<Claims>, AuthError> {
self.0.load().decode(token)
}
}
impl std::fmt::Debug for SwappableJwtAuth {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
write!(f, "Swappable({:?})", self.0.load())
}
}
#[derive(Clone, PartialEq, Eq, Hash, Debug)]
pub struct AuthError(pub Cow<'static, str>);
impl Display for AuthError {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
write!(f, "{}", self.0)
}
}
impl From<AuthError> for ApiError {
fn from(_value: AuthError) -> Self {
// Don't pass on the value of the AuthError as a precautionary measure.
// Being intentionally vague in public error communication hurts debugability
// but it is more secure.
ApiError::Forbidden("JWT authentication error".to_string())
}
}
pub struct JwtAuth { pub struct JwtAuth {
decoding_key: DecodingKey, decoding_keys: Vec<DecodingKey>,
validation: Validation, validation: Validation,
} }
impl JwtAuth { impl JwtAuth {
pub fn new(decoding_key: DecodingKey) -> Self { pub fn new(decoding_keys: Vec<DecodingKey>) -> Self {
let mut validation = Validation::default(); let mut validation = Validation::default();
validation.algorithms = vec![STORAGE_TOKEN_ALGORITHM]; validation.algorithms = vec![STORAGE_TOKEN_ALGORITHM];
// The default 'required_spec_claims' is 'exp'. But we don't want to require // The default 'required_spec_claims' is 'exp'. But we don't want to require
// expiration. // expiration.
validation.required_spec_claims = [].into(); validation.required_spec_claims = [].into();
Self { Self {
decoding_key, decoding_keys,
validation, validation,
} }
} }
pub fn from_key_path(key_path: &Utf8Path) -> Result<Self> { pub fn from_key_path(key_path: &Utf8Path) -> Result<Self> {
let public_key = fs::read(key_path)?; let metadata = key_path.metadata()?;
Ok(Self::new(DecodingKey::from_ed_pem(&public_key)?)) let decoding_keys = if metadata.is_dir() {
let mut keys = Vec::new();
for entry in fs::read_dir(key_path)? {
let path = entry?.path();
if !path.is_file() {
// Ignore directories (don't recurse)
continue;
}
let public_key = fs::read(path)?;
keys.push(DecodingKey::from_ed_pem(&public_key)?);
}
keys
} else if metadata.is_file() {
let public_key = fs::read(key_path)?;
vec![DecodingKey::from_ed_pem(&public_key)?]
} else {
anyhow::bail!("path is neither a directory or a file")
};
if decoding_keys.is_empty() {
anyhow::bail!("Configured for JWT auth with zero decoding keys. All JWT gated requests would be rejected.");
}
Ok(Self::new(decoding_keys))
} }
pub fn decode(&self, token: &str) -> Result<TokenData<Claims>> { /// Attempt to decode the token with the internal decoding keys.
Ok(decode(token, &self.decoding_key, &self.validation)?) ///
/// The function tries the stored decoding keys in succession,
/// and returns the first yielding a successful result.
/// If there is no working decoding key, it returns the last error.
pub fn decode(&self, token: &str) -> std::result::Result<TokenData<Claims>, AuthError> {
let mut res = None;
for decoding_key in &self.decoding_keys {
res = Some(decode(token, decoding_key, &self.validation));
if let Some(Ok(res)) = res {
return Ok(res);
}
}
if let Some(res) = res {
res.map_err(|e| AuthError(Cow::Owned(e.to_string())))
} else {
Err(AuthError(Cow::Borrowed("no JWT decoding keys configured")))
}
} }
} }
@@ -111,9 +184,9 @@ MC4CAQAwBQYDK2VwBCIEID/Drmc1AA6U/znNRWpF3zEGegOATQxfkdWxitcOMsIH
"#; "#;
#[test] #[test]
fn test_decode() -> Result<(), anyhow::Error> { fn test_decode() {
let expected_claims = Claims { let expected_claims = Claims {
tenant_id: Some(TenantId::from_str("3d1f7595b468230304e0b73cecbcb081")?), tenant_id: Some(TenantId::from_str("3d1f7595b468230304e0b73cecbcb081").unwrap()),
scope: Scope::Tenant, scope: Scope::Tenant,
}; };
@@ -132,28 +205,24 @@ MC4CAQAwBQYDK2VwBCIEID/Drmc1AA6U/znNRWpF3zEGegOATQxfkdWxitcOMsIH
let encoded_eddsa = "eyJhbGciOiJFZERTQSIsInR5cCI6IkpXVCJ9.eyJzY29wZSI6InRlbmFudCIsInRlbmFudF9pZCI6IjNkMWY3NTk1YjQ2ODIzMDMwNGUwYjczY2VjYmNiMDgxIiwiaXNzIjoibmVvbi5jb250cm9scGxhbmUiLCJleHAiOjE3MDkyMDA4NzksImlhdCI6MTY3ODQ0MjQ3OX0.U3eA8j-uU-JnhzeO3EDHRuXLwkAUFCPxtGHEgw6p7Ccc3YRbFs2tmCdbD9PZEXP-XsxSeBQi1FY0YPcT3NXADw"; let encoded_eddsa = "eyJhbGciOiJFZERTQSIsInR5cCI6IkpXVCJ9.eyJzY29wZSI6InRlbmFudCIsInRlbmFudF9pZCI6IjNkMWY3NTk1YjQ2ODIzMDMwNGUwYjczY2VjYmNiMDgxIiwiaXNzIjoibmVvbi5jb250cm9scGxhbmUiLCJleHAiOjE3MDkyMDA4NzksImlhdCI6MTY3ODQ0MjQ3OX0.U3eA8j-uU-JnhzeO3EDHRuXLwkAUFCPxtGHEgw6p7Ccc3YRbFs2tmCdbD9PZEXP-XsxSeBQi1FY0YPcT3NXADw";
// Check it can be validated with the public key // Check it can be validated with the public key
let auth = JwtAuth::new(DecodingKey::from_ed_pem(TEST_PUB_KEY_ED25519)?); let auth = JwtAuth::new(vec![DecodingKey::from_ed_pem(TEST_PUB_KEY_ED25519).unwrap()]);
let claims_from_token = auth.decode(encoded_eddsa)?.claims; let claims_from_token = auth.decode(encoded_eddsa).unwrap().claims;
assert_eq!(claims_from_token, expected_claims); assert_eq!(claims_from_token, expected_claims);
Ok(())
} }
#[test] #[test]
fn test_encode() -> Result<(), anyhow::Error> { fn test_encode() {
let claims = Claims { let claims = Claims {
tenant_id: Some(TenantId::from_str("3d1f7595b468230304e0b73cecbcb081")?), tenant_id: Some(TenantId::from_str("3d1f7595b468230304e0b73cecbcb081").unwrap()),
scope: Scope::Tenant, scope: Scope::Tenant,
}; };
let encoded = encode_from_key_file(&claims, TEST_PRIV_KEY_ED25519)?; let encoded = encode_from_key_file(&claims, TEST_PRIV_KEY_ED25519).unwrap();
// decode it back // decode it back
let auth = JwtAuth::new(DecodingKey::from_ed_pem(TEST_PUB_KEY_ED25519)?); let auth = JwtAuth::new(vec![DecodingKey::from_ed_pem(TEST_PUB_KEY_ED25519).unwrap()]);
let decoded = auth.decode(&encoded)?; let decoded = auth.decode(&encoded).unwrap();
assert_eq!(decoded.claims, claims); assert_eq!(decoded.claims, claims);
Ok(())
} }
} }

View File

@@ -7,7 +7,7 @@ use serde::{Deserialize, Serialize};
/// ///
/// See docs/rfcs/025-generation-numbers.md for detail on how generation /// See docs/rfcs/025-generation-numbers.md for detail on how generation
/// numbers are used. /// numbers are used.
#[derive(Copy, Clone, Eq, PartialEq, PartialOrd, Ord)] #[derive(Copy, Clone, Eq, PartialEq, PartialOrd, Ord, Hash)]
pub enum Generation { pub enum Generation {
// Generations with this magic value will not add a suffix to S3 keys, and will not // Generations with this magic value will not add a suffix to S3 keys, and will not
// be included in persisted index_part.json. This value is only to be used // be included in persisted index_part.json. This value is only to be used

41
libs/utils/src/hex.rs Normal file
View File

@@ -0,0 +1,41 @@
/// Useful type for asserting that expected bytes match reporting the bytes more readable
/// array-syntax compatible hex bytes.
///
/// # Usage
///
/// ```
/// use utils::Hex;
///
/// let actual = serialize_something();
/// let expected = [0x68, 0x65, 0x6c, 0x6c, 0x6f, 0x20, 0x77, 0x6f, 0x72, 0x6c, 0x64];
///
/// // the type implements PartialEq and on mismatch, both sides are printed in 16 wide multiline
/// // output suffixed with an array style length for easier comparisons.
/// assert_eq!(Hex(&actual), Hex(&expected));
///
/// // with `let expected = [0x68];` the error would had been:
/// // assertion `left == right` failed
/// // left: [0x68, 0x65, 0x6c, 0x6c, 0x6f, 0x20, 0x77, 0x6f, 0x72, 0x6c, 0x64; 11]
/// // right: [0x68; 1]
/// # fn serialize_something() -> Vec<u8> { "hello world".as_bytes().to_vec() }
/// ```
#[derive(PartialEq)]
pub struct Hex<'a>(pub &'a [u8]);
impl std::fmt::Debug for Hex<'_> {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
write!(f, "[")?;
for (i, c) in self.0.chunks(16).enumerate() {
if i > 0 && !c.is_empty() {
writeln!(f, ", ")?;
}
for (j, b) in c.iter().enumerate() {
if j > 0 {
write!(f, ", ")?;
}
write!(f, "0x{b:02x}")?;
}
}
write!(f, "; {}]", self.0.len())
}
}

View File

@@ -1,4 +1,4 @@
use crate::auth::{Claims, JwtAuth}; use crate::auth::{AuthError, Claims, SwappableJwtAuth};
use crate::http::error::{api_error_handler, route_error_handler, ApiError}; use crate::http::error::{api_error_handler, route_error_handler, ApiError};
use anyhow::Context; use anyhow::Context;
use hyper::header::{HeaderName, AUTHORIZATION}; use hyper::header::{HeaderName, AUTHORIZATION};
@@ -14,6 +14,11 @@ use tracing::{self, debug, info, info_span, warn, Instrument};
use std::future::Future; use std::future::Future;
use std::str::FromStr; use std::str::FromStr;
use bytes::{Bytes, BytesMut};
use std::io::Write as _;
use tokio::sync::mpsc;
use tokio_stream::wrappers::ReceiverStream;
static SERVE_METRICS_COUNT: Lazy<IntCounter> = Lazy::new(|| { static SERVE_METRICS_COUNT: Lazy<IntCounter> = Lazy::new(|| {
register_int_counter!( register_int_counter!(
"libmetrics_metric_handler_requests_total", "libmetrics_metric_handler_requests_total",
@@ -146,94 +151,89 @@ impl Drop for RequestCancelled {
} }
} }
/// An [`std::io::Write`] implementation on top of a channel sending [`bytes::Bytes`] chunks.
pub struct ChannelWriter {
buffer: BytesMut,
pub tx: mpsc::Sender<std::io::Result<Bytes>>,
written: usize,
}
impl ChannelWriter {
pub fn new(buf_len: usize, tx: mpsc::Sender<std::io::Result<Bytes>>) -> Self {
assert_ne!(buf_len, 0);
ChannelWriter {
// split about half off the buffer from the start, because we flush depending on
// capacity. first flush will come sooner than without this, but now resizes will
// have better chance of picking up the "other" half. not guaranteed of course.
buffer: BytesMut::with_capacity(buf_len).split_off(buf_len / 2),
tx,
written: 0,
}
}
pub fn flush0(&mut self) -> std::io::Result<usize> {
let n = self.buffer.len();
if n == 0 {
return Ok(0);
}
tracing::trace!(n, "flushing");
let ready = self.buffer.split().freeze();
// not ideal to call from blocking code to block_on, but we are sure that this
// operation does not spawn_blocking other tasks
let res: Result<(), ()> = tokio::runtime::Handle::current().block_on(async {
self.tx.send(Ok(ready)).await.map_err(|_| ())?;
// throttle sending to allow reuse of our buffer in `write`.
self.tx.reserve().await.map_err(|_| ())?;
// now the response task has picked up the buffer and hopefully started
// sending it to the client.
Ok(())
});
if res.is_err() {
return Err(std::io::ErrorKind::BrokenPipe.into());
}
self.written += n;
Ok(n)
}
pub fn flushed_bytes(&self) -> usize {
self.written
}
}
impl std::io::Write for ChannelWriter {
fn write(&mut self, mut buf: &[u8]) -> std::io::Result<usize> {
let remaining = self.buffer.capacity() - self.buffer.len();
let out_of_space = remaining < buf.len();
let original_len = buf.len();
if out_of_space {
let can_still_fit = buf.len() - remaining;
self.buffer.extend_from_slice(&buf[..can_still_fit]);
buf = &buf[can_still_fit..];
self.flush0()?;
}
// assume that this will often under normal operation just move the pointer back to the
// beginning of allocation, because previous split off parts are already sent and
// dropped.
self.buffer.extend_from_slice(buf);
Ok(original_len)
}
fn flush(&mut self) -> std::io::Result<()> {
self.flush0().map(|_| ())
}
}
async fn prometheus_metrics_handler(_req: Request<Body>) -> Result<Response<Body>, ApiError> { async fn prometheus_metrics_handler(_req: Request<Body>) -> Result<Response<Body>, ApiError> {
use bytes::{Bytes, BytesMut};
use std::io::Write as _;
use tokio::sync::mpsc;
use tokio_stream::wrappers::ReceiverStream;
SERVE_METRICS_COUNT.inc(); SERVE_METRICS_COUNT.inc();
/// An [`std::io::Write`] implementation on top of a channel sending [`bytes::Bytes`] chunks.
struct ChannelWriter {
buffer: BytesMut,
tx: mpsc::Sender<std::io::Result<Bytes>>,
written: usize,
}
impl ChannelWriter {
fn new(buf_len: usize, tx: mpsc::Sender<std::io::Result<Bytes>>) -> Self {
assert_ne!(buf_len, 0);
ChannelWriter {
// split about half off the buffer from the start, because we flush depending on
// capacity. first flush will come sooner than without this, but now resizes will
// have better chance of picking up the "other" half. not guaranteed of course.
buffer: BytesMut::with_capacity(buf_len).split_off(buf_len / 2),
tx,
written: 0,
}
}
fn flush0(&mut self) -> std::io::Result<usize> {
let n = self.buffer.len();
if n == 0 {
return Ok(0);
}
tracing::trace!(n, "flushing");
let ready = self.buffer.split().freeze();
// not ideal to call from blocking code to block_on, but we are sure that this
// operation does not spawn_blocking other tasks
let res: Result<(), ()> = tokio::runtime::Handle::current().block_on(async {
self.tx.send(Ok(ready)).await.map_err(|_| ())?;
// throttle sending to allow reuse of our buffer in `write`.
self.tx.reserve().await.map_err(|_| ())?;
// now the response task has picked up the buffer and hopefully started
// sending it to the client.
Ok(())
});
if res.is_err() {
return Err(std::io::ErrorKind::BrokenPipe.into());
}
self.written += n;
Ok(n)
}
fn flushed_bytes(&self) -> usize {
self.written
}
}
impl std::io::Write for ChannelWriter {
fn write(&mut self, mut buf: &[u8]) -> std::io::Result<usize> {
let remaining = self.buffer.capacity() - self.buffer.len();
let out_of_space = remaining < buf.len();
let original_len = buf.len();
if out_of_space {
let can_still_fit = buf.len() - remaining;
self.buffer.extend_from_slice(&buf[..can_still_fit]);
buf = &buf[can_still_fit..];
self.flush0()?;
}
// assume that this will often under normal operation just move the pointer back to the
// beginning of allocation, because previous split off parts are already sent and
// dropped.
self.buffer.extend_from_slice(buf);
Ok(original_len)
}
fn flush(&mut self) -> std::io::Result<()> {
self.flush0().map(|_| ())
}
}
let started_at = std::time::Instant::now(); let started_at = std::time::Instant::now();
let (tx, rx) = mpsc::channel(1); let (tx, rx) = mpsc::channel(1);
@@ -389,7 +389,7 @@ fn parse_token(header_value: &str) -> Result<&str, ApiError> {
} }
pub fn auth_middleware<B: hyper::body::HttpBody + Send + Sync + 'static>( pub fn auth_middleware<B: hyper::body::HttpBody + Send + Sync + 'static>(
provide_auth: fn(&Request<Body>) -> Option<&JwtAuth>, provide_auth: fn(&Request<Body>) -> Option<&SwappableJwtAuth>,
) -> Middleware<B, ApiError> { ) -> Middleware<B, ApiError> {
Middleware::pre(move |req| async move { Middleware::pre(move |req| async move {
if let Some(auth) = provide_auth(&req) { if let Some(auth) = provide_auth(&req) {
@@ -400,9 +400,11 @@ pub fn auth_middleware<B: hyper::body::HttpBody + Send + Sync + 'static>(
})?; })?;
let token = parse_token(header_value)?; let token = parse_token(header_value)?;
let data = auth let data = auth.decode(token).map_err(|err| {
.decode(token) warn!("Authentication error: {err}");
.map_err(|_| ApiError::Unauthorized("malformed jwt token".to_string()))?; // Rely on From<AuthError> for ApiError impl
err
})?;
req.set_context(data.claims); req.set_context(data.claims);
} }
None => { None => {
@@ -450,12 +452,11 @@ where
pub fn check_permission_with( pub fn check_permission_with(
req: &Request<Body>, req: &Request<Body>,
check_permission: impl Fn(&Claims) -> Result<(), anyhow::Error>, check_permission: impl Fn(&Claims) -> Result<(), AuthError>,
) -> Result<(), ApiError> { ) -> Result<(), ApiError> {
match req.context::<Claims>() { match req.context::<Claims>() {
Some(claims) => { Some(claims) => Ok(check_permission(&claims)
Ok(check_permission(&claims).map_err(|err| ApiError::Forbidden(err.to_string()))?) .map_err(|_err| ApiError::Forbidden("JWT authentication error".to_string()))?),
}
None => Ok(()), // claims is None because auth is disabled None => Ok(()), // claims is None because auth is disabled
} }
} }

View File

@@ -3,7 +3,7 @@ use serde::{Deserialize, Serialize};
use std::borrow::Cow; use std::borrow::Cow;
use std::error::Error as StdError; use std::error::Error as StdError;
use thiserror::Error; use thiserror::Error;
use tracing::{error, info}; use tracing::{error, info, warn};
#[derive(Debug, Error)] #[derive(Debug, Error)]
pub enum ApiError { pub enum ApiError {
@@ -118,6 +118,9 @@ pub fn api_error_handler(api_error: ApiError) -> Response<Body> {
// Print a stack trace for Internal Server errors // Print a stack trace for Internal Server errors
match api_error { match api_error {
ApiError::Forbidden(_) | ApiError::Unauthorized(_) => {
warn!("Error processing HTTP request: {api_error:#}")
}
ApiError::ResourceUnavailable(_) => info!("Error processing HTTP request: {api_error:#}"), ApiError::ResourceUnavailable(_) => info!("Error processing HTTP request: {api_error:#}"),
ApiError::NotFound(_) => info!("Error processing HTTP request: {api_error:#}"), ApiError::NotFound(_) => info!("Error processing HTTP request: {api_error:#}"),
ApiError::InternalServerError(_) => error!("Error processing HTTP request: {api_error:?}"), ApiError::InternalServerError(_) => error!("Error processing HTTP request: {api_error:?}"),

View File

@@ -3,6 +3,7 @@ use std::{fmt, str::FromStr};
use anyhow::Context; use anyhow::Context;
use hex::FromHex; use hex::FromHex;
use rand::Rng; use rand::Rng;
use serde::de::Visitor;
use serde::{Deserialize, Serialize}; use serde::{Deserialize, Serialize};
use thiserror::Error; use thiserror::Error;
@@ -17,12 +18,74 @@ pub enum IdError {
/// ///
/// NOTE: It (de)serializes as an array of hex bytes, so the string representation would look /// NOTE: It (de)serializes as an array of hex bytes, so the string representation would look
/// like `[173,80,132,115,129,226,72,254,170,201,135,108,199,26,228,24]`. /// like `[173,80,132,115,129,226,72,254,170,201,135,108,199,26,228,24]`.
/// #[derive(Clone, Copy, PartialEq, Eq, Hash, PartialOrd, Ord)]
/// Use `#[serde_as(as = "DisplayFromStr")]` to (de)serialize it as hex string instead: `ad50847381e248feaac9876cc71ae418`.
/// Check the `serde_with::serde_as` documentation for options for more complex types.
#[derive(Clone, Copy, PartialEq, Eq, Hash, Serialize, Deserialize, PartialOrd, Ord)]
struct Id([u8; 16]); struct Id([u8; 16]);
impl Serialize for Id {
fn serialize<S>(&self, serializer: S) -> Result<S::Ok, S::Error>
where
S: serde::Serializer,
{
if serializer.is_human_readable() {
serializer.collect_str(self)
} else {
self.0.serialize(serializer)
}
}
}
impl<'de> Deserialize<'de> for Id {
fn deserialize<D>(deserializer: D) -> Result<Self, D::Error>
where
D: serde::Deserializer<'de>,
{
struct IdVisitor {
is_human_readable_deserializer: bool,
}
impl<'de> Visitor<'de> for IdVisitor {
type Value = Id;
fn expecting(&self, formatter: &mut fmt::Formatter) -> fmt::Result {
if self.is_human_readable_deserializer {
formatter.write_str("value in form of hex string")
} else {
formatter.write_str("value in form of integer array([u8; 16])")
}
}
fn visit_seq<A>(self, seq: A) -> Result<Self::Value, A::Error>
where
A: serde::de::SeqAccess<'de>,
{
let s = serde::de::value::SeqAccessDeserializer::new(seq);
let id: [u8; 16] = Deserialize::deserialize(s)?;
Ok(Id::from(id))
}
fn visit_str<E>(self, v: &str) -> Result<Self::Value, E>
where
E: serde::de::Error,
{
Id::from_str(v).map_err(E::custom)
}
}
if deserializer.is_human_readable() {
deserializer.deserialize_str(IdVisitor {
is_human_readable_deserializer: true,
})
} else {
deserializer.deserialize_tuple(
16,
IdVisitor {
is_human_readable_deserializer: false,
},
)
}
}
}
impl Id { impl Id {
pub fn get_from_buf(buf: &mut impl bytes::Buf) -> Id { pub fn get_from_buf(buf: &mut impl bytes::Buf) -> Id {
let mut arr = [0u8; 16]; let mut arr = [0u8; 16];
@@ -57,6 +120,8 @@ impl Id {
chunk[0] = HEX[((b >> 4) & 0xf) as usize]; chunk[0] = HEX[((b >> 4) & 0xf) as usize];
chunk[1] = HEX[(b & 0xf) as usize]; chunk[1] = HEX[(b & 0xf) as usize];
} }
// SAFETY: vec constructed out of `HEX`, it can only be ascii
unsafe { String::from_utf8_unchecked(buf) } unsafe { String::from_utf8_unchecked(buf) }
} }
} }
@@ -308,3 +373,112 @@ impl fmt::Display for NodeId {
write!(f, "{}", self.0) write!(f, "{}", self.0)
} }
} }
#[cfg(test)]
mod tests {
use serde_assert::{Deserializer, Serializer, Token, Tokens};
use crate::bin_ser::BeSer;
use super::*;
#[test]
fn test_id_serde_non_human_readable() {
let original_id = Id([
173, 80, 132, 115, 129, 226, 72, 254, 170, 201, 135, 108, 199, 26, 228, 24,
]);
let expected_tokens = Tokens(vec![
Token::Tuple { len: 16 },
Token::U8(173),
Token::U8(80),
Token::U8(132),
Token::U8(115),
Token::U8(129),
Token::U8(226),
Token::U8(72),
Token::U8(254),
Token::U8(170),
Token::U8(201),
Token::U8(135),
Token::U8(108),
Token::U8(199),
Token::U8(26),
Token::U8(228),
Token::U8(24),
Token::TupleEnd,
]);
let serializer = Serializer::builder().is_human_readable(false).build();
let serialized_tokens = original_id.serialize(&serializer).unwrap();
assert_eq!(serialized_tokens, expected_tokens);
let mut deserializer = Deserializer::builder()
.is_human_readable(false)
.tokens(serialized_tokens)
.build();
let deserialized_id = Id::deserialize(&mut deserializer).unwrap();
assert_eq!(deserialized_id, original_id);
}
#[test]
fn test_id_serde_human_readable() {
let original_id = Id([
173, 80, 132, 115, 129, 226, 72, 254, 170, 201, 135, 108, 199, 26, 228, 24,
]);
let expected_tokens = Tokens(vec![Token::Str(String::from(
"ad50847381e248feaac9876cc71ae418",
))]);
let serializer = Serializer::builder().is_human_readable(true).build();
let serialized_tokens = original_id.serialize(&serializer).unwrap();
assert_eq!(serialized_tokens, expected_tokens);
let mut deserializer = Deserializer::builder()
.is_human_readable(true)
.tokens(Tokens(vec![Token::Str(String::from(
"ad50847381e248feaac9876cc71ae418",
))]))
.build();
assert_eq!(Id::deserialize(&mut deserializer).unwrap(), original_id);
}
macro_rules! roundtrip_type {
($type:ty, $expected_bytes:expr) => {{
let expected_bytes: [u8; 16] = $expected_bytes;
let original_id = <$type>::from(expected_bytes);
let ser_bytes = original_id.ser().unwrap();
assert_eq!(ser_bytes, expected_bytes);
let des_id = <$type>::des(&ser_bytes).unwrap();
assert_eq!(des_id, original_id);
}};
}
#[test]
fn test_id_bincode_serde() {
let expected_bytes = [
173, 80, 132, 115, 129, 226, 72, 254, 170, 201, 135, 108, 199, 26, 228, 24,
];
roundtrip_type!(Id, expected_bytes);
}
#[test]
fn test_tenant_id_bincode_serde() {
let expected_bytes = [
173, 80, 132, 115, 129, 226, 72, 254, 170, 201, 135, 108, 199, 26, 228, 24,
];
roundtrip_type!(TenantId, expected_bytes);
}
#[test]
fn test_timeline_id_bincode_serde() {
let expected_bytes = [
173, 80, 132, 115, 129, 226, 72, 254, 170, 201, 135, 108, 199, 26, 228, 24,
];
roundtrip_type!(TimelineId, expected_bytes);
}
}

View File

@@ -1,5 +1,6 @@
//! `utils` is intended to be a place to put code that is shared //! `utils` is intended to be a place to put code that is shared
//! between other crates in this repository. //! between other crates in this repository.
#![deny(clippy::undocumented_unsafe_blocks)]
pub mod backoff; pub mod backoff;
@@ -24,6 +25,10 @@ pub mod auth;
// utility functions and helper traits for unified unique id generation/serialization etc. // utility functions and helper traits for unified unique id generation/serialization etc.
pub mod id; pub mod id;
mod hex;
pub use hex::Hex;
// http endpoint utils // http endpoint utils
pub mod http; pub mod http;
@@ -73,6 +78,9 @@ pub mod completion;
/// Reporting utilities /// Reporting utilities
pub mod error; pub mod error;
/// async timeout helper
pub mod timeout;
pub mod sync; pub mod sync;
/// This is a shortcut to embed git sha into binaries and avoid copying the same build script to all packages /// This is a shortcut to embed git sha into binaries and avoid copying the same build script to all packages

View File

@@ -1,7 +1,7 @@
#![warn(missing_docs)] #![warn(missing_docs)]
use camino::Utf8Path; use camino::Utf8Path;
use serde::{Deserialize, Serialize}; use serde::{de::Visitor, Deserialize, Serialize};
use std::fmt; use std::fmt;
use std::ops::{Add, AddAssign}; use std::ops::{Add, AddAssign};
use std::str::FromStr; use std::str::FromStr;
@@ -13,10 +13,114 @@ use crate::seqwait::MonotonicCounter;
pub const XLOG_BLCKSZ: u32 = 8192; pub const XLOG_BLCKSZ: u32 = 8192;
/// A Postgres LSN (Log Sequence Number), also known as an XLogRecPtr /// A Postgres LSN (Log Sequence Number), also known as an XLogRecPtr
#[derive(Clone, Copy, Eq, Ord, PartialEq, PartialOrd, Hash, Serialize, Deserialize)] #[derive(Clone, Copy, Eq, Ord, PartialEq, PartialOrd, Hash)]
#[serde(transparent)]
pub struct Lsn(pub u64); pub struct Lsn(pub u64);
impl Serialize for Lsn {
fn serialize<S>(&self, serializer: S) -> Result<S::Ok, S::Error>
where
S: serde::Serializer,
{
if serializer.is_human_readable() {
serializer.collect_str(self)
} else {
self.0.serialize(serializer)
}
}
}
impl<'de> Deserialize<'de> for Lsn {
fn deserialize<D>(deserializer: D) -> Result<Self, D::Error>
where
D: serde::Deserializer<'de>,
{
struct LsnVisitor {
is_human_readable_deserializer: bool,
}
impl<'de> Visitor<'de> for LsnVisitor {
type Value = Lsn;
fn expecting(&self, formatter: &mut fmt::Formatter) -> fmt::Result {
if self.is_human_readable_deserializer {
formatter.write_str(
"value in form of hex string({upper_u32_hex}/{lower_u32_hex}) representing u64 integer",
)
} else {
formatter.write_str("value in form of integer(u64)")
}
}
fn visit_u64<E>(self, v: u64) -> Result<Self::Value, E>
where
E: serde::de::Error,
{
Ok(Lsn(v))
}
fn visit_str<E>(self, v: &str) -> Result<Self::Value, E>
where
E: serde::de::Error,
{
Lsn::from_str(v).map_err(|e| E::custom(e))
}
}
if deserializer.is_human_readable() {
deserializer.deserialize_str(LsnVisitor {
is_human_readable_deserializer: true,
})
} else {
deserializer.deserialize_u64(LsnVisitor {
is_human_readable_deserializer: false,
})
}
}
}
/// Allows (de)serialization of an `Lsn` always as `u64`.
///
/// ### Example
///
/// ```rust
/// # use serde::{Serialize, Deserialize};
/// use utils::lsn::Lsn;
///
/// #[derive(PartialEq, Serialize, Deserialize, Debug)]
/// struct Foo {
/// #[serde(with = "utils::lsn::serde_as_u64")]
/// always_u64: Lsn,
/// }
///
/// let orig = Foo { always_u64: Lsn(1234) };
///
/// let res = serde_json::to_string(&orig).unwrap();
/// assert_eq!(res, r#"{"always_u64":1234}"#);
///
/// let foo = serde_json::from_str::<Foo>(&res).unwrap();
/// assert_eq!(foo, orig);
/// ```
///
pub mod serde_as_u64 {
use super::Lsn;
/// Serializes the Lsn as u64 disregarding the human readability of the format.
///
/// Meant to be used via `#[serde(with = "...")]` or `#[serde(serialize_with = "...")]`.
pub fn serialize<S: serde::Serializer>(lsn: &Lsn, serializer: S) -> Result<S::Ok, S::Error> {
use serde::Serialize;
lsn.0.serialize(serializer)
}
/// Deserializes the Lsn as u64 disregarding the human readability of the format.
///
/// Meant to be used via `#[serde(with = "...")]` or `#[serde(deserialize_with = "...")]`.
pub fn deserialize<'de, D: serde::Deserializer<'de>>(deserializer: D) -> Result<Lsn, D::Error> {
use serde::Deserialize;
u64::deserialize(deserializer).map(Lsn)
}
}
/// We tried to parse an LSN from a string, but failed /// We tried to parse an LSN from a string, but failed
#[derive(Debug, PartialEq, Eq, thiserror::Error)] #[derive(Debug, PartialEq, Eq, thiserror::Error)]
#[error("LsnParseError")] #[error("LsnParseError")]
@@ -264,8 +368,13 @@ impl MonotonicCounter<Lsn> for RecordLsn {
#[cfg(test)] #[cfg(test)]
mod tests { mod tests {
use crate::bin_ser::BeSer;
use super::*; use super::*;
use serde::ser::Serialize;
use serde_assert::{Deserializer, Serializer, Token, Tokens};
#[test] #[test]
fn test_lsn_strings() { fn test_lsn_strings() {
assert_eq!("12345678/AAAA5555".parse(), Ok(Lsn(0x12345678AAAA5555))); assert_eq!("12345678/AAAA5555".parse(), Ok(Lsn(0x12345678AAAA5555)));
@@ -341,4 +450,95 @@ mod tests {
assert_eq!(lsn.fetch_max(Lsn(6000)), Lsn(5678)); assert_eq!(lsn.fetch_max(Lsn(6000)), Lsn(5678));
assert_eq!(lsn.fetch_max(Lsn(5000)), Lsn(6000)); assert_eq!(lsn.fetch_max(Lsn(5000)), Lsn(6000));
} }
#[test]
fn test_lsn_serde() {
let original_lsn = Lsn(0x0123456789abcdef);
let expected_readable_tokens = Tokens(vec![Token::U64(0x0123456789abcdef)]);
let expected_non_readable_tokens =
Tokens(vec![Token::Str(String::from("1234567/89ABCDEF"))]);
// Testing human_readable ser/de
let serializer = Serializer::builder().is_human_readable(false).build();
let readable_ser_tokens = original_lsn.serialize(&serializer).unwrap();
assert_eq!(readable_ser_tokens, expected_readable_tokens);
let mut deserializer = Deserializer::builder()
.is_human_readable(false)
.tokens(readable_ser_tokens)
.build();
let des_lsn = Lsn::deserialize(&mut deserializer).unwrap();
assert_eq!(des_lsn, original_lsn);
// Testing NON human_readable ser/de
let serializer = Serializer::builder().is_human_readable(true).build();
let non_readable_ser_tokens = original_lsn.serialize(&serializer).unwrap();
assert_eq!(non_readable_ser_tokens, expected_non_readable_tokens);
let mut deserializer = Deserializer::builder()
.is_human_readable(true)
.tokens(non_readable_ser_tokens)
.build();
let des_lsn = Lsn::deserialize(&mut deserializer).unwrap();
assert_eq!(des_lsn, original_lsn);
// Testing mismatching ser/de
let serializer = Serializer::builder().is_human_readable(false).build();
let non_readable_ser_tokens = original_lsn.serialize(&serializer).unwrap();
let mut deserializer = Deserializer::builder()
.is_human_readable(true)
.tokens(non_readable_ser_tokens)
.build();
Lsn::deserialize(&mut deserializer).unwrap_err();
let serializer = Serializer::builder().is_human_readable(true).build();
let readable_ser_tokens = original_lsn.serialize(&serializer).unwrap();
let mut deserializer = Deserializer::builder()
.is_human_readable(false)
.tokens(readable_ser_tokens)
.build();
Lsn::deserialize(&mut deserializer).unwrap_err();
}
#[test]
fn test_lsn_ensure_roundtrip() {
let original_lsn = Lsn(0xaaaabbbb);
let serializer = Serializer::builder().is_human_readable(false).build();
let ser_tokens = original_lsn.serialize(&serializer).unwrap();
let mut deserializer = Deserializer::builder()
.is_human_readable(false)
.tokens(ser_tokens)
.build();
let des_lsn = Lsn::deserialize(&mut deserializer).unwrap();
assert_eq!(des_lsn, original_lsn);
}
#[test]
fn test_lsn_bincode_serde() {
let lsn = Lsn(0x0123456789abcdef);
let expected_bytes = [0x01, 0x23, 0x45, 0x67, 0x89, 0xab, 0xcd, 0xef];
let ser_bytes = lsn.ser().unwrap();
assert_eq!(ser_bytes, expected_bytes);
let des_lsn = Lsn::des(&ser_bytes).unwrap();
assert_eq!(des_lsn, lsn);
}
#[test]
fn test_lsn_bincode_ensure_roundtrip() {
let original_lsn = Lsn(0x01_02_03_04_05_06_07_08);
let expected_bytes = vec![0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08];
let ser_bytes = original_lsn.ser().unwrap();
assert_eq!(ser_bytes, expected_bytes);
let des_lsn = Lsn::des(&ser_bytes).unwrap();
assert_eq!(des_lsn, original_lsn);
}
} }

View File

@@ -3,7 +3,6 @@ use std::time::{Duration, SystemTime};
use bytes::{Buf, BufMut, Bytes, BytesMut}; use bytes::{Buf, BufMut, Bytes, BytesMut};
use pq_proto::{read_cstr, PG_EPOCH}; use pq_proto::{read_cstr, PG_EPOCH};
use serde::{Deserialize, Serialize}; use serde::{Deserialize, Serialize};
use serde_with::{serde_as, DisplayFromStr};
use tracing::{trace, warn}; use tracing::{trace, warn};
use crate::lsn::Lsn; use crate::lsn::Lsn;
@@ -15,21 +14,17 @@ use crate::lsn::Lsn;
/// ///
/// serde Serialize is used only for human readable dump to json (e.g. in /// serde Serialize is used only for human readable dump to json (e.g. in
/// safekeepers debug_dump). /// safekeepers debug_dump).
#[serde_as]
#[derive(Debug, Clone, Copy, PartialEq, Eq, Serialize, Deserialize)] #[derive(Debug, Clone, Copy, PartialEq, Eq, Serialize, Deserialize)]
pub struct PageserverFeedback { pub struct PageserverFeedback {
/// Last known size of the timeline. Used to enforce timeline size limit. /// Last known size of the timeline. Used to enforce timeline size limit.
pub current_timeline_size: u64, pub current_timeline_size: u64,
/// LSN last received and ingested by the pageserver. Controls backpressure. /// LSN last received and ingested by the pageserver. Controls backpressure.
#[serde_as(as = "DisplayFromStr")]
pub last_received_lsn: Lsn, pub last_received_lsn: Lsn,
/// LSN up to which data is persisted by the pageserver to its local disc. /// LSN up to which data is persisted by the pageserver to its local disc.
/// Controls backpressure. /// Controls backpressure.
#[serde_as(as = "DisplayFromStr")]
pub disk_consistent_lsn: Lsn, pub disk_consistent_lsn: Lsn,
/// LSN up to which data is persisted by the pageserver on s3; safekeepers /// LSN up to which data is persisted by the pageserver on s3; safekeepers
/// consider WAL before it can be removed. /// consider WAL before it can be removed.
#[serde_as(as = "DisplayFromStr")]
pub remote_consistent_lsn: Lsn, pub remote_consistent_lsn: Lsn,
// Serialize with RFC3339 format. // Serialize with RFC3339 format.
#[serde(with = "serde_systemtime")] #[serde(with = "serde_systemtime")]

View File

@@ -125,6 +125,9 @@ where
// Wake everyone with an error. // Wake everyone with an error.
let mut internal = self.internal.lock().unwrap(); let mut internal = self.internal.lock().unwrap();
// Block any future waiters from starting
internal.shutdown = true;
// This will steal the entire waiters map. // This will steal the entire waiters map.
// When we drop it all waiters will be woken. // When we drop it all waiters will be woken.
mem::take(&mut internal.waiters) mem::take(&mut internal.waiters)

View File

@@ -1,6 +1,7 @@
/// Immediately terminate the calling process without calling /// Immediately terminate the calling process without calling
/// atexit callbacks, C runtime destructors etc. We mainly use /// atexit callbacks, C runtime destructors etc. We mainly use
/// this to protect coverage data from concurrent writes. /// this to protect coverage data from concurrent writes.
pub fn exit_now(code: u8) { pub fn exit_now(code: u8) -> ! {
// SAFETY: exiting is safe, the ffi is not safe
unsafe { nix::libc::_exit(code as _) }; unsafe { nix::libc::_exit(code as _) };
} }

View File

@@ -1 +1,3 @@
pub mod heavier_once_cell; pub mod heavier_once_cell;
pub mod gate;

158
libs/utils/src/sync/gate.rs Normal file
View File

@@ -0,0 +1,158 @@
use std::{sync::Arc, time::Duration};
/// Gates are a concurrency helper, primarily used for implementing safe shutdown.
///
/// Users of a resource call `enter()` to acquire a GateGuard, and the owner of
/// the resource calls `close()` when they want to ensure that all holders of guards
/// have released them, and that no future guards will be issued.
pub struct Gate {
/// Each caller of enter() takes one unit from the semaphore. In close(), we
/// take all the units to ensure all GateGuards are destroyed.
sem: Arc<tokio::sync::Semaphore>,
/// For observability only: a name that will be used to log warnings if a particular
/// gate is holding up shutdown
name: String,
}
/// RAII guard for a [`Gate`]: as long as this exists, calls to [`Gate::close`] will
/// not complete.
#[derive(Debug)]
pub struct GateGuard(tokio::sync::OwnedSemaphorePermit);
/// Observability helper: every `warn_period`, emit a log warning that we're still waiting on this gate
async fn warn_if_stuck<Fut: std::future::Future>(
fut: Fut,
name: &str,
warn_period: std::time::Duration,
) -> <Fut as std::future::Future>::Output {
let started = std::time::Instant::now();
let mut fut = std::pin::pin!(fut);
loop {
match tokio::time::timeout(warn_period, &mut fut).await {
Ok(ret) => return ret,
Err(_) => {
tracing::warn!(
gate = name,
elapsed_ms = started.elapsed().as_millis(),
"still waiting, taking longer than expected..."
);
}
}
}
}
#[derive(Debug)]
pub enum GateError {
GateClosed,
}
impl Gate {
const MAX_UNITS: u32 = u32::MAX;
pub fn new(name: String) -> Self {
Self {
sem: Arc::new(tokio::sync::Semaphore::new(Self::MAX_UNITS as usize)),
name,
}
}
/// Acquire a guard that will prevent close() calls from completing. If close()
/// was already called, this will return an error which should be interpreted
/// as "shutting down".
///
/// This function would typically be used from e.g. request handlers. While holding
/// the guard returned from this function, it is important to respect a CancellationToken
/// to avoid blocking close() indefinitely: typically types that contain a Gate will
/// also contain a CancellationToken.
pub fn enter(&self) -> Result<GateGuard, GateError> {
self.sem
.clone()
.try_acquire_owned()
.map(GateGuard)
.map_err(|_| GateError::GateClosed)
}
/// Types with a shutdown() method and a gate should call this method at the
/// end of shutdown, to ensure that all GateGuard holders are done.
///
/// This will wait for all guards to be destroyed. For this to complete promptly, it is
/// important that the holders of such guards are respecting a CancellationToken which has
/// been cancelled before entering this function.
pub async fn close(&self) {
warn_if_stuck(self.do_close(), &self.name, Duration::from_millis(1000)).await
}
/// Check if [`Self::close()`] has finished waiting for all [`Self::enter()`] users to finish. This
/// is usually analoguous for "Did shutdown finish?" for types that include a Gate, whereas checking
/// the CancellationToken on such types is analogous to "Did shutdown start?"
pub fn close_complete(&self) -> bool {
self.sem.is_closed()
}
async fn do_close(&self) {
tracing::debug!(gate = self.name, "Closing Gate...");
match self.sem.acquire_many(Self::MAX_UNITS).await {
Ok(_units) => {
// While holding all units, close the semaphore. All subsequent calls to enter() will fail.
self.sem.close();
}
Err(_) => {
// Semaphore closed: we are the only function that can do this, so it indicates a double-call.
// This is legal. Timeline::shutdown for example is not protected from being called more than
// once.
tracing::debug!(gate = self.name, "Double close")
}
}
tracing::debug!(gate = self.name, "Closed Gate.")
}
}
#[cfg(test)]
mod tests {
use futures::FutureExt;
use super::*;
#[tokio::test]
async fn test_idle_gate() {
// Having taken no gates, we should not be blocked in close
let gate = Gate::new("test".to_string());
gate.close().await;
// If a guard is dropped before entering, close should not be blocked
let gate = Gate::new("test".to_string());
let guard = gate.enter().unwrap();
drop(guard);
gate.close().await;
// Entering a closed guard fails
gate.enter().expect_err("enter should fail after close");
}
#[tokio::test]
async fn test_busy_gate() {
let gate = Gate::new("test".to_string());
let guard = gate.enter().unwrap();
let mut close_fut = std::pin::pin!(gate.close());
// Close should be blocked
assert!(close_fut.as_mut().now_or_never().is_none());
// Attempting to enter() should fail, even though close isn't done yet.
gate.enter()
.expect_err("enter should fail after entering close");
drop(guard);
// Guard is gone, close should finish
assert!(close_fut.as_mut().now_or_never().is_some());
// Attempting to enter() is still forbidden
gate.enter().expect_err("enter should fail finishing close");
}
}

View File

@@ -1,4 +1,7 @@
use std::sync::{Arc, Mutex, MutexGuard}; use std::sync::{
atomic::{AtomicUsize, Ordering},
Arc, Mutex, MutexGuard,
};
use tokio::sync::Semaphore; use tokio::sync::Semaphore;
/// Custom design like [`tokio::sync::OnceCell`] but using [`OwnedSemaphorePermit`] instead of /// Custom design like [`tokio::sync::OnceCell`] but using [`OwnedSemaphorePermit`] instead of
@@ -10,6 +13,7 @@ use tokio::sync::Semaphore;
/// [`OwnedSemaphorePermit`]: tokio::sync::OwnedSemaphorePermit /// [`OwnedSemaphorePermit`]: tokio::sync::OwnedSemaphorePermit
pub struct OnceCell<T> { pub struct OnceCell<T> {
inner: Mutex<Inner<T>>, inner: Mutex<Inner<T>>,
initializers: AtomicUsize,
} }
impl<T> Default for OnceCell<T> { impl<T> Default for OnceCell<T> {
@@ -17,6 +21,7 @@ impl<T> Default for OnceCell<T> {
fn default() -> Self { fn default() -> Self {
Self { Self {
inner: Default::default(), inner: Default::default(),
initializers: AtomicUsize::new(0),
} }
} }
} }
@@ -49,6 +54,7 @@ impl<T> OnceCell<T> {
init_semaphore: Arc::new(sem), init_semaphore: Arc::new(sem),
value: Some(value), value: Some(value),
}), }),
initializers: AtomicUsize::new(0),
} }
} }
@@ -60,8 +66,8 @@ impl<T> OnceCell<T> {
/// Initialization is panic-safe and cancellation-safe. /// Initialization is panic-safe and cancellation-safe.
pub async fn get_or_init<F, Fut, E>(&self, factory: F) -> Result<Guard<'_, T>, E> pub async fn get_or_init<F, Fut, E>(&self, factory: F) -> Result<Guard<'_, T>, E>
where where
F: FnOnce() -> Fut, F: FnOnce(InitPermit) -> Fut,
Fut: std::future::Future<Output = Result<T, E>>, Fut: std::future::Future<Output = Result<(T, InitPermit), E>>,
{ {
let sem = { let sem = {
let guard = self.inner.lock().unwrap(); let guard = self.inner.lock().unwrap();
@@ -71,29 +77,61 @@ impl<T> OnceCell<T> {
guard.init_semaphore.clone() guard.init_semaphore.clone()
}; };
let permit = sem.acquire_owned().await; let permit = {
if permit.is_err() { // increment the count for the duration of queued
let guard = self.inner.lock().unwrap(); let _guard = CountWaitingInitializers::start(self);
assert!( sem.acquire_owned().await
guard.value.is_some(), };
"semaphore got closed, must be initialized"
);
return Ok(Guard(guard));
} else {
// now we try
let value = factory().await?;
let mut guard = self.inner.lock().unwrap(); match permit {
assert!( Ok(permit) => {
guard.value.is_none(), let permit = InitPermit(permit);
"we won permit, must not be initialized" let (value, _permit) = factory(permit).await?;
);
guard.value = Some(value); let guard = self.inner.lock().unwrap();
guard.init_semaphore.close();
Ok(Guard(guard)) Ok(Self::set0(value, guard))
}
Err(_closed) => {
let guard = self.inner.lock().unwrap();
assert!(
guard.value.is_some(),
"semaphore got closed, must be initialized"
);
return Ok(Guard(guard));
}
} }
} }
/// Assuming a permit is held after previous call to [`Guard::take_and_deinit`], it can be used
/// to complete initializing the inner value.
///
/// # Panics
///
/// If the inner has already been initialized.
pub fn set(&self, value: T, _permit: InitPermit) -> Guard<'_, T> {
let guard = self.inner.lock().unwrap();
// cannot assert that this permit is for self.inner.semaphore, but we can assert it cannot
// give more permits right now.
if guard.init_semaphore.try_acquire().is_ok() {
drop(guard);
panic!("permit is of wrong origin");
}
Self::set0(value, guard)
}
fn set0(value: T, mut guard: std::sync::MutexGuard<'_, Inner<T>>) -> Guard<'_, T> {
if guard.value.is_some() {
drop(guard);
unreachable!("we won permit, must not be initialized");
}
guard.value = Some(value);
guard.init_semaphore.close();
Guard(guard)
}
/// Returns a guard to an existing initialized value, if any. /// Returns a guard to an existing initialized value, if any.
pub fn get(&self) -> Option<Guard<'_, T>> { pub fn get(&self) -> Option<Guard<'_, T>> {
let guard = self.inner.lock().unwrap(); let guard = self.inner.lock().unwrap();
@@ -103,6 +141,28 @@ impl<T> OnceCell<T> {
None None
} }
} }
/// Return the number of [`Self::get_or_init`] calls waiting for initialization to complete.
pub fn initializer_count(&self) -> usize {
self.initializers.load(Ordering::Relaxed)
}
}
/// DropGuard counter for queued tasks waiting to initialize, mainly accessible for the
/// initializing task for example at the end of initialization.
struct CountWaitingInitializers<'a, T>(&'a OnceCell<T>);
impl<'a, T> CountWaitingInitializers<'a, T> {
fn start(target: &'a OnceCell<T>) -> Self {
target.initializers.fetch_add(1, Ordering::Relaxed);
CountWaitingInitializers(target)
}
}
impl<'a, T> Drop for CountWaitingInitializers<'a, T> {
fn drop(&mut self) {
self.0.initializers.fetch_sub(1, Ordering::Relaxed);
}
} }
/// Uninteresting guard object to allow short-lived access to inspect or clone the held, /// Uninteresting guard object to allow short-lived access to inspect or clone the held,
@@ -135,7 +195,7 @@ impl<'a, T> Guard<'a, T> {
/// ///
/// The permit will be on a semaphore part of the new internal value, and any following /// The permit will be on a semaphore part of the new internal value, and any following
/// [`OnceCell::get_or_init`] will wait on it to complete. /// [`OnceCell::get_or_init`] will wait on it to complete.
pub fn take_and_deinit(&mut self) -> (T, tokio::sync::OwnedSemaphorePermit) { pub fn take_and_deinit(&mut self) -> (T, InitPermit) {
let mut swapped = Inner::default(); let mut swapped = Inner::default();
let permit = swapped let permit = swapped
.init_semaphore .init_semaphore
@@ -145,11 +205,14 @@ impl<'a, T> Guard<'a, T> {
std::mem::swap(&mut *self.0, &mut swapped); std::mem::swap(&mut *self.0, &mut swapped);
swapped swapped
.value .value
.map(|v| (v, permit)) .map(|v| (v, InitPermit(permit)))
.expect("guard is not created unless value has been initialized") .expect("guard is not created unless value has been initialized")
} }
} }
/// Type held by OnceCell (de)initializing task.
pub struct InitPermit(tokio::sync::OwnedSemaphorePermit);
#[cfg(test)] #[cfg(test)]
mod tests { mod tests {
use super::*; use super::*;
@@ -185,11 +248,11 @@ mod tests {
barrier.wait().await; barrier.wait().await;
let won = { let won = {
let g = cell let g = cell
.get_or_init(|| { .get_or_init(|permit| {
counters.factory_got_to_run.fetch_add(1, Ordering::Relaxed); counters.factory_got_to_run.fetch_add(1, Ordering::Relaxed);
async { async {
counters.future_polled.fetch_add(1, Ordering::Relaxed); counters.future_polled.fetch_add(1, Ordering::Relaxed);
Ok::<_, Infallible>(i) Ok::<_, Infallible>((i, permit))
} }
}) })
.await .await
@@ -243,7 +306,7 @@ mod tests {
deinitialization_started.wait().await; deinitialization_started.wait().await;
let started_at = tokio::time::Instant::now(); let started_at = tokio::time::Instant::now();
cell.get_or_init(|| async { Ok::<_, Infallible>(reinit) }) cell.get_or_init(|permit| async { Ok::<_, Infallible>((reinit, permit)) })
.await .await
.unwrap(); .unwrap();
@@ -258,18 +321,32 @@ mod tests {
assert_eq!(*cell.get().unwrap(), reinit); assert_eq!(*cell.get().unwrap(), reinit);
} }
#[test]
fn reinit_with_deinit_permit() {
let cell = Arc::new(OnceCell::new(42));
let (mol, permit) = cell.get().unwrap().take_and_deinit();
cell.set(5, permit);
assert_eq!(*cell.get().unwrap(), 5);
let (five, permit) = cell.get().unwrap().take_and_deinit();
assert_eq!(5, five);
cell.set(mol, permit);
assert_eq!(*cell.get().unwrap(), 42);
}
#[tokio::test] #[tokio::test]
async fn initialization_attemptable_until_ok() { async fn initialization_attemptable_until_ok() {
let cell = OnceCell::default(); let cell = OnceCell::default();
for _ in 0..10 { for _ in 0..10 {
cell.get_or_init(|| async { Err("whatever error") }) cell.get_or_init(|_permit| async { Err("whatever error") })
.await .await
.unwrap_err(); .unwrap_err();
} }
let g = cell let g = cell
.get_or_init(|| async { Ok::<_, Infallible>("finally success") }) .get_or_init(|permit| async { Ok::<_, Infallible>(("finally success", permit)) })
.await .await
.unwrap(); .unwrap();
assert_eq!(*g, "finally success"); assert_eq!(*g, "finally success");
@@ -281,11 +358,11 @@ mod tests {
let barrier = tokio::sync::Barrier::new(2); let barrier = tokio::sync::Barrier::new(2);
let initializer = cell.get_or_init(|| async { let initializer = cell.get_or_init(|permit| async {
barrier.wait().await; barrier.wait().await;
futures::future::pending::<()>().await; futures::future::pending::<()>().await;
Ok::<_, Infallible>("never reached") Ok::<_, Infallible>(("never reached", permit))
}); });
tokio::select! { tokio::select! {
@@ -298,7 +375,7 @@ mod tests {
assert!(cell.get().is_none()); assert!(cell.get().is_none());
let g = cell let g = cell
.get_or_init(|| async { Ok::<_, Infallible>("now initialized") }) .get_or_init(|permit| async { Ok::<_, Infallible>(("now initialized", permit)) })
.await .await
.unwrap(); .unwrap();
assert_eq!(*g, "now initialized"); assert_eq!(*g, "now initialized");

37
libs/utils/src/timeout.rs Normal file
View File

@@ -0,0 +1,37 @@
use std::time::Duration;
use tokio_util::sync::CancellationToken;
pub enum TimeoutCancellableError {
Timeout,
Cancelled,
}
/// Wrap [`tokio::time::timeout`] with a CancellationToken.
///
/// This wrapper is appropriate for any long running operation in a task
/// that ought to respect a CancellationToken (which means most tasks).
///
/// The only time you should use a bare tokio::timeout is when the future `F`
/// itself respects a CancellationToken: otherwise, always use this wrapper
/// with your CancellationToken to ensure that your task does not hold up
/// graceful shutdown.
pub async fn timeout_cancellable<F>(
duration: Duration,
cancel: &CancellationToken,
future: F,
) -> Result<F::Output, TimeoutCancellableError>
where
F: std::future::Future,
{
tokio::select!(
r = tokio::time::timeout(duration, future) => {
r.map_err(|_| TimeoutCancellableError::Timeout)
},
_ = cancel.cancelled() => {
Err(TimeoutCancellableError::Cancelled)
}
)
}

View File

@@ -19,13 +19,12 @@ inotify.workspace = true
serde.workspace = true serde.workspace = true
serde_json.workspace = true serde_json.workspace = true
sysinfo.workspace = true sysinfo.workspace = true
tokio.workspace = true tokio = { workspace = true, features = ["rt-multi-thread"] }
tokio-postgres.workspace = true tokio-postgres.workspace = true
tokio-stream.workspace = true tokio-stream.workspace = true
tokio-util.workspace = true tokio-util.workspace = true
tracing.workspace = true tracing.workspace = true
tracing-subscriber.workspace = true tracing-subscriber.workspace = true
workspace_hack = { version = "0.1", path = "../../workspace_hack" }
[target.'cfg(target_os = "linux")'.dependencies] [target.'cfg(target_os = "linux")'.dependencies]
cgroups-rs = "0.3.3" cgroups-rs = "0.3.3"

View File

@@ -21,11 +21,6 @@ pub struct FileCacheState {
#[derive(Debug)] #[derive(Debug)]
pub struct FileCacheConfig { pub struct FileCacheConfig {
/// Whether the file cache is *actually* stored in memory (e.g. by writing to
/// a tmpfs or shmem file). If true, the size of the file cache will be counted against the
/// memory available for the cgroup.
pub(crate) in_memory: bool,
/// The size of the file cache, in terms of the size of the resource it consumes /// The size of the file cache, in terms of the size of the resource it consumes
/// (currently: only memory) /// (currently: only memory)
/// ///
@@ -59,22 +54,9 @@ pub struct FileCacheConfig {
spread_factor: f64, spread_factor: f64,
} }
impl FileCacheConfig { impl Default for FileCacheConfig {
pub fn default_in_memory() -> Self { fn default() -> Self {
Self { Self {
in_memory: true,
// 75 %
resource_multiplier: 0.75,
// 640 MiB; (512 + 128)
min_remaining_after_cache: NonZeroU64::new(640 * MiB).unwrap(),
// ensure any increase in file cache size is split 90-10 with 10% to other memory
spread_factor: 0.1,
}
}
pub fn default_on_disk() -> Self {
Self {
in_memory: false,
resource_multiplier: 0.75, resource_multiplier: 0.75,
// 256 MiB - lower than when in memory because overcommitting is safe; if we don't have // 256 MiB - lower than when in memory because overcommitting is safe; if we don't have
// memory, the kernel will just evict from its page cache, rather than e.g. killing // memory, the kernel will just evict from its page cache, rather than e.g. killing
@@ -83,7 +65,9 @@ impl FileCacheConfig {
spread_factor: 0.1, spread_factor: 0.1,
} }
} }
}
impl FileCacheConfig {
/// Make sure fields of the config are consistent. /// Make sure fields of the config are consistent.
pub fn validate(&self) -> anyhow::Result<()> { pub fn validate(&self) -> anyhow::Result<()> {
// Single field validity // Single field validity

View File

@@ -1,3 +1,5 @@
#![deny(unsafe_code)]
#![deny(clippy::undocumented_unsafe_blocks)]
#![cfg(target_os = "linux")] #![cfg(target_os = "linux")]
use anyhow::Context; use anyhow::Context;
@@ -39,16 +41,6 @@ pub struct Args {
#[arg(short, long)] #[arg(short, long)]
pub pgconnstr: Option<String>, pub pgconnstr: Option<String>,
/// Flag to signal that the Postgres file cache is on disk (i.e. not in memory aside from the
/// kernel's page cache), and therefore should not count against available memory.
//
// NB: Ideally this flag would directly refer to whether the file cache is in memory (rather
// than a roundabout way, via whether it's on disk), but in order to be backwards compatible
// during the switch away from an in-memory file cache, we had to default to the previous
// behavior.
#[arg(long)]
pub file_cache_on_disk: bool,
/// The address we should listen on for connection requests. For the /// The address we should listen on for connection requests. For the
/// agent, this is 0.0.0.0:10301. For the informant, this is 127.0.0.1:10369. /// agent, this is 0.0.0.0:10301. For the informant, this is 127.0.0.1:10369.
#[arg(short, long)] #[arg(short, long)]

View File

@@ -156,10 +156,7 @@ impl Runner {
// memory limits. // memory limits.
if let Some(connstr) = &args.pgconnstr { if let Some(connstr) = &args.pgconnstr {
info!("initializing file cache"); info!("initializing file cache");
let config = match args.file_cache_on_disk { let config = FileCacheConfig::default();
true => FileCacheConfig::default_on_disk(),
false => FileCacheConfig::default_in_memory(),
};
let mut file_cache = FileCacheState::new(connstr, config, token.clone()) let mut file_cache = FileCacheState::new(connstr, config, token.clone())
.await .await
@@ -187,10 +184,7 @@ impl Runner {
info!("file cache size actually got set to {actual_size}") info!("file cache size actually got set to {actual_size}")
} }
if args.file_cache_on_disk { file_cache_disk_size = actual_size;
file_cache_disk_size = actual_size;
}
state.filecache = Some(file_cache); state.filecache = Some(file_cache);
} }
@@ -239,17 +233,11 @@ impl Runner {
let requested_mem = target.mem; let requested_mem = target.mem;
let usable_system_memory = requested_mem.saturating_sub(self.config.sys_buffer_bytes); let usable_system_memory = requested_mem.saturating_sub(self.config.sys_buffer_bytes);
let (expected_file_cache_size, expected_file_cache_disk_size) = self let expected_file_cache_size = self
.filecache .filecache
.as_ref() .as_ref()
.map(|file_cache| { .map(|file_cache| file_cache.config.calculate_cache_size(usable_system_memory))
let size = file_cache.config.calculate_cache_size(usable_system_memory); .unwrap_or(0);
match file_cache.config.in_memory {
true => (size, 0),
false => (size, size),
}
})
.unwrap_or((0, 0));
if let Some(cgroup) = &self.cgroup { if let Some(cgroup) = &self.cgroup {
let (last_time, last_history) = *cgroup.watcher.borrow(); let (last_time, last_history) = *cgroup.watcher.borrow();
@@ -273,7 +261,7 @@ impl Runner {
let new_threshold = self let new_threshold = self
.config .config
.cgroup_threshold(usable_system_memory, expected_file_cache_disk_size); .cgroup_threshold(usable_system_memory, expected_file_cache_size);
let current = last_history.avg_non_reclaimable; let current = last_history.avg_non_reclaimable;
@@ -300,13 +288,10 @@ impl Runner {
.set_file_cache_size(expected_file_cache_size) .set_file_cache_size(expected_file_cache_size)
.await .await
.context("failed to set file cache size")?; .context("failed to set file cache size")?;
if !file_cache.config.in_memory { file_cache_disk_size = actual_usage;
file_cache_disk_size = actual_usage;
}
let message = format!( let message = format!(
"set file cache size to {} MiB (in memory = {})", "set file cache size to {} MiB",
bytes_to_mebibytes(actual_usage), bytes_to_mebibytes(actual_usage),
file_cache.config.in_memory,
); );
info!("downscale: {message}"); info!("downscale: {message}");
status.push(message); status.push(message);
@@ -357,9 +342,7 @@ impl Runner {
.set_file_cache_size(expected_usage) .set_file_cache_size(expected_usage)
.await .await
.context("failed to set file cache size")?; .context("failed to set file cache size")?;
if !file_cache.config.in_memory { file_cache_disk_size = actual_usage;
file_cache_disk_size = actual_usage;
}
if actual_usage != expected_usage { if actual_usage != expected_usage {
warn!( warn!(

View File

@@ -188,6 +188,7 @@ extern "C" fn recovery_download(
} }
} }
#[allow(clippy::unnecessary_cast)]
extern "C" fn wal_read( extern "C" fn wal_read(
sk: *mut Safekeeper, sk: *mut Safekeeper,
buf: *mut ::std::os::raw::c_char, buf: *mut ::std::os::raw::c_char,
@@ -421,6 +422,7 @@ impl std::fmt::Display for Level {
} }
/// Take ownership of `Vec<u8>` from StringInfoData. /// Take ownership of `Vec<u8>` from StringInfoData.
#[allow(clippy::unnecessary_cast)]
pub(crate) fn take_vec_u8(pg: &mut StringInfoData) -> Option<Vec<u8>> { pub(crate) fn take_vec_u8(pg: &mut StringInfoData) -> Option<Vec<u8>> {
if pg.data.is_null() { if pg.data.is_null() {
return None; return None;

View File

@@ -186,7 +186,7 @@ impl Wrapper {
.unwrap() .unwrap()
.into_bytes_with_nul(); .into_bytes_with_nul();
assert!(safekeepers_list_vec.len() == safekeepers_list_vec.capacity()); assert!(safekeepers_list_vec.len() == safekeepers_list_vec.capacity());
let safekeepers_list = safekeepers_list_vec.as_mut_ptr() as *mut i8; let safekeepers_list = safekeepers_list_vec.as_mut_ptr() as *mut std::ffi::c_char;
let callback_data = Box::into_raw(Box::new(api)) as *mut ::std::os::raw::c_void; let callback_data = Box::into_raw(Box::new(api)) as *mut ::std::os::raw::c_void;

View File

@@ -1,22 +1,21 @@
use anyhow::{bail, Result}; use utils::auth::{AuthError, Claims, Scope};
use utils::auth::{Claims, Scope};
use utils::id::TenantId; use utils::id::TenantId;
pub fn check_permission(claims: &Claims, tenant_id: Option<TenantId>) -> Result<()> { pub fn check_permission(claims: &Claims, tenant_id: Option<TenantId>) -> Result<(), AuthError> {
match (&claims.scope, tenant_id) { match (&claims.scope, tenant_id) {
(Scope::Tenant, None) => { (Scope::Tenant, None) => Err(AuthError(
bail!("Attempt to access management api with tenant scope. Permission denied") "Attempt to access management api with tenant scope. Permission denied".into(),
} )),
(Scope::Tenant, Some(tenant_id)) => { (Scope::Tenant, Some(tenant_id)) => {
if claims.tenant_id.unwrap() != tenant_id { if claims.tenant_id.unwrap() != tenant_id {
bail!("Tenant id mismatch. Permission denied") return Err(AuthError("Tenant id mismatch. Permission denied".into()));
} }
Ok(()) Ok(())
} }
(Scope::PageServerApi, None) => Ok(()), // access to management api for PageServerApi scope (Scope::PageServerApi, None) => Ok(()), // access to management api for PageServerApi scope
(Scope::PageServerApi, Some(_)) => Ok(()), // access to tenant api using PageServerApi scope (Scope::PageServerApi, Some(_)) => Ok(()), // access to tenant api using PageServerApi scope
(Scope::SafekeeperData, _) => { (Scope::SafekeeperData, _) => Err(AuthError(
bail!("SafekeeperData scope makes no sense for Pageserver") "SafekeeperData scope makes no sense for Pageserver".into(),
} )),
} }
} }

View File

@@ -34,8 +34,11 @@ use postgres_backend::AuthType;
use utils::logging::TracingErrorLayerEnablement; use utils::logging::TracingErrorLayerEnablement;
use utils::signals::ShutdownSignals; use utils::signals::ShutdownSignals;
use utils::{ use utils::{
auth::JwtAuth, logging, project_build_tag, project_git_version, sentry_init::init_sentry, auth::{JwtAuth, SwappableJwtAuth},
signals::Signal, tcp_listener, logging, project_build_tag, project_git_version,
sentry_init::init_sentry,
signals::Signal,
tcp_listener,
}; };
project_git_version!(GIT_VERSION); project_git_version!(GIT_VERSION);
@@ -321,13 +324,12 @@ fn start_pageserver(
let http_auth; let http_auth;
let pg_auth; let pg_auth;
if conf.http_auth_type == AuthType::NeonJWT || conf.pg_auth_type == AuthType::NeonJWT { if conf.http_auth_type == AuthType::NeonJWT || conf.pg_auth_type == AuthType::NeonJWT {
// unwrap is ok because check is performed when creating config, so path is set and file exists // unwrap is ok because check is performed when creating config, so path is set and exists
let key_path = conf.auth_validation_public_key_path.as_ref().unwrap(); let key_path = conf.auth_validation_public_key_path.as_ref().unwrap();
info!( info!("Loading public key(s) for verifying JWT tokens from {key_path:?}");
"Loading public key for verifying JWT tokens from {:#?}",
key_path let jwt_auth = JwtAuth::from_key_path(key_path)?;
); let auth: Arc<SwappableJwtAuth> = Arc::new(SwappableJwtAuth::new(jwt_auth));
let auth: Arc<JwtAuth> = Arc::new(JwtAuth::from_key_path(key_path)?);
http_auth = match &conf.http_auth_type { http_auth = match &conf.http_auth_type {
AuthType::Trust => None, AuthType::Trust => None,
@@ -410,7 +412,7 @@ fn start_pageserver(
// Scan the local 'tenants/' directory and start loading the tenants // Scan the local 'tenants/' directory and start loading the tenants
let deletion_queue_client = deletion_queue.new_client(); let deletion_queue_client = deletion_queue.new_client();
BACKGROUND_RUNTIME.block_on(mgr::init_tenant_mgr( let tenant_manager = BACKGROUND_RUNTIME.block_on(mgr::init_tenant_mgr(
conf, conf,
TenantSharedResources { TenantSharedResources {
broker_client: broker_client.clone(), broker_client: broker_client.clone(),
@@ -420,6 +422,7 @@ fn start_pageserver(
order, order,
shutdown_pageserver.clone(), shutdown_pageserver.clone(),
))?; ))?;
let tenant_manager = Arc::new(tenant_manager);
BACKGROUND_RUNTIME.spawn({ BACKGROUND_RUNTIME.spawn({
let init_done_rx = init_done_rx; let init_done_rx = init_done_rx;
@@ -548,6 +551,7 @@ fn start_pageserver(
let router_state = Arc::new( let router_state = Arc::new(
http::routes::State::new( http::routes::State::new(
conf, conf,
tenant_manager,
http_auth.clone(), http_auth.clone(),
remote_storage.clone(), remote_storage.clone(),
broker_client.clone(), broker_client.clone(),

View File

@@ -161,7 +161,7 @@ pub struct PageServerConf {
pub http_auth_type: AuthType, pub http_auth_type: AuthType,
/// authentication method for libpq connections from compute /// authentication method for libpq connections from compute
pub pg_auth_type: AuthType, pub pg_auth_type: AuthType,
/// Path to a file containing public key for verifying JWT tokens. /// Path to a file or directory containing public key(s) for verifying JWT tokens.
/// Used for both mgmt and compute auth, if enabled. /// Used for both mgmt and compute auth, if enabled.
pub auth_validation_public_key_path: Option<Utf8PathBuf>, pub auth_validation_public_key_path: Option<Utf8PathBuf>,
@@ -1314,12 +1314,6 @@ broker_endpoint = '{broker_endpoint}'
assert_eq!( assert_eq!(
parsed_remote_storage_config, parsed_remote_storage_config,
RemoteStorageConfig { RemoteStorageConfig {
max_concurrent_syncs: NonZeroUsize::new(
remote_storage::DEFAULT_REMOTE_STORAGE_MAX_CONCURRENT_SYNCS
)
.unwrap(),
max_sync_errors: NonZeroU32::new(remote_storage::DEFAULT_REMOTE_STORAGE_MAX_SYNC_ERRORS)
.unwrap(),
storage: RemoteStorageKind::LocalFs(local_storage_path.clone()), storage: RemoteStorageKind::LocalFs(local_storage_path.clone()),
}, },
"Remote storage config should correctly parse the local FS config and fill other storage defaults" "Remote storage config should correctly parse the local FS config and fill other storage defaults"
@@ -1380,8 +1374,6 @@ broker_endpoint = '{broker_endpoint}'
assert_eq!( assert_eq!(
parsed_remote_storage_config, parsed_remote_storage_config,
RemoteStorageConfig { RemoteStorageConfig {
max_concurrent_syncs,
max_sync_errors,
storage: RemoteStorageKind::AwsS3(S3Config { storage: RemoteStorageKind::AwsS3(S3Config {
bucket_name: bucket_name.clone(), bucket_name: bucket_name.clone(),
bucket_region: bucket_region.clone(), bucket_region: bucket_region.clone(),

View File

@@ -266,7 +266,7 @@ async fn calculate_synthetic_size_worker(
continue; continue;
} }
if let Ok(tenant) = mgr::get_tenant(tenant_id, true).await { if let Ok(tenant) = mgr::get_tenant(tenant_id, true) {
// TODO should we use concurrent_background_tasks_rate_limit() here, like the other background tasks? // TODO should we use concurrent_background_tasks_rate_limit() here, like the other background tasks?
// We can put in some prioritization for consumption metrics. // We can put in some prioritization for consumption metrics.
// Same for the loop that fetches computed metrics. // Same for the loop that fetches computed metrics.

View File

@@ -3,7 +3,6 @@ use anyhow::Context;
use chrono::{DateTime, Utc}; use chrono::{DateTime, Utc};
use consumption_metrics::EventType; use consumption_metrics::EventType;
use futures::stream::StreamExt; use futures::stream::StreamExt;
use serde_with::serde_as;
use std::{sync::Arc, time::SystemTime}; use std::{sync::Arc, time::SystemTime};
use utils::{ use utils::{
id::{TenantId, TimelineId}, id::{TenantId, TimelineId},
@@ -42,13 +41,10 @@ pub(super) enum Name {
/// ///
/// This is a denormalization done at the MetricsKey const methods; these should not be constructed /// This is a denormalization done at the MetricsKey const methods; these should not be constructed
/// elsewhere. /// elsewhere.
#[serde_with::serde_as]
#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash, serde::Serialize, serde::Deserialize)] #[derive(Debug, Clone, Copy, PartialEq, Eq, Hash, serde::Serialize, serde::Deserialize)]
pub(crate) struct MetricsKey { pub(crate) struct MetricsKey {
#[serde_as(as = "serde_with::DisplayFromStr")]
pub(super) tenant_id: TenantId, pub(super) tenant_id: TenantId,
#[serde_as(as = "Option<serde_with::DisplayFromStr>")]
#[serde(skip_serializing_if = "Option::is_none")] #[serde(skip_serializing_if = "Option::is_none")]
pub(super) timeline_id: Option<TimelineId>, pub(super) timeline_id: Option<TimelineId>,
@@ -206,7 +202,6 @@ pub(super) async fn collect_all_metrics(
None None
} else { } else {
crate::tenant::mgr::get_tenant(id, true) crate::tenant::mgr::get_tenant(id, true)
.await
.ok() .ok()
.map(|tenant| (id, tenant)) .map(|tenant| (id, tenant))
} }

View File

@@ -1,5 +1,4 @@
use consumption_metrics::{Event, EventChunk, IdempotencyKey, CHUNK_SIZE}; use consumption_metrics::{Event, EventChunk, IdempotencyKey, CHUNK_SIZE};
use serde_with::serde_as;
use tokio_util::sync::CancellationToken; use tokio_util::sync::CancellationToken;
use tracing::Instrument; use tracing::Instrument;
@@ -7,12 +6,9 @@ use super::{metrics::Name, Cache, MetricsKey, RawMetric};
use utils::id::{TenantId, TimelineId}; use utils::id::{TenantId, TimelineId};
/// How the metrics from pageserver are identified. /// How the metrics from pageserver are identified.
#[serde_with::serde_as]
#[derive(serde::Serialize, serde::Deserialize, Debug, Clone, Copy, PartialEq)] #[derive(serde::Serialize, serde::Deserialize, Debug, Clone, Copy, PartialEq)]
struct Ids { struct Ids {
#[serde_as(as = "serde_with::DisplayFromStr")]
pub(super) tenant_id: TenantId, pub(super) tenant_id: TenantId,
#[serde_as(as = "Option<serde_with::DisplayFromStr>")]
#[serde(skip_serializing_if = "Option::is_none")] #[serde(skip_serializing_if = "Option::is_none")]
pub(super) timeline_id: Option<TimelineId>, pub(super) timeline_id: Option<TimelineId>,
} }

View File

@@ -57,7 +57,10 @@ impl ControlPlaneClient {
if let Some(jwt) = &conf.control_plane_api_token { if let Some(jwt) = &conf.control_plane_api_token {
let mut headers = hyper::HeaderMap::new(); let mut headers = hyper::HeaderMap::new();
headers.insert("Authorization", jwt.get_contents().parse().unwrap()); headers.insert(
"Authorization",
format!("Bearer {}", jwt.get_contents()).parse().unwrap(),
);
client = client.default_headers(headers); client = client.default_headers(headers);
} }

View File

@@ -10,6 +10,7 @@ use crate::control_plane_client::ControlPlaneGenerationsApi;
use crate::metrics; use crate::metrics;
use crate::tenant::remote_timeline_client::remote_layer_path; use crate::tenant::remote_timeline_client::remote_layer_path;
use crate::tenant::remote_timeline_client::remote_timeline_path; use crate::tenant::remote_timeline_client::remote_timeline_path;
use crate::virtual_file::MaybeFatalIo;
use crate::virtual_file::VirtualFile; use crate::virtual_file::VirtualFile;
use anyhow::Context; use anyhow::Context;
use camino::Utf8PathBuf; use camino::Utf8PathBuf;
@@ -17,7 +18,6 @@ use hex::FromHex;
use remote_storage::{GenericRemoteStorage, RemotePath}; use remote_storage::{GenericRemoteStorage, RemotePath};
use serde::Deserialize; use serde::Deserialize;
use serde::Serialize; use serde::Serialize;
use serde_with::serde_as;
use thiserror::Error; use thiserror::Error;
use tokio; use tokio;
use tokio_util::sync::CancellationToken; use tokio_util::sync::CancellationToken;
@@ -214,7 +214,6 @@ where
/// during recovery as startup. /// during recovery as startup.
const TEMP_SUFFIX: &str = "tmp"; const TEMP_SUFFIX: &str = "tmp";
#[serde_as]
#[derive(Debug, Serialize, Deserialize)] #[derive(Debug, Serialize, Deserialize)]
struct DeletionList { struct DeletionList {
/// Serialization version, for future use /// Serialization version, for future use
@@ -243,7 +242,6 @@ struct DeletionList {
validated: bool, validated: bool,
} }
#[serde_as]
#[derive(Debug, Serialize, Deserialize)] #[derive(Debug, Serialize, Deserialize)]
struct DeletionHeader { struct DeletionHeader {
/// Serialization version, for future use /// Serialization version, for future use
@@ -271,7 +269,9 @@ impl DeletionHeader {
let temp_path = path_with_suffix_extension(&header_path, TEMP_SUFFIX); let temp_path = path_with_suffix_extension(&header_path, TEMP_SUFFIX);
VirtualFile::crashsafe_overwrite(&header_path, &temp_path, &header_bytes) VirtualFile::crashsafe_overwrite(&header_path, &temp_path, &header_bytes)
.await .await
.map_err(Into::into) .maybe_fatal_err("save deletion header")?;
Ok(())
} }
} }
@@ -360,6 +360,7 @@ impl DeletionList {
let bytes = serde_json::to_vec(self).expect("Failed to serialize deletion list"); let bytes = serde_json::to_vec(self).expect("Failed to serialize deletion list");
VirtualFile::crashsafe_overwrite(&path, &temp_path, &bytes) VirtualFile::crashsafe_overwrite(&path, &temp_path, &bytes)
.await .await
.maybe_fatal_err("save deletion list")
.map_err(Into::into) .map_err(Into::into)
} }
} }
@@ -892,14 +893,6 @@ mod test {
std::fs::create_dir_all(remote_fs_dir)?; std::fs::create_dir_all(remote_fs_dir)?;
let remote_fs_dir = harness.conf.workdir.join("remote_fs").canonicalize_utf8()?; let remote_fs_dir = harness.conf.workdir.join("remote_fs").canonicalize_utf8()?;
let storage_config = RemoteStorageConfig { let storage_config = RemoteStorageConfig {
max_concurrent_syncs: std::num::NonZeroUsize::new(
remote_storage::DEFAULT_REMOTE_STORAGE_MAX_CONCURRENT_SYNCS,
)
.unwrap(),
max_sync_errors: std::num::NonZeroU32::new(
remote_storage::DEFAULT_REMOTE_STORAGE_MAX_SYNC_ERRORS,
)
.unwrap(),
storage: RemoteStorageKind::LocalFs(remote_fs_dir.clone()), storage: RemoteStorageKind::LocalFs(remote_fs_dir.clone()),
}; };
let storage = GenericRemoteStorage::from_config(&storage_config).unwrap(); let storage = GenericRemoteStorage::from_config(&storage_config).unwrap();

View File

@@ -55,21 +55,24 @@ impl Deleter {
/// Wrap the remote `delete_objects` with a failpoint /// Wrap the remote `delete_objects` with a failpoint
async fn remote_delete(&self) -> Result<(), anyhow::Error> { async fn remote_delete(&self) -> Result<(), anyhow::Error> {
fail::fail_point!("deletion-queue-before-execute", |_| {
info!("Skipping execution, failpoint set");
metrics::DELETION_QUEUE
.remote_errors
.with_label_values(&["failpoint"])
.inc();
Err(anyhow::anyhow!("failpoint hit"))
});
// A backoff::retry is used here for two reasons: // A backoff::retry is used here for two reasons:
// - To provide a backoff rather than busy-polling the API on errors // - To provide a backoff rather than busy-polling the API on errors
// - To absorb transient 429/503 conditions without hitting our error // - To absorb transient 429/503 conditions without hitting our error
// logging path for issues deleting objects. // logging path for issues deleting objects.
backoff::retry( backoff::retry(
|| async { self.remote_storage.delete_objects(&self.accumulator).await }, || async {
fail::fail_point!("deletion-queue-before-execute", |_| {
info!("Skipping execution, failpoint set");
metrics::DELETION_QUEUE
.remote_errors
.with_label_values(&["failpoint"])
.inc();
Err(anyhow::anyhow!("failpoint: deletion-queue-before-execute"))
});
self.remote_storage.delete_objects(&self.accumulator).await
},
|_| false, |_| false,
3, 3,
10, 10,

View File

@@ -34,6 +34,8 @@ use crate::deletion_queue::TEMP_SUFFIX;
use crate::metrics; use crate::metrics;
use crate::tenant::remote_timeline_client::remote_layer_path; use crate::tenant::remote_timeline_client::remote_layer_path;
use crate::tenant::storage_layer::LayerFileName; use crate::tenant::storage_layer::LayerFileName;
use crate::virtual_file::on_fatal_io_error;
use crate::virtual_file::MaybeFatalIo;
// The number of keys in a DeletionList before we will proactively persist it // The number of keys in a DeletionList before we will proactively persist it
// (without reaching a flush deadline). This aims to deliver objects of the order // (without reaching a flush deadline). This aims to deliver objects of the order
@@ -195,7 +197,7 @@ impl ListWriter {
debug!("Deletion header {header_path} not found, first start?"); debug!("Deletion header {header_path} not found, first start?");
Ok(None) Ok(None)
} else { } else {
Err(anyhow::anyhow!(e)) on_fatal_io_error(&e, "reading deletion header");
} }
} }
} }
@@ -216,16 +218,9 @@ impl ListWriter {
self.pending.sequence = validated_sequence + 1; self.pending.sequence = validated_sequence + 1;
let deletion_directory = self.conf.deletion_prefix(); let deletion_directory = self.conf.deletion_prefix();
let mut dir = match tokio::fs::read_dir(&deletion_directory).await { let mut dir = tokio::fs::read_dir(&deletion_directory)
Ok(d) => d, .await
Err(e) => { .fatal_err("read deletion directory");
warn!("Failed to open deletion list directory {deletion_directory}: {e:#}");
// Give up: if we can't read the deletion list directory, we probably can't
// write lists into it later, so the queue won't work.
return Err(e.into());
}
};
let list_name_pattern = let list_name_pattern =
Regex::new("(?<sequence>[a-zA-Z0-9]{16})-(?<version>[a-zA-Z0-9]{2}).list").unwrap(); Regex::new("(?<sequence>[a-zA-Z0-9]{16})-(?<version>[a-zA-Z0-9]{2}).list").unwrap();
@@ -233,7 +228,7 @@ impl ListWriter {
let temp_extension = format!(".{TEMP_SUFFIX}"); let temp_extension = format!(".{TEMP_SUFFIX}");
let header_path = self.conf.deletion_header_path(); let header_path = self.conf.deletion_header_path();
let mut seqs: Vec<u64> = Vec::new(); let mut seqs: Vec<u64> = Vec::new();
while let Some(dentry) = dir.next_entry().await? { while let Some(dentry) = dir.next_entry().await.fatal_err("read deletion dentry") {
let file_name = dentry.file_name(); let file_name = dentry.file_name();
let dentry_str = file_name.to_string_lossy(); let dentry_str = file_name.to_string_lossy();
@@ -246,11 +241,9 @@ impl ListWriter {
info!("Cleaning up temporary file {dentry_str}"); info!("Cleaning up temporary file {dentry_str}");
let absolute_path = let absolute_path =
deletion_directory.join(dentry.file_name().to_str().expect("non-Unicode path")); deletion_directory.join(dentry.file_name().to_str().expect("non-Unicode path"));
if let Err(e) = tokio::fs::remove_file(&absolute_path).await { tokio::fs::remove_file(&absolute_path)
// Non-fatal error: we will just leave the file behind but not .await
// try and load it. .fatal_err("delete temp file");
warn!("Failed to clean up temporary file {absolute_path}: {e:#}");
}
continue; continue;
} }
@@ -290,7 +283,9 @@ impl ListWriter {
for s in seqs { for s in seqs {
let list_path = self.conf.deletion_list_path(s); let list_path = self.conf.deletion_list_path(s);
let list_bytes = tokio::fs::read(&list_path).await?; let list_bytes = tokio::fs::read(&list_path)
.await
.fatal_err("read deletion list");
let mut deletion_list = match serde_json::from_slice::<DeletionList>(&list_bytes) { let mut deletion_list = match serde_json::from_slice::<DeletionList>(&list_bytes) {
Ok(l) => l, Ok(l) => l,

View File

@@ -28,6 +28,7 @@ use crate::config::PageServerConf;
use crate::control_plane_client::ControlPlaneGenerationsApi; use crate::control_plane_client::ControlPlaneGenerationsApi;
use crate::control_plane_client::RetryForeverError; use crate::control_plane_client::RetryForeverError;
use crate::metrics; use crate::metrics;
use crate::virtual_file::MaybeFatalIo;
use super::deleter::DeleterMessage; use super::deleter::DeleterMessage;
use super::DeletionHeader; use super::DeletionHeader;
@@ -287,16 +288,9 @@ where
async fn cleanup_lists(&mut self, list_paths: Vec<Utf8PathBuf>) { async fn cleanup_lists(&mut self, list_paths: Vec<Utf8PathBuf>) {
for list_path in list_paths { for list_path in list_paths {
debug!("Removing deletion list {list_path}"); debug!("Removing deletion list {list_path}");
tokio::fs::remove_file(&list_path)
if let Err(e) = tokio::fs::remove_file(&list_path).await { .await
// Unexpected: we should have permissions and nothing else should .fatal_err("remove deletion list");
// be touching these files. We will leave the file behind. Subsequent
// pageservers will try and load it again: hopefully whatever storage
// issue (probably permissions) has been fixed by then.
tracing::error!("Failed to delete {list_path}: {e:#}");
metrics::DELETION_QUEUE.unexpected_errors.inc();
break;
}
} }
} }

View File

@@ -403,7 +403,7 @@ pub async fn disk_usage_eviction_task_iteration_impl<U: Usage>(
return (evicted_bytes, evictions_failed); return (evicted_bytes, evictions_failed);
}; };
let results = timeline.evict_layers(&batch, &cancel).await; let results = timeline.evict_layers(&batch).await;
match results { match results {
Ok(results) => { Ok(results) => {
@@ -545,7 +545,7 @@ async fn collect_eviction_candidates(
if cancel.is_cancelled() { if cancel.is_cancelled() {
return Ok(EvictionCandidates::Cancelled); return Ok(EvictionCandidates::Cancelled);
} }
let tenant = match tenant::mgr::get_tenant(*tenant_id, true).await { let tenant = match tenant::mgr::get_tenant(*tenant_id, true) {
Ok(tenant) => tenant, Ok(tenant) => tenant,
Err(e) => { Err(e) => {
// this can happen if tenant has lifecycle transition after we fetched it // this can happen if tenant has lifecycle transition after we fetched it
@@ -554,6 +554,11 @@ async fn collect_eviction_candidates(
} }
}; };
if tenant.cancel.is_cancelled() {
info!(%tenant_id, "Skipping tenant for eviction, it is shutting down");
continue;
}
// collect layers from all timelines in this tenant // collect layers from all timelines in this tenant
// //
// If one of the timelines becomes `!is_active()` during the iteration, // If one of the timelines becomes `!is_active()` during the iteration,

View File

@@ -52,6 +52,31 @@ paths:
schema: schema:
type: object type: object
/v1/reload_auth_validation_keys:
post:
description: Reloads the JWT public keys from their pre-configured location on disk.
responses:
"200":
description: The reload completed successfully.
"401":
description: Unauthorized Error
content:
application/json:
schema:
$ref: "#/components/schemas/UnauthorizedError"
"403":
description: Forbidden Error
content:
application/json:
schema:
$ref: "#/components/schemas/ForbiddenError"
"500":
description: Generic operation error (also hits if no keys were found)
content:
application/json:
schema:
$ref: "#/components/schemas/Error"
/v1/tenant/{tenant_id}: /v1/tenant/{tenant_id}:
parameters: parameters:
- name: tenant_id - name: tenant_id
@@ -327,7 +352,8 @@ paths:
in: query in: query
required: true required: true
schema: schema:
type: integer type: string
format: hex
description: A LSN to get the timestamp description: A LSN to get the timestamp
responses: responses:
"200": "200":
@@ -569,7 +595,17 @@ paths:
schema: schema:
$ref: "#/components/schemas/NotFoundError" $ref: "#/components/schemas/NotFoundError"
"409": "409":
description: Tenant download is already in progress description: |
The tenant is already known to Pageserver in some way,
and hence this `/attach` call has been rejected.
Some examples of how this can happen:
- tenant was created on this pageserver
- tenant attachment was started by an earlier call to `/attach`.
Callers should poll the tenant status's `attachment_status` field,
like for status 202. See the longer description for `POST /attach`
for details.
content: content:
application/json: application/json:
schema: schema:

View File

@@ -16,11 +16,12 @@ use pageserver_api::models::{
DownloadRemoteLayersTaskSpawnRequest, LocationConfigMode, TenantAttachRequest, DownloadRemoteLayersTaskSpawnRequest, LocationConfigMode, TenantAttachRequest,
TenantLoadRequest, TenantLocationConfigRequest, TenantLoadRequest, TenantLocationConfigRequest,
}; };
use pageserver_api::shard::TenantShardId;
use remote_storage::GenericRemoteStorage; use remote_storage::GenericRemoteStorage;
use serde_with::{serde_as, DisplayFromStr};
use tenant_size_model::{SizeResult, StorageModel}; use tenant_size_model::{SizeResult, StorageModel};
use tokio_util::sync::CancellationToken; use tokio_util::sync::CancellationToken;
use tracing::*; use tracing::*;
use utils::auth::JwtAuth;
use utils::http::endpoint::request_span; use utils::http::endpoint::request_span;
use utils::http::json::json_request_or_empty_body; use utils::http::json::json_request_or_empty_body;
use utils::http::request::{get_request_param, must_get_query_param, parse_query_param}; use utils::http::request::{get_request_param, must_get_query_param, parse_query_param};
@@ -36,7 +37,8 @@ use crate::pgdatadir_mapping::LsnForTimestamp;
use crate::task_mgr::TaskKind; use crate::task_mgr::TaskKind;
use crate::tenant::config::{LocationConf, TenantConfOpt}; use crate::tenant::config::{LocationConf, TenantConfOpt};
use crate::tenant::mgr::{ use crate::tenant::mgr::{
GetTenantError, SetNewTenantConfigError, TenantMapInsertError, TenantStateError, GetTenantError, SetNewTenantConfigError, TenantManager, TenantMapError, TenantMapInsertError,
TenantSlotError, TenantSlotUpsertError, TenantStateError,
}; };
use crate::tenant::size::ModelInputs; use crate::tenant::size::ModelInputs;
use crate::tenant::storage_layer::LayerAccessStatsReset; use crate::tenant::storage_layer::LayerAccessStatsReset;
@@ -45,7 +47,7 @@ use crate::tenant::{LogicalSizeCalculationCause, PageReconstructError, TenantSha
use crate::{config::PageServerConf, tenant::mgr}; use crate::{config::PageServerConf, tenant::mgr};
use crate::{disk_usage_eviction_task, tenant}; use crate::{disk_usage_eviction_task, tenant};
use utils::{ use utils::{
auth::JwtAuth, auth::SwappableJwtAuth,
generation::Generation, generation::Generation,
http::{ http::{
endpoint::{self, attach_openapi_ui, auth_middleware, check_permission_with}, endpoint::{self, attach_openapi_ui, auth_middleware, check_permission_with},
@@ -63,7 +65,8 @@ use super::models::ConfigureFailpointsRequest;
pub struct State { pub struct State {
conf: &'static PageServerConf, conf: &'static PageServerConf,
auth: Option<Arc<JwtAuth>>, tenant_manager: Arc<TenantManager>,
auth: Option<Arc<SwappableJwtAuth>>,
allowlist_routes: Vec<Uri>, allowlist_routes: Vec<Uri>,
remote_storage: Option<GenericRemoteStorage>, remote_storage: Option<GenericRemoteStorage>,
broker_client: storage_broker::BrokerClientChannel, broker_client: storage_broker::BrokerClientChannel,
@@ -74,7 +77,8 @@ pub struct State {
impl State { impl State {
pub fn new( pub fn new(
conf: &'static PageServerConf, conf: &'static PageServerConf,
auth: Option<Arc<JwtAuth>>, tenant_manager: Arc<TenantManager>,
auth: Option<Arc<SwappableJwtAuth>>,
remote_storage: Option<GenericRemoteStorage>, remote_storage: Option<GenericRemoteStorage>,
broker_client: storage_broker::BrokerClientChannel, broker_client: storage_broker::BrokerClientChannel,
disk_usage_eviction_state: Arc<disk_usage_eviction_task::State>, disk_usage_eviction_state: Arc<disk_usage_eviction_task::State>,
@@ -86,6 +90,7 @@ impl State {
.collect::<Vec<_>>(); .collect::<Vec<_>>();
Ok(Self { Ok(Self {
conf, conf,
tenant_manager,
auth, auth,
allowlist_routes, allowlist_routes,
remote_storage, remote_storage,
@@ -147,28 +152,59 @@ impl From<PageReconstructError> for ApiError {
impl From<TenantMapInsertError> for ApiError { impl From<TenantMapInsertError> for ApiError {
fn from(tmie: TenantMapInsertError) -> ApiError { fn from(tmie: TenantMapInsertError) -> ApiError {
match tmie { match tmie {
TenantMapInsertError::StillInitializing | TenantMapInsertError::ShuttingDown => { TenantMapInsertError::SlotError(e) => e.into(),
ApiError::ResourceUnavailable(format!("{tmie}").into()) TenantMapInsertError::SlotUpsertError(e) => e.into(),
}
TenantMapInsertError::TenantAlreadyExists(id, state) => {
ApiError::Conflict(format!("tenant {id} already exists, state: {state:?}"))
}
TenantMapInsertError::TenantExistsSecondary(id) => {
ApiError::Conflict(format!("tenant {id} already exists as secondary"))
}
TenantMapInsertError::Other(e) => ApiError::InternalServerError(e), TenantMapInsertError::Other(e) => ApiError::InternalServerError(e),
} }
} }
} }
impl From<TenantSlotError> for ApiError {
fn from(e: TenantSlotError) -> ApiError {
use TenantSlotError::*;
match e {
NotFound(tenant_id) => {
ApiError::NotFound(anyhow::anyhow!("NotFound: tenant {tenant_id}").into())
}
e @ (AlreadyExists(_, _) | Conflict(_)) => ApiError::Conflict(format!("{e}")),
InProgress => {
ApiError::ResourceUnavailable("Tenant is being modified concurrently".into())
}
MapState(e) => e.into(),
}
}
}
impl From<TenantSlotUpsertError> for ApiError {
fn from(e: TenantSlotUpsertError) -> ApiError {
use TenantSlotUpsertError::*;
match e {
InternalError(e) => ApiError::InternalServerError(anyhow::anyhow!("{e}")),
MapState(e) => e.into(),
}
}
}
impl From<TenantMapError> for ApiError {
fn from(e: TenantMapError) -> ApiError {
use TenantMapError::*;
match e {
StillInitializing | ShuttingDown => {
ApiError::ResourceUnavailable(format!("{e}").into())
}
}
}
}
impl From<TenantStateError> for ApiError { impl From<TenantStateError> for ApiError {
fn from(tse: TenantStateError) -> ApiError { fn from(tse: TenantStateError) -> ApiError {
match tse { match tse {
TenantStateError::NotFound(tid) => ApiError::NotFound(anyhow!("tenant {}", tid).into()),
TenantStateError::IsStopping(_) => { TenantStateError::IsStopping(_) => {
ApiError::ResourceUnavailable("Tenant is stopping".into()) ApiError::ResourceUnavailable("Tenant is stopping".into())
} }
_ => ApiError::InternalServerError(anyhow::Error::new(tse)), TenantStateError::SlotError(e) => e.into(),
TenantStateError::SlotUpsertError(e) => e.into(),
TenantStateError::Other(e) => ApiError::InternalServerError(anyhow!(e)),
} }
} }
} }
@@ -189,6 +225,7 @@ impl From<GetTenantError> for ApiError {
// (We can produce this variant only in `mgr::get_tenant(..., active=true)` calls). // (We can produce this variant only in `mgr::get_tenant(..., active=true)` calls).
ApiError::ResourceUnavailable("Tenant not yet active".into()) ApiError::ResourceUnavailable("Tenant not yet active".into())
} }
GetTenantError::MapState(e) => ApiError::ResourceUnavailable(format!("{e}").into()),
} }
} }
} }
@@ -243,6 +280,9 @@ impl From<crate::tenant::delete::DeleteTenantError> for ApiError {
Get(g) => ApiError::from(g), Get(g) => ApiError::from(g),
e @ AlreadyInProgress => ApiError::Conflict(e.to_string()), e @ AlreadyInProgress => ApiError::Conflict(e.to_string()),
Timeline(t) => ApiError::from(t), Timeline(t) => ApiError::from(t),
NotAttached => ApiError::NotFound(anyhow::anyhow!("Tenant is not attached").into()),
SlotError(e) => e.into(),
SlotUpsertError(e) => e.into(),
Other(o) => ApiError::InternalServerError(o), Other(o) => ApiError::InternalServerError(o),
e @ InvalidState(_) => ApiError::PreconditionFailed(e.to_string().into_boxed_str()), e @ InvalidState(_) => ApiError::PreconditionFailed(e.to_string().into_boxed_str()),
} }
@@ -264,11 +304,7 @@ async fn build_timeline_info(
// we're executing this function, we will outlive the timeline on-disk state. // we're executing this function, we will outlive the timeline on-disk state.
info.current_logical_size_non_incremental = Some( info.current_logical_size_non_incremental = Some(
timeline timeline
.get_current_logical_size_non_incremental( .get_current_logical_size_non_incremental(info.last_record_lsn, ctx)
info.last_record_lsn,
CancellationToken::new(),
ctx,
)
.await?, .await?,
); );
} }
@@ -354,13 +390,39 @@ async fn status_handler(
json_response(StatusCode::OK, StatusResponse { id: config.id }) json_response(StatusCode::OK, StatusResponse { id: config.id })
} }
async fn reload_auth_validation_keys_handler(
request: Request<Body>,
_cancel: CancellationToken,
) -> Result<Response<Body>, ApiError> {
check_permission(&request, None)?;
let config = get_config(&request);
let state = get_state(&request);
let Some(shared_auth) = &state.auth else {
return json_response(StatusCode::BAD_REQUEST, ());
};
// unwrap is ok because check is performed when creating config, so path is set and exists
let key_path = config.auth_validation_public_key_path.as_ref().unwrap();
info!("Reloading public key(s) for verifying JWT tokens from {key_path:?}");
match JwtAuth::from_key_path(key_path) {
Ok(new_auth) => {
shared_auth.swap(new_auth);
json_response(StatusCode::OK, ())
}
Err(e) => {
warn!("Error reloading public keys from {key_path:?}: {e:}");
json_response(StatusCode::INTERNAL_SERVER_ERROR, ())
}
}
}
async fn timeline_create_handler( async fn timeline_create_handler(
mut request: Request<Body>, mut request: Request<Body>,
_cancel: CancellationToken, _cancel: CancellationToken,
) -> Result<Response<Body>, ApiError> { ) -> Result<Response<Body>, ApiError> {
let tenant_id: TenantId = parse_request_param(&request, "tenant_id")?; let tenant_shard_id: TenantShardId = parse_request_param(&request, "tenant_shard_id")?;
let request_data: TimelineCreateRequest = json_request(&mut request).await?; let request_data: TimelineCreateRequest = json_request(&mut request).await?;
check_permission(&request, Some(tenant_id))?; check_permission(&request, Some(tenant_shard_id.tenant_id))?;
let new_timeline_id = request_data.new_timeline_id; let new_timeline_id = request_data.new_timeline_id;
@@ -369,7 +431,7 @@ async fn timeline_create_handler(
let state = get_state(&request); let state = get_state(&request);
async { async {
let tenant = mgr::get_tenant(tenant_id, true).await?; let tenant = state.tenant_manager.get_attached_tenant_shard(tenant_shard_id, true)?;
match tenant.create_timeline( match tenant.create_timeline(
new_timeline_id, new_timeline_id,
request_data.ancestor_timeline_id.map(TimelineId::from), request_data.ancestor_timeline_id.map(TimelineId::from),
@@ -397,10 +459,16 @@ async fn timeline_create_handler(
Err(e @ tenant::CreateTimelineError::AncestorNotActive) => { Err(e @ tenant::CreateTimelineError::AncestorNotActive) => {
json_response(StatusCode::SERVICE_UNAVAILABLE, HttpErrorBody::from_msg(e.to_string())) json_response(StatusCode::SERVICE_UNAVAILABLE, HttpErrorBody::from_msg(e.to_string()))
} }
Err(tenant::CreateTimelineError::ShuttingDown) => {
json_response(StatusCode::SERVICE_UNAVAILABLE,HttpErrorBody::from_msg("tenant shutting down".to_string()))
}
Err(tenant::CreateTimelineError::Other(err)) => Err(ApiError::InternalServerError(err)), Err(tenant::CreateTimelineError::Other(err)) => Err(ApiError::InternalServerError(err)),
} }
} }
.instrument(info_span!("timeline_create", %tenant_id, timeline_id = %new_timeline_id, lsn=?request_data.ancestor_start_lsn, pg_version=?request_data.pg_version)) .instrument(info_span!("timeline_create",
tenant_id = %tenant_shard_id.tenant_id,
shard = %tenant_shard_id.shard_slug(),
timeline_id = %new_timeline_id, lsn=?request_data.ancestor_start_lsn, pg_version=?request_data.pg_version))
.await .await
} }
@@ -416,7 +484,7 @@ async fn timeline_list_handler(
let ctx = RequestContext::new(TaskKind::MgmtRequest, DownloadBehavior::Download); let ctx = RequestContext::new(TaskKind::MgmtRequest, DownloadBehavior::Download);
let response_data = async { let response_data = async {
let tenant = mgr::get_tenant(tenant_id, true).await?; let tenant = mgr::get_tenant(tenant_id, true)?;
let timelines = tenant.list_timelines(); let timelines = tenant.list_timelines();
let mut response_data = Vec::with_capacity(timelines.len()); let mut response_data = Vec::with_capacity(timelines.len());
@@ -455,7 +523,7 @@ async fn timeline_detail_handler(
let ctx = RequestContext::new(TaskKind::MgmtRequest, DownloadBehavior::Download); let ctx = RequestContext::new(TaskKind::MgmtRequest, DownloadBehavior::Download);
let timeline_info = async { let timeline_info = async {
let tenant = mgr::get_tenant(tenant_id, true).await?; let tenant = mgr::get_tenant(tenant_id, true)?;
let timeline = tenant let timeline = tenant
.get_timeline(timeline_id, false) .get_timeline(timeline_id, false)
@@ -499,10 +567,8 @@ async fn get_lsn_by_timestamp_handler(
let result = timeline.find_lsn_for_timestamp(timestamp_pg, &ctx).await?; let result = timeline.find_lsn_for_timestamp(timestamp_pg, &ctx).await?;
if version.unwrap_or(0) > 1 { if version.unwrap_or(0) > 1 {
#[serde_as]
#[derive(serde::Serialize)] #[derive(serde::Serialize)]
struct Result { struct Result {
#[serde_as(as = "DisplayFromStr")]
lsn: Lsn, lsn: Lsn,
kind: &'static str, kind: &'static str,
} }
@@ -598,14 +664,15 @@ async fn timeline_delete_handler(
request: Request<Body>, request: Request<Body>,
_cancel: CancellationToken, _cancel: CancellationToken,
) -> Result<Response<Body>, ApiError> { ) -> Result<Response<Body>, ApiError> {
let tenant_id: TenantId = parse_request_param(&request, "tenant_id")?; let tenant_shard_id: TenantShardId = parse_request_param(&request, "tenant_shard_id")?;
let timeline_id: TimelineId = parse_request_param(&request, "timeline_id")?; let timeline_id: TimelineId = parse_request_param(&request, "timeline_id")?;
check_permission(&request, Some(tenant_id))?; check_permission(&request, Some(tenant_shard_id.tenant_id))?;
let ctx = RequestContext::new(TaskKind::MgmtRequest, DownloadBehavior::Warn); let ctx = RequestContext::new(TaskKind::MgmtRequest, DownloadBehavior::Warn);
let state = get_state(&request);
mgr::delete_timeline(tenant_id, timeline_id, &ctx) state.tenant_manager.delete_timeline(tenant_shard_id, timeline_id, &ctx)
.instrument(info_span!("timeline_delete", %tenant_id, %timeline_id)) .instrument(info_span!("timeline_delete", tenant_id=%tenant_shard_id.tenant_id, shard=%tenant_shard_id.shard_slug(), %timeline_id))
.await?; .await?;
json_response(StatusCode::ACCEPTED, ()) json_response(StatusCode::ACCEPTED, ())
@@ -619,11 +686,14 @@ async fn tenant_detach_handler(
check_permission(&request, Some(tenant_id))?; check_permission(&request, Some(tenant_id))?;
let detach_ignored: Option<bool> = parse_query_param(&request, "detach_ignored")?; let detach_ignored: Option<bool> = parse_query_param(&request, "detach_ignored")?;
// This is a legacy API (`/location_conf` is the replacement). It only supports unsharded tenants
let tenant_shard_id = TenantShardId::unsharded(tenant_id);
let state = get_state(&request); let state = get_state(&request);
let conf = state.conf; let conf = state.conf;
mgr::detach_tenant( mgr::detach_tenant(
conf, conf,
tenant_id, tenant_shard_id,
detach_ignored.unwrap_or(false), detach_ignored.unwrap_or(false),
&state.deletion_queue_client, &state.deletion_queue_client,
) )
@@ -713,7 +783,7 @@ async fn tenant_status(
check_permission(&request, Some(tenant_id))?; check_permission(&request, Some(tenant_id))?;
let tenant_info = async { let tenant_info = async {
let tenant = mgr::get_tenant(tenant_id, false).await?; let tenant = mgr::get_tenant(tenant_id, false)?;
// Calculate total physical size of all timelines // Calculate total physical size of all timelines
let mut current_physical_size = 0; let mut current_physical_size = 0;
@@ -740,13 +810,16 @@ async fn tenant_delete_handler(
_cancel: CancellationToken, _cancel: CancellationToken,
) -> Result<Response<Body>, ApiError> { ) -> Result<Response<Body>, ApiError> {
// TODO openapi spec // TODO openapi spec
let tenant_id: TenantId = parse_request_param(&request, "tenant_id")?; let tenant_shard_id: TenantShardId = parse_request_param(&request, "tenant_shard_id")?;
check_permission(&request, Some(tenant_id))?; check_permission(&request, Some(tenant_shard_id.tenant_id))?;
let state = get_state(&request); let state = get_state(&request);
mgr::delete_tenant(state.conf, state.remote_storage.clone(), tenant_id) mgr::delete_tenant(state.conf, state.remote_storage.clone(), tenant_shard_id)
.instrument(info_span!("tenant_delete_handler", %tenant_id)) .instrument(info_span!("tenant_delete_handler",
tenant_id = %tenant_shard_id.tenant_id,
shard = tenant_shard_id.shard_slug()
))
.await?; .await?;
json_response(StatusCode::ACCEPTED, ()) json_response(StatusCode::ACCEPTED, ())
@@ -776,7 +849,7 @@ async fn tenant_size_handler(
let headers = request.headers(); let headers = request.headers();
let ctx = RequestContext::new(TaskKind::MgmtRequest, DownloadBehavior::Download); let ctx = RequestContext::new(TaskKind::MgmtRequest, DownloadBehavior::Download);
let tenant = mgr::get_tenant(tenant_id, true).await?; let tenant = mgr::get_tenant(tenant_id, true)?;
// this can be long operation // this can be long operation
let inputs = tenant let inputs = tenant
@@ -811,10 +884,8 @@ async fn tenant_size_handler(
} }
/// The type resides in the pageserver not to expose `ModelInputs`. /// The type resides in the pageserver not to expose `ModelInputs`.
#[serde_with::serde_as]
#[derive(serde::Serialize)] #[derive(serde::Serialize)]
struct TenantHistorySize { struct TenantHistorySize {
#[serde_as(as = "serde_with::DisplayFromStr")]
id: TenantId, id: TenantId,
/// Size is a mixture of WAL and logical size, so the unit is bytes. /// Size is a mixture of WAL and logical size, so the unit is bytes.
/// ///
@@ -1035,7 +1106,7 @@ async fn get_tenant_config_handler(
let tenant_id: TenantId = parse_request_param(&request, "tenant_id")?; let tenant_id: TenantId = parse_request_param(&request, "tenant_id")?;
check_permission(&request, Some(tenant_id))?; check_permission(&request, Some(tenant_id))?;
let tenant = mgr::get_tenant(tenant_id, false).await?; let tenant = mgr::get_tenant(tenant_id, false)?;
let response = HashMap::from([ let response = HashMap::from([
( (
@@ -1078,9 +1149,10 @@ async fn put_tenant_location_config_handler(
mut request: Request<Body>, mut request: Request<Body>,
_cancel: CancellationToken, _cancel: CancellationToken,
) -> Result<Response<Body>, ApiError> { ) -> Result<Response<Body>, ApiError> {
let tenant_shard_id: TenantShardId = parse_request_param(&request, "tenant_shard_id")?;
let request_data: TenantLocationConfigRequest = json_request(&mut request).await?; let request_data: TenantLocationConfigRequest = json_request(&mut request).await?;
let tenant_id = request_data.tenant_id; check_permission(&request, Some(tenant_shard_id.tenant_id))?;
check_permission(&request, Some(tenant_id))?;
let ctx = RequestContext::new(TaskKind::MgmtRequest, DownloadBehavior::Warn); let ctx = RequestContext::new(TaskKind::MgmtRequest, DownloadBehavior::Warn);
let state = get_state(&request); let state = get_state(&request);
@@ -1089,12 +1161,16 @@ async fn put_tenant_location_config_handler(
// The `Detached` state is special, it doesn't upsert a tenant, it removes // The `Detached` state is special, it doesn't upsert a tenant, it removes
// its local disk content and drops it from memory. // its local disk content and drops it from memory.
if let LocationConfigMode::Detached = request_data.config.mode { if let LocationConfigMode::Detached = request_data.config.mode {
if let Err(e) = mgr::detach_tenant(conf, tenant_id, true, &state.deletion_queue_client) if let Err(e) =
.instrument(info_span!("tenant_detach", %tenant_id)) mgr::detach_tenant(conf, tenant_shard_id, true, &state.deletion_queue_client)
.await .instrument(info_span!("tenant_detach",
tenant_id = %tenant_shard_id.tenant_id,
shard = tenant_shard_id.shard_slug()
))
.await
{ {
match e { match e {
TenantStateError::NotFound(_) => { TenantStateError::SlotError(TenantSlotError::NotFound(_)) => {
// This API is idempotent: a NotFound on a detach is fine. // This API is idempotent: a NotFound on a detach is fine.
} }
_ => return Err(e.into()), _ => return Err(e.into()),
@@ -1106,20 +1182,14 @@ async fn put_tenant_location_config_handler(
let location_conf = let location_conf =
LocationConf::try_from(&request_data.config).map_err(ApiError::BadRequest)?; LocationConf::try_from(&request_data.config).map_err(ApiError::BadRequest)?;
mgr::upsert_location( state
state.conf, .tenant_manager
tenant_id, .upsert_location(tenant_shard_id, location_conf, &ctx)
location_conf, .await
state.broker_client.clone(), // TODO: badrequest assumes the caller was asking for something unreasonable, but in
state.remote_storage.clone(), // principle we might have hit something like concurrent API calls to the same tenant,
state.deletion_queue_client.clone(), // which is not a 400 but a 409.
&ctx, .map_err(ApiError::BadRequest)?;
)
.await
// TODO: badrequest assumes the caller was asking for something unreasonable, but in
// principle we might have hit something like concurrent API calls to the same tenant,
// which is not a 400 but a 409.
.map_err(ApiError::BadRequest)?;
json_response(StatusCode::OK, ()) json_response(StatusCode::OK, ())
} }
@@ -1132,7 +1202,6 @@ async fn handle_tenant_break(
let tenant_id: TenantId = parse_request_param(&r, "tenant_id")?; let tenant_id: TenantId = parse_request_param(&r, "tenant_id")?;
let tenant = crate::tenant::mgr::get_tenant(tenant_id, true) let tenant = crate::tenant::mgr::get_tenant(tenant_id, true)
.await
.map_err(|_| ApiError::Conflict(String::from("no active tenant found")))?; .map_err(|_| ApiError::Conflict(String::from("no active tenant found")))?;
tenant.set_broken("broken from test".to_owned()).await; tenant.set_broken("broken from test".to_owned()).await;
@@ -1425,7 +1494,7 @@ async fn timeline_collect_keyspace(
let keys = timeline let keys = timeline
.collect_keyspace(at_lsn, &ctx) .collect_keyspace(at_lsn, &ctx)
.await .await
.map_err(ApiError::InternalServerError)?; .map_err(|e| ApiError::InternalServerError(e.into()))?;
json_response(StatusCode::OK, Partitioning { keys, at_lsn }) json_response(StatusCode::OK, Partitioning { keys, at_lsn })
} }
@@ -1437,7 +1506,7 @@ async fn active_timeline_of_active_tenant(
tenant_id: TenantId, tenant_id: TenantId,
timeline_id: TimelineId, timeline_id: TimelineId,
) -> Result<Arc<Timeline>, ApiError> { ) -> Result<Arc<Timeline>, ApiError> {
let tenant = mgr::get_tenant(tenant_id, true).await?; let tenant = mgr::get_tenant(tenant_id, true)?;
tenant tenant
.get_timeline(timeline_id, true) .get_timeline(timeline_id, true)
.map_err(|e| ApiError::NotFound(e.into())) .map_err(|e| ApiError::NotFound(e.into()))
@@ -1614,6 +1683,8 @@ where
); );
match handle.await { match handle.await {
// TODO: never actually return Err from here, always Ok(...) so that we can log
// spanned errors. Call api_error_handler instead and return appropriate Body.
Ok(result) => result, Ok(result) => result,
Err(e) => { Err(e) => {
// The handler task panicked. We have a global panic handler that logs the // The handler task panicked. We have a global panic handler that logs the
@@ -1662,7 +1733,7 @@ where
pub fn make_router( pub fn make_router(
state: Arc<State>, state: Arc<State>,
launch_ts: &'static LaunchTimestamp, launch_ts: &'static LaunchTimestamp,
auth: Option<Arc<JwtAuth>>, auth: Option<Arc<SwappableJwtAuth>>,
) -> anyhow::Result<RouterBuilder<hyper::Body, ApiError>> { ) -> anyhow::Result<RouterBuilder<hyper::Body, ApiError>> {
let spec = include_bytes!("openapi_spec.yml"); let spec = include_bytes!("openapi_spec.yml");
let mut router = attach_openapi_ui(endpoint::make_router(), spec, "/swagger.yml", "/v1/doc"); let mut router = attach_openapi_ui(endpoint::make_router(), spec, "/swagger.yml", "/v1/doc");
@@ -1691,10 +1762,13 @@ pub fn make_router(
.put("/v1/failpoints", |r| { .put("/v1/failpoints", |r| {
testing_api_handler("manage failpoints", r, failpoints_handler) testing_api_handler("manage failpoints", r, failpoints_handler)
}) })
.post("/v1/reload_auth_validation_keys", |r| {
api_handler(r, reload_auth_validation_keys_handler)
})
.get("/v1/tenant", |r| api_handler(r, tenant_list_handler)) .get("/v1/tenant", |r| api_handler(r, tenant_list_handler))
.post("/v1/tenant", |r| api_handler(r, tenant_create_handler)) .post("/v1/tenant", |r| api_handler(r, tenant_create_handler))
.get("/v1/tenant/:tenant_id", |r| api_handler(r, tenant_status)) .get("/v1/tenant/:tenant_id", |r| api_handler(r, tenant_status))
.delete("/v1/tenant/:tenant_id", |r| { .delete("/v1/tenant/:tenant_shard_id", |r| {
api_handler(r, tenant_delete_handler) api_handler(r, tenant_delete_handler)
}) })
.get("/v1/tenant/:tenant_id/synthetic_size", |r| { .get("/v1/tenant/:tenant_id/synthetic_size", |r| {
@@ -1706,13 +1780,13 @@ pub fn make_router(
.get("/v1/tenant/:tenant_id/config", |r| { .get("/v1/tenant/:tenant_id/config", |r| {
api_handler(r, get_tenant_config_handler) api_handler(r, get_tenant_config_handler)
}) })
.put("/v1/tenant/:tenant_id/location_config", |r| { .put("/v1/tenant/:tenant_shard_id/location_config", |r| {
api_handler(r, put_tenant_location_config_handler) api_handler(r, put_tenant_location_config_handler)
}) })
.get("/v1/tenant/:tenant_id/timeline", |r| { .get("/v1/tenant/:tenant_id/timeline", |r| {
api_handler(r, timeline_list_handler) api_handler(r, timeline_list_handler)
}) })
.post("/v1/tenant/:tenant_id/timeline", |r| { .post("/v1/tenant/:tenant_shard_id/timeline", |r| {
api_handler(r, timeline_create_handler) api_handler(r, timeline_create_handler)
}) })
.post("/v1/tenant/:tenant_id/attach", |r| { .post("/v1/tenant/:tenant_id/attach", |r| {
@@ -1756,7 +1830,7 @@ pub fn make_router(
"/v1/tenant/:tenant_id/timeline/:timeline_id/download_remote_layers", "/v1/tenant/:tenant_id/timeline/:timeline_id/download_remote_layers",
|r| api_handler(r, timeline_download_remote_layers_handler_get), |r| api_handler(r, timeline_download_remote_layers_handler_get),
) )
.delete("/v1/tenant/:tenant_id/timeline/:timeline_id", |r| { .delete("/v1/tenant/:tenant_shard_id/timeline/:timeline_id", |r| {
api_handler(r, timeline_delete_handler) api_handler(r, timeline_delete_handler)
}) })
.get("/v1/tenant/:tenant_id/timeline/:timeline_id/layer", |r| { .get("/v1/tenant/:tenant_id/timeline/:timeline_id/layer", |r| {

View File

@@ -1,3 +1,5 @@
#![deny(clippy::undocumented_unsafe_blocks)]
mod auth; mod auth;
pub mod basebackup; pub mod basebackup;
pub mod config; pub mod config;
@@ -61,14 +63,6 @@ pub async fn shutdown_pageserver(deletion_queue: Option<DeletionQueue>, exit_cod
) )
.await; .await;
// Shut down any page service tasks.
timed(
task_mgr::shutdown_tasks(Some(TaskKind::PageRequestHandler), None, None),
"shutdown PageRequestHandlers",
Duration::from_secs(1),
)
.await;
// Shut down all the tenants. This flushes everything to disk and kills // Shut down all the tenants. This flushes everything to disk and kills
// the checkpoint and GC tasks. // the checkpoint and GC tasks.
timed( timed(
@@ -78,6 +72,15 @@ pub async fn shutdown_pageserver(deletion_queue: Option<DeletionQueue>, exit_cod
) )
.await; .await;
// Shut down any page service tasks: any in-progress work for particular timelines or tenants
// should already have been canclled via mgr::shutdown_all_tenants
timed(
task_mgr::shutdown_tasks(Some(TaskKind::PageRequestHandler), None, None),
"shutdown PageRequestHandlers",
Duration::from_secs(1),
)
.await;
// Best effort to persist any outstanding deletions, to avoid leaking objects // Best effort to persist any outstanding deletions, to avoid leaking objects
if let Some(mut deletion_queue) = deletion_queue { if let Some(mut deletion_queue) = deletion_queue {
deletion_queue.shutdown(Duration::from_secs(5)).await; deletion_queue.shutdown(Duration::from_secs(5)).await;

View File

@@ -962,6 +962,32 @@ static REMOTE_TIMELINE_CLIENT_BYTES_FINISHED_COUNTER: Lazy<IntCounterVec> = Lazy
.expect("failed to define a metric") .expect("failed to define a metric")
}); });
pub(crate) struct TenantManagerMetrics {
pub(crate) tenant_slots: UIntGauge,
pub(crate) tenant_slot_writes: IntCounter,
pub(crate) unexpected_errors: IntCounter,
}
pub(crate) static TENANT_MANAGER: Lazy<TenantManagerMetrics> = Lazy::new(|| {
TenantManagerMetrics {
tenant_slots: register_uint_gauge!(
"pageserver_tenant_manager_slots",
"How many slots currently exist, including all attached, secondary and in-progress operations",
)
.expect("failed to define a metric"),
tenant_slot_writes: register_int_counter!(
"pageserver_tenant_manager_slot_writes",
"Writes to a tenant slot, including all of create/attach/detach/delete"
)
.expect("failed to define a metric"),
unexpected_errors: register_int_counter!(
"pageserver_tenant_manager_unexpected_errors_total",
"Number of unexpected conditions encountered: nonzero value indicates a non-fatal bug."
)
.expect("failed to define a metric"),
}
});
pub(crate) struct DeletionQueueMetrics { pub(crate) struct DeletionQueueMetrics {
pub(crate) keys_submitted: IntCounter, pub(crate) keys_submitted: IntCounter,
pub(crate) keys_dropped: IntCounter, pub(crate) keys_dropped: IntCounter,
@@ -1199,15 +1225,6 @@ pub(crate) static WAL_REDO_TIME: Lazy<Histogram> = Lazy::new(|| {
.expect("failed to define a metric") .expect("failed to define a metric")
}); });
pub(crate) static WAL_REDO_WAIT_TIME: Lazy<Histogram> = Lazy::new(|| {
register_histogram!(
"pageserver_wal_redo_wait_seconds",
"Time spent waiting for access to the Postgres WAL redo process",
redo_histogram_time_buckets!(),
)
.expect("failed to define a metric")
});
pub(crate) static WAL_REDO_RECORDS_HISTOGRAM: Lazy<Histogram> = Lazy::new(|| { pub(crate) static WAL_REDO_RECORDS_HISTOGRAM: Lazy<Histogram> = Lazy::new(|| {
register_histogram!( register_histogram!(
"pageserver_wal_redo_records_histogram", "pageserver_wal_redo_records_histogram",
@@ -1235,6 +1252,46 @@ pub(crate) static WAL_REDO_RECORD_COUNTER: Lazy<IntCounter> = Lazy::new(|| {
.unwrap() .unwrap()
}); });
pub(crate) struct WalRedoProcessCounters {
pub(crate) started: IntCounter,
pub(crate) killed_by_cause: enum_map::EnumMap<WalRedoKillCause, IntCounter>,
}
#[derive(Debug, enum_map::Enum, strum_macros::IntoStaticStr)]
pub(crate) enum WalRedoKillCause {
WalRedoProcessDrop,
NoLeakChildDrop,
Startup,
}
impl Default for WalRedoProcessCounters {
fn default() -> Self {
let started = register_int_counter!(
"pageserver_wal_redo_process_started_total",
"Number of WAL redo processes started",
)
.unwrap();
let killed = register_int_counter_vec!(
"pageserver_wal_redo_process_stopped_total",
"Number of WAL redo processes stopped",
&["cause"],
)
.unwrap();
Self {
started,
killed_by_cause: EnumMap::from_array(std::array::from_fn(|i| {
let cause = <WalRedoKillCause as enum_map::Enum>::from_usize(i);
let cause_str: &'static str = cause.into();
killed.with_label_values(&[cause_str])
})),
}
}
}
pub(crate) static WAL_REDO_PROCESS_COUNTERS: Lazy<WalRedoProcessCounters> =
Lazy::new(WalRedoProcessCounters::default);
/// Similar to `prometheus::HistogramTimer` but does not record on drop. /// Similar to `prometheus::HistogramTimer` but does not record on drop.
pub struct StorageTimeMetricsTimer { pub struct StorageTimeMetricsTimer {
metrics: StorageTimeMetrics, metrics: StorageTimeMetrics,
@@ -1884,6 +1941,9 @@ pub fn preinitialize_metrics() {
// Deletion queue stats // Deletion queue stats
Lazy::force(&DELETION_QUEUE); Lazy::force(&DELETION_QUEUE);
// Tenant manager stats
Lazy::force(&TENANT_MANAGER);
// countervecs // countervecs
[&BACKGROUND_LOOP_PERIOD_OVERRUN_COUNT] [&BACKGROUND_LOOP_PERIOD_OVERRUN_COUNT]
.into_iter() .into_iter()
@@ -1899,7 +1959,6 @@ pub fn preinitialize_metrics() {
&READ_NUM_FS_LAYERS, &READ_NUM_FS_LAYERS,
&WAIT_LSN_TIME, &WAIT_LSN_TIME,
&WAL_REDO_TIME, &WAL_REDO_TIME,
&WAL_REDO_WAIT_TIME,
&WAL_REDO_RECORDS_HISTOGRAM, &WAL_REDO_RECORDS_HISTOGRAM,
&WAL_REDO_BYTES_HISTOGRAM, &WAL_REDO_BYTES_HISTOGRAM,
] ]

View File

@@ -40,7 +40,7 @@ use tracing::field;
use tracing::*; use tracing::*;
use utils::id::ConnectionId; use utils::id::ConnectionId;
use utils::{ use utils::{
auth::{Claims, JwtAuth, Scope}, auth::{Claims, Scope, SwappableJwtAuth},
id::{TenantId, TimelineId}, id::{TenantId, TimelineId},
lsn::Lsn, lsn::Lsn,
simple_rcu::RcuReadGuard, simple_rcu::RcuReadGuard,
@@ -55,16 +55,20 @@ use crate::metrics;
use crate::metrics::LIVE_CONNECTIONS_COUNT; use crate::metrics::LIVE_CONNECTIONS_COUNT;
use crate::task_mgr; use crate::task_mgr;
use crate::task_mgr::TaskKind; use crate::task_mgr::TaskKind;
use crate::tenant;
use crate::tenant::debug_assert_current_span_has_tenant_and_timeline_id; use crate::tenant::debug_assert_current_span_has_tenant_and_timeline_id;
use crate::tenant::mgr; use crate::tenant::mgr;
use crate::tenant::mgr::GetTenantError; use crate::tenant::mgr::get_active_tenant_with_timeout;
use crate::tenant::{Tenant, Timeline}; use crate::tenant::mgr::GetActiveTenantError;
use crate::tenant::Timeline;
use crate::trace::Tracer; use crate::trace::Tracer;
use postgres_ffi::pg_constants::DEFAULTTABLESPACE_OID; use postgres_ffi::pg_constants::DEFAULTTABLESPACE_OID;
use postgres_ffi::BLCKSZ; use postgres_ffi::BLCKSZ;
// How long we may block waiting for a [`TenantSlot::InProgress`]` and/or a [`Tenant`] which
// is not yet in state [`TenantState::Active`].
const ACTIVE_TENANT_TIMEOUT: Duration = Duration::from_millis(5000);
/// Read the end of a tar archive. /// Read the end of a tar archive.
/// ///
/// A tar archive normally ends with two consecutive blocks of zeros, 512 bytes each. /// A tar archive normally ends with two consecutive blocks of zeros, 512 bytes each.
@@ -118,7 +122,7 @@ async fn read_tar_eof(mut reader: (impl AsyncRead + Unpin)) -> anyhow::Result<()
pub async fn libpq_listener_main( pub async fn libpq_listener_main(
conf: &'static PageServerConf, conf: &'static PageServerConf,
broker_client: storage_broker::BrokerClientChannel, broker_client: storage_broker::BrokerClientChannel,
auth: Option<Arc<JwtAuth>>, auth: Option<Arc<SwappableJwtAuth>>,
listener: TcpListener, listener: TcpListener,
auth_type: AuthType, auth_type: AuthType,
listener_ctx: RequestContext, listener_ctx: RequestContext,
@@ -186,7 +190,7 @@ pub async fn libpq_listener_main(
async fn page_service_conn_main( async fn page_service_conn_main(
conf: &'static PageServerConf, conf: &'static PageServerConf,
broker_client: storage_broker::BrokerClientChannel, broker_client: storage_broker::BrokerClientChannel,
auth: Option<Arc<JwtAuth>>, auth: Option<Arc<SwappableJwtAuth>>,
socket: tokio::net::TcpStream, socket: tokio::net::TcpStream,
auth_type: AuthType, auth_type: AuthType,
connection_ctx: RequestContext, connection_ctx: RequestContext,
@@ -214,22 +218,34 @@ async fn page_service_conn_main(
// no write timeout is used, because the kernel is assumed to error writes after some time. // no write timeout is used, because the kernel is assumed to error writes after some time.
let mut socket = tokio_io_timeout::TimeoutReader::new(socket); let mut socket = tokio_io_timeout::TimeoutReader::new(socket);
// timeout should be lower, but trying out multiple days for let default_timeout_ms = 10 * 60 * 1000; // 10 minutes by default
// <https://github.com/neondatabase/neon/issues/4205> let socket_timeout_ms = (|| {
socket.set_timeout(Some(std::time::Duration::from_secs(60 * 60 * 24 * 3))); fail::fail_point!("simulated-bad-compute-connection", |avg_timeout_ms| {
// Exponential distribution for simulating
// poor network conditions, expect about avg_timeout_ms to be around 15
// in tests
if let Some(avg_timeout_ms) = avg_timeout_ms {
let avg = avg_timeout_ms.parse::<i64>().unwrap() as f32;
let u = rand::random::<f32>();
((1.0 - u).ln() / (-avg)) as u64
} else {
default_timeout_ms
}
});
default_timeout_ms
})();
// A timeout here does not mean the client died, it can happen if it's just idle for
// a while: we will tear down this PageServerHandler and instantiate a new one if/when
// they reconnect.
socket.set_timeout(Some(std::time::Duration::from_millis(socket_timeout_ms)));
let socket = std::pin::pin!(socket); let socket = std::pin::pin!(socket);
// XXX: pgbackend.run() should take the connection_ctx, // XXX: pgbackend.run() should take the connection_ctx,
// and create a child per-query context when it invokes process_query. // and create a child per-query context when it invokes process_query.
// But it's in a shared crate, so, we store connection_ctx inside PageServerHandler // But it's in a shared crate, so, we store connection_ctx inside PageServerHandler
// and create the per-query context in process_query ourselves. // and create the per-query context in process_query ourselves.
let mut conn_handler = PageServerHandler::new( let mut conn_handler = PageServerHandler::new(conf, broker_client, auth, connection_ctx);
conf,
broker_client,
auth,
connection_ctx,
task_mgr::shutdown_token(),
);
let pgbackend = PostgresBackend::new_from_io(socket, peer_addr, auth_type, None)?; let pgbackend = PostgresBackend::new_from_io(socket, peer_addr, auth_type, None)?;
match pgbackend match pgbackend
@@ -255,7 +271,7 @@ async fn page_service_conn_main(
struct PageServerHandler { struct PageServerHandler {
_conf: &'static PageServerConf, _conf: &'static PageServerConf,
broker_client: storage_broker::BrokerClientChannel, broker_client: storage_broker::BrokerClientChannel,
auth: Option<Arc<JwtAuth>>, auth: Option<Arc<SwappableJwtAuth>>,
claims: Option<Claims>, claims: Option<Claims>,
/// The context created for the lifetime of the connection /// The context created for the lifetime of the connection
@@ -263,19 +279,14 @@ struct PageServerHandler {
/// For each query received over the connection, /// For each query received over the connection,
/// `process_query` creates a child context from this one. /// `process_query` creates a child context from this one.
connection_ctx: RequestContext, connection_ctx: RequestContext,
/// A token that should fire when the tenant transitions from
/// attached state, or when the pageserver is shutting down.
cancel: CancellationToken,
} }
impl PageServerHandler { impl PageServerHandler {
pub fn new( pub fn new(
conf: &'static PageServerConf, conf: &'static PageServerConf,
broker_client: storage_broker::BrokerClientChannel, broker_client: storage_broker::BrokerClientChannel,
auth: Option<Arc<JwtAuth>>, auth: Option<Arc<SwappableJwtAuth>>,
connection_ctx: RequestContext, connection_ctx: RequestContext,
cancel: CancellationToken,
) -> Self { ) -> Self {
PageServerHandler { PageServerHandler {
_conf: conf, _conf: conf,
@@ -283,7 +294,6 @@ impl PageServerHandler {
auth, auth,
claims: None, claims: None,
connection_ctx, connection_ctx,
cancel,
} }
} }
@@ -291,7 +301,11 @@ impl PageServerHandler {
/// this rather than naked flush() in order to shut down promptly. Without this, we would /// this rather than naked flush() in order to shut down promptly. Without this, we would
/// block shutdown of a tenant if a postgres client was failing to consume bytes we send /// block shutdown of a tenant if a postgres client was failing to consume bytes we send
/// in the flush. /// in the flush.
async fn flush_cancellable<IO>(&self, pgb: &mut PostgresBackend<IO>) -> Result<(), QueryError> async fn flush_cancellable<IO>(
&self,
pgb: &mut PostgresBackend<IO>,
cancel: &CancellationToken,
) -> Result<(), QueryError>
where where
IO: AsyncRead + AsyncWrite + Send + Sync + Unpin, IO: AsyncRead + AsyncWrite + Send + Sync + Unpin,
{ {
@@ -299,7 +313,7 @@ impl PageServerHandler {
flush_r = pgb.flush() => { flush_r = pgb.flush() => {
Ok(flush_r?) Ok(flush_r?)
}, },
_ = self.cancel.cancelled() => { _ = cancel.cancelled() => {
Err(QueryError::Shutdown) Err(QueryError::Shutdown)
} }
) )
@@ -308,6 +322,7 @@ impl PageServerHandler {
fn copyin_stream<'a, IO>( fn copyin_stream<'a, IO>(
&'a self, &'a self,
pgb: &'a mut PostgresBackend<IO>, pgb: &'a mut PostgresBackend<IO>,
cancel: &'a CancellationToken,
) -> impl Stream<Item = io::Result<Bytes>> + 'a ) -> impl Stream<Item = io::Result<Bytes>> + 'a
where where
IO: AsyncRead + AsyncWrite + Send + Sync + Unpin, IO: AsyncRead + AsyncWrite + Send + Sync + Unpin,
@@ -317,7 +332,7 @@ impl PageServerHandler {
let msg = tokio::select! { let msg = tokio::select! {
biased; biased;
_ = self.cancel.cancelled() => { _ = cancel.cancelled() => {
// We were requested to shut down. // We were requested to shut down.
let msg = "pageserver is shutting down"; let msg = "pageserver is shutting down";
let _ = pgb.write_message_noflush(&BeMessage::ErrorResponse(msg, None)); let _ = pgb.write_message_noflush(&BeMessage::ErrorResponse(msg, None));
@@ -357,7 +372,7 @@ impl PageServerHandler {
let query_error = QueryError::Disconnected(ConnectionError::Io(io::Error::new(io::ErrorKind::ConnectionReset, msg))); let query_error = QueryError::Disconnected(ConnectionError::Io(io::Error::new(io::ErrorKind::ConnectionReset, msg)));
// error can't happen here, ErrorResponse serialization should be always ok // error can't happen here, ErrorResponse serialization should be always ok
pgb.write_message_noflush(&BeMessage::ErrorResponse(msg, Some(query_error.pg_error_code()))).map_err(|e| e.into_io_error())?; pgb.write_message_noflush(&BeMessage::ErrorResponse(msg, Some(query_error.pg_error_code()))).map_err(|e| e.into_io_error())?;
self.flush_cancellable(pgb).await.map_err(|e| io::Error::new(io::ErrorKind::Other, e.to_string()))?; self.flush_cancellable(pgb, cancel).await.map_err(|e| io::Error::new(io::ErrorKind::Other, e.to_string()))?;
Err(io::Error::new(io::ErrorKind::ConnectionReset, msg))?; Err(io::Error::new(io::ErrorKind::ConnectionReset, msg))?;
} }
Err(QueryError::Disconnected(ConnectionError::Io(io_error))) => { Err(QueryError::Disconnected(ConnectionError::Io(io_error))) => {
@@ -384,12 +399,13 @@ impl PageServerHandler {
{ {
debug_assert_current_span_has_tenant_and_timeline_id(); debug_assert_current_span_has_tenant_and_timeline_id();
// NOTE: pagerequests handler exits when connection is closed,
// so there is no need to reset the association
task_mgr::associate_with(Some(tenant_id), Some(timeline_id));
// Make request tracer if needed // Make request tracer if needed
let tenant = get_active_tenant_with_timeout(tenant_id, &ctx).await?; let tenant = mgr::get_active_tenant_with_timeout(
tenant_id,
ACTIVE_TENANT_TIMEOUT,
&task_mgr::shutdown_token(),
)
.await?;
let mut tracer = if tenant.get_trace_read_requests() { let mut tracer = if tenant.get_trace_read_requests() {
let connection_id = ConnectionId::generate(); let connection_id = ConnectionId::generate();
let path = tenant let path = tenant
@@ -405,9 +421,14 @@ impl PageServerHandler {
.get_timeline(timeline_id, true) .get_timeline(timeline_id, true)
.map_err(|e| anyhow::anyhow!(e))?; .map_err(|e| anyhow::anyhow!(e))?;
// Avoid starting new requests if the timeline has already started shutting down,
// and block timeline shutdown until this request is complete, or drops out due
// to cancellation.
let _timeline_guard = timeline.gate.enter().map_err(|_| QueryError::Shutdown)?;
// switch client to COPYBOTH // switch client to COPYBOTH
pgb.write_message_noflush(&BeMessage::CopyBothResponse)?; pgb.write_message_noflush(&BeMessage::CopyBothResponse)?;
self.flush_cancellable(pgb).await?; self.flush_cancellable(pgb, &timeline.cancel).await?;
let metrics = metrics::SmgrQueryTimePerTimeline::new(&tenant_id, &timeline_id); let metrics = metrics::SmgrQueryTimePerTimeline::new(&tenant_id, &timeline_id);
@@ -415,7 +436,7 @@ impl PageServerHandler {
let msg = tokio::select! { let msg = tokio::select! {
biased; biased;
_ = self.cancel.cancelled() => { _ = timeline.cancel.cancelled() => {
// We were requested to shut down. // We were requested to shut down.
info!("shutdown request received in page handler"); info!("shutdown request received in page handler");
return Err(QueryError::Shutdown) return Err(QueryError::Shutdown)
@@ -490,9 +511,24 @@ impl PageServerHandler {
} }
}; };
if let Err(e) = &response {
// Requests may fail as soon as we are Stopping, even if the Timeline's cancellation token wasn't fired yet,
// because wait_lsn etc will drop out
// is_stopping(): [`Timeline::flush_and_shutdown`] has entered
// is_canceled(): [`Timeline::shutdown`]` has entered
if timeline.cancel.is_cancelled() || timeline.is_stopping() {
// If we fail to fulfil a request during shutdown, which may be _because_ of
// shutdown, then do not send the error to the client. Instead just drop the
// connection.
span.in_scope(|| info!("dropped response during shutdown: {e:#}"));
return Err(QueryError::Shutdown);
}
}
let response = response.unwrap_or_else(|e| { let response = response.unwrap_or_else(|e| {
// print the all details to the log with {:#}, but for the client the // print the all details to the log with {:#}, but for the client the
// error message is enough // error message is enough. Do not log if shutting down, as the anyhow::Error
// here includes cancellation which is not an error.
span.in_scope(|| error!("error reading relation or page version: {:#}", e)); span.in_scope(|| error!("error reading relation or page version: {:#}", e));
PagestreamBeMessage::Error(PagestreamErrorResponse { PagestreamBeMessage::Error(PagestreamErrorResponse {
message: e.to_string(), message: e.to_string(),
@@ -500,7 +536,7 @@ impl PageServerHandler {
}); });
pgb.write_message_noflush(&BeMessage::CopyData(&response.serialize()))?; pgb.write_message_noflush(&BeMessage::CopyData(&response.serialize()))?;
self.flush_cancellable(pgb).await?; self.flush_cancellable(pgb, &timeline.cancel).await?;
} }
Ok(()) Ok(())
} }
@@ -522,10 +558,14 @@ impl PageServerHandler {
{ {
debug_assert_current_span_has_tenant_and_timeline_id(); debug_assert_current_span_has_tenant_and_timeline_id();
task_mgr::associate_with(Some(tenant_id), Some(timeline_id));
// Create empty timeline // Create empty timeline
info!("creating new timeline"); info!("creating new timeline");
let tenant = get_active_tenant_with_timeout(tenant_id, &ctx).await?; let tenant = get_active_tenant_with_timeout(
tenant_id,
ACTIVE_TENANT_TIMEOUT,
&task_mgr::shutdown_token(),
)
.await?;
let timeline = tenant let timeline = tenant
.create_empty_timeline(timeline_id, base_lsn, pg_version, &ctx) .create_empty_timeline(timeline_id, base_lsn, pg_version, &ctx)
.await?; .await?;
@@ -543,9 +583,9 @@ impl PageServerHandler {
// Import basebackup provided via CopyData // Import basebackup provided via CopyData
info!("importing basebackup"); info!("importing basebackup");
pgb.write_message_noflush(&BeMessage::CopyInResponse)?; pgb.write_message_noflush(&BeMessage::CopyInResponse)?;
self.flush_cancellable(pgb).await?; self.flush_cancellable(pgb, &tenant.cancel).await?;
let mut copyin_reader = pin!(StreamReader::new(self.copyin_stream(pgb))); let mut copyin_reader = pin!(StreamReader::new(self.copyin_stream(pgb, &tenant.cancel)));
timeline timeline
.import_basebackup_from_tar( .import_basebackup_from_tar(
&mut copyin_reader, &mut copyin_reader,
@@ -582,9 +622,10 @@ impl PageServerHandler {
IO: AsyncRead + AsyncWrite + Send + Sync + Unpin, IO: AsyncRead + AsyncWrite + Send + Sync + Unpin,
{ {
debug_assert_current_span_has_tenant_and_timeline_id(); debug_assert_current_span_has_tenant_and_timeline_id();
task_mgr::associate_with(Some(tenant_id), Some(timeline_id));
let timeline = get_active_tenant_timeline(tenant_id, timeline_id, &ctx).await?; let timeline = self
.get_active_tenant_timeline(tenant_id, timeline_id)
.await?;
let last_record_lsn = timeline.get_last_record_lsn(); let last_record_lsn = timeline.get_last_record_lsn();
if last_record_lsn != start_lsn { if last_record_lsn != start_lsn {
return Err(QueryError::Other( return Err(QueryError::Other(
@@ -598,8 +639,8 @@ impl PageServerHandler {
// Import wal provided via CopyData // Import wal provided via CopyData
info!("importing wal"); info!("importing wal");
pgb.write_message_noflush(&BeMessage::CopyInResponse)?; pgb.write_message_noflush(&BeMessage::CopyInResponse)?;
self.flush_cancellable(pgb).await?; self.flush_cancellable(pgb, &timeline.cancel).await?;
let mut copyin_reader = pin!(StreamReader::new(self.copyin_stream(pgb))); let mut copyin_reader = pin!(StreamReader::new(self.copyin_stream(pgb, &timeline.cancel)));
import_wal_from_tar(&timeline, &mut copyin_reader, start_lsn, end_lsn, &ctx).await?; import_wal_from_tar(&timeline, &mut copyin_reader, start_lsn, end_lsn, &ctx).await?;
info!("wal import complete"); info!("wal import complete");
@@ -792,7 +833,9 @@ impl PageServerHandler {
let started = std::time::Instant::now(); let started = std::time::Instant::now();
// check that the timeline exists // check that the timeline exists
let timeline = get_active_tenant_timeline(tenant_id, timeline_id, &ctx).await?; let timeline = self
.get_active_tenant_timeline(tenant_id, timeline_id)
.await?;
let latest_gc_cutoff_lsn = timeline.get_latest_gc_cutoff_lsn(); let latest_gc_cutoff_lsn = timeline.get_latest_gc_cutoff_lsn();
if let Some(lsn) = lsn { if let Some(lsn) = lsn {
// Backup was requested at a particular LSN. Wait for it to arrive. // Backup was requested at a particular LSN. Wait for it to arrive.
@@ -807,7 +850,7 @@ impl PageServerHandler {
// switch client to COPYOUT // switch client to COPYOUT
pgb.write_message_noflush(&BeMessage::CopyOutResponse)?; pgb.write_message_noflush(&BeMessage::CopyOutResponse)?;
self.flush_cancellable(pgb).await?; self.flush_cancellable(pgb, &timeline.cancel).await?;
// Send a tarball of the latest layer on the timeline. Compress if not // Send a tarball of the latest layer on the timeline. Compress if not
// fullbackup. TODO Compress in that case too (tests need to be updated) // fullbackup. TODO Compress in that case too (tests need to be updated)
@@ -859,7 +902,7 @@ impl PageServerHandler {
} }
pgb.write_message_noflush(&BeMessage::CopyDone)?; pgb.write_message_noflush(&BeMessage::CopyDone)?;
self.flush_cancellable(pgb).await?; self.flush_cancellable(pgb, &timeline.cancel).await?;
let basebackup_after = started let basebackup_after = started
.elapsed() .elapsed()
@@ -877,7 +920,7 @@ impl PageServerHandler {
// when accessing management api supply None as an argument // when accessing management api supply None as an argument
// when using to authorize tenant pass corresponding tenant id // when using to authorize tenant pass corresponding tenant id
fn check_permission(&self, tenant_id: Option<TenantId>) -> anyhow::Result<()> { fn check_permission(&self, tenant_id: Option<TenantId>) -> Result<(), QueryError> {
if self.auth.is_none() { if self.auth.is_none() {
// auth is set to Trust, nothing to check so just return ok // auth is set to Trust, nothing to check so just return ok
return Ok(()); return Ok(());
@@ -889,7 +932,26 @@ impl PageServerHandler {
.claims .claims
.as_ref() .as_ref()
.expect("claims presence already checked"); .expect("claims presence already checked");
check_permission(claims, tenant_id) check_permission(claims, tenant_id).map_err(|e| QueryError::Unauthorized(e.0))
}
/// Shorthand for getting a reference to a Timeline of an Active tenant.
async fn get_active_tenant_timeline(
&self,
tenant_id: TenantId,
timeline_id: TimelineId,
) -> Result<Arc<Timeline>, GetActiveTimelineError> {
let tenant = get_active_tenant_with_timeout(
tenant_id,
ACTIVE_TENANT_TIMEOUT,
&task_mgr::shutdown_token(),
)
.await
.map_err(GetActiveTimelineError::Tenant)?;
let timeline = tenant
.get_timeline(timeline_id, true)
.map_err(|e| GetActiveTimelineError::Timeline(anyhow::anyhow!(e)))?;
Ok(timeline)
} }
} }
@@ -909,16 +971,17 @@ where
.auth .auth
.as_ref() .as_ref()
.unwrap() .unwrap()
.decode(str::from_utf8(jwt_response).context("jwt response is not UTF-8")?)?; .decode(str::from_utf8(jwt_response).context("jwt response is not UTF-8")?)
.map_err(|e| QueryError::Unauthorized(e.0))?;
if matches!(data.claims.scope, Scope::Tenant) && data.claims.tenant_id.is_none() { if matches!(data.claims.scope, Scope::Tenant) && data.claims.tenant_id.is_none() {
return Err(QueryError::Other(anyhow::anyhow!( return Err(QueryError::Unauthorized(
"jwt token scope is Tenant, but tenant id is missing" "jwt token scope is Tenant, but tenant id is missing".into(),
))); ));
} }
info!( debug!(
"jwt auth succeeded for scope: {:#?} by tenant id: {:?}", "jwt scope check succeeded for scope: {:#?} by tenant id: {:?}",
data.claims.scope, data.claims.tenant_id, data.claims.scope, data.claims.tenant_id,
); );
@@ -940,9 +1003,13 @@ where
pgb: &mut PostgresBackend<IO>, pgb: &mut PostgresBackend<IO>,
query_string: &str, query_string: &str,
) -> Result<(), QueryError> { ) -> Result<(), QueryError> {
fail::fail_point!("simulated-bad-compute-connection", |_| {
info!("Hit failpoint for bad connection");
Err(QueryError::SimulatedConnectionError)
});
let ctx = self.connection_ctx.attached_child(); let ctx = self.connection_ctx.attached_child();
debug!("process query {query_string:?}"); debug!("process query {query_string:?}");
if query_string.starts_with("pagestream ") { if query_string.starts_with("pagestream ") {
let (_, params_raw) = query_string.split_at("pagestream ".len()); let (_, params_raw) = query_string.split_at("pagestream ".len());
let params = params_raw.split(' ').collect::<Vec<_>>(); let params = params_raw.split(' ').collect::<Vec<_>>();
@@ -1048,7 +1115,9 @@ where
.record("timeline_id", field::display(timeline_id)); .record("timeline_id", field::display(timeline_id));
self.check_permission(Some(tenant_id))?; self.check_permission(Some(tenant_id))?;
let timeline = get_active_tenant_timeline(tenant_id, timeline_id, &ctx).await?; let timeline = self
.get_active_tenant_timeline(tenant_id, timeline_id)
.await?;
let end_of_timeline = timeline.get_last_record_rlsn(); let end_of_timeline = timeline.get_last_record_rlsn();
@@ -1232,7 +1301,12 @@ where
self.check_permission(Some(tenant_id))?; self.check_permission(Some(tenant_id))?;
let tenant = get_active_tenant_with_timeout(tenant_id, &ctx).await?; let tenant = get_active_tenant_with_timeout(
tenant_id,
ACTIVE_TENANT_TIMEOUT,
&task_mgr::shutdown_token(),
)
.await?;
pgb.write_message_noflush(&BeMessage::RowDescription(&[ pgb.write_message_noflush(&BeMessage::RowDescription(&[
RowDescriptor::int8_col(b"checkpoint_distance"), RowDescriptor::int8_col(b"checkpoint_distance"),
RowDescriptor::int8_col(b"checkpoint_timeout"), RowDescriptor::int8_col(b"checkpoint_timeout"),
@@ -1278,67 +1352,16 @@ where
} }
} }
#[derive(thiserror::Error, Debug)]
enum GetActiveTenantError {
#[error(
"Timed out waiting {wait_time:?} for tenant active state. Latest state: {latest_state:?}"
)]
WaitForActiveTimeout {
latest_state: TenantState,
wait_time: Duration,
},
#[error(transparent)]
NotFound(GetTenantError),
#[error(transparent)]
WaitTenantActive(tenant::WaitToBecomeActiveError),
}
impl From<GetActiveTenantError> for QueryError { impl From<GetActiveTenantError> for QueryError {
fn from(e: GetActiveTenantError) -> Self { fn from(e: GetActiveTenantError) -> Self {
match e { match e {
GetActiveTenantError::WaitForActiveTimeout { .. } => QueryError::Disconnected( GetActiveTenantError::WaitForActiveTimeout { .. } => QueryError::Disconnected(
ConnectionError::Io(io::Error::new(io::ErrorKind::TimedOut, e.to_string())), ConnectionError::Io(io::Error::new(io::ErrorKind::TimedOut, e.to_string())),
), ),
GetActiveTenantError::WaitTenantActive(e) => QueryError::Other(anyhow::Error::new(e)), GetActiveTenantError::WillNotBecomeActive(TenantState::Stopping { .. }) => {
GetActiveTenantError::NotFound(e) => QueryError::Other(anyhow::Error::new(e)), QueryError::Shutdown
}
}
}
/// Get active tenant.
///
/// If the tenant is Loading, waits for it to become Active, for up to 30 s. That
/// ensures that queries don't fail immediately after pageserver startup, because
/// all tenants are still loading.
async fn get_active_tenant_with_timeout(
tenant_id: TenantId,
_ctx: &RequestContext, /* require get a context to support cancellation in the future */
) -> Result<Arc<Tenant>, GetActiveTenantError> {
let tenant = match mgr::get_tenant(tenant_id, false).await {
Ok(tenant) => tenant,
Err(e @ GetTenantError::NotFound(_)) => return Err(GetActiveTenantError::NotFound(e)),
Err(GetTenantError::NotActive(_)) => {
unreachable!("we're calling get_tenant with active_only=false")
}
Err(GetTenantError::Broken(_)) => {
unreachable!("we're calling get_tenant with active_only=false")
}
};
let wait_time = Duration::from_secs(30);
match tokio::time::timeout(wait_time, tenant.wait_to_become_active()).await {
Ok(Ok(())) => Ok(tenant),
// no .context(), the error message is good enough and some tests depend on it
Ok(Err(e)) => Err(GetActiveTenantError::WaitTenantActive(e)),
Err(_) => {
let latest_state = tenant.current_state();
if latest_state == TenantState::Active {
Ok(tenant)
} else {
Err(GetActiveTenantError::WaitForActiveTimeout {
latest_state,
wait_time,
})
} }
e => QueryError::Other(anyhow::anyhow!(e)),
} }
} }
} }
@@ -1359,18 +1382,3 @@ impl From<GetActiveTimelineError> for QueryError {
} }
} }
} }
/// Shorthand for getting a reference to a Timeline of an Active tenant.
async fn get_active_tenant_timeline(
tenant_id: TenantId,
timeline_id: TimelineId,
ctx: &RequestContext,
) -> Result<Arc<Timeline>, GetActiveTimelineError> {
let tenant = get_active_tenant_with_timeout(tenant_id, ctx)
.await
.map_err(GetActiveTimelineError::Tenant)?;
let timeline = tenant
.get_timeline(timeline_id, true)
.map_err(|e| GetActiveTimelineError::Timeline(anyhow::anyhow!(e)))?;
Ok(timeline)
}

View File

@@ -21,8 +21,8 @@ use serde::{Deserialize, Serialize};
use std::collections::{hash_map, HashMap, HashSet}; use std::collections::{hash_map, HashMap, HashSet};
use std::ops::ControlFlow; use std::ops::ControlFlow;
use std::ops::Range; use std::ops::Range;
use tokio_util::sync::CancellationToken;
use tracing::{debug, trace, warn}; use tracing::{debug, trace, warn};
use utils::bin_ser::DeserializeError;
use utils::{bin_ser::BeSer, lsn::Lsn}; use utils::{bin_ser::BeSer, lsn::Lsn};
/// Block number within a relation or SLRU. This matches PostgreSQL's BlockNumber type. /// Block number within a relation or SLRU. This matches PostgreSQL's BlockNumber type.
@@ -30,9 +30,33 @@ pub type BlockNumber = u32;
#[derive(Debug)] #[derive(Debug)]
pub enum LsnForTimestamp { pub enum LsnForTimestamp {
/// Found commits both before and after the given timestamp
Present(Lsn), Present(Lsn),
/// Found no commits after the given timestamp, this means
/// that the newest data in the branch is older than the given
/// timestamp.
///
/// All commits <= LSN happened before the given timestamp
Future(Lsn), Future(Lsn),
/// The queried timestamp is past our horizon we look back at (PITR)
///
/// All commits > LSN happened after the given timestamp,
/// but any commits < LSN might have happened before or after
/// the given timestamp. We don't know because no data before
/// the given lsn is available.
Past(Lsn), Past(Lsn),
/// We have found no commit with a timestamp,
/// so we can't return anything meaningful.
///
/// The associated LSN is the lower bound value we can safely
/// create branches on, but no statement is made if it is
/// older or newer than the timestamp.
///
/// This variant can e.g. be returned right after a
/// cluster import.
NoData(Lsn), NoData(Lsn),
} }
@@ -44,6 +68,36 @@ pub enum CalculateLogicalSizeError {
Other(#[from] anyhow::Error), Other(#[from] anyhow::Error),
} }
#[derive(Debug, thiserror::Error)]
pub(crate) enum CollectKeySpaceError {
#[error(transparent)]
Decode(#[from] DeserializeError),
#[error(transparent)]
PageRead(PageReconstructError),
#[error("cancelled")]
Cancelled,
}
impl From<PageReconstructError> for CollectKeySpaceError {
fn from(err: PageReconstructError) -> Self {
match err {
PageReconstructError::Cancelled => Self::Cancelled,
err => Self::PageRead(err),
}
}
}
impl From<PageReconstructError> for CalculateLogicalSizeError {
fn from(pre: PageReconstructError) -> Self {
match pre {
PageReconstructError::AncestorStopping(_) | PageReconstructError::Cancelled => {
Self::Cancelled
}
_ => Self::Other(pre.into()),
}
}
}
#[derive(Debug, thiserror::Error)] #[derive(Debug, thiserror::Error)]
pub enum RelationError { pub enum RelationError {
#[error("Relation Already Exists")] #[error("Relation Already Exists")]
@@ -314,7 +368,11 @@ impl Timeline {
ctx: &RequestContext, ctx: &RequestContext,
) -> Result<LsnForTimestamp, PageReconstructError> { ) -> Result<LsnForTimestamp, PageReconstructError> {
let gc_cutoff_lsn_guard = self.get_latest_gc_cutoff_lsn(); let gc_cutoff_lsn_guard = self.get_latest_gc_cutoff_lsn();
let min_lsn = *gc_cutoff_lsn_guard; // We use this method to figure out the branching LSN for the new branch, but the
// GC cutoff could be before the branching point and we cannot create a new branch
// with LSN < `ancestor_lsn`. Thus, pick the maximum of these two to be
// on the safe side.
let min_lsn = std::cmp::max(*gc_cutoff_lsn_guard, self.get_ancestor_lsn());
let max_lsn = self.get_last_record_lsn(); let max_lsn = self.get_last_record_lsn();
// LSNs are always 8-byte aligned. low/mid/high represent the // LSNs are always 8-byte aligned. low/mid/high represent the
@@ -344,30 +402,33 @@ impl Timeline {
low = mid + 1; low = mid + 1;
} }
} }
// If `found_smaller == true`, `low = t + 1` where `t` is the target LSN,
// so the LSN of the last commit record before or at `search_timestamp`.
// Remove one from `low` to get `t`.
//
// FIXME: it would be better to get the LSN of the previous commit.
// Otherwise, if you restore to the returned LSN, the database will
// include physical changes from later commits that will be marked
// as aborted, and will need to be vacuumed away.
let commit_lsn = Lsn((low - 1) * 8);
match (found_smaller, found_larger) { match (found_smaller, found_larger) {
(false, false) => { (false, false) => {
// This can happen if no commit records have been processed yet, e.g. // This can happen if no commit records have been processed yet, e.g.
// just after importing a cluster. // just after importing a cluster.
Ok(LsnForTimestamp::NoData(max_lsn)) Ok(LsnForTimestamp::NoData(min_lsn))
}
(true, false) => {
// Didn't find any commit timestamps larger than the request
Ok(LsnForTimestamp::Future(max_lsn))
} }
(false, true) => { (false, true) => {
// Didn't find any commit timestamps smaller than the request // Didn't find any commit timestamps smaller than the request
Ok(LsnForTimestamp::Past(max_lsn)) Ok(LsnForTimestamp::Past(min_lsn))
} }
(true, true) => { (true, false) => {
// low is the LSN of the first commit record *after* the search_timestamp, // Only found commits with timestamps smaller than the request.
// Back off by one to get to the point just before the commit. // It's still a valid case for branch creation, return it.
// // And `update_gc_info()` ignores LSN for a `LsnForTimestamp::Future`
// FIXME: it would be better to get the LSN of the previous commit. // case, anyway.
// Otherwise, if you restore to the returned LSN, the database will Ok(LsnForTimestamp::Future(commit_lsn))
// include physical changes from later commits that will be marked
// as aborted, and will need to be vacuumed away.
Ok(LsnForTimestamp::Present(Lsn((low - 1) * 8)))
} }
(true, true) => Ok(LsnForTimestamp::Present(commit_lsn)),
} }
} }
@@ -567,30 +628,22 @@ impl Timeline {
pub async fn get_current_logical_size_non_incremental( pub async fn get_current_logical_size_non_incremental(
&self, &self,
lsn: Lsn, lsn: Lsn,
cancel: CancellationToken,
ctx: &RequestContext, ctx: &RequestContext,
) -> Result<u64, CalculateLogicalSizeError> { ) -> Result<u64, CalculateLogicalSizeError> {
crate::tenant::debug_assert_current_span_has_tenant_and_timeline_id(); crate::tenant::debug_assert_current_span_has_tenant_and_timeline_id();
// Fetch list of database dirs and iterate them // Fetch list of database dirs and iterate them
let buf = self.get(DBDIR_KEY, lsn, ctx).await.context("read dbdir")?; let buf = self.get(DBDIR_KEY, lsn, ctx).await?;
let dbdir = DbDirectory::des(&buf).context("deserialize db directory")?; let dbdir = DbDirectory::des(&buf).context("deserialize db directory")?;
let mut total_size: u64 = 0; let mut total_size: u64 = 0;
for (spcnode, dbnode) in dbdir.dbdirs.keys() { for (spcnode, dbnode) in dbdir.dbdirs.keys() {
for rel in self for rel in self.list_rels(*spcnode, *dbnode, lsn, ctx).await? {
.list_rels(*spcnode, *dbnode, lsn, ctx) if self.cancel.is_cancelled() {
.await
.context("list rels")?
{
if cancel.is_cancelled() {
return Err(CalculateLogicalSizeError::Cancelled); return Err(CalculateLogicalSizeError::Cancelled);
} }
let relsize_key = rel_size_to_key(rel); let relsize_key = rel_size_to_key(rel);
let mut buf = self let mut buf = self.get(relsize_key, lsn, ctx).await?;
.get(relsize_key, lsn, ctx)
.await
.with_context(|| format!("read relation size of {rel:?}"))?;
let relsize = buf.get_u32_le(); let relsize = buf.get_u32_le();
total_size += relsize as u64; total_size += relsize as u64;
@@ -603,11 +656,11 @@ impl Timeline {
/// Get a KeySpace that covers all the Keys that are in use at the given LSN. /// Get a KeySpace that covers all the Keys that are in use at the given LSN.
/// Anything that's not listed maybe removed from the underlying storage (from /// Anything that's not listed maybe removed from the underlying storage (from
/// that LSN forwards). /// that LSN forwards).
pub async fn collect_keyspace( pub(crate) async fn collect_keyspace(
&self, &self,
lsn: Lsn, lsn: Lsn,
ctx: &RequestContext, ctx: &RequestContext,
) -> anyhow::Result<KeySpace> { ) -> Result<KeySpace, CollectKeySpaceError> {
// Iterate through key ranges, greedily packing them into partitions // Iterate through key ranges, greedily packing them into partitions
let mut result = KeySpaceAccum::new(); let mut result = KeySpaceAccum::new();
@@ -616,7 +669,7 @@ impl Timeline {
// Fetch list of database dirs and iterate them // Fetch list of database dirs and iterate them
let buf = self.get(DBDIR_KEY, lsn, ctx).await?; let buf = self.get(DBDIR_KEY, lsn, ctx).await?;
let dbdir = DbDirectory::des(&buf).context("deserialization failure")?; let dbdir = DbDirectory::des(&buf)?;
let mut dbs: Vec<(Oid, Oid)> = dbdir.dbdirs.keys().cloned().collect(); let mut dbs: Vec<(Oid, Oid)> = dbdir.dbdirs.keys().cloned().collect();
dbs.sort_unstable(); dbs.sort_unstable();
@@ -649,7 +702,7 @@ impl Timeline {
let slrudir_key = slru_dir_to_key(kind); let slrudir_key = slru_dir_to_key(kind);
result.add_key(slrudir_key); result.add_key(slrudir_key);
let buf = self.get(slrudir_key, lsn, ctx).await?; let buf = self.get(slrudir_key, lsn, ctx).await?;
let dir = SlruSegmentDirectory::des(&buf).context("deserialization failure")?; let dir = SlruSegmentDirectory::des(&buf)?;
let mut segments: Vec<u32> = dir.segments.iter().cloned().collect(); let mut segments: Vec<u32> = dir.segments.iter().cloned().collect();
segments.sort_unstable(); segments.sort_unstable();
for segno in segments { for segno in segments {
@@ -667,7 +720,7 @@ impl Timeline {
// Then pg_twophase // Then pg_twophase
result.add_key(TWOPHASEDIR_KEY); result.add_key(TWOPHASEDIR_KEY);
let buf = self.get(TWOPHASEDIR_KEY, lsn, ctx).await?; let buf = self.get(TWOPHASEDIR_KEY, lsn, ctx).await?;
let twophase_dir = TwoPhaseDirectory::des(&buf).context("deserialization failure")?; let twophase_dir = TwoPhaseDirectory::des(&buf)?;
let mut xids: Vec<TransactionId> = twophase_dir.xids.iter().cloned().collect(); let mut xids: Vec<TransactionId> = twophase_dir.xids.iter().cloned().collect();
xids.sort_unstable(); xids.sort_unstable();
for xid in xids { for xid in xids {

View File

@@ -1,106 +1,11 @@
use crate::walrecord::NeonWalRecord; use crate::walrecord::NeonWalRecord;
use anyhow::{bail, Result}; use anyhow::Result;
use byteorder::{ByteOrder, BE};
use bytes::Bytes; use bytes::Bytes;
use serde::{Deserialize, Serialize}; use serde::{Deserialize, Serialize};
use std::fmt;
use std::ops::{AddAssign, Range}; use std::ops::{AddAssign, Range};
use std::time::Duration; use std::time::Duration;
/// Key used in the Repository kv-store. pub use pageserver_api::key::{Key, KEY_SIZE};
///
/// The Repository treats this as an opaque struct, but see the code in pgdatadir_mapping.rs
/// for what we actually store in these fields.
#[derive(Debug, Clone, Copy, Hash, PartialEq, Eq, Ord, PartialOrd, Serialize, Deserialize)]
pub struct Key {
pub field1: u8,
pub field2: u32,
pub field3: u32,
pub field4: u32,
pub field5: u8,
pub field6: u32,
}
pub const KEY_SIZE: usize = 18;
impl Key {
/// 'field2' is used to store tablespaceid for relations and small enum numbers for other relish.
/// As long as Neon does not support tablespace (because of lack of access to local file system),
/// we can assume that only some predefined namespace OIDs are used which can fit in u16
pub fn to_i128(&self) -> i128 {
assert!(self.field2 < 0xFFFF || self.field2 == 0xFFFFFFFF || self.field2 == 0x22222222);
(((self.field1 & 0xf) as i128) << 120)
| (((self.field2 & 0xFFFF) as i128) << 104)
| ((self.field3 as i128) << 72)
| ((self.field4 as i128) << 40)
| ((self.field5 as i128) << 32)
| self.field6 as i128
}
pub const fn from_i128(x: i128) -> Self {
Key {
field1: ((x >> 120) & 0xf) as u8,
field2: ((x >> 104) & 0xFFFF) as u32,
field3: (x >> 72) as u32,
field4: (x >> 40) as u32,
field5: (x >> 32) as u8,
field6: x as u32,
}
}
pub fn next(&self) -> Key {
self.add(1)
}
pub fn add(&self, x: u32) -> Key {
let mut key = *self;
let r = key.field6.overflowing_add(x);
key.field6 = r.0;
if r.1 {
let r = key.field5.overflowing_add(1);
key.field5 = r.0;
if r.1 {
let r = key.field4.overflowing_add(1);
key.field4 = r.0;
if r.1 {
let r = key.field3.overflowing_add(1);
key.field3 = r.0;
if r.1 {
let r = key.field2.overflowing_add(1);
key.field2 = r.0;
if r.1 {
let r = key.field1.overflowing_add(1);
key.field1 = r.0;
assert!(!r.1);
}
}
}
}
}
key
}
pub fn from_slice(b: &[u8]) -> Self {
Key {
field1: b[0],
field2: u32::from_be_bytes(b[1..5].try_into().unwrap()),
field3: u32::from_be_bytes(b[5..9].try_into().unwrap()),
field4: u32::from_be_bytes(b[9..13].try_into().unwrap()),
field5: b[13],
field6: u32::from_be_bytes(b[14..18].try_into().unwrap()),
}
}
pub fn write_to_byte_slice(&self, buf: &mut [u8]) {
buf[0] = self.field1;
BE::write_u32(&mut buf[1..5], self.field2);
BE::write_u32(&mut buf[5..9], self.field3);
BE::write_u32(&mut buf[9..13], self.field4);
buf[13] = self.field5;
BE::write_u32(&mut buf[14..18], self.field6);
}
}
pub fn key_range_size(key_range: &Range<Key>) -> u32 { pub fn key_range_size(key_range: &Range<Key>) -> u32 {
let start = key_range.start; let start = key_range.start;
@@ -129,51 +34,9 @@ pub fn singleton_range(key: Key) -> Range<Key> {
key..key.next() key..key.next()
} }
impl fmt::Display for Key {
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
write!(
f,
"{:02X}{:08X}{:08X}{:08X}{:02X}{:08X}",
self.field1, self.field2, self.field3, self.field4, self.field5, self.field6
)
}
}
impl Key {
pub const MIN: Key = Key {
field1: u8::MIN,
field2: u32::MIN,
field3: u32::MIN,
field4: u32::MIN,
field5: u8::MIN,
field6: u32::MIN,
};
pub const MAX: Key = Key {
field1: u8::MAX,
field2: u32::MAX,
field3: u32::MAX,
field4: u32::MAX,
field5: u8::MAX,
field6: u32::MAX,
};
pub fn from_hex(s: &str) -> Result<Self> {
if s.len() != 36 {
bail!("parse error");
}
Ok(Key {
field1: u8::from_str_radix(&s[0..2], 16)?,
field2: u32::from_str_radix(&s[2..10], 16)?,
field3: u32::from_str_radix(&s[10..18], 16)?,
field4: u32::from_str_radix(&s[18..26], 16)?,
field5: u8::from_str_radix(&s[26..28], 16)?,
field6: u32::from_str_radix(&s[28..36], 16)?,
})
}
}
/// A 'value' stored for a one Key. /// A 'value' stored for a one Key.
#[derive(Debug, Clone, Serialize, Deserialize)] #[derive(Debug, Clone, Serialize, Deserialize)]
#[cfg_attr(test, derive(PartialEq))]
pub enum Value { pub enum Value {
/// An Image value contains a full copy of the value /// An Image value contains a full copy of the value
Image(Bytes), Image(Bytes),
@@ -197,6 +60,70 @@ impl Value {
} }
} }
#[cfg(test)]
mod test {
use super::*;
use bytes::Bytes;
use utils::bin_ser::BeSer;
macro_rules! roundtrip {
($orig:expr, $expected:expr) => {{
let orig: Value = $orig;
let actual = Value::ser(&orig).unwrap();
let expected: &[u8] = &$expected;
assert_eq!(utils::Hex(&actual), utils::Hex(expected));
let deser = Value::des(&actual).unwrap();
assert_eq!(orig, deser);
}};
}
#[test]
fn image_roundtrip() {
let image = Bytes::from_static(b"foobar");
let image = Value::Image(image);
#[rustfmt::skip]
let expected = [
// top level discriminator of 4 bytes
0x00, 0x00, 0x00, 0x00,
// 8 byte length
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x06,
// foobar
0x66, 0x6f, 0x6f, 0x62, 0x61, 0x72
];
roundtrip!(image, expected);
}
#[test]
fn walrecord_postgres_roundtrip() {
let rec = NeonWalRecord::Postgres {
will_init: true,
rec: Bytes::from_static(b"foobar"),
};
let rec = Value::WalRecord(rec);
#[rustfmt::skip]
let expected = [
// flattened discriminator of total 8 bytes
0x00, 0x00, 0x00, 0x01, 0x00, 0x00, 0x00, 0x00,
// will_init
0x01,
// 8 byte length
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x06,
// foobar
0x66, 0x6f, 0x6f, 0x62, 0x61, 0x72
];
roundtrip!(rec, expected);
}
}
/// ///
/// Result of performing GC /// Result of performing GC
/// ///

View File

@@ -299,10 +299,6 @@ pub enum TaskKind {
#[derive(Default)] #[derive(Default)]
struct MutableTaskState { struct MutableTaskState {
/// Tenant and timeline that this task is associated with.
tenant_id: Option<TenantId>,
timeline_id: Option<TimelineId>,
/// Handle for waiting for the task to exit. It can be None, if the /// Handle for waiting for the task to exit. It can be None, if the
/// the task has already exited. /// the task has already exited.
join_handle: Option<JoinHandle<()>>, join_handle: Option<JoinHandle<()>>,
@@ -319,6 +315,11 @@ struct PageServerTask {
// To request task shutdown, just cancel this token. // To request task shutdown, just cancel this token.
cancel: CancellationToken, cancel: CancellationToken,
/// Tasks may optionally be launched for a particular tenant/timeline, enabling
/// later cancelling tasks for that tenant/timeline in [`shutdown_tasks`]
tenant_id: Option<TenantId>,
timeline_id: Option<TimelineId>,
mutable: Mutex<MutableTaskState>, mutable: Mutex<MutableTaskState>,
} }
@@ -344,11 +345,9 @@ where
kind, kind,
name: name.to_string(), name: name.to_string(),
cancel: cancel.clone(), cancel: cancel.clone(),
mutable: Mutex::new(MutableTaskState { tenant_id,
tenant_id, timeline_id,
timeline_id, mutable: Mutex::new(MutableTaskState { join_handle: None }),
join_handle: None,
}),
}); });
TASKS.lock().unwrap().insert(task_id, Arc::clone(&task)); TASKS.lock().unwrap().insert(task_id, Arc::clone(&task));
@@ -418,8 +417,6 @@ async fn task_finish(
let mut shutdown_process = false; let mut shutdown_process = false;
{ {
let task_mut = task.mutable.lock().unwrap();
match result { match result {
Ok(Ok(())) => { Ok(Ok(())) => {
debug!("Task '{}' exited normally", task_name); debug!("Task '{}' exited normally", task_name);
@@ -428,13 +425,13 @@ async fn task_finish(
if shutdown_process_on_error { if shutdown_process_on_error {
error!( error!(
"Shutting down: task '{}' tenant_id: {:?}, timeline_id: {:?} exited with error: {:?}", "Shutting down: task '{}' tenant_id: {:?}, timeline_id: {:?} exited with error: {:?}",
task_name, task_mut.tenant_id, task_mut.timeline_id, err task_name, task.tenant_id, task.timeline_id, err
); );
shutdown_process = true; shutdown_process = true;
} else { } else {
error!( error!(
"Task '{}' tenant_id: {:?}, timeline_id: {:?} exited with error: {:?}", "Task '{}' tenant_id: {:?}, timeline_id: {:?} exited with error: {:?}",
task_name, task_mut.tenant_id, task_mut.timeline_id, err task_name, task.tenant_id, task.timeline_id, err
); );
} }
} }
@@ -442,13 +439,13 @@ async fn task_finish(
if shutdown_process_on_error { if shutdown_process_on_error {
error!( error!(
"Shutting down: task '{}' tenant_id: {:?}, timeline_id: {:?} panicked: {:?}", "Shutting down: task '{}' tenant_id: {:?}, timeline_id: {:?} panicked: {:?}",
task_name, task_mut.tenant_id, task_mut.timeline_id, err task_name, task.tenant_id, task.timeline_id, err
); );
shutdown_process = true; shutdown_process = true;
} else { } else {
error!( error!(
"Task '{}' tenant_id: {:?}, timeline_id: {:?} panicked: {:?}", "Task '{}' tenant_id: {:?}, timeline_id: {:?} panicked: {:?}",
task_name, task_mut.tenant_id, task_mut.timeline_id, err task_name, task.tenant_id, task.timeline_id, err
); );
} }
} }
@@ -460,17 +457,6 @@ async fn task_finish(
} }
} }
// expected to be called from the task of the given id.
pub fn associate_with(tenant_id: Option<TenantId>, timeline_id: Option<TimelineId>) {
CURRENT_TASK.with(|ct| {
let mut task_mut = ct.mutable.lock().unwrap();
task_mut.tenant_id = tenant_id;
task_mut.timeline_id = timeline_id;
});
}
/// Is there a task running that matches the criteria
/// Signal and wait for tasks to shut down. /// Signal and wait for tasks to shut down.
/// ///
/// ///
@@ -493,17 +479,16 @@ pub async fn shutdown_tasks(
{ {
let tasks = TASKS.lock().unwrap(); let tasks = TASKS.lock().unwrap();
for task in tasks.values() { for task in tasks.values() {
let task_mut = task.mutable.lock().unwrap();
if (kind.is_none() || Some(task.kind) == kind) if (kind.is_none() || Some(task.kind) == kind)
&& (tenant_id.is_none() || task_mut.tenant_id == tenant_id) && (tenant_id.is_none() || task.tenant_id == tenant_id)
&& (timeline_id.is_none() || task_mut.timeline_id == timeline_id) && (timeline_id.is_none() || task.timeline_id == timeline_id)
{ {
task.cancel.cancel(); task.cancel.cancel();
victim_tasks.push(( victim_tasks.push((
Arc::clone(task), Arc::clone(task),
task.kind, task.kind,
task_mut.tenant_id, task.tenant_id,
task_mut.timeline_id, task.timeline_id,
)); ));
} }
} }

View File

@@ -26,6 +26,7 @@ use tracing::*;
use utils::completion; use utils::completion;
use utils::crashsafe::path_with_suffix_extension; use utils::crashsafe::path_with_suffix_extension;
use utils::fs_ext; use utils::fs_ext;
use utils::sync::gate::Gate;
use std::cmp::min; use std::cmp::min;
use std::collections::hash_map::Entry; use std::collections::hash_map::Entry;
@@ -54,6 +55,8 @@ use self::config::TenantConf;
use self::delete::DeleteTenantFlow; use self::delete::DeleteTenantFlow;
use self::metadata::LoadMetadataError; use self::metadata::LoadMetadataError;
use self::metadata::TimelineMetadata; use self::metadata::TimelineMetadata;
use self::mgr::GetActiveTenantError;
use self::mgr::GetTenantError;
use self::mgr::TenantsMap; use self::mgr::TenantsMap;
use self::remote_timeline_client::RemoteTimelineClient; use self::remote_timeline_client::RemoteTimelineClient;
use self::timeline::uninit::TimelineUninitMark; use self::timeline::uninit::TimelineUninitMark;
@@ -252,6 +255,20 @@ pub struct Tenant {
eviction_task_tenant_state: tokio::sync::Mutex<EvictionTaskTenantState>, eviction_task_tenant_state: tokio::sync::Mutex<EvictionTaskTenantState>,
pub(crate) delete_progress: Arc<tokio::sync::Mutex<DeleteTenantFlow>>, pub(crate) delete_progress: Arc<tokio::sync::Mutex<DeleteTenantFlow>>,
// Cancellation token fires when we have entered shutdown(). This is a parent of
// Timelines' cancellation token.
pub(crate) cancel: CancellationToken,
// Users of the Tenant such as the page service must take this Gate to avoid
// trying to use a Tenant which is shutting down.
pub(crate) gate: Gate,
}
impl std::fmt::Debug for Tenant {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
write!(f, "{} ({})", self.tenant_id, self.current_state())
}
} }
pub(crate) enum WalRedoManager { pub(crate) enum WalRedoManager {
@@ -359,34 +376,6 @@ impl Debug for SetStoppingError {
} }
} }
#[derive(Debug, thiserror::Error)]
pub(crate) enum WaitToBecomeActiveError {
WillNotBecomeActive {
tenant_id: TenantId,
state: TenantState,
},
TenantDropped {
tenant_id: TenantId,
},
}
impl std::fmt::Display for WaitToBecomeActiveError {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
match self {
WaitToBecomeActiveError::WillNotBecomeActive { tenant_id, state } => {
write!(
f,
"Tenant {} will not become active. Current state: {:?}",
tenant_id, state
)
}
WaitToBecomeActiveError::TenantDropped { tenant_id } => {
write!(f, "Tenant {tenant_id} will not become active (dropped)")
}
}
}
}
#[derive(thiserror::Error, Debug)] #[derive(thiserror::Error, Debug)]
pub enum CreateTimelineError { pub enum CreateTimelineError {
#[error("a timeline with the given ID already exists")] #[error("a timeline with the given ID already exists")]
@@ -395,6 +384,8 @@ pub enum CreateTimelineError {
AncestorLsn(anyhow::Error), AncestorLsn(anyhow::Error),
#[error("ancestor timeline is not active")] #[error("ancestor timeline is not active")]
AncestorNotActive, AncestorNotActive,
#[error("tenant shutting down")]
ShuttingDown,
#[error(transparent)] #[error(transparent)]
Other(#[from] anyhow::Error), Other(#[from] anyhow::Error),
} }
@@ -526,7 +517,7 @@ impl Tenant {
resources: TenantSharedResources, resources: TenantSharedResources,
attached_conf: AttachedTenantConf, attached_conf: AttachedTenantConf,
init_order: Option<InitializationOrder>, init_order: Option<InitializationOrder>,
tenants: &'static tokio::sync::RwLock<TenantsMap>, tenants: &'static std::sync::RwLock<TenantsMap>,
mode: SpawnMode, mode: SpawnMode,
ctx: &RequestContext, ctx: &RequestContext,
) -> anyhow::Result<Arc<Tenant>> { ) -> anyhow::Result<Arc<Tenant>> {
@@ -1524,6 +1515,11 @@ impl Tenant {
))); )));
} }
let _gate = self
.gate
.enter()
.map_err(|_| CreateTimelineError::ShuttingDown)?;
if let Ok(existing) = self.get_timeline(new_timeline_id, false) { if let Ok(existing) = self.get_timeline(new_timeline_id, false) {
debug!("timeline {new_timeline_id} already exists"); debug!("timeline {new_timeline_id} already exists");
@@ -1808,6 +1804,7 @@ impl Tenant {
freeze_and_flush: bool, freeze_and_flush: bool,
) -> Result<(), completion::Barrier> { ) -> Result<(), completion::Barrier> {
span::debug_assert_current_span_has_tenant_id(); span::debug_assert_current_span_has_tenant_id();
// Set tenant (and its timlines) to Stoppping state. // Set tenant (and its timlines) to Stoppping state.
// //
// Since we can only transition into Stopping state after activation is complete, // Since we can only transition into Stopping state after activation is complete,
@@ -1833,6 +1830,7 @@ impl Tenant {
} }
Err(SetStoppingError::AlreadyStopping(other)) => { Err(SetStoppingError::AlreadyStopping(other)) => {
// give caller the option to wait for this this shutdown // give caller the option to wait for this this shutdown
info!("Tenant::shutdown: AlreadyStopping");
return Err(other); return Err(other);
} }
}; };
@@ -1843,9 +1841,16 @@ impl Tenant {
timelines.values().for_each(|timeline| { timelines.values().for_each(|timeline| {
let timeline = Arc::clone(timeline); let timeline = Arc::clone(timeline);
let span = Span::current(); let span = Span::current();
js.spawn(async move { timeline.shutdown(freeze_and_flush).instrument(span).await }); js.spawn(async move {
if freeze_and_flush {
timeline.flush_and_shutdown().instrument(span).await
} else {
timeline.shutdown().instrument(span).await
}
});
}) })
}; };
tracing::info!("Waiting for timelines...");
while let Some(res) = js.join_next().await { while let Some(res) = js.join_next().await {
match res { match res {
Ok(()) => {} Ok(()) => {}
@@ -1855,12 +1860,21 @@ impl Tenant {
} }
} }
// We cancel the Tenant's cancellation token _after_ the timelines have all shut down. This permits
// them to continue to do work during their shutdown methods, e.g. flushing data.
tracing::debug!("Cancelling CancellationToken");
self.cancel.cancel();
// shutdown all tenant and timeline tasks: gc, compaction, page service // shutdown all tenant and timeline tasks: gc, compaction, page service
// No new tasks will be started for this tenant because it's in `Stopping` state. // No new tasks will be started for this tenant because it's in `Stopping` state.
// //
// this will additionally shutdown and await all timeline tasks. // this will additionally shutdown and await all timeline tasks.
tracing::debug!("Waiting for tasks...");
task_mgr::shutdown_tasks(None, Some(self.tenant_id), None).await; task_mgr::shutdown_tasks(None, Some(self.tenant_id), None).await;
// Wait for any in-flight operations to complete
self.gate.close().await;
Ok(()) Ok(())
} }
@@ -2021,7 +2035,7 @@ impl Tenant {
self.state.subscribe() self.state.subscribe()
} }
pub(crate) async fn wait_to_become_active(&self) -> Result<(), WaitToBecomeActiveError> { pub(crate) async fn wait_to_become_active(&self) -> Result<(), GetActiveTenantError> {
let mut receiver = self.state.subscribe(); let mut receiver = self.state.subscribe();
loop { loop {
let current_state = receiver.borrow_and_update().clone(); let current_state = receiver.borrow_and_update().clone();
@@ -2029,11 +2043,9 @@ impl Tenant {
TenantState::Loading | TenantState::Attaching | TenantState::Activating(_) => { TenantState::Loading | TenantState::Attaching | TenantState::Activating(_) => {
// in these states, there's a chance that we can reach ::Active // in these states, there's a chance that we can reach ::Active
receiver.changed().await.map_err( receiver.changed().await.map_err(
|_e: tokio::sync::watch::error::RecvError| { |_e: tokio::sync::watch::error::RecvError|
WaitToBecomeActiveError::TenantDropped { // Tenant existed but was dropped: report it as non-existent
tenant_id: self.tenant_id, GetActiveTenantError::NotFound(GetTenantError::NotFound(self.tenant_id))
}
},
)?; )?;
} }
TenantState::Active { .. } => { TenantState::Active { .. } => {
@@ -2041,10 +2053,7 @@ impl Tenant {
} }
TenantState::Broken { .. } | TenantState::Stopping { .. } => { TenantState::Broken { .. } | TenantState::Stopping { .. } => {
// There's no chance the tenant can transition back into ::Active // There's no chance the tenant can transition back into ::Active
return Err(WaitToBecomeActiveError::WillNotBecomeActive { return Err(GetActiveTenantError::WillNotBecomeActive(current_state));
tenant_id: self.tenant_id,
state: current_state,
});
} }
} }
} }
@@ -2110,6 +2119,9 @@ where
} }
impl Tenant { impl Tenant {
pub fn get_tenant_id(&self) -> TenantId {
self.tenant_id
}
pub fn tenant_specific_overrides(&self) -> TenantConfOpt { pub fn tenant_specific_overrides(&self) -> TenantConfOpt {
self.tenant_conf.read().unwrap().tenant_conf self.tenant_conf.read().unwrap().tenant_conf
} }
@@ -2267,6 +2279,7 @@ impl Tenant {
initial_logical_size_can_start.cloned(), initial_logical_size_can_start.cloned(),
initial_logical_size_attempt.cloned().flatten(), initial_logical_size_attempt.cloned().flatten(),
state, state,
self.cancel.child_token(),
); );
Ok(timeline) Ok(timeline)
@@ -2356,6 +2369,8 @@ impl Tenant {
cached_synthetic_tenant_size: Arc::new(AtomicU64::new(0)), cached_synthetic_tenant_size: Arc::new(AtomicU64::new(0)),
eviction_task_tenant_state: tokio::sync::Mutex::new(EvictionTaskTenantState::default()), eviction_task_tenant_state: tokio::sync::Mutex::new(EvictionTaskTenantState::default()),
delete_progress: Arc::new(tokio::sync::Mutex::new(DeleteTenantFlow::default())), delete_progress: Arc::new(tokio::sync::Mutex::new(DeleteTenantFlow::default())),
cancel: CancellationToken::default(),
gate: Gate::new(format!("Tenant<{tenant_id}>")),
} }
} }
@@ -3519,10 +3534,6 @@ pub(crate) mod harness {
let remote_fs_dir = conf.workdir.join("localfs"); let remote_fs_dir = conf.workdir.join("localfs");
std::fs::create_dir_all(&remote_fs_dir).unwrap(); std::fs::create_dir_all(&remote_fs_dir).unwrap();
let config = RemoteStorageConfig { let config = RemoteStorageConfig {
// TODO: why not remote_storage::DEFAULT_REMOTE_STORAGE_MAX_CONCURRENT_SYNCS,
max_concurrent_syncs: std::num::NonZeroUsize::new(2_000_000).unwrap(),
// TODO: why not remote_storage::DEFAULT_REMOTE_STORAGE_MAX_SYNC_ERRORS,
max_sync_errors: std::num::NonZeroU32::new(3_000_000).unwrap(),
storage: RemoteStorageKind::LocalFs(remote_fs_dir.clone()), storage: RemoteStorageKind::LocalFs(remote_fs_dir.clone()),
}; };
let remote_storage = GenericRemoteStorage::from_config(&config).unwrap(); let remote_storage = GenericRemoteStorage::from_config(&config).unwrap();
@@ -3692,7 +3703,7 @@ mod tests {
use tokio_util::sync::CancellationToken; use tokio_util::sync::CancellationToken;
static TEST_KEY: Lazy<Key> = static TEST_KEY: Lazy<Key> =
Lazy::new(|| Key::from_slice(&hex!("112222222233333333444444445500000001"))); Lazy::new(|| Key::from_slice(&hex!("010000000033333333444444445500000001")));
#[tokio::test] #[tokio::test]
async fn test_basic() -> anyhow::Result<()> { async fn test_basic() -> anyhow::Result<()> {
@@ -3788,9 +3799,9 @@ mod tests {
let writer = tline.writer().await; let writer = tline.writer().await;
#[allow(non_snake_case)] #[allow(non_snake_case)]
let TEST_KEY_A: Key = Key::from_hex("112222222233333333444444445500000001").unwrap(); let TEST_KEY_A: Key = Key::from_hex("110000000033333333444444445500000001").unwrap();
#[allow(non_snake_case)] #[allow(non_snake_case)]
let TEST_KEY_B: Key = Key::from_hex("112222222233333333444444445500000002").unwrap(); let TEST_KEY_B: Key = Key::from_hex("110000000033333333444444445500000002").unwrap();
// Insert a value on the timeline // Insert a value on the timeline
writer writer
@@ -4236,11 +4247,7 @@ mod tests {
metadata_bytes[8] ^= 1; metadata_bytes[8] ^= 1;
std::fs::write(metadata_path, metadata_bytes)?; std::fs::write(metadata_path, metadata_bytes)?;
let err = harness let err = harness.try_load_local(&ctx).await.expect_err("should fail");
.try_load_local(&ctx)
.await
.err()
.expect("should fail");
// get all the stack with all .context, not only the last one // get all the stack with all .context, not only the last one
let message = format!("{err:#}"); let message = format!("{err:#}");
let expected = "failed to load metadata"; let expected = "failed to load metadata";
@@ -4374,7 +4381,7 @@ mod tests {
let mut keyspace = KeySpaceAccum::new(); let mut keyspace = KeySpaceAccum::new();
let mut test_key = Key::from_hex("012222222233333333444444445500000000").unwrap(); let mut test_key = Key::from_hex("010000000033333333444444445500000000").unwrap();
let mut blknum = 0; let mut blknum = 0;
for _ in 0..50 { for _ in 0..50 {
for _ in 0..10000 { for _ in 0..10000 {
@@ -4420,7 +4427,7 @@ mod tests {
const NUM_KEYS: usize = 1000; const NUM_KEYS: usize = 1000;
let mut test_key = Key::from_hex("012222222233333333444444445500000000").unwrap(); let mut test_key = Key::from_hex("010000000033333333444444445500000000").unwrap();
let mut keyspace = KeySpaceAccum::new(); let mut keyspace = KeySpaceAccum::new();
@@ -4501,7 +4508,7 @@ mod tests {
const NUM_KEYS: usize = 1000; const NUM_KEYS: usize = 1000;
let mut test_key = Key::from_hex("012222222233333333444444445500000000").unwrap(); let mut test_key = Key::from_hex("010000000033333333444444445500000000").unwrap();
let mut keyspace = KeySpaceAccum::new(); let mut keyspace = KeySpaceAccum::new();
@@ -4592,7 +4599,7 @@ mod tests {
const NUM_KEYS: usize = 100; const NUM_KEYS: usize = 100;
const NUM_TLINES: usize = 50; const NUM_TLINES: usize = 50;
let mut test_key = Key::from_hex("012222222233333333444444445500000000").unwrap(); let mut test_key = Key::from_hex("010000000033333333444444445500000000").unwrap();
// Track page mutation lsns across different timelines. // Track page mutation lsns across different timelines.
let mut updated = [[Lsn(0); NUM_KEYS]; NUM_TLINES]; let mut updated = [[Lsn(0); NUM_KEYS]; NUM_TLINES];
@@ -4726,7 +4733,7 @@ mod tests {
// Keeps uninit mark in place // Keeps uninit mark in place
let raw_tline = tline.raw_timeline().unwrap(); let raw_tline = tline.raw_timeline().unwrap();
raw_tline raw_tline
.shutdown(false) .shutdown()
.instrument(info_span!("test_shutdown", tenant_id=%raw_tline.tenant_id)) .instrument(info_span!("test_shutdown", tenant_id=%raw_tline.tenant_id))
.await; .await;
std::mem::forget(tline); std::mem::forget(tline);

View File

@@ -327,7 +327,7 @@ mod tests {
let mut sz: u16 = rng.gen(); let mut sz: u16 = rng.gen();
// Make 50% of the arrays small // Make 50% of the arrays small
if rng.gen() { if rng.gen() {
sz |= 63; sz &= 63;
} }
random_array(sz.into()) random_array(sz.into())
}) })

View File

@@ -21,7 +21,7 @@ use crate::{
}; };
use super::{ use super::{
mgr::{GetTenantError, TenantsMap}, mgr::{GetTenantError, TenantSlotError, TenantSlotUpsertError, TenantsMap},
remote_timeline_client::{FAILED_REMOTE_OP_RETRIES, FAILED_UPLOAD_WARN_THRESHOLD}, remote_timeline_client::{FAILED_REMOTE_OP_RETRIES, FAILED_UPLOAD_WARN_THRESHOLD},
span, span,
timeline::delete::DeleteTimelineFlow, timeline::delete::DeleteTimelineFlow,
@@ -33,12 +33,21 @@ pub(crate) enum DeleteTenantError {
#[error("GetTenant {0}")] #[error("GetTenant {0}")]
Get(#[from] GetTenantError), Get(#[from] GetTenantError),
#[error("Tenant not attached")]
NotAttached,
#[error("Invalid state {0}. Expected Active or Broken")] #[error("Invalid state {0}. Expected Active or Broken")]
InvalidState(TenantState), InvalidState(TenantState),
#[error("Tenant deletion is already in progress")] #[error("Tenant deletion is already in progress")]
AlreadyInProgress, AlreadyInProgress,
#[error("Tenant map slot error {0}")]
SlotError(#[from] TenantSlotError),
#[error("Tenant map slot upsert error {0}")]
SlotUpsertError(#[from] TenantSlotUpsertError),
#[error("Timeline {0}")] #[error("Timeline {0}")]
Timeline(#[from] DeleteTimelineError), Timeline(#[from] DeleteTimelineError),
@@ -273,12 +282,12 @@ impl DeleteTenantFlow {
pub(crate) async fn run( pub(crate) async fn run(
conf: &'static PageServerConf, conf: &'static PageServerConf,
remote_storage: Option<GenericRemoteStorage>, remote_storage: Option<GenericRemoteStorage>,
tenants: &'static tokio::sync::RwLock<TenantsMap>, tenants: &'static std::sync::RwLock<TenantsMap>,
tenant_id: TenantId, tenant: Arc<Tenant>,
) -> Result<(), DeleteTenantError> { ) -> Result<(), DeleteTenantError> {
span::debug_assert_current_span_has_tenant_id(); span::debug_assert_current_span_has_tenant_id();
let (tenant, mut guard) = Self::prepare(tenants, tenant_id).await?; let mut guard = Self::prepare(&tenant).await?;
if let Err(e) = Self::run_inner(&mut guard, conf, remote_storage.as_ref(), &tenant).await { if let Err(e) = Self::run_inner(&mut guard, conf, remote_storage.as_ref(), &tenant).await {
tenant.set_broken(format!("{e:#}")).await; tenant.set_broken(format!("{e:#}")).await;
@@ -378,7 +387,7 @@ impl DeleteTenantFlow {
guard: DeletionGuard, guard: DeletionGuard,
tenant: &Arc<Tenant>, tenant: &Arc<Tenant>,
preload: Option<TenantPreload>, preload: Option<TenantPreload>,
tenants: &'static tokio::sync::RwLock<TenantsMap>, tenants: &'static std::sync::RwLock<TenantsMap>,
init_order: Option<InitializationOrder>, init_order: Option<InitializationOrder>,
ctx: &RequestContext, ctx: &RequestContext,
) -> Result<(), DeleteTenantError> { ) -> Result<(), DeleteTenantError> {
@@ -405,15 +414,8 @@ impl DeleteTenantFlow {
} }
async fn prepare( async fn prepare(
tenants: &tokio::sync::RwLock<TenantsMap>, tenant: &Arc<Tenant>,
tenant_id: TenantId, ) -> Result<tokio::sync::OwnedMutexGuard<Self>, DeleteTenantError> {
) -> Result<(Arc<Tenant>, tokio::sync::OwnedMutexGuard<Self>), DeleteTenantError> {
let m = tenants.read().await;
let tenant = m
.get(&tenant_id)
.ok_or(GetTenantError::NotFound(tenant_id))?;
// FIXME: unsure about active only. Our init jobs may not be cancellable properly, // FIXME: unsure about active only. Our init jobs may not be cancellable properly,
// so at least for now allow deletions only for active tenants. TODO recheck // so at least for now allow deletions only for active tenants. TODO recheck
// Broken and Stopping is needed for retries. // Broken and Stopping is needed for retries.
@@ -447,14 +449,14 @@ impl DeleteTenantFlow {
))); )));
} }
Ok((Arc::clone(tenant), guard)) Ok(guard)
} }
fn schedule_background( fn schedule_background(
guard: OwnedMutexGuard<Self>, guard: OwnedMutexGuard<Self>,
conf: &'static PageServerConf, conf: &'static PageServerConf,
remote_storage: Option<GenericRemoteStorage>, remote_storage: Option<GenericRemoteStorage>,
tenants: &'static tokio::sync::RwLock<TenantsMap>, tenants: &'static std::sync::RwLock<TenantsMap>,
tenant: Arc<Tenant>, tenant: Arc<Tenant>,
) { ) {
let tenant_id = tenant.tenant_id; let tenant_id = tenant.tenant_id;
@@ -487,7 +489,7 @@ impl DeleteTenantFlow {
mut guard: OwnedMutexGuard<Self>, mut guard: OwnedMutexGuard<Self>,
conf: &PageServerConf, conf: &PageServerConf,
remote_storage: Option<GenericRemoteStorage>, remote_storage: Option<GenericRemoteStorage>,
tenants: &'static tokio::sync::RwLock<TenantsMap>, tenants: &'static std::sync::RwLock<TenantsMap>,
tenant: &Arc<Tenant>, tenant: &Arc<Tenant>,
) -> Result<(), DeleteTenantError> { ) -> Result<(), DeleteTenantError> {
// Tree sort timelines, schedule delete for them. Mention retries from the console side. // Tree sort timelines, schedule delete for them. Mention retries from the console side.
@@ -535,10 +537,18 @@ impl DeleteTenantFlow {
.await .await
.context("cleanup_remaining_fs_traces")?; .context("cleanup_remaining_fs_traces")?;
let mut locked = tenants.write().await; {
if locked.remove(&tenant.tenant_id).is_none() { let mut locked = tenants.write().unwrap();
warn!("Tenant got removed from tenants map during deletion"); if locked.remove(&tenant.tenant_id).is_none() {
}; warn!("Tenant got removed from tenants map during deletion");
};
// FIXME: we should not be modifying this from outside of mgr.rs.
// This will go away when we simplify deletion (https://github.com/neondatabase/neon/issues/5080)
crate::metrics::TENANT_MANAGER
.tenant_slots
.set(locked.len() as u64);
}
*guard = Self::Finished; *guard = Self::Finished;

File diff suppressed because it is too large Load Diff

Some files were not shown because too many files have changed in this diff Show More